Describe the enhancement requested
PlainValuesReader (used for PLAIN-encoded INT32, INT64, FLOAT, and DOUBLE columns) currently
reads each value through a LittleEndianDataInputStream wrapper around a ByteBufferInputStream.
Per value, readInt() performs 4 separate virtual in.read() calls and manually reassembles the
result with bit shifts:
public final int readInt() throws IOException {
int ch4 = in.read();
int ch3 = in.read();
int ch2 = in.read();
int ch1 = in.read();
if ((ch1 | ch2 | ch3 | ch4) < 0) throw new EOFException();
return ((ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4 << 0));
}
The LittleEndianDataInputStream.readInt() method itself carries a TODO comment from years ago
suggesting exactly this kind of replacement (a ByteBuffer-backed implementation).
Proposal
In PlainValuesReader.initFromPage(), obtain the page data as a single contiguous ByteBuffer
via stream.slice(stream.available()) with ByteOrder.LITTLE_ENDIAN, and call the corresponding
ByteBuffer accessor directly per value:
readInteger() → buffer.getInt()
readLong() → buffer.getLong()
readFloat() → buffer.getFloat()
readDouble() → buffer.getDouble()
skip(n) → buffer.position(buffer.position() + n * typeSize)
ByteBuffer.getInt() with the appropriate byte order is a HotSpot intrinsic that compiles to a
single unaligned load instruction on x86/ARM — no virtual dispatch, no per-byte assembly, no
checked IOException on the per-value path.
ByteBufferInputStream.slice() already handles both single-buffer (zero-copy view) and
multi-buffer (single contiguous copy) cases transparently. In practice page data is almost
always a single contiguous buffer.
Benchmark results
IntEncodingBenchmark.decodePlain (100,000 INT32 values per invocation, JMH -wi 3 -i 5 -f 1):
| Pattern |
Before (ops/s) |
After (ops/s) |
Speedup |
| SEQUENTIAL |
92,918,297 |
1,143,149,235 |
12.3x |
| RANDOM |
92,126,888 |
1,147,547,093 |
12.5x |
| LOW_CARDINALITY |
93,005,451 |
1,142,666,760 |
12.3x |
| HIGH_CARDINALITY |
93,312,596 |
1,144,681,876 |
12.3x |
The speedup is consistent across data patterns because the bottleneck is entirely in the
per-value dispatch overhead, not the data itself. All four numeric plain reader types
(int, long, float, double) benefit equally.
End-to-end on ReadBenchmarks.readFile (6-column schema, 100k rows), this contributes to the
overall read-path improvement of 5–10% measured when combined with related decode optimizations.
Side effects
After this change, LittleEndianDataInputStream has no remaining production usages in the
codebase — only one test reference (TestColumnChunkPageWriteStore) and two TODO comments.
A follow-up could remove or deprecate the class.
Validation
All 573 parquet-column tests pass with the change applied.
Component(s)
Core
Describe the enhancement requested
PlainValuesReader(used for PLAIN-encoded INT32, INT64, FLOAT, and DOUBLE columns) currentlyreads each value through a
LittleEndianDataInputStreamwrapper around aByteBufferInputStream.Per value,
readInt()performs 4 separate virtualin.read()calls and manually reassembles theresult with bit shifts:
The
LittleEndianDataInputStream.readInt()method itself carries a TODO comment from years agosuggesting exactly this kind of replacement (a
ByteBuffer-backed implementation).Proposal
In
PlainValuesReader.initFromPage(), obtain the page data as a single contiguousByteBuffervia
stream.slice(stream.available())withByteOrder.LITTLE_ENDIAN, and call the correspondingByteBufferaccessor directly per value:readInteger()→buffer.getInt()readLong()→buffer.getLong()readFloat()→buffer.getFloat()readDouble()→buffer.getDouble()skip(n)→buffer.position(buffer.position() + n * typeSize)ByteBuffer.getInt()with the appropriate byte order is a HotSpot intrinsic that compiles to asingle unaligned load instruction on x86/ARM — no virtual dispatch, no per-byte assembly, no
checked
IOExceptionon the per-value path.ByteBufferInputStream.slice()already handles both single-buffer (zero-copy view) andmulti-buffer (single contiguous copy) cases transparently. In practice page data is almost
always a single contiguous buffer.
Benchmark results
IntEncodingBenchmark.decodePlain(100,000 INT32 values per invocation, JMH-wi 3 -i 5 -f 1):The speedup is consistent across data patterns because the bottleneck is entirely in the
per-value dispatch overhead, not the data itself. All four numeric plain reader types
(int, long, float, double) benefit equally.
End-to-end on
ReadBenchmarks.readFile(6-column schema, 100k rows), this contributes to theoverall read-path improvement of 5–10% measured when combined with related decode optimizations.
Side effects
After this change,
LittleEndianDataInputStreamhas no remaining production usages in thecodebase — only one test reference (
TestColumnChunkPageWriteStore) and two TODO comments.A follow-up could remove or deprecate the class.
Validation
All 573
parquet-columntests pass with the change applied.Component(s)
Core