Describe the enhancement requested
PlainValuesWriter (used for PLAIN-encoded INT32, INT64, FLOAT, DOUBLE, and BINARY
columns) currently writes each value through two layers of abstraction:
PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream
Per writeInt(), LittleEndianDataOutputStream decomposes the int into 4 bytes
in a temporary writeBuffer[8] array and calls out.write(writeBuffer, 0, 4),
which dispatches through the OutputStream chain into CapacityByteArrayOutputStream.
That path performs:
- 4 byte-shift operations for little-endian decomposition
- 1 intermediate
writeBuffer[8] array write
- 2 levels of virtual dispatch
- 1 bounds check in
write(byte[], off, len)
- 1
System.arraycopy for 4 bytes
Since CapacityByteArrayOutputStream already buffers into ByteBuffer slabs
internally, the entire chain can be collapsed into a single ByteBuffer.putInt()
call, which is a HotSpot intrinsic that compiles to a single unaligned store on
x86/ARM when the buffer is in LITTLE_ENDIAN order.
Proposal
-
In CapacityByteArrayOutputStream:
- Set
ByteOrder.LITTLE_ENDIAN on newly allocated slabs in addSlab().
- Add
writeInt(int) and writeLong(long) methods that call
currentSlab.putInt(v) / currentSlab.putLong(v) directly, with a single
remaining-check that grows the slab if needed.
-
In PlainValuesWriter:
- Remove the
LittleEndianDataOutputStream field entirely.
writeInteger(v) -> arrayOut.writeInt(v)
writeLong(v) -> arrayOut.writeLong(v)
writeFloat(v) -> arrayOut.writeInt(Float.floatToIntBits(v))
writeDouble(v) -> arrayOut.writeLong(Double.doubleToLongBits(v))
writeBytes(Binary v) -> arrayOut.writeInt(v.length()); v.writeTo(arrayOut);
getBytes() no longer needs to flush a buffering layer.
close() no longer closes the defunct stream.
What was eliminated per writeInt call:
- 4 byte-shift operations for little-endian decomposition
- 1 intermediate
writeBuffer[8] array write
- 2 levels of virtual dispatch
- 1 bounds check in
write(byte[], off, len)
- 1
System.arraycopy for 4 bytes
Replaced with:
- 1 remaining-check on the slab
ByteBuffer
- 1
ByteBuffer.putInt() call (single JVM intrinsic, ~1 store instruction on
little-endian architectures)
Benchmark results
IntEncodingBenchmark.encodePlain (100,000 INT32 values per invocation, JMH
-wi 3 -i 5 -f 1):
| Pattern |
Before (ops/s) |
After (ops/s) |
Improvement |
| SEQUENTIAL |
26,817,451 |
52,953,193 |
+97.5% (2.0x) |
| RANDOM |
28,517,312 |
37,774,036 |
+32.5% |
| LOW_CARDINALITY |
28,705,158 |
52,819,678 |
+84.0% |
| HIGH_CARDINALITY |
28,595,519 |
37,862,571 |
+32.4% |
The improvement varies by pattern: SEQUENTIAL and LOW_CARDINALITY see ~2x because
the slab putInt() path has highly predictable branching (slab rarely runs out
for sequential writes). RANDOM and HIGH_CARDINALITY still see a solid +32%
improvement.
The same code path also benefits writeLong(), writeFloat(), writeDouble(),
and the length prefix written by writeBytes(Binary).
Decode round-trip verified: re-reading the encoded data with PlainValuesReader
produces identical values at ~1.15B ops/s.
Validation
All 573 parquet-column tests and 308 parquet-common tests pass with the
change applied.
Component(s)
Core
Describe the enhancement requested
PlainValuesWriter(used for PLAIN-encoded INT32, INT64, FLOAT, DOUBLE, and BINARYcolumns) currently writes each value through two layers of abstraction:
Per
writeInt(),LittleEndianDataOutputStreamdecomposes the int into 4 bytesin a temporary
writeBuffer[8]array and callsout.write(writeBuffer, 0, 4),which dispatches through the
OutputStreamchain intoCapacityByteArrayOutputStream.That path performs:
writeBuffer[8]array writewrite(byte[], off, len)System.arraycopyfor 4 bytesSince
CapacityByteArrayOutputStreamalready buffers intoByteBufferslabsinternally, the entire chain can be collapsed into a single
ByteBuffer.putInt()call, which is a HotSpot intrinsic that compiles to a single unaligned store on
x86/ARM when the buffer is in
LITTLE_ENDIANorder.Proposal
In
CapacityByteArrayOutputStream:ByteOrder.LITTLE_ENDIANon newly allocated slabs inaddSlab().writeInt(int)andwriteLong(long)methods that callcurrentSlab.putInt(v)/currentSlab.putLong(v)directly, with a singleremaining-check that grows the slab if needed.
In
PlainValuesWriter:LittleEndianDataOutputStreamfield entirely.writeInteger(v)->arrayOut.writeInt(v)writeLong(v)->arrayOut.writeLong(v)writeFloat(v)->arrayOut.writeInt(Float.floatToIntBits(v))writeDouble(v)->arrayOut.writeLong(Double.doubleToLongBits(v))writeBytes(Binary v)->arrayOut.writeInt(v.length()); v.writeTo(arrayOut);getBytes()no longer needs to flush a buffering layer.close()no longer closes the defunct stream.What was eliminated per
writeIntcall:writeBuffer[8]array writewrite(byte[], off, len)System.arraycopyfor 4 bytesReplaced with:
ByteBufferByteBuffer.putInt()call (single JVM intrinsic, ~1 store instruction onlittle-endian architectures)
Benchmark results
IntEncodingBenchmark.encodePlain(100,000 INT32 values per invocation, JMH-wi 3 -i 5 -f 1):The improvement varies by pattern: SEQUENTIAL and LOW_CARDINALITY see ~2x because
the slab
putInt()path has highly predictable branching (slab rarely runs outfor sequential writes). RANDOM and HIGH_CARDINALITY still see a solid +32%
improvement.
The same code path also benefits
writeLong(),writeFloat(),writeDouble(),and the length prefix written by
writeBytes(Binary).Decode round-trip verified: re-reading the encoded data with
PlainValuesReaderproduces identical values at ~1.15B ops/s.
Validation
All 573
parquet-columntests and 308parquet-commontests pass with thechange applied.
Component(s)
Core