Skip to content

Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup) #3495

@iemejia

Description

@iemejia

Describe the enhancement requested

PlainValuesWriter (used for PLAIN-encoded INT32, INT64, FLOAT, DOUBLE, and BINARY
columns) currently writes each value through two layers of abstraction:

PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream

Per writeInt(), LittleEndianDataOutputStream decomposes the int into 4 bytes
in a temporary writeBuffer[8] array and calls out.write(writeBuffer, 0, 4),
which dispatches through the OutputStream chain into CapacityByteArrayOutputStream.
That path performs:

  • 4 byte-shift operations for little-endian decomposition
  • 1 intermediate writeBuffer[8] array write
  • 2 levels of virtual dispatch
  • 1 bounds check in write(byte[], off, len)
  • 1 System.arraycopy for 4 bytes

Since CapacityByteArrayOutputStream already buffers into ByteBuffer slabs
internally, the entire chain can be collapsed into a single ByteBuffer.putInt()
call, which is a HotSpot intrinsic that compiles to a single unaligned store on
x86/ARM when the buffer is in LITTLE_ENDIAN order.

Proposal

  1. In CapacityByteArrayOutputStream:

    • Set ByteOrder.LITTLE_ENDIAN on newly allocated slabs in addSlab().
    • Add writeInt(int) and writeLong(long) methods that call
      currentSlab.putInt(v) / currentSlab.putLong(v) directly, with a single
      remaining-check that grows the slab if needed.
  2. In PlainValuesWriter:

    • Remove the LittleEndianDataOutputStream field entirely.
    • writeInteger(v) -> arrayOut.writeInt(v)
    • writeLong(v) -> arrayOut.writeLong(v)
    • writeFloat(v) -> arrayOut.writeInt(Float.floatToIntBits(v))
    • writeDouble(v) -> arrayOut.writeLong(Double.doubleToLongBits(v))
    • writeBytes(Binary v) -> arrayOut.writeInt(v.length()); v.writeTo(arrayOut);
    • getBytes() no longer needs to flush a buffering layer.
    • close() no longer closes the defunct stream.

What was eliminated per writeInt call:

  • 4 byte-shift operations for little-endian decomposition
  • 1 intermediate writeBuffer[8] array write
  • 2 levels of virtual dispatch
  • 1 bounds check in write(byte[], off, len)
  • 1 System.arraycopy for 4 bytes

Replaced with:

  • 1 remaining-check on the slab ByteBuffer
  • 1 ByteBuffer.putInt() call (single JVM intrinsic, ~1 store instruction on
    little-endian architectures)

Benchmark results

IntEncodingBenchmark.encodePlain (100,000 INT32 values per invocation, JMH
-wi 3 -i 5 -f 1):

Pattern Before (ops/s) After (ops/s) Improvement
SEQUENTIAL 26,817,451 52,953,193 +97.5% (2.0x)
RANDOM 28,517,312 37,774,036 +32.5%
LOW_CARDINALITY 28,705,158 52,819,678 +84.0%
HIGH_CARDINALITY 28,595,519 37,862,571 +32.4%

The improvement varies by pattern: SEQUENTIAL and LOW_CARDINALITY see ~2x because
the slab putInt() path has highly predictable branching (slab rarely runs out
for sequential writes). RANDOM and HIGH_CARDINALITY still see a solid +32%
improvement.

The same code path also benefits writeLong(), writeFloat(), writeDouble(),
and the length prefix written by writeBytes(Binary).

Decode round-trip verified: re-reading the encoded data with PlainValuesReader
produces identical values at ~1.15B ops/s.

Validation

All 573 parquet-column tests and 308 parquet-common tests pass with the
change applied.

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions