Skip to content

Optimize PlainValuesReader by reading directly from ByteBuffer (12x decode speedup) #3493

@iemejia

Description

@iemejia

Describe the enhancement requested

PlainValuesReader (used for PLAIN-encoded INT32, INT64, FLOAT, and DOUBLE columns) currently
reads each value through a LittleEndianDataInputStream wrapper around a ByteBufferInputStream.
Per value, readInt() performs 4 separate virtual in.read() calls and manually reassembles the
result with bit shifts:

public final int readInt() throws IOException {
    int ch4 = in.read();
    int ch3 = in.read();
    int ch2 = in.read();
    int ch1 = in.read();
    if ((ch1 | ch2 | ch3 | ch4) < 0) throw new EOFException();
    return ((ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4 << 0));
}

The LittleEndianDataInputStream.readInt() method itself carries a TODO comment from years ago
suggesting exactly this kind of replacement (a ByteBuffer-backed implementation).

Proposal

In PlainValuesReader.initFromPage(), obtain the page data as a single contiguous ByteBuffer
via stream.slice(stream.available()) with ByteOrder.LITTLE_ENDIAN, and call the corresponding
ByteBuffer accessor directly per value:

  • readInteger()buffer.getInt()
  • readLong()buffer.getLong()
  • readFloat()buffer.getFloat()
  • readDouble()buffer.getDouble()
  • skip(n)buffer.position(buffer.position() + n * typeSize)

ByteBuffer.getInt() with the appropriate byte order is a HotSpot intrinsic that compiles to a
single unaligned load instruction on x86/ARM — no virtual dispatch, no per-byte assembly, no
checked IOException on the per-value path.

ByteBufferInputStream.slice() already handles both single-buffer (zero-copy view) and
multi-buffer (single contiguous copy) cases transparently. In practice page data is almost
always a single contiguous buffer.

Benchmark results

IntEncodingBenchmark.decodePlain (100,000 INT32 values per invocation, JMH -wi 3 -i 5 -f 1):

Pattern Before (ops/s) After (ops/s) Speedup
SEQUENTIAL 92,918,297 1,143,149,235 12.3x
RANDOM 92,126,888 1,147,547,093 12.5x
LOW_CARDINALITY 93,005,451 1,142,666,760 12.3x
HIGH_CARDINALITY 93,312,596 1,144,681,876 12.3x

The speedup is consistent across data patterns because the bottleneck is entirely in the
per-value dispatch overhead, not the data itself. All four numeric plain reader types
(int, long, float, double) benefit equally.

End-to-end on ReadBenchmarks.readFile (6-column schema, 100k rows), this contributes to the
overall read-path improvement of 5–10% measured when combined with related decode optimizations.

Side effects

After this change, LittleEndianDataInputStream has no remaining production usages in the
codebase — only one test reference (TestColumnChunkPageWriteStore) and two TODO comments.
A follow-up could remove or deprecate the class.

Validation

All 573 parquet-column tests pass with the change applied.

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions