Optimize PlainValuesReader by reading directly from ByteBuffer (12x decode speedup)

### Describe the enhancement requested

`PlainValuesReader` (used for PLAIN-encoded INT32, INT64, FLOAT, and DOUBLE columns) currently
reads each value through a `LittleEndianDataInputStream` wrapper around a `ByteBufferInputStream`.
Per value, `readInt()` performs 4 separate virtual `in.read()` calls and manually reassembles the
result with bit shifts:

```java
public final int readInt() throws IOException {
    int ch4 = in.read();
    int ch3 = in.read();
    int ch2 = in.read();
    int ch1 = in.read();
    if ((ch1 | ch2 | ch3 | ch4) < 0) throw new EOFException();
    return ((ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4 << 0));
}
```

The `LittleEndianDataInputStream.readInt()` method itself carries a TODO comment from years ago
suggesting exactly this kind of replacement (a `ByteBuffer`-backed implementation).

### Proposal

In `PlainValuesReader.initFromPage()`, obtain the page data as a single contiguous `ByteBuffer`
via `stream.slice(stream.available())` with `ByteOrder.LITTLE_ENDIAN`, and call the corresponding
`ByteBuffer` accessor directly per value:

- `readInteger()` → `buffer.getInt()`
- `readLong()`    → `buffer.getLong()`
- `readFloat()`   → `buffer.getFloat()`
- `readDouble()`  → `buffer.getDouble()`
- `skip(n)`       → `buffer.position(buffer.position() + n * typeSize)`

`ByteBuffer.getInt()` with the appropriate byte order is a HotSpot intrinsic that compiles to a
single unaligned load instruction on x86/ARM — no virtual dispatch, no per-byte assembly, no
checked `IOException` on the per-value path.

`ByteBufferInputStream.slice()` already handles both single-buffer (zero-copy view) and
multi-buffer (single contiguous copy) cases transparently. In practice page data is almost
always a single contiguous buffer.

### Benchmark results

`IntEncodingBenchmark.decodePlain` (100,000 INT32 values per invocation, JMH `-wi 3 -i 5 -f 1`):

| Pattern          | Before (ops/s) | After (ops/s)   | Speedup |
|------------------|---------------:|----------------:|--------:|
| SEQUENTIAL       |     92,918,297 |   1,143,149,235 | **12.3x** |
| RANDOM           |     92,126,888 |   1,147,547,093 | **12.5x** |
| LOW_CARDINALITY  |     93,005,451 |   1,142,666,760 | **12.3x** |
| HIGH_CARDINALITY |     93,312,596 |   1,144,681,876 | **12.3x** |

The speedup is consistent across data patterns because the bottleneck is entirely in the
per-value dispatch overhead, not the data itself. All four numeric plain reader types
(int, long, float, double) benefit equally.

End-to-end on `ReadBenchmarks.readFile` (6-column schema, 100k rows), this contributes to the
overall read-path improvement of 5–10% measured when combined with related decode optimizations.

### Side effects

After this change, `LittleEndianDataInputStream` has no remaining production usages in the
codebase — only one test reference (`TestColumnChunkPageWriteStore`) and two TODO comments.
A follow-up could remove or deprecate the class.

### Validation

All 573 `parquet-column` tests pass with the change applied.

### Component(s)

Core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PlainValuesReader by reading directly from ByteBuffer (12x decode speedup) #3493

Describe the enhancement requested

Proposal

Benchmark results

Side effects

Validation

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pattern	Before (ops/s)	After (ops/s)	Speedup
SEQUENTIAL	92,918,297	1,143,149,235	12.3x
RANDOM	92,126,888	1,147,547,093	12.5x
LOW_CARDINALITY	93,005,451	1,142,666,760	12.3x
HIGH_CARDINALITY	93,312,596	1,144,681,876	12.3x

Optimize PlainValuesReader by reading directly from ByteBuffer (12x decode speedup) #3493

Description

Describe the enhancement requested

Proposal

Benchmark results

Side effects

Validation

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions