Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup)

### Describe the enhancement requested

`PlainValuesWriter` (used for PLAIN-encoded INT32, INT64, FLOAT, DOUBLE, and BINARY
columns) currently writes each value through two layers of abstraction:

```
PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream
```

Per `writeInt()`, `LittleEndianDataOutputStream` decomposes the int into 4 bytes
in a temporary `writeBuffer[8]` array and calls `out.write(writeBuffer, 0, 4)`,
which dispatches through the `OutputStream` chain into `CapacityByteArrayOutputStream`.
That path performs:

- 4 byte-shift operations for little-endian decomposition
- 1 intermediate `writeBuffer[8]` array write
- 2 levels of virtual dispatch
- 1 bounds check in `write(byte[], off, len)`
- 1 `System.arraycopy` for 4 bytes

Since `CapacityByteArrayOutputStream` already buffers into `ByteBuffer` slabs
internally, the entire chain can be collapsed into a single `ByteBuffer.putInt()`
call, which is a HotSpot intrinsic that compiles to a single unaligned store on
x86/ARM when the buffer is in `LITTLE_ENDIAN` order.

### Proposal

1. In `CapacityByteArrayOutputStream`:
   - Set `ByteOrder.LITTLE_ENDIAN` on newly allocated slabs in `addSlab()`.
   - Add `writeInt(int)` and `writeLong(long)` methods that call
     `currentSlab.putInt(v)` / `currentSlab.putLong(v)` directly, with a single
     remaining-check that grows the slab if needed.

2. In `PlainValuesWriter`:
   - Remove the `LittleEndianDataOutputStream` field entirely.
   - `writeInteger(v)` -> `arrayOut.writeInt(v)`
   - `writeLong(v)` -> `arrayOut.writeLong(v)`
   - `writeFloat(v)` -> `arrayOut.writeInt(Float.floatToIntBits(v))`
   - `writeDouble(v)` -> `arrayOut.writeLong(Double.doubleToLongBits(v))`
   - `writeBytes(Binary v)` -> `arrayOut.writeInt(v.length()); v.writeTo(arrayOut);`
   - `getBytes()` no longer needs to flush a buffering layer.
   - `close()` no longer closes the defunct stream.

What was eliminated per `writeInt` call:

- 4 byte-shift operations for little-endian decomposition
- 1 intermediate `writeBuffer[8]` array write
- 2 levels of virtual dispatch
- 1 bounds check in `write(byte[], off, len)`
- 1 `System.arraycopy` for 4 bytes

Replaced with:

- 1 remaining-check on the slab `ByteBuffer`
- 1 `ByteBuffer.putInt()` call (single JVM intrinsic, ~1 store instruction on
  little-endian architectures)

### Benchmark results

`IntEncodingBenchmark.encodePlain` (100,000 INT32 values per invocation, JMH
`-wi 3 -i 5 -f 1`):

| Pattern          | Before (ops/s) | After (ops/s) | Improvement |
|------------------|---------------:|--------------:|------------:|
| SEQUENTIAL       |     26,817,451 |    52,953,193 | **+97.5% (2.0x)** |
| RANDOM           |     28,517,312 |    37,774,036 | **+32.5%** |
| LOW_CARDINALITY  |     28,705,158 |    52,819,678 | **+84.0%** |
| HIGH_CARDINALITY |     28,595,519 |    37,862,571 | **+32.4%** |

The improvement varies by pattern: SEQUENTIAL and LOW_CARDINALITY see ~2x because
the slab `putInt()` path has highly predictable branching (slab rarely runs out
for sequential writes). RANDOM and HIGH_CARDINALITY still see a solid +32%
improvement.

The same code path also benefits `writeLong()`, `writeFloat()`, `writeDouble()`,
and the length prefix written by `writeBytes(Binary)`.

Decode round-trip verified: re-reading the encoded data with `PlainValuesReader`
produces identical values at ~1.15B ops/s.

### Validation

All 573 `parquet-column` tests and 308 `parquet-common` tests pass with the
change applied.

### Component(s)

Core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup) #3495

Describe the enhancement requested

Proposal

Benchmark results

Validation

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pattern	Before (ops/s)	After (ops/s)	Improvement
SEQUENTIAL	26,817,451	52,953,193	+97.5% (2.0x)
RANDOM	28,517,312	37,774,036	+32.5%
LOW_CARDINALITY	28,705,158	52,819,678	+84.0%
HIGH_CARDINALITY	28,595,519	37,862,571	+32.4%

Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup) #3495

Description

Describe the enhancement requested

Proposal

Benchmark results

Validation

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions