Skip to content

GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads and batch methods#3560

Open
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:perf-plain-bulk-batch
Open

GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads and batch methods#3560
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:perf-plain-bulk-batch

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 13, 2026

Summary

  • Replace LittleEndianDataInputStream wrapper with direct ByteBuffer reads using LITTLE_ENDIAN byte order in PlainValuesReader, eliminating per-value virtual dispatch overhead (4 in.read() calls + manual bit shifts → single ByteBuffer.get*() JVM intrinsic).
  • Add batch read methods (readIntegers, readFloats, readLongs, readDoubles) that use bulk typed-buffer view reads (e.g. buffer.asIntBuffer().get(dest, offset, count)) to bypass per-value bounds checks and position updates.
  • Page data is obtained as a single contiguous ByteBuffer via ByteBufferInputStream.slice(available), which handles both single-buffer (zero-copy view) and multi-buffer (copy into contiguous buffer) cases transparently.

Benchmark Results

Per-value read optimization (100k INT32 values, JMH):

Pattern Before (ops/s) After (ops/s) Speedup
SEQUENTIAL 427,630,411 5,397,298,681 12.6x
RANDOM 431,052,072 5,437,926,758 12.6x
LOW_CARDINALITY 423,443,685 5,477,810,011 12.9x
HIGH_CARDINALITY 426,405,891 5,485,493,740 12.9x

Batch read methods (PlainDecodingBenchmark, 100K values, pre-allocated arrays):

Type Per-value (ops/s) Batch (ops/s) Speedup
INT32 5,454M 28,256M +418%
FLOAT 5,407M 25,798M +377%
INT64 5,408M 8,088M +50%
DOUBLE 7,404M 7,965M +8%

All 573 parquet-column tests pass.

iemejia added 2 commits May 13, 2026 14:11
Replace the LittleEndianDataInputStream wrapper with direct ByteBuffer
access using LITTLE_ENDIAN byte order in PlainValuesReader. Each
read{Integer,Long,Float,Double}() previously dispatched through 4
in.read() calls per value and assembled the result with manual bit
shifts; it now compiles to a single ByteBuffer get*() JVM intrinsic.

In initFromPage, the page data is obtained as a single contiguous
ByteBuffer via ByteBufferInputStream.slice(available). The
ByteBufferInputStream.slice() method handles both single-buffer
(zero-copy view) and multi-buffer (copy into contiguous buffer) cases
transparently. In practice page data is almost always a single
contiguous buffer.

Benchmark (IntEncodingBenchmark.decodePlain, 100k INT32 values per
invocation, JMH -wi 3 -i 5 -f 1):

  Pattern           Before (ops/s)   After (ops/s)   Speedup
  SEQUENTIAL        427,630,411   5,397,298,681     12.6x
  RANDOM            431,052,072   5,437,926,758     12.6x
  LOW_CARDINALITY   423,443,685   5,477,810,011     12.9x
  HIGH_CARDINALITY  426,405,891   5,485,493,740     12.9x

The improvement is consistent regardless of data distribution because
the bottleneck was entirely in the dispatch overhead. All four numeric
plain reader types (int, long, float, double) benefit equally.

All 573 parquet-column tests pass.
… reads

Add readIntegers/readFloats/readLongs/readDoubles batch methods to all
PlainValuesReader inner classes. All four types use bulk typed-buffer
view reads (e.g. buffer.asIntBuffer().get(dest, offset, count)) which
bypass per-value bounds checks and position updates.

Benchmark results (PlainDecodingBenchmark, 100K values, pre-allocated arrays):

  Type    Per-value (ops/s)  Batch (ops/s)  Speedup
  INT32        5,454M          28,256M       +418%
  FLOAT        5,407M          25,798M       +377%
  INT64        5,408M           8,088M        +50%
  DOUBLE       7,404M           7,965M         +8%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant