Skip to content

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512

Open
iemejia wants to merge 27 commits into
apache:masterfrom
iemejia:perf-benchmarks
Open

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512
iemejia wants to merge 27 commits into
apache:masterfrom
iemejia:perf-benchmarks

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Summary

Resolves #3511.

The parquet-benchmarks shaded jar built from current master is non-functional — it fails at runtime with RuntimeException: Unable to find the resource: /META-INF/BenchmarkList. This PR fixes that and adds 11 JMH benchmarks covering the encode/decode paths exercised by the open performance PRs, so reviewers can reproduce the reported numbers.

What's broken on master

parquet-benchmarks/pom.xml is missing two pieces of configuration:

  • maven-compiler-plugin lacks the annotationProcessorPaths / annotationProcessors config for jmh-generator-annprocess, so the JMH annotation processor never runs and META-INF/BenchmarkList / META-INF/CompilerHints are never generated.
  • maven-shade-plugin lacks AppendingTransformer entries for those two resources, so even if generated they would be dropped during shading.

Both problems are fixed in this PR.

Benchmarks added

11 new files in parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/:

Benchmark Coverage
IntEncodingBenchmark int encode/decode: PLAIN, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, RLE, DICTIONARY
BinaryEncodingBenchmark Binary write/read paths, parameterized on length and cardinality
ByteStreamSplitEncodingBenchmark / ByteStreamSplitDecodingBenchmark BSS for float / double / int / long
FixedLenByteArrayEncodingBenchmark FLBA encode/decode
FileReadBenchmark / FileWriteBenchmark CPU-focused file-level benchmarks
RowGroupFlushBenchmark Flush path
ConcurrentReadWriteBenchmark Multi-threaded read/write throughput
BlackHoleOutputFile OutputFile that discards bytes — isolates CPU from I/O
TestDataFactory Shared data-generation utilities

Validation

After this PR, the shaded jar is runnable and registers 87 benchmarks:

$ ./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar -l | wc -l
87

Sanity check — IntEncodingBenchmark.decodePlain reproduces the master baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations):

Benchmark                            (dataPattern)   Mode  Cnt         Score         Error  Units
IntEncodingBenchmark.decodePlain        SEQUENTIAL  thrpt    5  93528419.575 ± 1472148.214  ops/s
IntEncodingBenchmark.decodePlain            RANDOM  thrpt    5  90908523.483 ± 1978982.394  ops/s
IntEncodingBenchmark.decodePlain   LOW_CARDINALITY  thrpt    5  92672978.255 ± 2071927.851  ops/s
IntEncodingBenchmark.decodePlain  HIGH_CARDINALITY  thrpt    5  90770177.655 ± 2427904.955  ops/s

Out of scope (deferred)

Modernization of the existing ReadBenchmarks / WriteBenchmarks / NestedNullWritingBenchmarks (Hadoop-free LocalInputFile, parameterization, JMH-idiomatic state setup) is a separate concern and will be proposed in a follow-up PR.

Follow-up

Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, #3510) will be updated with a one-line "How to reproduce" snippet referencing the relevant *Benchmark class.

iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 20, 2026
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 20, 2026
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented. I'm only just learning effective jmh myself; these all LGTM. The one request is that temp files are put under target/ so that a ./run.sh all command puts everything in the existing temporary directory tree

}

@Benchmark
@OperationsPerInvocation(VALUE_COUNT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't know of this trick until i saw your PR; adopted in #3452 earlier today

* A no-op {@link OutputFile} that discards all written data.
* Useful for isolating CPU/encoding cost from filesystem I/O in write benchmarks.
*/
public final class BlackHoleOutputFile implements OutputFile {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this act as a black hole for the benchmarks? or would passing in the blackhole to the constructor and having it used on L62 and L67 be best?


@Setup(Level.Trial)
public void setup() throws IOException {
tempFile = File.createTempFile("parquet-read-bench-", ".parquet");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a constant BenchmarkFiles.TARGET_DIR which defines the dir for benchmarks; it puts them under target/ so maven will clean them up. I used that in my PR so killing a test run in my IDE didn't leave cruft around...I'd recommend the same.

iemejia added a commit to iemejia/parquet-java that referenced this pull request May 1, 2026
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
iemejia added a commit to iemejia/parquet-java that referenced this pull request May 1, 2026
…luesWriter

Two related changes in the DELTA_BYTE_ARRAY write path:

1. DeltaLengthByteArrayValuesWriter: drop the unused LittleEndianDataOutputStream
   wrapper. Binary.writeTo(arrayOut) works directly with the underlying
   CapacityByteArrayOutputStream; the LE wrapper added an extra layer of
   dispatch on every value but never used any LE functionality
   (writeInt/writeLong/etc.). Add a new writeBytes(byte[], int, int) overload
   so callers that already have the raw bytes can avoid allocating a Binary
   wrapper.

2. DeltaByteArrayWriter: tighten suffixWriter field type to
   DeltaLengthByteArrayValuesWriter (it's always constructed as one) so the
   new writeBytes(byte[], int, int) overload is callable. Replace the suffix
   call with the raw-bytes overload, eliminating the per-value Binary.slice()
   allocation.

Benchmark results (BinaryEncodingBenchmark.encodeDeltaByteArray and
encodeDeltaLengthByteArray, added in apache#3512):

  - encodeDeltaByteArray (LOW cardinality, len=10):  +33% to +55%
  - encodeDeltaLengthByteArray (LOW card, len=10):   +18% to +21%
  - long-string cases: flat (per-value alloc amortized away)

No public API change. No file format change.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
@iemejia iemejia force-pushed the perf-benchmarks branch from 404ed02 to 19f343e Compare May 11, 2026 22:14
public void readIntegers(int[] dest, int offset, int count) {
try {
// Batch-decode dictionary IDs, then batch-lookup
int[] ids = new int[count];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could havve a lambda expression to catch and translate all of these, as per org.apache.hadoop.util.functional.FunctionalIO
If you are happy with UncheckedIOException, you can use that as is, and submit a PR saying "this shouldn't be private", which is shouldn't be.

iemejia added 7 commits May 13, 2026 21:31
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Pre-generate deterministic rows for the file and concurrent benchmarks so row construction does not skew the timed section, and make the encoding benchmarks include real dictionary-page and dictionary-decode work instead of only value buffers. Split synthetic RLE dictionary-index decoding into its own benchmark and encode generated binary payloads as UTF-8 explicitly so benchmark inputs stay consistent across runs and platforms.
Make the dictionary encode/decode benchmarks symmetric by routing both
sides through a shared EncodedDictionary helper, guard against the
dictionary writer falling back to plain encoding (which previously NPE'd
in BinaryEncodingBenchmark setup for high-cardinality long strings),
and drop redundant close() calls after toDictPageAndClose().

Share the pre-generated row array across threads in
ConcurrentReadWriteBenchmark via Scope.Benchmark, eliminating 4x heap
duplication and a now-unnecessary ThreadData inner class.

Centralize the RNG seed as TestDataFactory.DEFAULT_SEED and add
seed-overload variants for the int and binary generators so generators
in the same setup no longer share a Random and silently depend on call
order. Wrap the RLE encoder in try-with-resources and validate that
LOW_CARDINALITY_DISTINCT fits within the configured bit width.
Benchmarks raw compress/decompress throughput for each supported codec
(SNAPPY, ZSTD, LZ4_RAW, GZIP) at page sizes 8KB, 64KB, and 256KB using
the heap-based CodecFactory path. Input data mixes sequential, repeated,
low-range random, and full random patterns for realistic compression ratios.
- Add RLE encodeDictionaryIds benchmark to cover par9 encoder
  pack32Values fast path (previously only decode was benchmarked)
- Trim CompressionBenchmark page sizes to boundary conditions
  (64K, 1MB) to cut redundant mid-points
- Increase FileRead/FileWriteBenchmark SS iterations (warmup 3->5,
  measurement 5->10) for better statistical stability
- Increase RowGroupFlushBenchmark iterations (warmup 2->3,
  measurement 3->5) for improved confidence with 2 param combos
iemejia added 20 commits May 13, 2026 21:31
Restore decodeDictionaryIdsBatch and decodeValuesReaderBatch from the
original cherry-pick, now that the par13 batch read APIs are available.
Add decodeValuesReader for production-path (ValuesReader wrapper) coverage.

Five RLE benchmark methods now cover par9 and par13:
- encodeDictionaryIds: RLE encoder pack32Values fast path
- decodeDictionaryIds: per-value RLE decoder
- decodeDictionaryIdsBatch: batch RLE decoder via readInts()
- decodeValuesReader: per-value via ValuesReader wrapper
- decodeValuesReaderBatch: batch via ValuesReader.readIntegers()
…/DOUBLE)

Covers per-value and batch decode paths for PlainValuesReader across
all four numeric primitive types. Uses pre-allocated destination arrays
to avoid per-invocation allocation noise in batch measurements.
Add decodeFloatBatch, decodeDoubleBatch, decodeIntBatch, decodeLongBatch
benchmarks with pre-allocated destination arrays to measure readXxx(dest,
offset, count) throughput for all four BSS primitive types.
BooleanEncodingBenchmark exercises both encoding paths across six data
patterns: ALL_TRUE, ALL_FALSE, ALTERNATING, RANDOM, MOSTLY_TRUE_99,
MOSTLY_FALSE_99.

Key findings (100K values):
  Encode: V1 PLAIN is data-independent (~880M ops/s). V2 RLE ranges
  from 2,344M (ALL_FALSE, +166%) to 192M (RANDOM, -78%).
  Decode: V2 RLE always >= V1 PLAIN — from +154% (ALL_FALSE) to +7%
  (ALTERNATING). The RLE decode penalty for random data is negligible.

The severe RLE encode penalty for random data (4.6x slower than PLAIN)
suggests the V1/V2 split is well-justified: V2 RLE is ideal for the
common case of skewed boolean columns, while V1 PLAIN is safer for
high-entropy data.
…2 RLE)

Adds decodePlainV1Batch and decodeRleV2Batch benchmark methods that
exercise the new readBooleans() batch API. Uses a pre-allocated boolean[]
destination array to isolate decode throughput from allocation overhead.
…ypes

Covers encode + decode (scalar and batch) paths for all four type-specific
dictionary implementations: Long2IntOpenHashMap, Float2IntOpenHashMap,
Double2IntOpenHashMap, and Object2IntOpenHashMap (for FLBA). Two data
patterns exercise low-cardinality (100 distinct values, ~100% hit rate)
and high-cardinality (all unique, stresses hash map growth).

Also adds TestDataFactory generators for long[], float[], double[], and
fixed-length Binary[] data with configurable cardinality.

Characterization results (100K values, JDK 25, Compiler Blackholes):
- Batch decode shows +60-67% over scalar for LONG/FLOAT/DOUBLE
- HIGH_CARDINALITY encode is 6-7x slower than LOW (hash map pressure)
- FLBA encode is 14-108M ops/s (Binary hashing overhead)
…ecode

Rewrites FixedLenByteArrayEncodingBenchmark from a single encodePlain()
method to full coverage of all four FLBA-supported encodings (PLAIN,
DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT, DICTIONARY) with both encode and
decode benchmarks. Adds parameterized fixedLength (2=FLOAT16, 12=INT96,
16=UUID) and dataPattern (RANDOM, LOW_CARDINALITY) axes.

Characterization results (100K values, JDK 25, fixedLength=16):
- Dictionary decode: 543M ops/s (fastest, avoids 16B copy per value)
- PLAIN decode: 184M ops/s (slice + Binary wrapping)
- BSS/Delta decode: ~87M ops/s (byte scatter/prefix overhead)
- BSS excels at fixedLength=2: 368M ops/s (trivial 2-stream transpose)
LZ4_RAW was optimized (+47-77% decompress throughput) and has a
micro-benchmark in CompressionBenchmark, but was missing from the
end-to-end file read/write benchmarks. Adding it enables direct
comparison with SNAPPY, ZSTD, and GZIP at the full-file level.
Add encodePlainV1Batch and encodeRleV2Batch benchmarks that exercise
the new writeBooleans() batch encoding path, complementing the existing
scalar encode benchmarks.
…LOAT/DOUBLE)

New PlainEncodingBenchmark class with scalar vs batch comparison for all four
numeric types. Also adds encodePlainBatch to IntEncodingBenchmark for consistency.
…dePlainBatch)

Exercises the new readBinaries()/writeBinaries() batch APIs for FIXED_LEN_BYTE_ARRAY
PLAIN encoding. Results: decode batch +165-245%, encode batch +19-81%.
- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only)
- Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP
- Add jitpack.io repository for brotli-codec resolution
Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and
decompression by using org.meteogroup.jbrotli's native JNI bindings directly
with ByteBuffer support via reflection (brotli-codec remains runtime scope).
This eliminates intermediate buffer copies and the BrotliStreamCompressor
state machine overhead.

Changes:
- DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop
  default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection
- Load native library eagerly with graceful fallback to Hadoop codec path
- CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory
  to benchmark the actual production code path

Results at 64KB page size:
- Compress: 6,746 -> 9,662 ops/s (1.43x speedup)
- Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)
Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get()
bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded
data buffer is a contiguous heap byte[] in LE order, making view buffer
bulk reads a single memcpy via Unsafe.copyMemory.

Benchmark results (100K values, BSS FLOAT batch):
  Before: ~1,228M ops/s
  After:  ~1,442M ops/s (+17%)

INT32/INT64/DOUBLE show negligible change because BSS invocation cost is
dominated by page transposition in initFromPage, not the read loop.
…ns()

Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader
with direct bit extraction from the page byte[]. The scalar path uses a
single array access + shift + mask instead of the 8-element int[] buffer
and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per
byte with constant masks.

For RLE (V2), add a native readBooleans() method that uses Arrays.fill
for RLE runs (constant-time for uniform data) and direct int-to-boolean
conversion for packed groups, avoiding the intermediate int[] allocation
of the readInts() path.

Benchmark results (1M values, JDK 25, Compiler Blackholes):
- V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%)
- V1 PLAIN batch:  ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%)
- V2 RLE batch:    ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)
Replace the per-bit unrolled extraction loop with a static boolean[256][8]
lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded
booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit
load/store pair — the boolean equivalent of asIntBuffer().get() for ints.

For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and
read directly from the raw packed bytes via the same lookup table.

This makes batch decode throughput independent of data pattern:
- V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%)
- V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%)
- V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%)
- Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)
…king

Refactor BooleanPlainValuesWriter to pack bits directly into bytes
instead of delegating through ByteBitPackingValuesWriter and the generic
int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API
to ValuesWriter with optimized overrides:

- PLAIN: processes 8 booleans at a time into single bytes with OR/shift,
  eliminating the per-value method call chain and int[] intermediate.
- RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial
  bit-packed groups from run boundaries to avoid spurious padding.

PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring.
PLAIN batch: +184% over old scalar (2,528M for RANDOM).
RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.
…riteDoubles with bulk ByteBuffer view transfers

Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs,
writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer
view puts to transfer entire arrays in one operation, amortizing capacity checks
across the batch. Add corresponding batch APIs to ValuesWriter (with scalar
default) and optimized overrides in PlainValuesWriter.

Performance improvement (100K values, JDK 25):
  INT32:  566M -> 2,809M ops/s (+396%)
  FLOAT:  540M -> 2,818M ops/s (+422%)
  INT64:  479M -> 1,306M ops/s (+173%)
  DOUBLE: 442M -> 1,275M ops/s (+189%)
- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls
- FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views
- FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead
- ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops
  for element sizes 2, 4, 8, 12, 16
- ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset
- FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants
- Add TestDataFactory and BenchmarkEncodingUtils helper classes
- Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+
… writes

Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter
with a BATCH_SIZE=64 buffered scatter pattern:
- Accumulate byte values into per-stream batch buffers
- Flush as bulk write(byte[], 0, count) to each stream
- Eliminates N*elementSize individual stream.write(byte) calls per batch
- Adds writeBinaries() batch override for FLBA BSS writer

Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).
@iemejia iemejia force-pushed the perf-benchmarks branch from b58fc2a to 165bf49 Compare May 13, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add JMH benchmarks for encoding/decoding paths and fix parquet-benchmarks shaded jar

2 participants