GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512
GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512iemejia wants to merge 27 commits into
Conversation
668caf7 to
2404a29
Compare
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
steveloughran
left a comment
There was a problem hiding this comment.
commented. I'm only just learning effective jmh myself; these all LGTM. The one request is that temp files are put under target/ so that a ./run.sh all command puts everything in the existing temporary directory tree
| } | ||
|
|
||
| @Benchmark | ||
| @OperationsPerInvocation(VALUE_COUNT) |
There was a problem hiding this comment.
didn't know of this trick until i saw your PR; adopted in #3452 earlier today
| * A no-op {@link OutputFile} that discards all written data. | ||
| * Useful for isolating CPU/encoding cost from filesystem I/O in write benchmarks. | ||
| */ | ||
| public final class BlackHoleOutputFile implements OutputFile { |
There was a problem hiding this comment.
does this act as a black hole for the benchmarks? or would passing in the blackhole to the constructor and having it used on L62 and L67 be best?
|
|
||
| @Setup(Level.Trial) | ||
| public void setup() throws IOException { | ||
| tempFile = File.createTempFile("parquet-read-bench-", ".parquet"); |
There was a problem hiding this comment.
there's a constant BenchmarkFiles.TARGET_DIR which defines the dir for benchmarks; it puts them under target/ so maven will clean them up. I used that in my PR so killing a test run in my IDE didn't leave cruft around...I'd recommend the same.
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
…luesWriter Two related changes in the DELTA_BYTE_ARRAY write path: 1. DeltaLengthByteArrayValuesWriter: drop the unused LittleEndianDataOutputStream wrapper. Binary.writeTo(arrayOut) works directly with the underlying CapacityByteArrayOutputStream; the LE wrapper added an extra layer of dispatch on every value but never used any LE functionality (writeInt/writeLong/etc.). Add a new writeBytes(byte[], int, int) overload so callers that already have the raw bytes can avoid allocating a Binary wrapper. 2. DeltaByteArrayWriter: tighten suffixWriter field type to DeltaLengthByteArrayValuesWriter (it's always constructed as one) so the new writeBytes(byte[], int, int) overload is callable. Replace the suffix call with the raw-bytes overload, eliminating the per-value Binary.slice() allocation. Benchmark results (BinaryEncodingBenchmark.encodeDeltaByteArray and encodeDeltaLengthByteArray, added in apache#3512): - encodeDeltaByteArray (LOW cardinality, len=10): +33% to +55% - encodeDeltaLengthByteArray (LOW card, len=10): +18% to +21% - long-string cases: flat (per-value alloc amortized away) No public API change. No file format change. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
| public void readIntegers(int[] dest, int offset, int count) { | ||
| try { | ||
| // Batch-decode dictionary IDs, then batch-lookup | ||
| int[] ids = new int[count]; |
There was a problem hiding this comment.
You could havve a lambda expression to catch and translate all of these, as per org.apache.hadoop.util.functional.FunctionalIO
If you are happy with UncheckedIOException, you can use that as is, and submit a PR saying "this shouldn't be private", which is shouldn't be.
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
Pre-generate deterministic rows for the file and concurrent benchmarks so row construction does not skew the timed section, and make the encoding benchmarks include real dictionary-page and dictionary-decode work instead of only value buffers. Split synthetic RLE dictionary-index decoding into its own benchmark and encode generated binary payloads as UTF-8 explicitly so benchmark inputs stay consistent across runs and platforms.
Make the dictionary encode/decode benchmarks symmetric by routing both sides through a shared EncodedDictionary helper, guard against the dictionary writer falling back to plain encoding (which previously NPE'd in BinaryEncodingBenchmark setup for high-cardinality long strings), and drop redundant close() calls after toDictPageAndClose(). Share the pre-generated row array across threads in ConcurrentReadWriteBenchmark via Scope.Benchmark, eliminating 4x heap duplication and a now-unnecessary ThreadData inner class. Centralize the RNG seed as TestDataFactory.DEFAULT_SEED and add seed-overload variants for the int and binary generators so generators in the same setup no longer share a Random and silently depend on call order. Wrap the RLE encoder in try-with-resources and validate that LOW_CARDINALITY_DISTINCT fits within the configured bit width.
Benchmarks raw compress/decompress throughput for each supported codec (SNAPPY, ZSTD, LZ4_RAW, GZIP) at page sizes 8KB, 64KB, and 256KB using the heap-based CodecFactory path. Input data mixes sequential, repeated, low-range random, and full random patterns for realistic compression ratios.
- Add RLE encodeDictionaryIds benchmark to cover par9 encoder pack32Values fast path (previously only decode was benchmarked) - Trim CompressionBenchmark page sizes to boundary conditions (64K, 1MB) to cut redundant mid-points - Increase FileRead/FileWriteBenchmark SS iterations (warmup 3->5, measurement 5->10) for better statistical stability - Increase RowGroupFlushBenchmark iterations (warmup 2->3, measurement 3->5) for improved confidence with 2 param combos
Restore decodeDictionaryIdsBatch and decodeValuesReaderBatch from the original cherry-pick, now that the par13 batch read APIs are available. Add decodeValuesReader for production-path (ValuesReader wrapper) coverage. Five RLE benchmark methods now cover par9 and par13: - encodeDictionaryIds: RLE encoder pack32Values fast path - decodeDictionaryIds: per-value RLE decoder - decodeDictionaryIdsBatch: batch RLE decoder via readInts() - decodeValuesReader: per-value via ValuesReader wrapper - decodeValuesReaderBatch: batch via ValuesReader.readIntegers()
…/DOUBLE) Covers per-value and batch decode paths for PlainValuesReader across all four numeric primitive types. Uses pre-allocated destination arrays to avoid per-invocation allocation noise in batch measurements.
Add decodeFloatBatch, decodeDoubleBatch, decodeIntBatch, decodeLongBatch benchmarks with pre-allocated destination arrays to measure readXxx(dest, offset, count) throughput for all four BSS primitive types.
BooleanEncodingBenchmark exercises both encoding paths across six data patterns: ALL_TRUE, ALL_FALSE, ALTERNATING, RANDOM, MOSTLY_TRUE_99, MOSTLY_FALSE_99. Key findings (100K values): Encode: V1 PLAIN is data-independent (~880M ops/s). V2 RLE ranges from 2,344M (ALL_FALSE, +166%) to 192M (RANDOM, -78%). Decode: V2 RLE always >= V1 PLAIN — from +154% (ALL_FALSE) to +7% (ALTERNATING). The RLE decode penalty for random data is negligible. The severe RLE encode penalty for random data (4.6x slower than PLAIN) suggests the V1/V2 split is well-justified: V2 RLE is ideal for the common case of skewed boolean columns, while V1 PLAIN is safer for high-entropy data.
…2 RLE) Adds decodePlainV1Batch and decodeRleV2Batch benchmark methods that exercise the new readBooleans() batch API. Uses a pre-allocated boolean[] destination array to isolate decode throughput from allocation overhead.
…ypes Covers encode + decode (scalar and batch) paths for all four type-specific dictionary implementations: Long2IntOpenHashMap, Float2IntOpenHashMap, Double2IntOpenHashMap, and Object2IntOpenHashMap (for FLBA). Two data patterns exercise low-cardinality (100 distinct values, ~100% hit rate) and high-cardinality (all unique, stresses hash map growth). Also adds TestDataFactory generators for long[], float[], double[], and fixed-length Binary[] data with configurable cardinality. Characterization results (100K values, JDK 25, Compiler Blackholes): - Batch decode shows +60-67% over scalar for LONG/FLOAT/DOUBLE - HIGH_CARDINALITY encode is 6-7x slower than LOW (hash map pressure) - FLBA encode is 14-108M ops/s (Binary hashing overhead)
…ecode Rewrites FixedLenByteArrayEncodingBenchmark from a single encodePlain() method to full coverage of all four FLBA-supported encodings (PLAIN, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT, DICTIONARY) with both encode and decode benchmarks. Adds parameterized fixedLength (2=FLOAT16, 12=INT96, 16=UUID) and dataPattern (RANDOM, LOW_CARDINALITY) axes. Characterization results (100K values, JDK 25, fixedLength=16): - Dictionary decode: 543M ops/s (fastest, avoids 16B copy per value) - PLAIN decode: 184M ops/s (slice + Binary wrapping) - BSS/Delta decode: ~87M ops/s (byte scatter/prefix overhead) - BSS excels at fixedLength=2: 368M ops/s (trivial 2-stream transpose)
LZ4_RAW was optimized (+47-77% decompress throughput) and has a micro-benchmark in CompressionBenchmark, but was missing from the end-to-end file read/write benchmarks. Adding it enables direct comparison with SNAPPY, ZSTD, and GZIP at the full-file level.
Add encodePlainV1Batch and encodeRleV2Batch benchmarks that exercise the new writeBooleans() batch encoding path, complementing the existing scalar encode benchmarks.
…LOAT/DOUBLE) New PlainEncodingBenchmark class with scalar vs batch comparison for all four numeric types. Also adds encodePlainBatch to IntEncodingBenchmark for consistency.
…dePlainBatch) Exercises the new readBinaries()/writeBinaries() batch APIs for FIXED_LEN_BYTE_ARRAY PLAIN encoding. Results: decode batch +165-245%, encode batch +19-81%.
- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution
Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and decompression by using org.meteogroup.jbrotli's native JNI bindings directly with ByteBuffer support via reflection (brotli-codec remains runtime scope). This eliminates intermediate buffer copies and the BrotliStreamCompressor state machine overhead. Changes: - DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection - Load native library eagerly with graceful fallback to Hadoop codec path - CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory to benchmark the actual production code path Results at 64KB page size: - Compress: 6,746 -> 9,662 ops/s (1.43x speedup) - Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)
Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get() bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded data buffer is a contiguous heap byte[] in LE order, making view buffer bulk reads a single memcpy via Unsafe.copyMemory. Benchmark results (100K values, BSS FLOAT batch): Before: ~1,228M ops/s After: ~1,442M ops/s (+17%) INT32/INT64/DOUBLE show negligible change because BSS invocation cost is dominated by page transposition in initFromPage, not the read loop.
…ns() Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader with direct bit extraction from the page byte[]. The scalar path uses a single array access + shift + mask instead of the 8-element int[] buffer and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per byte with constant masks. For RLE (V2), add a native readBooleans() method that uses Arrays.fill for RLE runs (constant-time for uniform data) and direct int-to-boolean conversion for packed groups, avoiding the intermediate int[] allocation of the readInts() path. Benchmark results (1M values, JDK 25, Compiler Blackholes): - V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%) - V1 PLAIN batch: ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%) - V2 RLE batch: ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)
Replace the per-bit unrolled extraction loop with a static boolean[256][8] lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit load/store pair — the boolean equivalent of asIntBuffer().get() for ints. For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and read directly from the raw packed bytes via the same lookup table. This makes batch decode throughput independent of data pattern: - V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%) - V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%) - V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%) - Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)
…king Refactor BooleanPlainValuesWriter to pack bits directly into bytes instead of delegating through ByteBitPackingValuesWriter and the generic int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API to ValuesWriter with optimized overrides: - PLAIN: processes 8 booleans at a time into single bytes with OR/shift, eliminating the per-value method call chain and int[] intermediate. - RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial bit-packed groups from run boundaries to avoid spurious padding. PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring. PLAIN batch: +184% over old scalar (2,528M for RANDOM). RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.
…riteDoubles with bulk ByteBuffer view transfers Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs, writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer view puts to transfer entire arrays in one operation, amortizing capacity checks across the batch. Add corresponding batch APIs to ValuesWriter (with scalar default) and optimized overrides in PlainValuesWriter. Performance improvement (100K values, JDK 25): INT32: 566M -> 2,809M ops/s (+396%) FLOAT: 540M -> 2,818M ops/s (+422%) INT64: 479M -> 1,306M ops/s (+173%) DOUBLE: 442M -> 1,275M ops/s (+189%)
- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls - FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views - FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead - ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops for element sizes 2, 4, 8, 12, 16 - ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset - FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants - Add TestDataFactory and BenchmarkEncodingUtils helper classes - Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+
… writes Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter with a BATCH_SIZE=64 buffered scatter pattern: - Accumulate byte values into per-stream batch buffers - Flush as bulk write(byte[], 0, count) to each stream - Eliminates N*elementSize individual stream.write(byte) calls per batch - Adds writeBinaries() batch override for FLBA BSS writer Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).
Summary
Resolves #3511.
The
parquet-benchmarksshaded jar built from current master is non-functional — it fails at runtime withRuntimeException: Unable to find the resource: /META-INF/BenchmarkList. This PR fixes that and adds 11 JMH benchmarks covering the encode/decode paths exercised by the open performance PRs, so reviewers can reproduce the reported numbers.What's broken on master
parquet-benchmarks/pom.xmlis missing two pieces of configuration:maven-compiler-pluginlacks theannotationProcessorPaths/annotationProcessorsconfig forjmh-generator-annprocess, so the JMH annotation processor never runs andMETA-INF/BenchmarkList/META-INF/CompilerHintsare never generated.maven-shade-pluginlacksAppendingTransformerentries for those two resources, so even if generated they would be dropped during shading.Both problems are fixed in this PR.
Benchmarks added
11 new files in
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/:IntEncodingBenchmarkBinaryEncodingBenchmarkByteStreamSplitEncodingBenchmark/ByteStreamSplitDecodingBenchmarkFixedLenByteArrayEncodingBenchmarkFileReadBenchmark/FileWriteBenchmarkRowGroupFlushBenchmarkConcurrentReadWriteBenchmarkBlackHoleOutputFileOutputFilethat discards bytes — isolates CPU from I/OTestDataFactoryValidation
After this PR, the shaded jar is runnable and registers 87 benchmarks:
Sanity check —
IntEncodingBenchmark.decodePlainreproduces the master baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations):Out of scope (deferred)
Modernization of the existing
ReadBenchmarks/WriteBenchmarks/NestedNullWritingBenchmarks(Hadoop-freeLocalInputFile, parameterization, JMH-idiomatic state setup) is a separate concern and will be proposed in a follow-up PR.Follow-up
Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, #3510) will be updated with a one-line "How to reproduce" snippet referencing the relevant
*Benchmarkclass.