PARQUET-3479: Add configuration to disable early dictionary compression check#3556
Open
yadavay-amzn wants to merge 1 commit into
Open
PARQUET-3479: Add configuration to disable early dictionary compression check#3556yadavay-amzn wants to merge 1 commit into
yadavay-amzn wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
FallbackValuesWritercallsisCompressionSatisfying()after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.
Fix
Add a configurable property
ParquetProperties.isDictionaryEarlyCheckEnabled()(default:truefor backward compatibility) that controls whether the first-page compression check is performed inFallbackValuesWriter.getBytes().When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (
shouldFallBack()), not based on the first-page compression ratio.Changes
ParquetProperties: addeddictionaryEarlyCheckEnabledfield, getter, and builder methodFallbackValuesWriter: added overloadedof()factory and constructor accepting the flag; guarded theisCompressionSatisfyingcallDefaultValuesWriterFactory: passes the config through toFallbackValuesWriter.of()TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabledTesting
parquet-columntests unaffected (defaulttruepreserves existing behavior)