Skip to content

PARQUET-3479: Add configuration to disable early dictionary compression check#3556

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479
Open

PARQUET-3479: Add configuration to disable early dictionary compression check#3556
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 11, 2026

Problem

FallbackValuesWriter calls isCompressionSatisfying() after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.

As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.

Fix

Add a configurable property ParquetProperties.isDictionaryEarlyCheckEnabled() (default: true for backward compatibility) that controls whether the first-page compression check is performed in FallbackValuesWriter.getBytes().

When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (shouldFallBack()), not based on the first-page compression ratio.

Changes

  • ParquetProperties: added dictionaryEarlyCheckEnabled field, getter, and builder method
  • FallbackValuesWriter: added overloaded of() factory and constructor accepting the flag; guarded the isCompressionSatisfying call
  • DefaultValuesWriterFactory: passes the config through to FallbackValuesWriter.of()
  • New test TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabled

Testing

  • New unit tests pass (2/2)
  • Existing parquet-column tests unaffected (default true preserves existing behavior)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant