Skip to content

Arrow batch-size estimate breaks the bounded-memory guarantee on skewed/wide frames #161

@27Bslash6

Description

@27Bslash6

Summary

_bounded_chunksize uses a uniform average bytes_per_row = table.nbytes // table.num_rows and caps batches by row count only. For variable-width/skewed frames (e.g. a JSON-blob column, clustered wide rows) a single record-batch can vastly exceed the 8 MiB target, and since zstd compression runs per-batch, the compressor working set spikes — exactly the OOM regime this feature targets.

Evidence

  • src/cachekit/serializers/arrow_serializer.py:65-68 (estimate), :223-227 (max_chunksize is rows-only), :52-54/:218-222 (stated per-batch bound)
  • Empirical: 100k tiny rows + 50 wide 2 MiB cells → chunksize 7966 rows → one batch ~100 MiB vs 8 MiB target (~12.5x overshoot)

Impact

Silently undermines the peak-RSS bound for the target workload. No data-integrity impact; no test coverage for chunking.

Fix

Byte-aware batching: accumulate batches by estimated bytes (or cap by both rows and a byte budget). Add a skewed-frame memory regression test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions