Summary
_bounded_chunksize uses a uniform average bytes_per_row = table.nbytes // table.num_rows and caps batches by row count only. For variable-width/skewed frames (e.g. a JSON-blob column, clustered wide rows) a single record-batch can vastly exceed the 8 MiB target, and since zstd compression runs per-batch, the compressor working set spikes — exactly the OOM regime this feature targets.
Evidence
src/cachekit/serializers/arrow_serializer.py:65-68 (estimate), :223-227 (max_chunksize is rows-only), :52-54/:218-222 (stated per-batch bound)
- Empirical: 100k tiny rows + 50 wide 2 MiB cells → chunksize 7966 rows → one batch ~100 MiB vs 8 MiB target (~12.5x overshoot)
Impact
Silently undermines the peak-RSS bound for the target workload. No data-integrity impact; no test coverage for chunking.
Fix
Byte-aware batching: accumulate batches by estimated bytes (or cap by both rows and a byte budget). Add a skewed-frame memory regression test.
Summary
_bounded_chunksizeuses a uniform averagebytes_per_row = table.nbytes // table.num_rowsand caps batches by row count only. For variable-width/skewed frames (e.g. a JSON-blob column, clustered wide rows) a single record-batch can vastly exceed the 8 MiB target, and since zstd compression runs per-batch, the compressor working set spikes — exactly the OOM regime this feature targets.Evidence
src/cachekit/serializers/arrow_serializer.py:65-68(estimate),:223-227(max_chunksizeis rows-only),:52-54/:218-222(stated per-batch bound)Impact
Silently undermines the peak-RSS bound for the target workload. No data-integrity impact; no test coverage for chunking.
Fix
Byte-aware batching: accumulate batches by estimated bytes (or cap by both rows and a byte budget). Add a skewed-frame memory regression test.