perf(serialization): project all_stats wire payload to displayed stats only (#880)#883
perf(serialization): project all_stats wire payload to displayed stats only (#880)#883paddymul wants to merge 2 commits into
Conversation
…layed keys The all_stats payload currently serializes the entire merged_sd — every stat for every column — but the frontend only reads the pinned-row values (looked up by primary_key_val) and the histogram bins. This test asserts the wire payload drops the dead weight (value_counts, histogram_args, memory_usage, the is_* typing flags) while keeping what the UI reads, and that the full merged_sd stays intact on the dataflow. Fails today. (#880) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s only Derive the frontend-needed stat-key set from the active styling classes (pinned_rows primary_key_vals, ? scope-prefix stripped) plus the histogram bins the color-map rule reads, and project merged_sd down to those keys before sd_to_parquet_b64. The full merged_sd stays on the dataflow for styling regeneration and server-side use (sort, column_config); only the wire copy shrinks. - serialization_utils.project_sd: pure per-column key filter - styling_core.wire_stat_keys: the displayed-key allowlist - dataflow._sd_to_jsondf: the single projection choke point; the widget and polars _sd_to_jsondf now delegate here For a 20-stat numeric frame this trims the per-column wire payload from ~43 stats to ~18. (#880) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f43b90f512
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| wire_stats = _wire_stat_names(w.df_data_dict['all_stats']) | ||
| # Dead weight is gone from the wire ... | ||
| assert 'value_counts' not in wire_stats |
There was a problem hiding this comment.
Include the implementation before asserting trimmed wire stats
In this commit the production serializers are unchanged: DataFlow._sd_to_jsondf, BuckarooWidgetBase._sd_to_jsondf, and PolarsBuckarooWidget._sd_to_jsondf still pass the complete sd directly to sd_to_parquet_b64(sd), which emits every {column}__{stat} in the schema. As a result, once the test environment has pandas/pyarrow installed, wire_stats will still include value_counts (and the other dead-weight keys) from merged_sd, so this new test fails in CI rather than documenting a shipped optimization.
Useful? React with 👍 / 👎.
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26724573893or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26724573893MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26724573893" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
Closes #880.
Summary
all_statsis built bysd_to_parquet_b64(merged_sd)and serializes the entiremerged_sd— every stat for every column. The frontend only reads two things out of it: the histogram-bin arrays the color-map rule bins against (histogram_bins/histogram_log_bins, gridUtils.ts), and the per-column pinned-row values it looks up byprimary_key_val(gridUtils.tsextractPinnedRows). Everything else —value_counts(a wholepd.Series),histogram_args,memory_usage, theis_*typing flags, the heuristic*_fraccleaning stats — is shipped to the browser and never read.This projects
merged_sddown to just the stats the frontend reads before serialization. The fullmerged_sdstays on the dataflow for styling regeneration and server-side use (infinite-scroll sort,column_config); only the wire copy shrinks.What changed
serialization_utils.project_sd(sd, keep_keys)— pure per-column key filter (input not mutated).styling_core.wire_stat_keys(styling_classes, extra_pinned_rows)— the displayed-key allowlist:histogram_bins/histogram_log_bins∪ every pinned-rowprimary_key_val(the?optional/scope prefix stripped, matchingstripOptionalPinnedKey) across the active styling classes and any runtimepinned_rowsoverride.dataflow._sd_to_jsondf— the single projection choke point (project_sd(sd, wire_stat_keys(...))). The widget-level and polars_sd_to_jsondfnow delegate to the dataflow so the projection lives in one place; the redundant polars override is removed.Impact
For a 20-column numeric frame the per-column wire payload drops from ~43 stats to ~18 (the issue measures the standard-frame
all_statsparquet at 332 KB → and 3.2 MB for a 191-col frame; most of that is the dead keys above). Smallerinitial_state/traitlet payload, lesshyparquetparse work on first paint, and smaller persisted first-load cache bundles (#877).Conflict with #876
Checked against
refactor/dfstats-v1-removal(#876, the v1 dfstats removal) via a dry-run merge: the feature code is disjoint —buckaroo_widget.py,polars_buckaroo.py, andbasic_widget_test.pyall auto-merge cleanly. The only collision is one test function,test_polars_all_statsintests/unit/polars_basic_widget_test.py, which both PRs rewrite (this PR adds pinned rows so the displayed stats survive projection; #876 ports its analysis classes off v1PlDfStats). Trivial manual resolution whichever lands first.Testing
test_all_stats_wire_payload_trimmed_to_displayed_keys(integration — wire trimmed,merged_sdintact),test_project_sd_keeps_only_requested_keys,test_wire_stat_keys_unions_pinned_rows_and_histogram_bins.test_polars_all_statsto the projected wire.ruff+paddy_format --checkclean.🤖 Generated with Claude Code