Skip to content

perf(serialization): project all_stats wire payload to displayed stats only (#880)#883

Open
paddymul wants to merge 2 commits into
mainfrom
feat/880-trim-wire-stats
Open

perf(serialization): project all_stats wire payload to displayed stats only (#880)#883
paddymul wants to merge 2 commits into
mainfrom
feat/880-trim-wire-stats

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Closes #880.

Summary

all_stats is built by sd_to_parquet_b64(merged_sd) and serializes the entire merged_sd — every stat for every column. The frontend only reads two things out of it: the histogram-bin arrays the color-map rule bins against (histogram_bins / histogram_log_bins, gridUtils.ts), and the per-column pinned-row values it looks up by primary_key_val (gridUtils.ts extractPinnedRows). Everything else — value_counts (a whole pd.Series), histogram_args, memory_usage, the is_* typing flags, the heuristic *_frac cleaning stats — is shipped to the browser and never read.

This projects merged_sd down to just the stats the frontend reads before serialization. The full merged_sd stays on the dataflow for styling regeneration and server-side use (infinite-scroll sort, column_config); only the wire copy shrinks.

What changed

  • serialization_utils.project_sd(sd, keep_keys) — pure per-column key filter (input not mutated).
  • styling_core.wire_stat_keys(styling_classes, extra_pinned_rows) — the displayed-key allowlist: histogram_bins/histogram_log_bins ∪ every pinned-row primary_key_val (the ? optional/scope prefix stripped, matching stripOptionalPinnedKey) across the active styling classes and any runtime pinned_rows override.
  • dataflow._sd_to_jsondf — the single projection choke point (project_sd(sd, wire_stat_keys(...))). The widget-level and polars _sd_to_jsondf now delegate to the dataflow so the projection lives in one place; the redundant polars override is removed.

Impact

For a 20-column numeric frame the per-column wire payload drops from ~43 stats to ~18 (the issue measures the standard-frame all_stats parquet at 332 KB → and 3.2 MB for a 191-col frame; most of that is the dead keys above). Smaller initial_state/traitlet payload, less hyparquet parse work on first paint, and smaller persisted first-load cache bundles (#877).

Conflict with #876

Checked against refactor/dfstats-v1-removal (#876, the v1 dfstats removal) via a dry-run merge: the feature code is disjoint — buckaroo_widget.py, polars_buckaroo.py, and basic_widget_test.py all auto-merge cleanly. The only collision is one test function, test_polars_all_stats in tests/unit/polars_basic_widget_test.py, which both PRs rewrite (this PR adds pinned rows so the displayed stats survive projection; #876 ports its analysis classes off v1 PlDfStats). Trivial manual resolution whichever lands first.

Testing

  • New: test_all_stats_wire_payload_trimmed_to_displayed_keys (integration — wire trimmed, merged_sd intact), test_project_sd_keeps_only_requested_keys, test_wire_stat_keys_unions_pinned_rows_and_histogram_bins.
  • Adapted test_polars_all_stats to the projected wire.
  • Full unit suite green (1175 passed, 6 skipped); ruff + paddy_format --check clean.

🤖 Generated with Claude Code

paddymul and others added 2 commits May 31, 2026 17:05
…layed keys

The all_stats payload currently serializes the entire merged_sd — every
stat for every column — but the frontend only reads the pinned-row values
(looked up by primary_key_val) and the histogram bins. This test asserts
the wire payload drops the dead weight (value_counts, histogram_args,
memory_usage, the is_* typing flags) while keeping what the UI reads, and
that the full merged_sd stays intact on the dataflow. Fails today. (#880)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s only

Derive the frontend-needed stat-key set from the active styling classes
(pinned_rows primary_key_vals, ? scope-prefix stripped) plus the histogram
bins the color-map rule reads, and project merged_sd down to those keys
before sd_to_parquet_b64. The full merged_sd stays on the dataflow for
styling regeneration and server-side use (sort, column_config); only the
wire copy shrinks.

- serialization_utils.project_sd: pure per-column key filter
- styling_core.wire_stat_keys: the displayed-key allowlist
- dataflow._sd_to_jsondf: the single projection choke point; the widget
  and polars _sd_to_jsondf now delegate here

For a 20-stat numeric frame this trims the per-column wire payload from
~43 stats to ~18. (#880)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f43b90f512

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


wire_stats = _wire_stat_names(w.df_data_dict['all_stats'])
# Dead weight is gone from the wire ...
assert 'value_counts' not in wire_stats
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include the implementation before asserting trimmed wire stats

In this commit the production serializers are unchanged: DataFlow._sd_to_jsondf, BuckarooWidgetBase._sd_to_jsondf, and PolarsBuckarooWidget._sd_to_jsondf still pass the complete sd directly to sd_to_parquet_b64(sd), which emits every {column}__{stat} in the schema. As a result, once the test environment has pandas/pyarrow installed, wire_stats will still include value_counts (and the other dead-weight keys) from merged_sd, so this new test fails in CI rather than documenting a shipped optimization.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26724573893

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26724573893

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26724573893" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Project summary_stats wire payload to only the stats the frontend reads

1 participant