Remove all vestiges of dfstats_v1 (full @stat port) by paddymul · Pull Request #876 · buckaroo-data/buckaroo

paddymul · 2026-05-31T15:57:21Z

Removes the entire legacy v1 DfStats / ColAnalysis-execution stack now that the v2 @stat StatPipeline is the only analysis path. (Replaces #875 — rebased onto current main for a clean diff.)

Removed

v1 executor: DfStats / AnalysisPipeline (deleted analysis_management.py), PlDfStats / PolarsAnalysisPipeline + the polars produce_* helpers, and the v1 order_analysis / check_solvable ordering.
v1 adapter: v1_adapter.py, the ColAnalysis branch in _normalize_inputs, and the v1_computed / spread_dict_result StatFunc flags.
compat shims: process_df_v1_compat / process_table_v1_compat / _find_v1_class. The DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 wrappers now call process_df / process_table and build the ErrDict via a new errors_to_errdict.
dead v1 ColAnalysis stat classes: TypingStats, DefaultSummaryStats, ComputedDefaultSummaryStats, PdCleaningStats, Histogram, HeuristicFracs, the cleaning-op-gen classes, and customizations/heuristics.py (the shared helper functions are kept).

Ported to @stat

BuckarooWidget, the headless server, and the pandas autocleaning configs run on PD_ANALYSIS_V2; added a cleaning_gen_ops @stat for the default autoclean op.
_normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.
All affected tests ported to @stat / StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered by test_paf_v2 / test_pd_stats_v2.

Polars autocleaning (on v2)

Re-implemented on the v2 pipeline (it previously only worked through the v1 PlDfStats executor): pl_cleaning_stats @stat + PL_AUTOCLEAN_DEFAULT_V2, a re-added PolarsAutocleaning (PlDfStatsV2 + polars make_origs), and PolarsBuckarooWidget wired to it.

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

base_summary_stats short-circuits value_counts on unhashable (list/dict/set) columns — restores perf: DefaultSummaryStats O(n²) on object columns with unhashable values (lists/dicts/sets) #843.
make_origs checks add_orig by truthiness, so @stat cleaning funcs returning add_orig=False no longer add spurious _orig columns.

Notes

The docs/example-notebooks/mo-autocleaning.py sample still references a removed v1 cleaning class; it's a docs notebook (not imported, not in CI).
Full unit suite green (1050 passed, 0 failed); ruff + paddy_format --check clean.

🤖 Generated with Claude Code

@stat

Production now runs entirely on the v2 @stat StatPipeline, so remove the legacy v1 pluggable-analysis stack: - delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py), PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1 order_analysis/check_solvable ordering - delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the v1_computed / spread_dict_result StatFunc flags - drop process_df_v1_compat / process_table_v1_compat / _find_v1_class; DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and build the ErrDict via the new errors_to_errdict helper - wire BuckarooWidget, the headless server, and the pandas autocleaning configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the default autoclean op - _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes Fixes two latent issues exposed by routing the pandas widget through @stat: base_summary_stats short-circuits value_counts on unhashable columns (#843), and make_origs checks add_orig by truthiness so the @stat cleaning funcs' add_orig=False no longer adds spurious _orig columns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions run through StatPipeline.process_df / DfStatsV2: - histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2 / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter tests now covered by the dedicated v2 suites - autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat autoclean lists; local cleaning-op generators ported to @stat - polars analysis-management / categorical-histogram: drive the kept column-executor path (polars_series_stats_from_select_result / ColumnExecutorDataflow) instead of the removed v1 polars executor - polars autocleaning SD-channel/search tests run through PandasAutocleaning - delete analysis_management_test.py (v1 executor, covered by test_paf_v2); keep the live utils tests from pluggable_analysis_framework_test.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Restore polars autocleaning on the v2 pipeline (it previously relied on the removed v1 PlDfStats executor running select_clauses PolarsAnalysis): - add pl_cleaning_stats @stat (int-parse fraction from a polars Series) and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops - re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that rebuilds the cleaned frame, add_orig checked by truthiness) - PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the polars path - restore the polars autocleaning tests (int-parse stats, op generation, handle_ops_and_clean, codegen, make_origs) on the @stat path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Now that production and tests run on @stat, delete the superseded v1 ColAnalysis stat classes, keeping the pure helper functions they shared: - analysis.py: drop TypingStats / DefaultSummaryStats / ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode, _has_unhashable_values, probable_datetime) - histogram.py: drop the Histogram class (keep categorical_histogram / numeric_histogram, used by pd_stats_v2 / polars_analysis) - pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning classes (keep the frac helper functions) - delete customizations/heuristics.py (BaseHeuristicCleaningGenOps + the unused invert_rewritten_orig) geopandas (TypingStats) and the docs/example-notebooks autocleaning sample still reference these; both are handled by the separate geopandas-removal PR and are not in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-31T15:59:01Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26719133492" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 450fdd7d4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T16:02:01Z

                clauses.append(cleaned_df[col])
-                clauses.append(raw_df[col].alias(col+"_orig"))
+                clauses.append(raw_df[col].alias(col + "_orig"))


Use original names for v2 polars cleaning keys

When Polars autocleaning is run through PlDfStatsV2, the cleaning_sd entries are keyed by rewritten summary names like a/b, while cleaned_df and raw_df still have the user's original column names. For any dataframe whose columns are not already named a, b, etc., this indexes a non-existent column and handle_ops_and_clean(..., cleaning_method='default', ...) raises ColumnNotFoundError; use sd['orig_col_name'] (as the pandas path does) when building these clauses.

Useful? React with 👍 / 👎.

Fixed in 177bdb6. Confirmed the failure (a column not named a → ColumnNotFoundError: "a" not found from handle_ops_and_clean(..., cleaning_method='default')). PolarsAutocleaning.make_origs now indexes cleaned_df/raw_df by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs) instead of the rewritten a/b/c sd key, with the same not-in-columns / index guards plus a dedupe. Added a regression test (test_autoclean_preserves_original_column_names) that autocleans a column named realname and asserts realname/realname_orig.

PolarsAutocleaning.make_origs iterated cleaning_sd (keyed by buckaroo's internal a/b/c names) and indexed cleaned_df / raw_df — which carry the user's original column names — with those keys. Any column not literally named a/b/c raised ColumnNotFoundError from handle_ops_and_clean(..., cleaning_method='default'). Index by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs), with the same not-in-columns / 'index' guards plus a dedupe. Adds a regression test that cleans a column named 'realname'. Addresses the codex P2 review comment on #876. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- remove dead tests/unit/polars_error_handling_comparison.py — it imported the deleted polars_produce_series_df and was not collected by pytest - pd_stats_v2: import jlisp/heuristic_lang at module top and drop the silent ImportError->None fallback that would quietly empty the aggressive/conservative autoclean sets (these are core in-package modules, not optional deps) - stat_pipeline: fix stale v1_adapter reference in the module docstring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

paddymul and others added 4 commits May 31, 2026 11:54

paddymul mentioned this pull request May 31, 2026

Remove all vestiges of dfstats_v1 (full @stat port) #875

Closed

paddymul temporarily deployed to testpypi May 31, 2026 15:58 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

paddymul temporarily deployed to testpypi May 31, 2026 16:12 — with GitHub Actions Inactive

This was referenced May 31, 2026

Autocleaning emits a spurious safe_int op on unhashable (list/dict/set) columns #878

Open

Rebuild autocleaning op-generation now that it rides on the @stat pipeline #879

Open

paddymul temporarily deployed to testpypi May 31, 2026 17:15 — with GitHub Actions Inactive

paddymul mentioned this pull request May 31, 2026

perf(serialization): project all_stats wire payload to displayed stats only (#880) #883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all vestiges of dfstats_v1 (full @stat port)#876

Remove all vestiges of dfstats_v1 (full @stat port)#876
paddymul wants to merge 6 commits into
mainfrom
refactor/dfstats-v1-removal

paddymul commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

paddymul May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 31, 2026

Removed

Ported to @stat

Polars autocleaning (on v2)

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

Notes

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

paddymul May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 31, 2026 •

edited

Loading