Remove all vestiges of dfstats_v1 (full @stat port)#876
Conversation
Production now runs entirely on the v2 @stat StatPipeline, so remove the legacy v1 pluggable-analysis stack: - delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py), PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1 order_analysis/check_solvable ordering - delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the v1_computed / spread_dict_result StatFunc flags - drop process_df_v1_compat / process_table_v1_compat / _find_v1_class; DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and build the ErrDict via the new errors_to_errdict helper - wire BuckarooWidget, the headless server, and the pandas autocleaning configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the default autoclean op - _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes Fixes two latent issues exposed by routing the pandas widget through @stat: base_summary_stats short-circuits value_counts on unhashable columns (#843), and make_origs checks add_orig by truthiness so the @stat cleaning funcs' add_orig=False no longer adds spurious _orig columns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions run through StatPipeline.process_df / DfStatsV2: - histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2 / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter tests now covered by the dedicated v2 suites - autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat autoclean lists; local cleaning-op generators ported to @stat - polars analysis-management / categorical-histogram: drive the kept column-executor path (polars_series_stats_from_select_result / ColumnExecutorDataflow) instead of the removed v1 polars executor - polars autocleaning SD-channel/search tests run through PandasAutocleaning - delete analysis_management_test.py (v1 executor, covered by test_paf_v2); keep the live utils tests from pluggable_analysis_framework_test.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore polars autocleaning on the v2 pipeline (it previously relied on the removed v1 PlDfStats executor running select_clauses PolarsAnalysis): - add pl_cleaning_stats @stat (int-parse fraction from a polars Series) and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops - re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that rebuilds the cleaned frame, add_orig checked by truthiness) - PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the polars path - restore the polars autocleaning tests (int-parse stats, op generation, handle_ops_and_clean, codegen, make_origs) on the @stat path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that production and tests run on @stat, delete the superseded v1 ColAnalysis stat classes, keeping the pure helper functions they shared: - analysis.py: drop TypingStats / DefaultSummaryStats / ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode, _has_unhashable_values, probable_datetime) - histogram.py: drop the Histogram class (keep categorical_histogram / numeric_histogram, used by pd_stats_v2 / polars_analysis) - pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning classes (keep the frac helper functions) - delete customizations/heuristics.py (BaseHeuristicCleaningGenOps + the unused invert_rewritten_orig) geopandas (TypingStats) and the docs/example-notebooks autocleaning sample still reference these; both are handled by the separate geopandas-removal PR and are not in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26719133492" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 450fdd7d4e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| clauses.append(cleaned_df[col]) | ||
| clauses.append(raw_df[col].alias(col+"_orig")) | ||
| clauses.append(raw_df[col].alias(col + "_orig")) |
There was a problem hiding this comment.
Use original names for v2 polars cleaning keys
When Polars autocleaning is run through PlDfStatsV2, the cleaning_sd entries are keyed by rewritten summary names like a/b, while cleaned_df and raw_df still have the user's original column names. For any dataframe whose columns are not already named a, b, etc., this indexes a non-existent column and handle_ops_and_clean(..., cleaning_method='default', ...) raises ColumnNotFoundError; use sd['orig_col_name'] (as the pandas path does) when building these clauses.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 177bdb6. Confirmed the failure (a column not named a → ColumnNotFoundError: "a" not found from handle_ops_and_clean(..., cleaning_method='default')). PolarsAutocleaning.make_origs now indexes cleaned_df/raw_df by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs) instead of the rewritten a/b/c sd key, with the same not-in-columns / index guards plus a dedupe. Added a regression test (test_autoclean_preserves_original_column_names) that autocleans a column named realname and asserts realname/realname_orig.
PolarsAutocleaning.make_origs iterated cleaning_sd (keyed by buckaroo's internal a/b/c names) and indexed cleaned_df / raw_df — which carry the user's original column names — with those keys. Any column not literally named a/b/c raised ColumnNotFoundError from handle_ops_and_clean(..., cleaning_method='default'). Index by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs), with the same not-in-columns / 'index' guards plus a dedupe. Adds a regression test that cleans a column named 'realname'. Addresses the codex P2 review comment on #876. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- remove dead tests/unit/polars_error_handling_comparison.py — it imported the deleted polars_produce_series_df and was not collected by pytest - pd_stats_v2: import jlisp/heuristic_lang at module top and drop the silent ImportError->None fallback that would quietly empty the aggressive/conservative autoclean sets (these are core in-package modules, not optional deps) - stat_pipeline: fix stale v1_adapter reference in the module docstring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removes the entire legacy v1 DfStats /
ColAnalysis-execution stack now that the v2@statStatPipelineis the only analysis path. (Replaces #875 — rebased onto currentmainfor a clean diff.)Removed
DfStats/AnalysisPipeline(deletedanalysis_management.py),PlDfStats/PolarsAnalysisPipeline+ the polarsproduce_*helpers, and the v1order_analysis/check_solvableordering.v1_adapter.py, theColAnalysisbranch in_normalize_inputs, and thev1_computed/spread_dict_resultStatFuncflags.process_df_v1_compat/process_table_v1_compat/_find_v1_class. TheDfStatsV2/PlDfStatsV2/XorqDfStatsV2wrappers now callprocess_df/process_tableand build theErrDictvia a newerrors_to_errdict.ColAnalysisstat classes:TypingStats,DefaultSummaryStats,ComputedDefaultSummaryStats,PdCleaningStats,Histogram,HeuristicFracs, the cleaning-op-gen classes, andcustomizations/heuristics.py(the shared helper functions are kept).Ported to @stat
BuckarooWidget, the headless server, and the pandas autocleaning configs run onPD_ANALYSIS_V2; added acleaning_gen_ops@statfor the default autoclean op._normalize_inputsaccepts structuralColAnalysisclasses (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.@stat/StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered bytest_paf_v2/test_pd_stats_v2.Polars autocleaning (on v2)
Re-implemented on the v2 pipeline (it previously only worked through the v1
PlDfStatsexecutor):pl_cleaning_stats@stat+PL_AUTOCLEAN_DEFAULT_V2, a re-addedPolarsAutocleaning(PlDfStatsV2+ polarsmake_origs), andPolarsBuckarooWidgetwired to it.Two latent regressions fixed (exposed by routing the pandas widget through @stat)
base_summary_statsshort-circuitsvalue_countson unhashable (list/dict/set) columns — restores perf: DefaultSummaryStats O(n²) on object columns with unhashable values (lists/dicts/sets) #843.make_origschecksadd_origby truthiness, so@statcleaning funcs returningadd_orig=Falseno longer add spurious_origcolumns.Notes
docs/example-notebooks/mo-autocleaning.pysample still references a removed v1 cleaning class; it's a docs notebook (not imported, not in CI).ruff+paddy_format --checkclean.🤖 Generated with Claude Code