Skip to content

Remove all vestiges of dfstats_v1 (full @stat port)#876

Open
paddymul wants to merge 6 commits into
mainfrom
refactor/dfstats-v1-removal
Open

Remove all vestiges of dfstats_v1 (full @stat port)#876
paddymul wants to merge 6 commits into
mainfrom
refactor/dfstats-v1-removal

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Removes the entire legacy v1 DfStats / ColAnalysis-execution stack now that the v2 @stat StatPipeline is the only analysis path. (Replaces #875 — rebased onto current main for a clean diff.)

Removed

  • v1 executor: DfStats / AnalysisPipeline (deleted analysis_management.py), PlDfStats / PolarsAnalysisPipeline + the polars produce_* helpers, and the v1 order_analysis / check_solvable ordering.
  • v1 adapter: v1_adapter.py, the ColAnalysis branch in _normalize_inputs, and the v1_computed / spread_dict_result StatFunc flags.
  • compat shims: process_df_v1_compat / process_table_v1_compat / _find_v1_class. The DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 wrappers now call process_df / process_table and build the ErrDict via a new errors_to_errdict.
  • dead v1 ColAnalysis stat classes: TypingStats, DefaultSummaryStats, ComputedDefaultSummaryStats, PdCleaningStats, Histogram, HeuristicFracs, the cleaning-op-gen classes, and customizations/heuristics.py (the shared helper functions are kept).

Ported to @stat

  • BuckarooWidget, the headless server, and the pandas autocleaning configs run on PD_ANALYSIS_V2; added a cleaning_gen_ops @stat for the default autoclean op.
  • _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.
  • All affected tests ported to @stat / StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered by test_paf_v2 / test_pd_stats_v2.

Polars autocleaning (on v2)

Re-implemented on the v2 pipeline (it previously only worked through the v1 PlDfStats executor): pl_cleaning_stats @stat + PL_AUTOCLEAN_DEFAULT_V2, a re-added PolarsAutocleaning (PlDfStatsV2 + polars make_origs), and PolarsBuckarooWidget wired to it.

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

Notes

  • The docs/example-notebooks/mo-autocleaning.py sample still references a removed v1 cleaning class; it's a docs notebook (not imported, not in CI).
  • Full unit suite green (1050 passed, 0 failed); ruff + paddy_format --check clean.

🤖 Generated with Claude Code

paddymul and others added 4 commits May 31, 2026 11:54
Production now runs entirely on the v2 @stat StatPipeline, so remove the
legacy v1 pluggable-analysis stack:

- delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py),
  PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1
  order_analysis/check_solvable ordering
- delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the
  v1_computed / spread_dict_result StatFunc flags
- drop process_df_v1_compat / process_table_v1_compat / _find_v1_class;
  DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and
  build the ErrDict via the new errors_to_errdict helper
- wire BuckarooWidget, the headless server, and the pandas autocleaning
  configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the
  default autoclean op
- _normalize_inputs accepts structural ColAnalysis classes (styling,
  post-processing) as no-ops and rejects leftover v1 stat classes

Fixes two latent issues exposed by routing the pandas widget through @stat:
base_summary_stats short-circuits value_counts on unhashable columns (#843),
and make_origs checks add_orig by truthiness so the @stat cleaning funcs'
add_orig=False no longer adds spurious _orig columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions
run through StatPipeline.process_df / DfStatsV2:

- histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2
  / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter
  tests now covered by the dedicated v2 suites
- autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat
  autoclean lists; local cleaning-op generators ported to @stat
- polars analysis-management / categorical-histogram: drive the kept
  column-executor path (polars_series_stats_from_select_result /
  ColumnExecutorDataflow) instead of the removed v1 polars executor
- polars autocleaning SD-channel/search tests run through PandasAutocleaning
- delete analysis_management_test.py (v1 executor, covered by test_paf_v2);
  keep the live utils tests from pluggable_analysis_framework_test.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore polars autocleaning on the v2 pipeline (it previously relied on
the removed v1 PlDfStats executor running select_clauses PolarsAnalysis):

- add pl_cleaning_stats @stat (int-parse fraction from a polars Series)
  and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops
- re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that
  rebuilds the cleaned frame, add_orig checked by truthiness)
- PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the
  polars path
- restore the polars autocleaning tests (int-parse stats, op generation,
  handle_ops_and_clean, codegen, make_origs) on the @stat path

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that production and tests run on @stat, delete the superseded v1
ColAnalysis stat classes, keeping the pure helper functions they shared:

- analysis.py: drop TypingStats / DefaultSummaryStats /
  ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode,
  _has_unhashable_values, probable_datetime)
- histogram.py: drop the Histogram class (keep categorical_histogram /
  numeric_histogram, used by pd_stats_v2 / polars_analysis)
- pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning
  classes (keep the frac helper functions)
- delete customizations/heuristics.py (BaseHeuristicCleaningGenOps +
  the unused invert_rewritten_orig)

geopandas (TypingStats) and the docs/example-notebooks autocleaning sample
still reference these; both are handled by the separate geopandas-removal PR
and are not in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26719133492

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26719133492" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 450fdd7d4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/polars_buckaroo.py Outdated
Comment on lines +52 to +53
clauses.append(cleaned_df[col])
clauses.append(raw_df[col].alias(col+"_orig"))
clauses.append(raw_df[col].alias(col + "_orig"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use original names for v2 polars cleaning keys

When Polars autocleaning is run through PlDfStatsV2, the cleaning_sd entries are keyed by rewritten summary names like a/b, while cleaned_df and raw_df still have the user's original column names. For any dataframe whose columns are not already named a, b, etc., this indexes a non-existent column and handle_ops_and_clean(..., cleaning_method='default', ...) raises ColumnNotFoundError; use sd['orig_col_name'] (as the pandas path does) when building these clauses.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 177bdb6. Confirmed the failure (a column not named aColumnNotFoundError: "a" not found from handle_ops_and_clean(..., cleaning_method='default')). PolarsAutocleaning.make_origs now indexes cleaned_df/raw_df by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs) instead of the rewritten a/b/c sd key, with the same not-in-columns / index guards plus a dedupe. Added a regression test (test_autoclean_preserves_original_column_names) that autocleans a column named realname and asserts realname/realname_orig.

PolarsAutocleaning.make_origs iterated cleaning_sd (keyed by buckaroo's
internal a/b/c names) and indexed cleaned_df / raw_df — which carry the
user's original column names — with those keys. Any column not literally
named a/b/c raised ColumnNotFoundError from
handle_ops_and_clean(..., cleaning_method='default').

Index by sd['orig_col_name'] (mirroring PandasAutocleaning.make_origs),
with the same not-in-columns / 'index' guards plus a dedupe. Adds a
regression test that cleans a column named 'realname'.

Addresses the codex P2 review comment on #876.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- remove dead tests/unit/polars_error_handling_comparison.py — it imported the
  deleted polars_produce_series_df and was not collected by pytest
- pd_stats_v2: import jlisp/heuristic_lang at module top and drop the silent
  ImportError->None fallback that would quietly empty the aggressive/conservative
  autoclean sets (these are core in-package modules, not optional deps)
- stat_pipeline: fix stale v1_adapter reference in the module docstring

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant