Remove all vestiges of dfstats_v1 (full @stat port) by paddymul · Pull Request #875 · buckaroo-data/buckaroo

paddymul · 2026-05-31T15:49:09Z

Removes the entire legacy v1 DfStats / ColAnalysis-execution stack now that the v2 @stat StatPipeline is the only analysis path.

Removed

v1 executor: DfStats / AnalysisPipeline (deleted analysis_management.py), PlDfStats / PolarsAnalysisPipeline + the polars produce_* helpers, and the v1 order_analysis / check_solvable ordering.
v1 adapter: v1_adapter.py, the ColAnalysis branch in _normalize_inputs, and the v1_computed / spread_dict_result StatFunc flags.
compat shims: process_df_v1_compat / process_table_v1_compat / _find_v1_class. The DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 wrappers now call process_df / process_table and build the ErrDict via a new errors_to_errdict.
dead v1 ColAnalysis stat classes: TypingStats, DefaultSummaryStats, ComputedDefaultSummaryStats, PdCleaningStats, Histogram, HeuristicFracs, the cleaning-op-gen classes, and customizations/heuristics.py (the shared helper functions are kept).

Ported to @stat

BuckarooWidget, the headless server, and the pandas autocleaning configs now run on PD_ANALYSIS_V2; added a cleaning_gen_ops @stat for the default autoclean op. Polars/xorq were already @stat.
_normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.
All affected tests ported to @stat / StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered by test_paf_v2 / test_pd_stats_v2.

Polars autocleaning

Re-implemented on the v2 pipeline (it previously only worked through the v1 PlDfStats executor): pl_cleaning_stats @stat + PL_AUTOCLEAN_DEFAULT_V2, a re-added PolarsAutocleaning (PlDfStatsV2 + polars make_origs), and PolarsBuckarooWidget wired to it.

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

base_summary_stats short-circuits value_counts on unhashable (list/dict/set) columns — restores perf: DefaultSummaryStats O(n²) on object columns with unhashable values (lists/dicts/sets) #843.
make_origs checks add_orig by truthiness, so @stat cleaning funcs returning add_orig=False no longer add spurious _orig columns.

Notes

geopandas_buckaroo.py (TypingStats) and docs/example-notebooks/mo-autocleaning.py still reference removed v1 classes; both are handled by the separate geopandas-removal PR (which merges first) and are not in CI.
Full unit suite green (1022 passed); ruff + paddy_format --check clean.

🤖 Generated with Claude Code

…and xorq backends Extends compare.py with per-column summary statistics and outer-join diffs across three backends: pandas — vectorised over all columns: one isnull pass, one nunique pass, one numeric agg across all numeric columns at once (instead of calling mean/min/max/sum per-column). polars — three .select() calls; head_diff reads only N rows via pl.scan_parquet lazy scan, nothing materialised beyond the head. xorq — one ibis aggregate expression covering every column, executed once per parquet file; total row count + all null/distinct/ numeric stats in a single DuckDB query with no materialisation. New public API (all gated on the matching optional extra): _column_summaries_pd / _infer_keys — pandas internals _column_summaries_polars / _infer_keys_polars [buckaroo[polars]] _column_summaries_xorq / _infer_keys_xorq [buckaroo[xorq]] stats_diff / head_diff / key_diff — pandas stats_diff_polars / head_diff_polars / key_diff_polars stats_diff_xorq / head_diff_xorq / key_diff_xorq col_join_dfs is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the <=50%-cardinality key heuristic (which mis-picked low-cardinality categoricals and exploded the outer join) with _detect_pk_xorq/_rank_pk_xorq: approximate-PK detection at a uniqueness threshold with a max-duplicate-group guard against many-to-many blowup, all as streaming xorq aggregates. Make the xorq diff functions accept a path *or* an expression (_as_expr), so a diff composes expr1 join expr2 and each side resolves its own cache rather than reaching for a result.parquet. Rewrite key_diff_xorq from raw DuckDB SQL to pure ibis, add a keys= param to skip detection when the caller already knows the join key, and _align_backends to unify two independently-loaded expressions only when they differ. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a skip_columns option that makes the stat pipeline emit a column's structural metadata (name/dtype) but compute no stat expressions for it, so the column's data is never scanned. Backend-agnostic: StatPipeline (pandas + polars) and XorqStatPipeline (xorq) both honour it, threaded through DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2. Surface it as a skip_stat_columns option on CustomizableDataflow (used by both _get_summary_sd paths) and accept it on the /load_expr handler. Explicit opt-in, separate from init_sd keys, so existing partial-init_sd display hints still get their stats computed. Lets a comparison/diff reuse each source column's already-cached stats instead of recomputing over the join. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Production now runs entirely on the v2 @stat StatPipeline, so remove the legacy v1 pluggable-analysis stack: - delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py), PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1 order_analysis/check_solvable ordering - delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the v1_computed / spread_dict_result StatFunc flags - drop process_df_v1_compat / process_table_v1_compat / _find_v1_class; DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and build the ErrDict via the new errors_to_errdict helper - wire BuckarooWidget, the headless server, and the pandas autocleaning configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the default autoclean op - _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes Fixes two latent issues exposed by routing the pandas widget through @stat: base_summary_stats short-circuits value_counts on unhashable columns (#843), and make_origs checks add_orig by truthiness so the @stat cleaning funcs' add_orig=False no longer adds spurious _orig columns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions run through StatPipeline.process_df / DfStatsV2: - histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2 / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter tests now covered by the dedicated v2 suites - autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat autoclean lists; local cleaning-op generators ported to @stat - polars analysis-management / categorical-histogram: drive the kept column-executor path (polars_series_stats_from_select_result / ColumnExecutorDataflow) instead of the removed v1 polars executor - polars autocleaning SD-channel/search tests run through PandasAutocleaning - delete analysis_management_test.py (v1 executor, covered by test_paf_v2); keep the live utils tests from pluggable_analysis_framework_test.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Restore polars autocleaning on the v2 pipeline (it previously relied on the removed v1 PlDfStats executor running select_clauses PolarsAnalysis): - add pl_cleaning_stats @stat (int-parse fraction from a polars Series) and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops - re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that rebuilds the cleaned frame, add_orig checked by truthiness) - PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the polars path - restore the polars autocleaning tests (int-parse stats, op generation, handle_ops_and_clean, codegen, make_origs) on the @stat path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@stat

Now that production and tests run on @stat, delete the superseded v1 ColAnalysis stat classes, keeping the pure helper functions they shared: - analysis.py: drop TypingStats / DefaultSummaryStats / ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode, _has_unhashable_values, probable_datetime) - histogram.py: drop the Histogram class (keep categorical_histogram / numeric_histogram, used by pd_stats_v2 / polars_analysis) - pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning classes (keep the frac helper functions) - delete customizations/heuristics.py (BaseHeuristicCleaningGenOps + the unused invert_rewritten_orig) geopandas (TypingStats) and the docs/example-notebooks autocleaning sample still reference these; both are handled by the separate geopandas-removal PR and are not in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

https://github.com/buckaroo-data/buckaroo/blob/85ab2038e48eb011ab142ba11240248ee7bceb8a/customizations/pd_stats_v2.py#L340-L341
Restore the removed typing stat export

The old TypingStats symbol is still imported by buckaroo/geopandas_buckaroo.py:8, which is reached by the registered GeoPandas display formatter in widget_utils.py:87-90 and by documented GeoPandas notebooks. After this port removes the class from customizations.analysis, displaying or directly importing the GeoPandas widget raises ImportError before any widget can render; either update that import to the v2 stat functions or keep a compatibility alias/export.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T15:52:45Z

+    a_only_col = next((c for c in a.columns if c not in keys), None)
+    b_col_after = f"{a_only_col}_after" if a_only_col else None
+    if a_only_col and b_col_after in merged.columns:


Use side markers for Polars key diffs

When the two Polars frames do not share the same first non-key column, b_col_after is absent and this falls through to only_before = only_after = 0 and matched = len(merged), so a full outer diff with keys present on only one side is reported as fully matched. This also happens for key-only frames; add explicit side marker columns before the join or probe independent a/b columns so one-sided keys are counted correctly.

Useful? React with 👍 / 👎.

paddymul · 2026-05-31T15:57:29Z

Superseded by #876 — rebased onto current main (after #872 and the geopandas removal #874) so the diff shows only the dfstats_v1 removal. Closing this one.

paddymul and others added 9 commits May 29, 2026 19:06

fix: remove unused ibis import flagged by ruff

54cd1e6

test(compare): drop dead scaffolding and semicolon flagged by ruff

273c9a7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

paddymul mentioned this pull request May 31, 2026

Remove all vestiges of dfstats_v1 (full @stat port) #876

Open

paddymul closed this May 31, 2026

paddymul deleted the refactor/remove-dfstats-v1 branch May 31, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all vestiges of dfstats_v1 (full @stat port)#875

Remove all vestiges of dfstats_v1 (full @stat port)#875
paddymul wants to merge 9 commits into
mainfrom
refactor/remove-dfstats-v1

paddymul commented May 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

paddymul commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 31, 2026

Removed

Ported to @stat

Polars autocleaning

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

paddymul commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant