Skip to content

Remove all vestiges of dfstats_v1 (full @stat port)#875

Closed
paddymul wants to merge 9 commits into
mainfrom
refactor/remove-dfstats-v1
Closed

Remove all vestiges of dfstats_v1 (full @stat port)#875
paddymul wants to merge 9 commits into
mainfrom
refactor/remove-dfstats-v1

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Removes the entire legacy v1 DfStats / ColAnalysis-execution stack now that the v2 @stat StatPipeline is the only analysis path.

Removed

  • v1 executor: DfStats / AnalysisPipeline (deleted analysis_management.py), PlDfStats / PolarsAnalysisPipeline + the polars produce_* helpers, and the v1 order_analysis / check_solvable ordering.
  • v1 adapter: v1_adapter.py, the ColAnalysis branch in _normalize_inputs, and the v1_computed / spread_dict_result StatFunc flags.
  • compat shims: process_df_v1_compat / process_table_v1_compat / _find_v1_class. The DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 wrappers now call process_df / process_table and build the ErrDict via a new errors_to_errdict.
  • dead v1 ColAnalysis stat classes: TypingStats, DefaultSummaryStats, ComputedDefaultSummaryStats, PdCleaningStats, Histogram, HeuristicFracs, the cleaning-op-gen classes, and customizations/heuristics.py (the shared helper functions are kept).

Ported to @stat

  • BuckarooWidget, the headless server, and the pandas autocleaning configs now run on PD_ANALYSIS_V2; added a cleaning_gen_ops @stat for the default autoclean op. Polars/xorq were already @stat.
  • _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.
  • All affected tests ported to @stat / StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered by test_paf_v2 / test_pd_stats_v2.

Polars autocleaning

Re-implemented on the v2 pipeline (it previously only worked through the v1 PlDfStats executor): pl_cleaning_stats @stat + PL_AUTOCLEAN_DEFAULT_V2, a re-added PolarsAutocleaning (PlDfStatsV2 + polars make_origs), and PolarsBuckarooWidget wired to it.

Two latent regressions fixed (exposed by routing the pandas widget through @stat)

Notes

  • geopandas_buckaroo.py (TypingStats) and docs/example-notebooks/mo-autocleaning.py still reference removed v1 classes; both are handled by the separate geopandas-removal PR (which merges first) and are not in CI.
  • Full unit suite green (1022 passed); ruff + paddy_format --check clean.

🤖 Generated with Claude Code

paddymul and others added 9 commits May 29, 2026 19:06
…and xorq backends

Extends compare.py with per-column summary statistics and outer-join diffs
across three backends:

  pandas  — vectorised over all columns: one isnull pass, one nunique pass,
            one numeric agg across all numeric columns at once (instead of
            calling mean/min/max/sum per-column).
  polars  — three .select() calls; head_diff reads only N rows via
            pl.scan_parquet lazy scan, nothing materialised beyond the head.
  xorq    — one ibis aggregate expression covering every column, executed
            once per parquet file; total row count + all null/distinct/
            numeric stats in a single DuckDB query with no materialisation.

New public API (all gated on the matching optional extra):
  _column_summaries_pd / _infer_keys          — pandas internals
  _column_summaries_polars / _infer_keys_polars  [buckaroo[polars]]
  _column_summaries_xorq / _infer_keys_xorq      [buckaroo[xorq]]
  stats_diff / head_diff / key_diff           — pandas
  stats_diff_polars / head_diff_polars / key_diff_polars
  stats_diff_xorq / head_diff_xorq / key_diff_xorq

col_join_dfs is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the <=50%-cardinality key heuristic (which mis-picked low-cardinality
categoricals and exploded the outer join) with _detect_pk_xorq/_rank_pk_xorq:
approximate-PK detection at a uniqueness threshold with a max-duplicate-group
guard against many-to-many blowup, all as streaming xorq aggregates.

Make the xorq diff functions accept a path *or* an expression (_as_expr), so a
diff composes expr1  join expr2 and each side resolves its own cache rather
than reaching for a result.parquet. Rewrite key_diff_xorq from raw DuckDB SQL
to pure ibis, add a keys= param to skip detection when the caller already
knows the join key, and _align_backends to unify two independently-loaded
expressions only when they differ.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a skip_columns option that makes the stat pipeline emit a column's
structural metadata (name/dtype) but compute no stat expressions for it, so
the column's data is never scanned. Backend-agnostic: StatPipeline (pandas +
polars) and XorqStatPipeline (xorq) both honour it, threaded through
DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2.

Surface it as a skip_stat_columns option on CustomizableDataflow (used by both
_get_summary_sd paths) and accept it on the /load_expr handler. Explicit
opt-in, separate from init_sd keys, so existing partial-init_sd display hints
still get their stats computed. Lets a comparison/diff reuse each source
column's already-cached stats instead of recomputing over the join.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Production now runs entirely on the v2 @stat StatPipeline, so remove the
legacy v1 pluggable-analysis stack:

- delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py),
  PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1
  order_analysis/check_solvable ordering
- delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the
  v1_computed / spread_dict_result StatFunc flags
- drop process_df_v1_compat / process_table_v1_compat / _find_v1_class;
  DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and
  build the ErrDict via the new errors_to_errdict helper
- wire BuckarooWidget, the headless server, and the pandas autocleaning
  configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the
  default autoclean op
- _normalize_inputs accepts structural ColAnalysis classes (styling,
  post-processing) as no-ops and rejects leftover v1 stat classes

Fixes two latent issues exposed by routing the pandas widget through @stat:
base_summary_stats short-circuits value_counts on unhashable columns (#843),
and make_origs checks add_orig by truthiness so the @stat cleaning funcs'
add_orig=False no longer adds spurious _orig columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions
run through StatPipeline.process_df / DfStatsV2:

- histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2
  / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter
  tests now covered by the dedicated v2 suites
- autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat
  autoclean lists; local cleaning-op generators ported to @stat
- polars analysis-management / categorical-histogram: drive the kept
  column-executor path (polars_series_stats_from_select_result /
  ColumnExecutorDataflow) instead of the removed v1 polars executor
- polars autocleaning SD-channel/search tests run through PandasAutocleaning
- delete analysis_management_test.py (v1 executor, covered by test_paf_v2);
  keep the live utils tests from pluggable_analysis_framework_test.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore polars autocleaning on the v2 pipeline (it previously relied on
the removed v1 PlDfStats executor running select_clauses PolarsAnalysis):

- add pl_cleaning_stats @stat (int-parse fraction from a polars Series)
  and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops
- re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that
  rebuilds the cleaned frame, add_orig checked by truthiness)
- PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the
  polars path
- restore the polars autocleaning tests (int-parse stats, op generation,
  handle_ops_and_clean, codegen, make_origs) on the @stat path

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that production and tests run on @stat, delete the superseded v1
ColAnalysis stat classes, keeping the pure helper functions they shared:

- analysis.py: drop TypingStats / DefaultSummaryStats /
  ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode,
  _has_unhashable_values, probable_datetime)
- histogram.py: drop the Histogram class (keep categorical_histogram /
  numeric_histogram, used by pd_stats_v2 / polars_analysis)
- pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning
  classes (keep the frac helper functions)
- delete customizations/heuristics.py (BaseHeuristicCleaningGenOps +
  the unused invert_rewritten_orig)

geopandas (TypingStats) and the docs/example-notebooks autocleaning sample
still reference these; both are handled by the separate geopandas-removal PR
and are not in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/buckaroo-data/buckaroo/blob/85ab2038e48eb011ab142ba11240248ee7bceb8a/customizations/pd_stats_v2.py#L340-L341
P2 Badge Restore the removed typing stat export

The old TypingStats symbol is still imported by buckaroo/geopandas_buckaroo.py:8, which is reached by the registered GeoPandas display formatter in widget_utils.py:87-90 and by documented GeoPandas notebooks. After this port removes the class from customizations.analysis, displaying or directly importing the GeoPandas widget raises ImportError before any widget can render; either update that import to the v2 stat functions or keep a compatibility alias/export.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/compare.py
Comment on lines +190 to +192
a_only_col = next((c for c in a.columns if c not in keys), None)
b_col_after = f"{a_only_col}_after" if a_only_col else None
if a_only_col and b_col_after in merged.columns:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use side markers for Polars key diffs

When the two Polars frames do not share the same first non-key column, b_col_after is absent and this falls through to only_before = only_after = 0 and matched = len(merged), so a full outer diff with keys present on only one side is reported as fully matched. This also happens for key-only frames; add explicit side marker columns before the join or probe independent a/b columns so one-sided keys are counted correctly.

Useful? React with 👍 / 👎.

@paddymul
Copy link
Copy Markdown
Collaborator Author

Superseded by #876 — rebased onto current main (after #872 and the geopandas removal #874) so the diff shows only the dfstats_v1 removal. Closing this one.

@paddymul paddymul closed this May 31, 2026
@paddymul paddymul deleted the refactor/remove-dfstats-v1 branch May 31, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant