Remove all vestiges of dfstats_v1 (full @stat port)#875
Conversation
…and xorq backends
Extends compare.py with per-column summary statistics and outer-join diffs
across three backends:
pandas — vectorised over all columns: one isnull pass, one nunique pass,
one numeric agg across all numeric columns at once (instead of
calling mean/min/max/sum per-column).
polars — three .select() calls; head_diff reads only N rows via
pl.scan_parquet lazy scan, nothing materialised beyond the head.
xorq — one ibis aggregate expression covering every column, executed
once per parquet file; total row count + all null/distinct/
numeric stats in a single DuckDB query with no materialisation.
New public API (all gated on the matching optional extra):
_column_summaries_pd / _infer_keys — pandas internals
_column_summaries_polars / _infer_keys_polars [buckaroo[polars]]
_column_summaries_xorq / _infer_keys_xorq [buckaroo[xorq]]
stats_diff / head_diff / key_diff — pandas
stats_diff_polars / head_diff_polars / key_diff_polars
stats_diff_xorq / head_diff_xorq / key_diff_xorq
col_join_dfs is unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the <=50%-cardinality key heuristic (which mis-picked low-cardinality categoricals and exploded the outer join) with _detect_pk_xorq/_rank_pk_xorq: approximate-PK detection at a uniqueness threshold with a max-duplicate-group guard against many-to-many blowup, all as streaming xorq aggregates. Make the xorq diff functions accept a path *or* an expression (_as_expr), so a diff composes expr1 join expr2 and each side resolves its own cache rather than reaching for a result.parquet. Rewrite key_diff_xorq from raw DuckDB SQL to pure ibis, add a keys= param to skip detection when the caller already knows the join key, and _align_backends to unify two independently-loaded expressions only when they differ. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a skip_columns option that makes the stat pipeline emit a column's structural metadata (name/dtype) but compute no stat expressions for it, so the column's data is never scanned. Backend-agnostic: StatPipeline (pandas + polars) and XorqStatPipeline (xorq) both honour it, threaded through DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2. Surface it as a skip_stat_columns option on CustomizableDataflow (used by both _get_summary_sd paths) and accept it on the /load_expr handler. Explicit opt-in, separate from init_sd keys, so existing partial-init_sd display hints still get their stats computed. Lets a comparison/diff reuse each source column's already-cached stats instead of recomputing over the join. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Production now runs entirely on the v2 @stat StatPipeline, so remove the legacy v1 pluggable-analysis stack: - delete the v1 executor: DfStats/AnalysisPipeline (analysis_management.py), PlDfStats/PolarsAnalysisPipeline + polars produce_* helpers, and the v1 order_analysis/check_solvable ordering - delete the v1 ColAnalysis->StatFunc adapter (v1_adapter.py) and the v1_computed / spread_dict_result StatFunc flags - drop process_df_v1_compat / process_table_v1_compat / _find_v1_class; DfStatsV2 / PlDfStatsV2 / XorqDfStatsV2 call process_df/process_table and build the ErrDict via the new errors_to_errdict helper - wire BuckarooWidget, the headless server, and the pandas autocleaning configs onto PD_ANALYSIS_V2 (@stat); add a cleaning_gen_ops @stat for the default autoclean op - _normalize_inputs accepts structural ColAnalysis classes (styling, post-processing) as no-ops and rejects leftover v1 stat classes Fixes two latent issues exposed by routing the pandas widget through @stat: base_summary_stats short-circuits value_counts on unhashable columns (#843), and make_origs checks add_orig by truthiness so the @stat cleaning funcs' add_orig=False no longer adds spurious _orig columns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace v1 ColAnalysis / DfStats usage in the tests with @stat functions run through StatPipeline.process_df / DfStatsV2: - histogram, analysis, skip_columns, pd_stats_v2, paf_v2: use PD_ANALYSIS_V2 / @stat funcs; drop the v1-vs-v2 parity (TestBackwardCompat) and v1-adapter tests now covered by the dedicated v2 suites - autocleaning (pd / heuristic / scoped / sd_cache): configs use the @stat autoclean lists; local cleaning-op generators ported to @stat - polars analysis-management / categorical-histogram: drive the kept column-executor path (polars_series_stats_from_select_result / ColumnExecutorDataflow) instead of the removed v1 polars executor - polars autocleaning SD-channel/search tests run through PandasAutocleaning - delete analysis_management_test.py (v1 executor, covered by test_paf_v2); keep the live utils tests from pluggable_analysis_framework_test.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore polars autocleaning on the v2 pipeline (it previously relied on the removed v1 PlDfStats executor running select_clauses PolarsAnalysis): - add pl_cleaning_stats @stat (int-parse fraction from a polars Series) and PL_AUTOCLEAN_DEFAULT_V2, reusing the backend-agnostic cleaning_gen_ops - re-add PolarsAutocleaning (PlDfStatsV2 executor + polars make_origs that rebuilds the cleaned frame, add_orig checked by truthiness) - PolarsBuckarooWidget uses PolarsAutocleaning so cleaning runs through the polars path - restore the polars autocleaning tests (int-parse stats, op generation, handle_ops_and_clean, codegen, make_origs) on the @stat path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that production and tests run on @stat, delete the superseded v1 ColAnalysis stat classes, keeping the pure helper functions they shared: - analysis.py: drop TypingStats / DefaultSummaryStats / ComputedDefaultSummaryStats / PdCleaningStats (keep get_mode, _has_unhashable_values, probable_datetime) - histogram.py: drop the Histogram class (keep categorical_histogram / numeric_histogram, used by pd_stats_v2 / polars_analysis) - pd_fracs.py: drop HeuristicFracs and the Conservative/Aggressive cleaning classes (keep the frac helper functions) - delete customizations/heuristics.py (BaseHeuristicCleaningGenOps + the unused invert_rewritten_orig) geopandas (TypingStats) and the docs/example-notebooks autocleaning sample still reference these; both are handled by the separate geopandas-removal PR and are not in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
https://github.com/buckaroo-data/buckaroo/blob/85ab2038e48eb011ab142ba11240248ee7bceb8a/customizations/pd_stats_v2.py#L340-L341
Restore the removed typing stat export
The old TypingStats symbol is still imported by buckaroo/geopandas_buckaroo.py:8, which is reached by the registered GeoPandas display formatter in widget_utils.py:87-90 and by documented GeoPandas notebooks. After this port removes the class from customizations.analysis, displaying or directly importing the GeoPandas widget raises ImportError before any widget can render; either update that import to the v2 stat functions or keep a compatibility alias/export.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| a_only_col = next((c for c in a.columns if c not in keys), None) | ||
| b_col_after = f"{a_only_col}_after" if a_only_col else None | ||
| if a_only_col and b_col_after in merged.columns: |
There was a problem hiding this comment.
Use side markers for Polars key diffs
When the two Polars frames do not share the same first non-key column, b_col_after is absent and this falls through to only_before = only_after = 0 and matched = len(merged), so a full outer diff with keys present on only one side is reported as fully matched. This also happens for key-only frames; add explicit side marker columns before the join or probe independent a/b columns so one-sided keys are counted correctly.
Useful? React with 👍 / 👎.
Removes the entire legacy v1 DfStats /
ColAnalysis-execution stack now that the v2@statStatPipelineis the only analysis path.Removed
DfStats/AnalysisPipeline(deletedanalysis_management.py),PlDfStats/PolarsAnalysisPipeline+ the polarsproduce_*helpers, and the v1order_analysis/check_solvableordering.v1_adapter.py, theColAnalysisbranch in_normalize_inputs, and thev1_computed/spread_dict_resultStatFuncflags.process_df_v1_compat/process_table_v1_compat/_find_v1_class. TheDfStatsV2/PlDfStatsV2/XorqDfStatsV2wrappers now callprocess_df/process_tableand build theErrDictvia a newerrors_to_errdict.ColAnalysisstat classes:TypingStats,DefaultSummaryStats,ComputedDefaultSummaryStats,PdCleaningStats,Histogram,HeuristicFracs, the cleaning-op-gen classes, andcustomizations/heuristics.py(the shared helper functions are kept).Ported to @stat
BuckarooWidget, the headless server, and the pandas autocleaning configs now run onPD_ANALYSIS_V2; added acleaning_gen_ops@statfor the default autoclean op. Polars/xorq were already@stat._normalize_inputsaccepts structuralColAnalysisclasses (styling, post-processing) as no-ops and rejects leftover v1 stat classes with a clear error.@stat/StatPipeline.process_df; v1-mechanic and v1-vs-v2 parity tests dropped where covered bytest_paf_v2/test_pd_stats_v2.Polars autocleaning
Re-implemented on the v2 pipeline (it previously only worked through the v1
PlDfStatsexecutor):pl_cleaning_stats@stat+PL_AUTOCLEAN_DEFAULT_V2, a re-addedPolarsAutocleaning(PlDfStatsV2+ polarsmake_origs), andPolarsBuckarooWidgetwired to it.Two latent regressions fixed (exposed by routing the pandas widget through @stat)
base_summary_statsshort-circuitsvalue_countson unhashable (list/dict/set) columns — restores perf: DefaultSummaryStats O(n²) on object columns with unhashable values (lists/dicts/sets) #843.make_origschecksadd_origby truthiness, so@statcleaning funcs returningadd_orig=Falseno longer add spurious_origcolumns.Notes
geopandas_buckaroo.py(TypingStats) anddocs/example-notebooks/mo-autocleaning.pystill reference removed v1 classes; both are handled by the separate geopandas-removal PR (which merges first) and are not in CI.ruff+paddy_format --checkclean.🤖 Generated with Claude Code