MRB-650 Add score maps by jonasbhend · Pull Request #92 · MeteoSwiss/evalml

jonasbhend · 2026-01-07T10:38:09Z

Opt-in score maps for runs and baselines

Adds a pipeline that produces temporally aggregated spatial verification maps (per-grid-point BIAS / RMSE / MAE / STDE) for both model runs and baselines. The computation is heavy, so it is opt-in via config.

What's new

Config

New optional experiment.scoremaps block controls the feature. Set enabled: true to produce score maps alongside the standard pipeline; omit the block or set enabled: false to skip it entirely (existing configs are unaffected).
Fields: params, leadtimes, scores, regions, seasons, init_hours.
leadtimes is validated at config-load time against every participant's forecast steps: requesting a lead time a run or baseline does not produce fails fast with a clear message (e.g. 36 h against an ICON-CH1 baseline with steps 0/33/6). Accumulated params (TOT_PREC) require a lead time of at least one accumulation period.
JSON schema regenerated; ScoreMapsConfig forbids unknown keys.
Replaces the earlier --maps CLI flag, which has been removed in favour of this config option.

Workflow

Score-map targets are appended to experiment_all only when experiment.scoremaps.enabled is true.
New rules verification_scoremaps (runs, GRIB input) and verification_scoremaps_baseline (baselines, ICON GRIB archive or INCA), plus plot_scoremaps / plot_scoremaps_baseline for the map plots.
Score-map data: data/{runs,baselines}/…/scoremaps/{param}_{leadtime}_{truth_hash}.nc; plots: results/{experiment}/scoremaps/{runs,baselines}/.
A single shared score colormap is used across RMSE/MAE/STDE for comparability.

Script (workflow/scripts/verification_scoremaps.py)

Single unified script handling run (GRIB) and baseline (ICON GRIB archive / INCA NetCDF) inputs via mutually exclusive --run_root / --baseline_root.
Streaming aggregation: the accumulators stream over init times, so no per-init-time error fields are written to disk.
Computes BIAS/RMSE/MAE/STDE plus an N (sample-count) field, stratified by season (DJF/MAM/JJA/SON/all) and init hour (00/06/12/18/all).
--reftimes restricts processing to the configured hindcast period (essential for baselines, whose archive is a continuous time series).
TOT_PREC is de-accumulated over the [lead − period, lead] window; a missing step-0 field is synthesised as a zero initial condition when step 0 is requested.

Testing

Unit tests for the TOT_PREC de-accumulation / step-0 handling.
Dry-run gating: feature off → no score-map jobs; on → correct job expansion; the leadtimes validator rejects unsupported lead times.
Numerical regression against reference score maps: instantaneous params (T_2M, TD_2M) reproduce the references bit-identically (sample counts exact); TOT_PREC validated independently against a hand computation (de-accumulate → regrid → difference) to float32 precision.
Full 2-init end-to-end run (verification → aggregation → score maps → plots → dashboard) completes cleanly.
Dry-run sweep of all example configs, with and without score maps, builds correctly on the current tree (with main merged in).

Deferred to follow-up PRs

GRIB loader per-call overhead (TODO marker in data_input/__init__.py).
Wind direction / vector visualisation.
Reorganisation of pre-existing scalar plots under plots/.
Further data_input consolidation (owned by the refactor/data-io branch).
Nicer country polygons for map plots.
Plot-every-pixel rendering

Authors

Co-authored-by: Louis Frey louis.frey@meteoswiss.ch
Co-authored-by: Francesco Zanetta francesco.zanetta@meteoswiss.ch
Co-authored-by: Jonas Bhend jonas.bhend@meteoswiss.ch

…oard" This reverts commit cdefa16.

summary statistics. (No changes to code yet.)

of summary statistics.

For Bias, RMSE and MAE map plots.

compat.py

earthkit instead.

Francesco. Got a long way towards the png plots. Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>

variables.

properly working). Output written to .png now working.

detailed inspection of results at smaller spatial scale.

symmetric colour map for bias.

to see if all of it still works.

Louis-Frey · 2026-06-15T09:49:15Z

Update:

Since the last round I ran a full regression test of the score-maps path (instantaneous params reproduce the pre-merge reference maps bit-identically; TOT_PREC validated end-to-end against an independent hand computation). It surfaced four bugs, now fixed:

af061b9 — TOT_PREC step-0 handling: the first-lead-time [0, period] window crashed (and full-range loads silently returned all-NaN precip at the first lead time).
3fe2531 — run-side score-maps rule was missing the ECCODES definitions export (empty maps) and hardcoded output/ instead of the configured output_root.
780cae8 — score-map plotting broke after the earthkit/loader migration (coord rename + Colormap-object rejection).
6e7d591 — score-map lead times are now resolved per participant: a 36h map is no longer scheduled for an ICON-CH1 baseline (steps 0/33/6), and leadtimes: all no longer plans doomed jobs.

The PR is ready in my view. Please check again @dnerini @jonasbhend

dnerini

Nice work, thanks! A few more comments from my side.

jonasbhend

Hi @Louis-Frey, here are a few more comments on the open PR. I have also taken it for a spin, to get a better understanding of what is happening, still have to look a bit more closely into results (filenames, rules, configs and such). Let me know if you need help / clarification.

Address dnerini's review: rename experiment.score_maps -> scoremaps and default it to None (mirrors scorecards), so the block can be omitted. Update the leadtimes validator to read scoremaps and skip when None, and the Snakefile consumer (SCORE_MAPS_CONFIG -> SCOREMAPS_CONFIGS) to be None-safe. Regenerate config.schema.json; rename the config YAML keys.

The scoremaps script is serial: it loops over init times one at a time and accumulates running statistics, with no multiprocessing. The only parallel-capable step (cKDTree.query) runs with workers=1. The 24-core request was unused and, since Snakemake fans these out as many independent tasks, multiplied wasted cores across the whole DAG. Two cores leaves headroom for incidental GRIB-decode / dask I/O overlap.

Per review on MRB-650: identify scoremaps results by the truth config hash (TRUTH_HASH, from #179) rather than the human truth label, so results are pinned to the actual truth config. Applied as a filename suffix (matching the verif_{TRUTH_HASH}.nc convention) to both the run and baseline scoremaps outputs and their plot-rule inputs; the run path previously carried no truth identifier at all. Drops the now-redundant {label} directory level from the baseline path.

Completes the reviewer's request to reference the truth hash in both output and logs: the run scoremaps output was hashed in the prior commit, this adds it to the matching log filename (the baseline log was already hashed). Also applies snakemake-fmt line wrapping.

Raise the "cannot form accumulation window" error based on the input step count (< 2) before computing tp.diff("step"), rather than detecting an empty result afterwards. Avoids the misleading "Disaggregating..." log line when the error path is taken.

Explain that this is an internal helper (external callers use load_forecast_data) and that `files` and `steps` are complementary, not redundant: `files` are what exists on disk, while `steps` carries the requested lead times for TOT_PREC de-accumulation — needed because the step-0 field is omitted by anemoi-inference and synthesised rather than loaded. Addresses reviewer confusion about apparent duplication.

Add model_config = {"extra": "forbid"} so a misspelled or wrong key (e.g. `metrics:` instead of `scores:`) raises a validation error instead of being silently ignored and falling back to defaults. Matches the convention already used by RunConfig, ScorecardConfig, etc.

STDE had no colormap entry and fell through to the parameter's field colormap (e.g. absolute-temperature scale) instead of an error-magnitude scale. Replace the 16 byte-identical RMSE+MAE entries with a single generic "{param}.score.map" (sequential Reds) used by any error-magnitude score, and have the score-map lookup fall back to it before the field colormap. BIAS stays special-cased with its diverging map. An explicit "{param}.{score}.map" still overrides the fallback, so a bespoke STDE map can be added later without losing this default.

Pick up additionalProperties: false in the generated JSON schema, the schema equivalent of the `extra: forbid` added to ScoreMapsConfig. The config change landed without regenerating the schema, so the pydantic- schema pre-commit hook flagged it as out of date.

Score maps silently skipped inits whose forecast or truth was missing, so a run map could cover more inits than a baseline map, making them not comparable. Match the rest of evalml (verification_metrics fails per init) and require the full configured init set: - runs: up-front check that every configured reftime has a GRIB output dir (the old code silently filtered missing dirs out) - truth: up-front check that every required valid time is in the truth zarr, reporting all missing at once - baselines: a forecast that fails to load is now a hard error instead of a warn-and-skip Remove n_skip bookkeeping and the n_skipped output attr; raise instead of returning silently when nothing was processed.

Louis-Frey · 2026-06-17T15:11:02Z

All comments raised before are addressed. Please check again @dnerini @jonasbhend. If there are no further issues / comments raised, I suggest you can proceed to testing, @jonasbhend. After you complete, I will re-run my comprehensive test engine and merge if nothing surfaces.

jonasbhend · 2026-06-18T12:51:36Z

All comments raised before are addressed. Please check again @dnerini @jonasbhend. If there are no further issues / comments raised, I suggest you can proceed to testing, @jonasbhend. After you complete, I will re-run my comprehensive test engine and merge if nothing surfaces.

Good for me, I tested it. Does what I expect.

dnerini

looking good for me too, thanks @Louis-Frey for the hard work!

Louis-Frey · 2026-06-18T18:02:40Z

Good. Comprehensive tests complete and passing. Updated PR description. Merging.

Louis-Frey force-pushed the MRB-650-Maps-simplified branch from 2185fd6 to 9eb4643 Compare January 22, 2026 12:43

jonasbhend and others added 29 commits January 27, 2026 16:28

Simplified way of implementing fields

2b49526

Exclude spatial data from being plotted and included in dashboard

a456b1d

delete intermeidate verification files

0ec286f

Fix typo

28692d6

include score components for maps

f3dcf0d

Revert "Exclude spatial data from being plotted and included in dashb…

99dac52

…oard" This reverts commit cdefa16.

remove source dimension from scores

edcca5b

clean up

0ec8a0f

New rule and plotting file for plotting maps of

51f22a6

summary statistics. (No changes to code yet.)

Obvious changes to the new plotting rule for maps

366249d

of summary statistics.

Some more changes (preliminary, to be continued).

f531395

Further changes to plotting scripts.

32c7ecf

First version of colour maps finished.

c2ab645

For Bias, RMSE and MAE map plots.

Better comments in the colour map code.

27b91e2

Better Comments, some further changes to code.

1b2e670

Added back instances of lead time.

52dd1e7

New function for loading netCDF files added to

ddc5883

compat.py

Marimo app cell for loading data from .nc

35396a7

Remove .nc-loading function again, do it with

e8a92aa

earthkit instead.

All kinds of changes. Co-Development session with

06be6fa

Francesco. Got a long way towards the png plots. Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>

Generalized to the other non-trivial (non-wind)

61c728e

variables.

Some changes to plotting script.

761f8d9

Plotting region now dynamical (but not yet

1cd4dbf

properly working). Output written to .png now working.

Dynamic Regions now working.

cdb2ccd

Store results under experiment hash.

558689c

Introduced new domain "switzerland_small" for more

986a7ee

detailed inspection of results at smaller spatial scale.

Reverse Red-Blue colour maps for bias.

7f70fb4

Preliminary changes to plotting script for getting

50e899f

symmetric colour map for bias.

Temporarily changed plotting script back to original

a0aefc0

to see if all of it still works.

dnerini requested changes Jun 15, 2026

View reviewed changes

Comment thread src/evalml/config.py Outdated

Comment thread workflow/Snakefile Outdated

Comment thread workflow/Snakefile Outdated

dnerini added 3 commits June 15, 2026 16:43

Simplify expansion of score map outputs

509808e

Remove support for 'leadtimes=all'

f38c2ee

Add validator for leadtimes in score maps

202b651

jonasbhend commented Jun 15, 2026

View reviewed changes

Comment thread workflow/rules/verification.smk Outdated

Comment thread workflow/rules/verification.smk Outdated

Comment thread src/data_input/__init__.py Outdated

Comment thread src/data_input/__init__.py

Comment thread workflow/Snakefile Outdated

jonasbhend commented Jun 15, 2026

View reviewed changes

Comment thread src/data_input/__init__.py

jonasbhend commented Jun 16, 2026

View reviewed changes

Comment thread workflow/rules/verification.smk Outdated

Louis-Frey added 6 commits June 16, 2026 17:26

Rename "score_maps" -> "scoremaps" throughout.

d919b51

Merge remote-tracking branch 'origin/main' into MRB-650-Maps-simplified

0cd0684

dnerini changed the title ~~MRB-650 maps simplified~~ MRB-650 Add score maps Jun 16, 2026

jonasbhend commented Jun 17, 2026

View reviewed changes

Comment thread src/evalml/config.py

jonasbhend commented Jun 17, 2026

View reviewed changes

Comment thread workflow/scripts/verification_scoremaps.py Outdated

jonasbhend commented Jun 17, 2026

View reviewed changes

Comment thread src/plotting/colormap_defaults.py

Louis-Frey added 6 commits June 17, 2026 14:22

dnerini approved these changes Jun 18, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into MRB-650-Maps-simplified

66a6e4b

Louis-Frey merged commit b04418a into main Jun 18, 2026
4 checks passed

Conversation

jonasbhend commented Jan 7, 2026 • edited by Louis-Frey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Opt-in score maps for runs and baselines

What's new

Testing

Deferred to follow-up PRs

Authors

Uh oh!

Louis-Frey commented Jun 15, 2026

Uh oh!

dnerini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonasbhend left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Louis-Frey commented Jun 17, 2026

Uh oh!

jonasbhend commented Jun 18, 2026

Uh oh!

dnerini left a comment

Choose a reason for hiding this comment

Uh oh!

Louis-Frey commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonasbhend commented Jan 7, 2026 •

edited by Louis-Frey

Loading