Skip to content

MRB-650 Add score maps#92

Merged
Louis-Frey merged 147 commits into
mainfrom
MRB-650-Maps-simplified
Jun 18, 2026
Merged

MRB-650 Add score maps#92
Louis-Frey merged 147 commits into
mainfrom
MRB-650-Maps-simplified

Conversation

@jonasbhend

@jonasbhend jonasbhend commented Jan 7, 2026

Copy link
Copy Markdown
Contributor

Opt-in score maps for runs and baselines

Adds a pipeline that produces temporally aggregated spatial verification maps (per-grid-point BIAS / RMSE / MAE / STDE) for both model runs and baselines. The computation is heavy, so it is opt-in via config.

What's new

Config

  • New optional experiment.scoremaps block controls the feature. Set enabled: true to produce score maps alongside the standard pipeline; omit the block or set enabled: false to skip it entirely (existing configs are unaffected).
  • Fields: params, leadtimes, scores, regions, seasons, init_hours.
  • leadtimes is validated at config-load time against every participant's forecast steps: requesting a lead time a run or baseline does not produce fails fast with a clear message (e.g. 36 h against an ICON-CH1 baseline with steps 0/33/6). Accumulated params (TOT_PREC) require a lead time of at least one accumulation period.
  • JSON schema regenerated; ScoreMapsConfig forbids unknown keys.
  • Replaces the earlier --maps CLI flag, which has been removed in favour of this config option.

Workflow

  • Score-map targets are appended to experiment_all only when experiment.scoremaps.enabled is true.
  • New rules verification_scoremaps (runs, GRIB input) and verification_scoremaps_baseline (baselines, ICON GRIB archive or INCA), plus plot_scoremaps / plot_scoremaps_baseline for the map plots.
  • Score-map data: data/{runs,baselines}/…/scoremaps/{param}_{leadtime}_{truth_hash}.nc; plots: results/{experiment}/scoremaps/{runs,baselines}/.
  • A single shared score colormap is used across RMSE/MAE/STDE for comparability.

Script (workflow/scripts/verification_scoremaps.py)

  • Single unified script handling run (GRIB) and baseline (ICON GRIB archive / INCA NetCDF) inputs via mutually exclusive --run_root / --baseline_root.
  • Streaming aggregation: the accumulators stream over init times, so no per-init-time error fields are written to disk.
  • Computes BIAS/RMSE/MAE/STDE plus an N (sample-count) field, stratified by season (DJF/MAM/JJA/SON/all) and init hour (00/06/12/18/all).
  • --reftimes restricts processing to the configured hindcast period (essential for baselines, whose archive is a continuous time series).
  • TOT_PREC is de-accumulated over the [lead − period, lead] window; a missing step-0 field is synthesised as a zero initial condition when step 0 is requested.

Testing

  • Unit tests for the TOT_PREC de-accumulation / step-0 handling.
  • Dry-run gating: feature off → no score-map jobs; on → correct job expansion; the leadtimes validator rejects unsupported lead times.
  • Numerical regression against reference score maps: instantaneous params (T_2M, TD_2M) reproduce the references bit-identically (sample counts exact); TOT_PREC validated independently against a hand computation (de-accumulate → regrid → difference) to float32 precision.
  • Full 2-init end-to-end run (verification → aggregation → score maps → plots → dashboard) completes cleanly.
  • Dry-run sweep of all example configs, with and without score maps, builds correctly on the current tree (with main merged in).

Deferred to follow-up PRs

  • GRIB loader per-call overhead (TODO marker in data_input/__init__.py).
  • Wind direction / vector visualisation.
  • Reorganisation of pre-existing scalar plots under plots/.
  • Further data_input consolidation (owned by the refactor/data-io branch).
  • Nicer country polygons for map plots.
  • Plot-every-pixel rendering

Authors

Co-authored-by: Louis Frey louis.frey@meteoswiss.ch
Co-authored-by: Francesco Zanetta francesco.zanetta@meteoswiss.ch
Co-authored-by: Jonas Bhend jonas.bhend@meteoswiss.ch

@Louis-Frey Louis-Frey force-pushed the MRB-650-Maps-simplified branch from 2185fd6 to 9eb4643 Compare January 22, 2026 12:43
jonasbhend and others added 29 commits January 27, 2026 16:28
summary statistics. (No changes to code yet.)
For Bias, RMSE and MAE map plots.
Francesco. Got a long way towards the png plots.

Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>
properly working). Output written to .png now
working.
detailed inspection of results at smaller spatial
scale.
@Louis-Frey

Copy link
Copy Markdown
Contributor

Update:

Since the last round I ran a full regression test of the score-maps path (instantaneous params reproduce the pre-merge reference maps bit-identically; TOT_PREC validated end-to-end against an independent hand computation). It surfaced four bugs, now fixed:

af061b9 — TOT_PREC step-0 handling: the first-lead-time [0, period] window crashed (and full-range loads silently returned all-NaN precip at the first lead time).
3fe2531 — run-side score-maps rule was missing the ECCODES definitions export (empty maps) and hardcoded output/ instead of the configured output_root.
780cae8 — score-map plotting broke after the earthkit/loader migration (coord rename + Colormap-object rejection).
6e7d591 — score-map lead times are now resolved per participant: a 36h map is no longer scheduled for an ICON-CH1 baseline (steps 0/33/6), and leadtimes: all no longer plans doomed jobs.

The PR is ready in my view. Please check again @dnerini @jonasbhend

@dnerini dnerini left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks! A few more comments from my side.

Comment thread src/evalml/config.py Outdated
Comment thread workflow/Snakefile Outdated
Comment thread workflow/Snakefile Outdated

@jonasbhend jonasbhend left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Louis-Frey, here are a few more comments on the open PR. I have also taken it for a spin, to get a better understanding of what is happening, still have to look a bit more closely into results (filenames, rules, configs and such). Let me know if you need help / clarification.

Comment thread workflow/rules/verification.smk Outdated
Comment thread workflow/rules/verification.smk Outdated
Comment thread src/data_input/__init__.py Outdated
Comment thread src/data_input/__init__.py
Comment thread workflow/Snakefile Outdated
Comment thread src/data_input/__init__.py
Comment thread workflow/rules/verification.smk Outdated
Address dnerini's review: rename experiment.score_maps -> scoremaps and
default it to None (mirrors scorecards), so the block can be omitted.
Update the leadtimes validator to read scoremaps and skip when None, and
the Snakefile consumer (SCORE_MAPS_CONFIG -> SCOREMAPS_CONFIGS) to be
None-safe. Regenerate config.schema.json; rename the config YAML keys.
The scoremaps script is serial: it loops over init times one at a time
and accumulates running statistics, with no multiprocessing. The only
parallel-capable step (cKDTree.query) runs with workers=1. The 24-core
request was unused and, since Snakemake fans these out as many
independent tasks, multiplied wasted cores across the whole DAG. Two
cores leaves headroom for incidental GRIB-decode / dask I/O overlap.
Per review on MRB-650: identify scoremaps results by the truth config
hash (TRUTH_HASH, from #179) rather than the human truth label, so
results are pinned to the actual truth config. Applied as a filename
suffix (matching the verif_{TRUTH_HASH}.nc convention) to both the run
and baseline scoremaps outputs and their plot-rule inputs; the run path
previously carried no truth identifier at all. Drops the now-redundant
{label} directory level from the baseline path.
Completes the reviewer's request to reference the truth hash in both
output and logs: the run scoremaps output was hashed in the prior
commit, this adds it to the matching log filename (the baseline log
was already hashed). Also applies snakemake-fmt line wrapping.
@dnerini dnerini changed the title MRB-650 maps simplified MRB-650 Add score maps Jun 16, 2026
Comment thread src/evalml/config.py
Comment thread workflow/scripts/verification_scoremaps.py Outdated
Comment thread src/plotting/colormap_defaults.py
Raise the "cannot form accumulation window" error based on the input
step count (< 2) before computing tp.diff("step"), rather than detecting
an empty result afterwards. Avoids the misleading "Disaggregating..."
log line when the error path is taken.
Explain that this is an internal helper (external callers use
load_forecast_data) and that `files` and `steps` are complementary, not
redundant: `files` are what exists on disk, while `steps` carries the
requested lead times for TOT_PREC de-accumulation — needed because the
step-0 field is omitted by anemoi-inference and synthesised rather than
loaded. Addresses reviewer confusion about apparent duplication.
Add model_config = {"extra": "forbid"} so a misspelled or wrong key
(e.g. `metrics:` instead of `scores:`) raises a validation error
instead of being silently ignored and falling back to defaults.
Matches the convention already used by RunConfig, ScorecardConfig, etc.
STDE had no colormap entry and fell through to the parameter's field
colormap (e.g. absolute-temperature scale) instead of an error-magnitude
scale. Replace the 16 byte-identical RMSE+MAE entries with a single
generic "{param}.score.map" (sequential Reds) used by any error-magnitude
score, and have the score-map lookup fall back to it before the field
colormap. BIAS stays special-cased with its diverging map. An explicit
"{param}.{score}.map" still overrides the fallback, so a bespoke STDE map
can be added later without losing this default.
Pick up additionalProperties: false in the generated JSON schema, the
schema equivalent of the `extra: forbid` added to ScoreMapsConfig. The
config change landed without regenerating the schema, so the pydantic-
schema pre-commit hook flagged it as out of date.
Score maps silently skipped inits whose forecast or truth was missing,
so a run map could cover more inits than a baseline map, making them not
comparable. Match the rest of evalml (verification_metrics fails per
init) and require the full configured init set:

- runs: up-front check that every configured reftime has a GRIB output
  dir (the old code silently filtered missing dirs out)
- truth: up-front check that every required valid time is in the truth
  zarr, reporting all missing at once
- baselines: a forecast that fails to load is now a hard error instead
  of a warn-and-skip

Remove n_skip bookkeeping and the n_skipped output attr; raise instead
of returning silently when nothing was processed.
@Louis-Frey

Copy link
Copy Markdown
Contributor

All comments raised before are addressed. Please check again @dnerini @jonasbhend. If there are no further issues / comments raised, I suggest you can proceed to testing, @jonasbhend. After you complete, I will re-run my comprehensive test engine and merge if nothing surfaces.

@jonasbhend

Copy link
Copy Markdown
Contributor Author

All comments raised before are addressed. Please check again @dnerini @jonasbhend. If there are no further issues / comments raised, I suggest you can proceed to testing, @jonasbhend. After you complete, I will re-run my comprehensive test engine and merge if nothing surfaces.

Good for me, I tested it. Does what I expect.

@dnerini dnerini left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good for me too, thanks @Louis-Frey for the hard work!

@Louis-Frey

Copy link
Copy Markdown
Contributor

Good. Comprehensive tests complete and passing. Updated PR description. Merging.

@Louis-Frey Louis-Frey merged commit b04418a into main Jun 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants