MRB-650 Add score maps#92
Conversation
2185fd6 to
9eb4643
Compare
…oard" This reverts commit cdefa16.
summary statistics. (No changes to code yet.)
of summary statistics.
For Bias, RMSE and MAE map plots.
earthkit instead.
Francesco. Got a long way towards the png plots. Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>
properly working). Output written to .png now working.
detailed inspection of results at smaller spatial scale.
symmetric colour map for bias.
to see if all of it still works.
|
Update: Since the last round I ran a full regression test of the score-maps path (instantaneous params reproduce the pre-merge reference maps bit-identically; TOT_PREC validated end-to-end against an independent hand computation). It surfaced four bugs, now fixed: af061b9 — TOT_PREC step-0 handling: the first-lead-time [0, period] window crashed (and full-range loads silently returned all-NaN precip at the first lead time). The PR is ready in my view. Please check again @dnerini @jonasbhend |
dnerini
left a comment
There was a problem hiding this comment.
Nice work, thanks! A few more comments from my side.
jonasbhend
left a comment
There was a problem hiding this comment.
Hi @Louis-Frey, here are a few more comments on the open PR. I have also taken it for a spin, to get a better understanding of what is happening, still have to look a bit more closely into results (filenames, rules, configs and such). Let me know if you need help / clarification.
Address dnerini's review: rename experiment.score_maps -> scoremaps and default it to None (mirrors scorecards), so the block can be omitted. Update the leadtimes validator to read scoremaps and skip when None, and the Snakefile consumer (SCORE_MAPS_CONFIG -> SCOREMAPS_CONFIGS) to be None-safe. Regenerate config.schema.json; rename the config YAML keys.
The scoremaps script is serial: it loops over init times one at a time and accumulates running statistics, with no multiprocessing. The only parallel-capable step (cKDTree.query) runs with workers=1. The 24-core request was unused and, since Snakemake fans these out as many independent tasks, multiplied wasted cores across the whole DAG. Two cores leaves headroom for incidental GRIB-decode / dask I/O overlap.
Per review on MRB-650: identify scoremaps results by the truth config hash (TRUTH_HASH, from #179) rather than the human truth label, so results are pinned to the actual truth config. Applied as a filename suffix (matching the verif_{TRUTH_HASH}.nc convention) to both the run and baseline scoremaps outputs and their plot-rule inputs; the run path previously carried no truth identifier at all. Drops the now-redundant {label} directory level from the baseline path.
Completes the reviewer's request to reference the truth hash in both output and logs: the run scoremaps output was hashed in the prior commit, this adds it to the matching log filename (the baseline log was already hashed). Also applies snakemake-fmt line wrapping.
Raise the "cannot form accumulation window" error based on the input
step count (< 2) before computing tp.diff("step"), rather than detecting
an empty result afterwards. Avoids the misleading "Disaggregating..."
log line when the error path is taken.
Explain that this is an internal helper (external callers use load_forecast_data) and that `files` and `steps` are complementary, not redundant: `files` are what exists on disk, while `steps` carries the requested lead times for TOT_PREC de-accumulation — needed because the step-0 field is omitted by anemoi-inference and synthesised rather than loaded. Addresses reviewer confusion about apparent duplication.
Add model_config = {"extra": "forbid"} so a misspelled or wrong key
(e.g. `metrics:` instead of `scores:`) raises a validation error
instead of being silently ignored and falling back to defaults.
Matches the convention already used by RunConfig, ScorecardConfig, etc.
STDE had no colormap entry and fell through to the parameter's field
colormap (e.g. absolute-temperature scale) instead of an error-magnitude
scale. Replace the 16 byte-identical RMSE+MAE entries with a single
generic "{param}.score.map" (sequential Reds) used by any error-magnitude
score, and have the score-map lookup fall back to it before the field
colormap. BIAS stays special-cased with its diverging map. An explicit
"{param}.{score}.map" still overrides the fallback, so a bespoke STDE map
can be added later without losing this default.
Pick up additionalProperties: false in the generated JSON schema, the schema equivalent of the `extra: forbid` added to ScoreMapsConfig. The config change landed without regenerating the schema, so the pydantic- schema pre-commit hook flagged it as out of date.
Score maps silently skipped inits whose forecast or truth was missing, so a run map could cover more inits than a baseline map, making them not comparable. Match the rest of evalml (verification_metrics fails per init) and require the full configured init set: - runs: up-front check that every configured reftime has a GRIB output dir (the old code silently filtered missing dirs out) - truth: up-front check that every required valid time is in the truth zarr, reporting all missing at once - baselines: a forecast that fails to load is now a hard error instead of a warn-and-skip Remove n_skip bookkeeping and the n_skipped output attr; raise instead of returning silently when nothing was processed.
|
All comments raised before are addressed. Please check again @dnerini @jonasbhend. If there are no further issues / comments raised, I suggest you can proceed to testing, @jonasbhend. After you complete, I will re-run my comprehensive test engine and merge if nothing surfaces. |
Good for me, I tested it. Does what I expect. |
dnerini
left a comment
There was a problem hiding this comment.
looking good for me too, thanks @Louis-Frey for the hard work!
|
Good. Comprehensive tests complete and passing. Updated PR description. Merging. |
Opt-in score maps for runs and baselines
Adds a pipeline that produces temporally aggregated spatial verification maps (per-grid-point BIAS / RMSE / MAE / STDE) for both model runs and baselines. The computation is heavy, so it is opt-in via config.
What's new
Config
experiment.scoremapsblock controls the feature. Setenabled: trueto produce score maps alongside the standard pipeline; omit the block or setenabled: falseto skip it entirely (existing configs are unaffected).params,leadtimes,scores,regions,seasons,init_hours.leadtimesis validated at config-load time against every participant's forecaststeps: requesting a lead time a run or baseline does not produce fails fast with a clear message (e.g. 36 h against an ICON-CH1 baseline with steps0/33/6). Accumulated params (TOT_PREC) require a lead time of at least one accumulation period.ScoreMapsConfigforbids unknown keys.--mapsCLI flag, which has been removed in favour of this config option.Workflow
experiment_allonly whenexperiment.scoremaps.enabledis true.verification_scoremaps(runs, GRIB input) andverification_scoremaps_baseline(baselines, ICON GRIB archive or INCA), plusplot_scoremaps/plot_scoremaps_baselinefor the map plots.data/{runs,baselines}/…/scoremaps/{param}_{leadtime}_{truth_hash}.nc; plots:results/{experiment}/scoremaps/{runs,baselines}/.Script (
workflow/scripts/verification_scoremaps.py)--run_root/--baseline_root.--reftimesrestricts processing to the configured hindcast period (essential for baselines, whose archive is a continuous time series).[lead − period, lead]window; a missing step-0 field is synthesised as a zero initial condition when step 0 is requested.Testing
mainmerged in).Deferred to follow-up PRs
data_input/__init__.py).plots/.data_inputconsolidation (owned by therefactor/data-iobranch).Authors
Co-authored-by: Louis Frey louis.frey@meteoswiss.ch
Co-authored-by: Francesco Zanetta francesco.zanetta@meteoswiss.ch
Co-authored-by: Jonas Bhend jonas.bhend@meteoswiss.ch