diff --git a/report/_quarto/.gitignore b/R/.gitignore similarity index 100% rename from report/_quarto/.gitignore rename to R/.gitignore diff --git a/R/README.md b/R/README.md new file mode 100644 index 0000000..37e87d1 --- /dev/null +++ b/R/README.md @@ -0,0 +1,69 @@ +# R/ + +Analysis scripts for evaluating COVID-19 forecast accuracy across models. Scripts fall into two categories: pipeline scripts run in sequence to produce results, and utility scripts called by the pipeline. + +## Pipeline scripts + +Run in this order to reproduce the analysis: + +| Script | Purpose | Inputs | Outputs | +|---|---|---|---| +| `process-score.R` | Score forecasts using WIS on natural and log scales; normalises predictions and observations to per-100,000 population | `data/covid19-forecast-hub-europe.parquet`, `data/observed-*.csv`, `data/populations.csv` | `data/scores-raw-case.csv`, `data/scores-raw-death.csv` | +| `process-data.R` | Join scores with covariates (model classification, variant phase, observed incidence, trend) into a single analysis-ready dataframe | `data/scores-raw-*.csv`, `data/observed-*.csv`, `data/model-classification.csv` | In-memory dataframe via `process_data()` | +| `analysis-descriptive.R` | Descriptive summaries: bootstrap confidence intervals, publication-ready Table 1, score distribution ridge plots, time-series plots | `process_data()` output | Plot and table objects | +| `analysis-model.R` | Fit GAMM to WIS isolating effects of model structure (`Method`) and geographic scope (`CountryTargets`) after adjusting for confounders | `process_data()` output | `output/results.rds`, `output/plots/check_*.pdf` | +| `dag-check.R` | Define and visualise causal DAG; derive minimal adjustment set for the exposure–outcome relationship | — | Plot objects | +| `plot-model-flow.R` | Generate model inclusion/exclusion flowchart applying sequential eligibility criteria | `data/covid19-forecast-hub-europe.parquet` | `output/plots/flowchart.png` | +| `plot-model-results.R` | Forest plots of GAMM random effects (unadjusted and adjusted) by model, method, and target scope | `output/results.rds`, `process_data()` output | ggplot objects | + +## Utility scripts + +Called by pipeline scripts; not intended to be run directly. + +| Script | Key functions | Purpose | +|---|---|---| +| `utils-data.R` | `get_forecasts()`, `download_obs()`, `download_pop()` | Read forecast parquet, download observed incidence and population data from JHU/ECDC sources | +| `utils-metadata.R` | `get_metadata_processed()`, `write_metadata()` | Fetch model metadata from GitHub; read and write qualitative model classifications to/from Google Sheets | +| `utils-variants.R` | `classify_variant_phases()` | Classify dominant SARS-CoV-2 variant per country-week using ECDC, UK UKHSA, and Swiss surveillance data | + +## Data flow + +``` +covid19-forecast-hub-europe.parquet + │ + ▼ +utils-data.R ──────────────────────────────► data/observed-*.csv + │ data/populations.csv + │ + ▼ +process-score.R ───────────────────────────► data/scores-raw-case.csv + │ data/scores-raw-death.csv + │ + ├── data/model-classification.csv ◄── utils-metadata.R (Google Sheets) + ├── data/observed-*.csv + └── utils-variants.R (ECDC/UKHSA/CH) + │ + ▼ + process-data.R ── process_data() ──► in-memory analysis dataframe + │ + ├──► analysis-descriptive.R ── tables, plots + │ + └──► analysis-model.R ────────► output/results.rds + output/plots/check_*.pdf + │ + ▼ + plot-model-results.R ── forest plots +``` + +## Model specification + +The main model (`analysis-model.R`) fits a GAMM via `mgcv::bam()`: + +- **Family:** Gaussian with log link +- **Fitting:** fREML with `discrete = TRUE` +- **Fixed effects of interest:** `Method` (model structure), `CountryTargets` (geographic scope) +- **Adjusting for:** `Incidence` (cubic spline), `Trend`, `Horizon`, `VariantPhase`, `Location`, `Model` +- **Random effects basis:** `re` for Location, VariantPhase, and Model +- **Outcomes:** cases and deaths scored separately on natural and log scales + +Univariate models are fitted first (one term each) to obtain unadjusted estimates; a joint model containing all terms yields adjusted estimates. Both are saved to `output/results.rds`. diff --git a/R/analysis-model.R b/R/analysis-model.R index ca1101b..8176696 100644 --- a/R/analysis-model.R +++ b/R/analysis-model.R @@ -47,7 +47,7 @@ model_wis <- function(scoring_scale = "log", output_dir = "output") { model = wis ~ s(Model, bs = "re") ) - # Full joint model + # Full model m.formula_joint <- wis ~ s(Method, bs = "re") + s(CountryTargets, bs = "re") + diff --git a/R/dag-check.R b/R/dag-check.R index 152b36f..e3b1d36 100644 --- a/R/dag-check.R +++ b/R/dag-check.R @@ -1,67 +1,45 @@ -# Define DAG for model structure and WIS -# Variables: -# model_structure: model structure (exposure) -# wis: weighted interval score (outcome) -# team_resources: resources available to the team (latent) -# transmission_dynamics: underlying transmission dynamics (latent) -# country_targets: whether the model forecasts for one or multiple countries -# variant: dominant variant phase -# country: country target location -# trend: increasing, decreasing, or stable epidemic trend -# horizon: forecast horizon - +# Define DAG: methods -> score library(dagitty) library(ggdag) library(ggplot2) -dag <- dagitty('dag { - model_structure[exposure] - wis [outcome] - model_team_resources[latent] - transmission_dynamics[latent] - - model_team_resources -> model_structure - model_team_resources -> wis - model_team_resources -> country_targets - - model_country_targets -> model_structure - model_country_targets -> wis - - country -> variant - variant -> transmission_dynamics - transmission_dynamics -> country - transmission_dynamics -> trend - - transmission_dynamics -> wis - variant -> wis - trend -> wis - country -> wis - horizon -> wis - - model_structure -> wis -}') - -cat("\n--- Minimal adjustment sets for TOTAL effect of model_structure on wis ---\n") -print(adjustmentSets(dag, effect = "total")) - -cat("\n--- Minimal adjustment sets for DIRECT effect of model_structure on wis ---\n") -print(adjustmentSets(dag, effect = "direct")) +dag <- dagitty(' +dag { +bb="-6.421,-7.624,6.826,6.874" +"Accuracy score" [outcome,pos="5.506,2.291"] +"Epidemic dynamics" [latent,pos="1.743,-0.912"] +"Model structure" [exposure,pos="-2.105,2.885"] +"Modeller strategy" [latent,pos="-2.133,-1.027"] +"Observed incidence" [adjusted,pos="3.234,0.948"] +"Single/multi-country" [pos="0.210,1.926"] +Country [adjusted,pos="-0.231,-0.950"] +Horizon [adjusted,pos="2.155,1.907"] +Model [adjusted,pos="1.658,3.000"] +Prediction [pos="3.418,3.058"] +Trend [adjusted,pos="3.262,-0.164"] +Variant [adjusted,pos="0.210,-1.794"] +"Epidemic dynamics" -> "Observed incidence" +"Epidemic dynamics" -> Trend +"Model structure" -> "Single/multi-country" +"Model structure" -> Model +"Modeller strategy" -> "Model structure" +"Modeller strategy" -> "Single/multi-country" +"Observed incidence" -> "Accuracy score" +"Observed incidence" <-> Trend +"Single/multi-country" -> Prediction +Country -> "Epidemic dynamics" +Country -> "Observed incidence" +Horizon -> Prediction +Model -> Prediction +Prediction -> "Accuracy score" +Variant -> "Epidemic dynamics" +} +') -cat("\n--- Implied conditional independencies ---\n") -print(impliedConditionalIndependencies(dag)) +tidy_ggdag <- tidy_dagitty(dag) -cat("\n--- All paths from model_structure to wis (open/closed unconditional) ---\n") -print(paths(dag, from = "model_structure", to = "wis")) - -cat("\n--- Paths given the total-effect adjustment set ---\n") -adj <- adjustmentSets(dag, effect = "total") -if (length(adj) > 0) { - print(paths(dag, from = "model_structure", to = "wis", Z = adj[[1]])) -} +p_adj <- ggdag_adjustment_set(tidy_ggdag, node_size = 14) + + theme(legend.position = "bottom") p <- ggdag_status(dag, text = FALSE, use_labels = "name") + - theme_dag() - -dir.create("report/supplement", recursive = TRUE, showWarnings = FALSE) -ggsave("report/supplement/dag.pdf", p, width = 8, height = 6) -cat("\nDAG figure saved to report/supplement/dag.pdf\n") + theme_dag_blank() diff --git a/report/_quarto/_background.qmd b/report/_quarto/_background.qmd deleted file mode 100644 index 171782e..0000000 --- a/report/_quarto/_background.qmd +++ /dev/null @@ -1,32 +0,0 @@ -The predictive accuracy of infectious disease forecasts varies in time, space, and between forecasting models. -When building models for prediction, forecasters have the choice of a range of underlying model structures, with complementary strengths and weaknesses and varying ability to tune a particular model to the forecasting target [(1–3)](https://www.zotero.org/google-docs/?XemggF). -However, it has been difficult to evaluate the impact of these choices on resulting differences in accurate performance. -A better understanding would offer insights into the utility of different models, supporting both longer-term model development and short-term selection for accurate prediction during outbreak responses. - -A fundamental difference among forecasting models is in the modeller's choice of underlying structure. -This can range on a spectrum of statistical, to mechanistic, reflecting the level of prior understanding and belief in how transmission unfolds in the included population. -While statistical methods can be purely data-driven, more mechanistic models develop explicit cause-and-effect hypotheses about biological or epidemiological drivers of transmission dynamics. -Previous work has compared performance between these approaches, finding no evidence of substantial variation in predictive accuracy between these model structures [(4–6)](https://www.zotero.org/google-docs/?piAg9a). - -An alternative way of looking at model development is to consider a model's specificity towards the target setting to which the forecasts are being applied. -One approach to model building is to create a model able to produce forecasts for any given setting, given only updated data. -This is the case, for example, for typical time-series forecasting and most semi-mechanistic models. -A different approach is to adapt the model structure itself to the predictive target, for example to emphasise different mechanistic processes or to exploit differences in the availability and quality of multiple data sources. -Such context-specific models may be more policy-relevant [(7,8)](https://www.zotero.org/google-docs/?vo794b), and a study of COVID-19 forecasts in Germany and Poland suggested the two best performing models were those most tuned to the national context [(9)](https://www.zotero.org/google-docs/?7QAdmm). - -However, it has been difficult to establish clear associations between different choices in model development, such as model structure and specificity, and the predictive performance of resulting forecasts. -Challenges include a lack of comparability between forecasts [(10)](https://www.zotero.org/google-docs/?8yn1XI), a small sample size of participating modelling teams [(11)](https://www.zotero.org/google-docs/?ZgO6IM), and the need to account for varying predictive difficulty among multiple forecast targets with differing outbreak characteristics [(12)](https://www.zotero.org/google-docs/?qtFeFg). - -A number of collaborative epidemic forecasting efforts have aimed to support such analyses. -Such "Hubs" collate predictions from diverse participating modellers, often publishing these projections in real-time during an epidemic. -Alongside informing public health, this enables like-for-like comparisons of predictive performance [(13)](https://www.zotero.org/google-docs/?7pIGoa). -While there may be inherent limits to predictability [(14)](https://www.zotero.org/google-docs/?CZumUU), several factors have been suggested to contribute to general predictive difficulty. -For example, infectious disease dynamics may be easier to predict in the short term [(15)](https://www.zotero.org/google-docs/?2K90bo), when they are more stable [(9)](https://www.zotero.org/google-docs/?KaFMKc), among larger population sizes [(16)](https://www.zotero.org/google-docs/?7uCfbD), reported with timely surveillance [(6)](https://www.zotero.org/google-docs/?YLVYXN), or have leading indicators in complementary data sources [(17)](https://www.zotero.org/google-docs/?gzkBxT). - -Analyses of forecast performance have typically used a univariate approach to describing differences in accuracy between models. -However, such findings could be better supported by using more formal models of association. -Such models are also able to capture co-varying factors affecting overall predictability [(11,16,18)](https://www.zotero.org/google-docs/?R3Ah2J). -Accounting for these factors may support more robust inferences about underlying differences in forecast performance. - -Here, we develop a generalised additive mixed effect model, with forecast scores as an outcome while including covariate indicators of predictive difficulty, in the context of short-term forecasts of COVID-19 in Europe. -We use this to evaluate the effects of model structure and geographic specificity on predictive performance. diff --git a/report/_quarto/_methods.qmd b/report/_quarto/_methods.qmd deleted file mode 100644 index 0fb8223..0000000 --- a/report/_quarto/_methods.qmd +++ /dev/null @@ -1,93 +0,0 @@ -**Study design** - -*Forecast targets* We consider incidence of COVID-19 cases and deaths between 8 March 2021 to 10 March 2023, using data collated by Johns Hopkins University (JHU). -Data comprised cumulative reported cases or deaths reported daily for each of the 32 European countries [(20)](https://www.zotero.org/google-docs/?cDnqeb). -We calculated the incident weekly count of cases or deaths from the cumulative count, and used the dataset available as of 10 March 2023, when JHU stopped reporting new data. - -*Data collection* We collected forecast data from the European COVID-19 Forecast Hub. -This is a public platform that collated real-time weekly forecasts for 32 European countries [(19)](https://www.zotero.org/google-docs/?5K4uXI). -The platform solicited forecasts for between one and four weeks ahead incidence of COVID-19 cases, deaths, or hospitalisation. -Forecasters were optionally able to express uncertainty by reporting a distribution of up to 23 probabilistic quantiles for each prediction. -Forecasters also contributed a standard set of metadata describing their team and model. -The Hub provided continuous real-time visualisation of each forecasting team's performance against observed data over the study period. - -*Participation* Forecasting teams were recruited to the Hub using existing networks and word of mouth. -Participation was voluntary and open, and any forecaster meeting formatting requirements was eligible for inclusion in the Hub platform. -The number and identity of forecasters contributing to the Hub project therefore varied over time. -Over 50 modelling teams contributed prospective forecasts during the study period. - -*Inclusion criteria* In this study, we included forecasts from between 8 March 2021 to 10 March 2023. -This includes all forecasts contributed to the Hub until the end of JHU data reporting. -We excluded forecasts of hospitalisations, which experienced multiple changes in source data during the study period. -We excluded forecasts that did not report the full set of 23 quantiles, in order to ensure fair comparison among probabilistic model results (see model contributions in Supplement). -We also excluded baseline and ensemble models created by the Hub team. - -*Performance evaluation* We evaluated forecasts against observed data using the Weighted Interval Score (WIS) [(21)](https://www.zotero.org/google-docs/?T3es1k). -The WIS is a proper scoring rule and a generalisation of the continuous ranked probability score (CRPS). -This assigns each probabilistic forecast a score that reflects the forecast's dispersion, overprediction, and underprediction. - -**Model building** - -We considered the outcome (absolute WIS of each forecast) as the outcome of a set of interrelated factors, considering model structure as our main effect of interest. -We also considered the potential for bias (confounding) in this relationship, as well as those affecting the variance of the outcome and therefore the precision of the effect estimate. -For clarity and to illustrate our approach, we include a directed acyclic graph to represent these assumptions in the Supplement. -However, we emphasise that in this work, our model building strategy is exploratory rather than inferential. -We suggest substantial further work is needed to support inferential approaches to forecast evaluation (see Discussion and Supplement). - -*Exposure: model structure* We categorised each model in terms of its design structure. -We assigned a classification according to the following types (in order of decreasing human input): "Judgement" (ensembles of human-judgement based predictions), "Agent-based" (with detailed individual-level interactions), "Mechanistic" (broadly based on compartmental models), "Semi-mechanistic" (statistical models with epidemiologically informed elements, e.g. delay distributions and/or generation intervals), "Statistical" (context-agnostic time-series models). - -*Confounding: forecasting target* We considered the potential for bias from the number of targets a forecasting team participated in predicting. -We hypothesised that this might influence both a forecasting team's choice of model structure, and their realised forecast performance. -This would introduce confounding when estimating the effect of model structure on performance. -For example, more data-driven model structures are typically more amenable to producing forecasts for multiple targets, compared to bespoke mechanistic approaches tailored to a single epidemiological context. -At the same time, a forecaster contributing to many targets simulatenously may be less able to provide consistent quality assurance across targets, potentially impacting overall forecast performance. -To address this, we used a simple binary classification of whether each forecaster provided predictions at any time for one single location target or more than one location target. -We used a binary classification to address data sparsity (few teams submitting for between 2 and 32 countries). -We note that this is a simplistic approach, and that forecasters could change the number of targets they submitted over time. -We also note that selection of epidemiological target (case and/or death incidence) may act in the same way, but we did not include this in this analysis. - -*Precision: epidemiological context* We considered contextual factors that influence the variance of forecast performance across model-target pairs. -We suggest that accounting for these factors improves the precision around the estimated effect of model structure, without impacting its point estimate. -Firstly, we included forecast horizon to account for increasing difficulty of forecast performance further into the future. -We also accounted for some variation in the epidemic transmission dynamics contributing to each observed forecast target. -This included the trend of incidence in the target week for each location, which we categorised as "Stable", "Decreasing", or "Increasing". -This was based on the difference over a three-week moving average of incidence (with a change of +/-5% as "Stable"; see Supplement). -We also considered differing transmission characteristics with each Sars-Cov2 variant. -For this, we used publicly available sequence data to classify each target location-week according to the dominant circulating variant. -This created a discrete sequence of phases (see Supplement) (\#ref-variant-data). -We considered extraneous variation at the country level, to account for, for example, effects of different transmission probabilities with national policy. - -**Data processing** - -We downloaded all data from the open-source European COVID-19 Forecast Hub. - -Four members of the research team (KS, SF, FB, JM) were tasked with classifying models by their structure, with at least two investigators classifying each model. -Each rater worked independently, and was asked to read each modelling teams' short description of their methods included in Hub metadata, and any additional links provided therein, such as citations or websites with model details. -The majority vote determined the final classification with any potential ties flagged for an additional independent classification by another investigator. -Two investigators (SF, KS) separately reviewed all classifications in discussion for any final revisions (see Supplement). - -We normalised both forecasts and observations to incidence per 100,000 population. -We then log-transformed both forecasts and observations before scoring. -This transform leads to more comparable scores across locations and times, and can be interpreted as evaluating the ability of models to predict the growth rate [(22,23)](https://www.zotero.org/google-docs/?NuBRI2). - -**Analysis** - -We used summary statistics to describe WIS by model structure and country target specificity. -We considered the WIS of each individual forecast as an outcome variable, separately for cases and deaths, and fitted a univariate model for each covariate in order to create a comparison between effects from an "unadjusted" approach, or a fully adjusted model. - -We then fitted a generalised additive mixed effect model (GAMM). -We used a Gaussian error model with a log-link (see Supplement). -We added a small amount (1e-7, or less than 1/1000th of the smallest positive value in the data) to all values in order to avoid zeros. -Covariates of interest included model structure, and whether a model targeted one or multiple country targets, each modelled with random effects. -We included covariates for additional factors affecting precision. -We included the length of horizon in weeks as a factor smooth spline, separately for each forecasting model. -We included epidemic trend; dominant variant phase; and each country location, as separate random effects. -Last, we added a random effect for each individual model, to account for unknown variation beyond the factors considered here. - -**Model fitting** - -We fitted models with the *mgcv* package in R [(24)](https://www.zotero.org/google-docs/?Jv7oFh), using fast restricted maximum likelihood estimation (fREML). -To assess confounding, we fitted univariate models to contrast unadjusted and adjusted estimates of the partial effect of model structure on WIS. -We checked model fit with visual inspection of residuals via Q-Q plots. -All data and code are publicly available online at [(25)](https://www.zotero.org/google-docs/?qHy24c), with details in Supplement. diff --git a/report/_quarto/supplement/Supplement.pdf b/report/_quarto/supplement/Supplement.pdf deleted file mode 100644 index cf7b1da..0000000 Binary files a/report/_quarto/supplement/Supplement.pdf and /dev/null differ diff --git a/report/_quarto/supplement/_supplement.qmd b/report/_quarto/supplement/_supplement.qmd deleted file mode 100644 index f219e6b..0000000 --- a/report/_quarto/supplement/_supplement.qmd +++ /dev/null @@ -1,184 +0,0 @@ ---- -title: "The influence of model structure and geographic specificity on predictive accuracy among European COVID-19 forecasts" -subtitle: "Supplementary information" -output: - bookdown::pdf_document2 ---- - -```{r set-up, include=FALSE} -library(here) -library(dplyr) -library(readr) -library(tidyr) -library(ggplot2) -library(gratia) -library(patchwork) -source(here("R", "process-data.R")) -source(here("R", "analysis-descriptive.R")) -source(here("R", "plot-model-results.R")) -theme_set(theme_classic()) -knitr::opts_chunk$set( - eval = TRUE, echo = FALSE, - message = FALSE, warning = FALSE -) -``` - - -# Code and data availability - -## Code - -The codebase for this paper is publicly available at: - -- Github: -- Zenodo with DOI: - -Comments and code contributions are welcome - please use Github [Issues](https://github.com/epiforecasts/eval-by-method/issues). - -Please cite code using: - -- Katharine Sherratt & Sebastian Funk. (2025). epiforecasts/eval-by-method: Zenodo. - -## Source data - -Forecast and observed data were sourced from the European COVID-19 Forecast Hub, available to view at . All Hub data are now archived at: - - - Github: - - Zenodo with DOI: - -Data for this work were downloaded on 30th May 2023. These data are available in the Github repository for this paper at: - -```{r load-data} -# Load data -scores <- process_data(scoring_scale = "log") -ensemble <- scores |> - filter(grepl("EuroCOVIDhub-ensemble", Model)) -scores <- scores |> - filter(!grepl("EuroCOVIDhub-ensemble", Model)) -n_forecasts <- nrow(scores) -``` - -\newpage - -# Model characteristics - -## Eligibility criteria - -```{r model-flow, fig.cap="Eligibility criteria for models contributing case (left) and death (right) forecasts to the European COVID-19 Forecast Hub, March 2021 - March 2023"} -# source(here("R", "plot-model-flow.R")) -# flow_chart <- create_model_flow() -knitr::include_graphics(here("output", "plots", "flowchart.png")) -``` - -## Model characteristics - -```{r metadata} -table_metadata(scores) |> - select(-Description) |> - kable(caption = "Model characteristics contributing to the European COVID-19 Forecast Hub, by method used, number of countries targeted, and number of forecasts contributed.") -``` - -```{r classification} -model_classifier <- classify_models(return_majority = FALSE) |> - distinct(model, classification, raters, votes, agreement) |> - group_by(model) |> - mutate(final = classification[which.max(votes)]) |> - pivot_wider(names_from = classification, values_from = votes, values_fill = 0) |> - filter(model %in% unique(scores$Model)) |> - select(Model = model, `Final classification` = final, - Agreement = agreement, `Total raters` = raters, everything()) |> - arrange(Agreement, Model) - -model_classifier |> - kable(caption = "Classification of models by number of raters in total and agreement on model structure") -``` - -\newpage - -# Statistical methods - -## Epidemic trend identification - -We retrospectively categorised each week as “Stable”, “Decreasing”, or “Increasing”, based on the difference over a three-week moving average of incidence (with a change of +/-5% as “Stable”). - -```{r trends,fig.cap="Trends (cases)", fig.height = 8, fig.width = 10} -scores |> - filter(epi_target == "Cases") |> - trends_plot() + - ggtitle("Cases") -``` - -```{r death-trends,fig.cap="Trends (deaths)", fig.height = 8, fig.width = 10} -scores |> - filter(epi_target == "Deaths") |> - trends_plot() + - ggtitle("Deaths") -``` - -\newpage - -## Variant phase identification - -```{r variant-phases, fig.cap="Variant phases identified by dominant variant in each location and week"} -# see: R/utils-variants.R -variant_phases <- classify_variant_phases() -variant_phases |> - ggplot(aes(x = target_end_date, y = location, fill = VariantPhase)) + - geom_tile() + - scale_fill_brewer(type = "qual", palette = 2) + - labs(x = NULL, y = "Location", fill = "Variant phase") + - theme_minimal() + - theme(legend.position = "bottom") -``` - -Genomic surveillance data were obtained from three sources: ECDC (covering 30 European countries), UKHSA (Great Britain), and the Swiss Federal Office of Public Health (Switzerland). Variant lineages were mapped to six named phases in expected chronological order: Alpha, Delta, Omicron-BA.1, Omicron-BA.2, Omicron-BA.4/5, and Omicron-BQ/XBB. For each country, we identified the first week in which each named variant exceeded 50% of sequenced samples. We enforced chronological ordering by removing any out-of-sequence phases, then expanded phase assignments to all weeks by filling forward and backward from observed transition dates. This per-location approach accounts for the fact that variant dominance dates differed substantially across European countries. Where genomic surveillance data were too sparse to identify a transition (Hungary), we supplemented with epidemiological reports to set the Alpha-to-Delta transition date. - -\newpage - -## Model fitting - -```{r model-wis} -results <- readRDS(here("output", "results.rds")) -``` - -### Model formula - -`r results$formula` - -### Model diagnostics - -#### Cases - -```{r gamm-diagnostics-cases} -# QQ plot, residuals -knitr::include_graphics(here("output", "plots", "check_Cases.pdf")) -``` - -#### Deaths - -```{r gamm-diagnostics-deaths} -# QQ plot, residuals -knitr::include_graphics(here("output", "plots", "check_Deaths.pdf")) -``` - -## Model results - -### Difference in effects for each modelled variable - -```{r plot-model-effects, fig.height = 10, fig.cap = "Effect of counfounding factors on forecast performance (WIS), showing epidemic trend, country location, and dominant variant phase. An effect <0 indicates improved forecast performance relative to other forecast targets. Effects of each variable shown from a univariate model (unadjusted), and in a joint (adjusted) model accounting for all variables."} -plot_effects(results$effects, - variables = c("Trend", "Location", "VariantPhase")) -``` - -### Results on the natural scale - -We present results from scoring forecast error on the natural scale (difference between observation and prediction in count of case or death incidence), from which the WIS is calculated. - -```{r scores-natural} -scores <- process_data(scoring_scale = "natural") -ensemble <- scores |> - filter(grepl("EuroCOVIDhub-ensemble", Model)) -scores <- scores |> - filter(!grepl("EuroCOVIDhub-", Model)) -print_table1(scores) -``` diff --git a/report/_quarto/supplement/dag.pdf b/report/_quarto/supplement/dag.pdf deleted file mode 100644 index 7baa848..0000000 Binary files a/report/_quarto/supplement/dag.pdf and /dev/null differ diff --git a/report/figures/gam-dag.drawio.svg b/report/figures/gam-dag.drawio.svg new file mode 100644 index 0000000..7e64044 --- /dev/null +++ b/report/figures/gam-dag.drawio.svg @@ -0,0 +1,4 @@ + + + +
Latent

Ancestor

Outcome

Exposure

Model

Model structure

Forecast accuracy

Single/multi-country

Epidemic dynamics
Incidence

Country

Variant

Trend

Task difficulty

Horizon

Modeller  strategy
\ No newline at end of file diff --git a/report/manuscript.qmd b/report/manuscript.qmd index a26eec1..ff0cb0b 100644 --- a/report/manuscript.qmd +++ b/report/manuscript.qmd @@ -1,5 +1,5 @@ --- -title: "The influence of model structure and geographic specificity on predictive accuracy among European COVID-19 forecasts" +title: "The influence of model structure on predictive accuracy among European COVID-19 forecasts" format: html: default docx: default @@ -21,19 +21,19 @@ Email: katharine.sherratt@lshtm.ac.uk ### Abstract - + ### Background - + ### Methods -{{< include _quarto/_methods.qmd >}} +{{< include quarto/_methods.qmd >}} ### Results -{{< include _quarto/_results.qmd >}} +{{< include quarto/_results.qmd >}} ### Discussion diff --git a/report/notebook.qmd b/report/notebook.qmd index 9f0d86f..c548f0c 100644 --- a/report/notebook.qmd +++ b/report/notebook.qmd @@ -41,6 +41,10 @@ Future work (See: future-work.qmd) ## notes +### 28 April 2026 + +- [propensity] match scores by similarity of forecast target? + ### 2 April 2026 - are there any examples of simpson's paradox where the ensemble changes direction from all component forecasts? diff --git a/report/quarto/.gitignore b/report/quarto/.gitignore new file mode 100644 index 0000000..075b254 --- /dev/null +++ b/report/quarto/.gitignore @@ -0,0 +1 @@ +/.quarto/ diff --git a/report/_quarto/_abstract.qmd b/report/quarto/_abstract.qmd similarity index 100% rename from report/_quarto/_abstract.qmd rename to report/quarto/_abstract.qmd diff --git a/report/quarto/_background.qmd b/report/quarto/_background.qmd new file mode 100644 index 0000000..24d653a --- /dev/null +++ b/report/quarto/_background.qmd @@ -0,0 +1,29 @@ +Forecasters addressing a forecast target may choose among many approaches to building a predictive model. +This choice demands a trade-off among complementary strengths and weaknesses of different methods [(1–3)](https://www.zotero.org/google-docs/?XemggF), which may impact predictive accuracy. +However, it has been difficult to evaluate how this choice of method influences resulting forecast performance. +One concern is that such evaluations typically summarise performance across both models and forecast targets simultaneously. +However, each unique forecast target has varying predictive difficulty. +This means that comparisons may conflate differences in predictive power with differences in target difficulty. + +One fundamental choice in model building is the choice of underlying model structure. +This structure might range from purely data-driven statistical methods, to mechanistic models that embed explicit hypotheses about biological or epidemiological drivers of transmission. +Different structural approaches may be better suited to different purposes, for example scenario projection, policy analysis, or real-time nowcasting, and forecasting accuracy is only one lens through which to evaluate them. +Previous work comparing predictive performance between these approaches found no evidence of substantial differences between model structures [(4–6)](https://www.zotero.org/google-docs/?piAg9a). +However, it is not clear whether this null result reflects a valid equivalence, or an artefact of comparisons that did not adequately account for varying target difficulty. + +At the same time, a forecaster may be faced with multiple forecast targets of varying difficulty, and may choose to handle this problem in different ways. +One approach is to build a model able to produce forecasts for any given setting given only updated data, as is typical of statistical time-series and many semi-mechanistic models. +A different approach is to adapt the model itself to a specific target, for example emphasising particular mechanistic processes or exploiting local data sources. +Such target-specific models may be more policy-relevant [(7,8)](https://www.zotero.org/google-docs/?vo794b), and a study of COVID-19 forecasts in Germany and Poland found the two best-performing models were those most tuned to the national context [(9)](https://www.zotero.org/google-docs/?7QAdmm). +Notably, model structure and specificity are not independent. +Data-driven approaches are typically more amenable to forecasting many settings at once, while bespoke mechanistic models are more easily tailored to a single context. +We suggest that any estimated effect of model structure needs to account for this relationship. + +Establishing associations between model development choices and predictive performance is also complicated by limited comparability between forecasts [(10)](https://www.zotero.org/google-docs/?8yn1XI), small samples of independent models [(11)](https://www.zotero.org/google-docs/?ZgO6IM), and the many factors that drive predictive difficulty across targets: dynamics are generally easier to predict in the short term [(15)](https://www.zotero.org/google-docs/?2K90bo), when stable [(9)](https://www.zotero.org/google-docs/?KaFMKc), in larger populations [(16)](https://www.zotero.org/google-docs/?7uCfbD), with timely surveillance [(6)](https://www.zotero.org/google-docs/?YLVYXN), or with informative leading indicators [(17)](https://www.zotero.org/google-docs/?gzkBxT), though inherent limits to predictability remain [(14)](https://www.zotero.org/google-docs/?CZumUU). +Collaborative forecasting Hubs, which collate predictions from diverse modellers and publish them in real time, help address comparability by enabling like-for-like evaluation across models [(13)](https://www.zotero.org/google-docs/?7pIGoa); what remains is to adjust for the co-varying influences on predictability that confound simple comparisons [(11,16,18)](https://www.zotero.org/google-docs/?R3Ah2J). + +Here, we develop a generalised additive mixed effects model that jointly adjusts for the key sources of variation in predictive difficulty, applying it to short-term COVID-19 forecasts across Europe. +Our target estimand is the controlled direct effect of model structure on performance. +This is the partial contribution that is not transmitted through which specific model a team built, or how many countries they chose to forecast, as we hold target characteristics and geographic specificity at their observed values. +We include a directed acyclic graph in the Supplement to represent our assumed causal structure. +This work is exploratory rather than formally causal; we do not have the ability to fully test our assumptions, and we consider a fuller causal treatment to be valuable future work. diff --git a/report/_quarto/_discussion.qmd b/report/quarto/_discussion.qmd similarity index 95% rename from report/_quarto/_discussion.qmd rename to report/quarto/_discussion.qmd index b8e5719..5bd4b13 100644 --- a/report/_quarto/_discussion.qmd +++ b/report/quarto/_discussion.qmd @@ -36,7 +36,9 @@ In particular, different model structures may be more or less well equipped to p For example, models incorporating underlying epidemic mechanisms may be less responsive to shorter term data artefacts but relatively better able to demonstrate the impact of longer term dynamics. Explicitly accounting for such interactions could be a useful insight from future work. -A fuller causal treatment of this question is left to future work. +Without adjustment, comparisons of model structure are confounded by the epidemiological characteristics of the targets each model happened to forecast. +We suggest that accounting for these factors is necessary, not merely desirable, for drawing meaningful comparisons across predictive models. +However, a fuller causal treatment of this question is left to future work. The analysis presented here is exploratory, aimed at describing variation in forecast performance across model structures once the most consequential sources of confounding are accounted for. A subsequent analysis could treat the causal estimand as primary: specifying a directed acyclic graph to derive the minimal sufficient adjustment set, stating the target estimand and contrasts precisely, examining positivity across model-structure by country-target strata, handling differential non-response explicitly, and reporting a quantitative bias analysis (such as an E-value) for unmeasured confounding. A natural candidate for unmeasured confounding in this setting is the latent capacity of a forecasting team to incorporate domain expertise, which may drive both the choice of model structure and the achievable forecast accuracy. diff --git a/report/quarto/_methods.qmd b/report/quarto/_methods.qmd new file mode 100644 index 0000000..cb43eaa --- /dev/null +++ b/report/quarto/_methods.qmd @@ -0,0 +1,107 @@ +**Study design** + +*Forecasting target* + +We consider incidence of COVID-19 cases and deaths between 8 March 2021 to 10 March 2023, using data collated by Johns Hopkins University (JHU). +Data comprised cumulative reported cases or deaths reported daily for 32 European countries [(20)](https://www.zotero.org/google-docs/?cDnqeb). +We calculated the incident weekly count of cases or deaths from the cumulative count, and used the dataset available as of 10 March 2023, when JHU stopped reporting new data. + +We collected forecasts contributed to the public European COVID-19 Forecast Hub [(19)](https://www.zotero.org/google-docs/?5K4uXI). +The platform solicited real-time forecasts for between one and four weeks ahead incidence of COVID-19 cases, deaths, or hospitalisation. +Forecasters were optionally able to express uncertainty by reporting a distribution of up to 23 probabilistic quantiles for each prediction. +The Hub provided continuous real-time visualisation of each forecasting team's performance against observed data over the study period. + +*Participation* + +Forecasting teams were recruited to the Hub using existing networks and ECDC publicity. +Any forecaster was eligible to participate, and there were no selection criteria. +All participation was voluntary and unrenumerated. +Forecasters contributed a standard set of metadata describing their team and model, and uploaded forecasts weekly. +Forecasts were validated against minimal formatting requirements for quantile intervals and that values were positive integers. + +For this study, we collected all forecasts from between 8 March 2021 to 10 March 2023. +We excluded forecasts of hospitalisations, which experienced multiple changes in source data during the study period. +We excluded forecasts that did not report the full set of 23 quantiles, in order to ensure fair comparison among probabilistic model results. +We also excluded baseline and ensemble models created by the Hub team. +We report a STROBE flow diagram of participation in Supplement. + +*Performance evaluation* + +We evaluated forecasts against observed data using the Weighted Interval Score (WIS) [(21)](https://www.zotero.org/google-docs/?T3es1k). +The WIS is a proper scoring rule and a generalisation of the continuous ranked probability score (CRPS). +This assigns each probabilistic forecast a score that reflects the forecast's dispersion, overprediction, and underprediction. + +Prior to forecast scoring, we normalised both forecasts and observations to incidence per 100,000 population. +We then log-transformed both forecasts and observations before scoring. +This transform leads to more comparable scores across locations and times, and can be interpreted as evaluating the ability of models to predict the growth rate [(22,23)](https://www.zotero.org/google-docs/?NuBRI2). +To avoid zeros, we added a small amount (1e-7, or less than 1/1000th of the smallest positive value in the data) to all values. + +*Participant model structures* + +We qualitatively categorised each model in terms of its structure using participant' self-reported metadata. +This comprised an optional description of methods, and links to model code, website or publication. +See Supplement for metadata schema and example. + +We assigned a classification according to the following types: "Agent-based" (with detailed individual-level interactions), "Mechanistic" (broadly based on compartmental models), "Semi-mechanistic" (statistical models with epidemiological elements, e.g. delay distributions and/or generation intervals), "Statistical" (context-agnostic time-series models), "Judgement" (ensembles of human-judgement based predictions). + +Four members of the research team (KS, SF, FB, JM) were tasked with classifying models by their structure, with at least two investigators classifying each model. +Each rater worked independently, and was asked to read each modelling teams' short description of their methods included in Hub metadata, and any additional links provided therein, such as citations or websites with model details. +The majority vote determined the final classification with any potential ties flagged for an additional independent classification by another investigator. +Two investigators (SF, KS) separately reviewed all classifications in discussion for any final revisions (see Supplement). + +**Analysis** + +*Model features* + +We classified each model as targeting either a single location or more than one location. +We used a binary classification rather than a count to address data sparsity (few teams submitted for between 2 and 32 countries) and because forecasters could change the number of targets they submitted over time. +Conditioning on country-target specificity partially blocks the pathway from model structure to performance via task difficulty, consistent with our intended framing of a controlled direct effect. + +We additionally included a random effect for individual model identity, separating the between-method-class contribution from variation between individual model implementations. +We did not attempt to adjust for unobserved characteristics of the forecasting team, such as capacity, infrastructure, or experience, which plausibly shape both model structure choices and realised forecast quality. + +*Target features* + +We stratified all analysis by epidemiological target of incident cases or deaths. +We accounted for contextual features of the forecast target that influenced the variance of forecast performance across model-target pairs. +Adjusting for these narrows the variance around the estimated effect of model structure. + +We first included forecast horizon to account for the increasing difficulty of forecasting further into the future. +We then accounted for variation in epidemic transmission dynamics underlying each forecast target. +The transmission state is itself unobserved, so we adjusted for observed downstream effects (the absolute incidence in the target week and its trend) and observed upstream causes (the country location and the dominant variant phase). +We categorised trend as "Stable", "Decreasing", or "Increasing", based on the difference over a three-week moving average of incidence (with a change of +/-5% as "Stable"; see Supplement). +For variant phase, we used publicly available sequence data to classify each target location-week according to the dominant circulating SARS-CoV-2 variant, producing a discrete sequence of phases (see Supplement) (#ref-variant-data). +We used the country of each forecast target as a categorical covariate, capturing differences in transmission dynamics shaped by, for example, national policy or population structure that are not otherwise captured by incidence, trend, or variant phase. + +*Model building* + +We used a heirarchical model structure, fit separately for each epidemiological target (cases, deaths). +We treated the WIS of each forecast $i$ as Gaussian with a log link, so that the linear predictor $\eta_i$ acts multiplicatively on expected score: + +$$ +\mathrm{WIS}_i \sim \mathcal{N}(\mu_i,\, \sigma^2), \qquad \log \mu_i = \eta_i. +$$ + +The linear predictor decomposed into a fixed intercept, a sum of random-effect contributions from categorical covariates, and two smooth terms for continuous covariates: + +$$ +\eta_i = \beta_0 \;+\; \underbrace{\sum_{g \in \mathcal{G}} u^{g}_{j_g(i)}}_{\text{random effects}} \;+\; \underbrace{f_{\text{Inc}}(\mathrm{Incidence}_i) \;+\; f^{\text{Horizon}}_{k(i)}(\mathrm{Horizon}_i)}_{\text{smooths}}. +$$ + +Here $j_g(i)$ denotes the level of grouping factor $g$ observed for forecast $i$. +We assume random effects are drawn independently from a zero-mean Gaussian with its own variance: + +$$ +u^{g}_{j} \sim \mathcal{N}(0,\, \sigma^2_g), \qquad g \in \mathcal{G} = \{\text{Method},\, \text{CountryTargets},\, \text{Trend},\, \text{Variant},\, \text{Location},\, \text{Model}\}. +$$ + +The two smooths captured continuous covariates. +$f_{\text{Inc}}$ was a penalised thin-plate regression spline over (log-transformed) incidence, and $\{f^{\text{Horizon}}_k\}$ was a factor-smooth interaction (sum-to-zero constrained, with $k=3$ basis functions per forecasting model), allowing the horizon effect to vary by model. + +*Model fitting* + +We used summary statistics to describe WIS by model structure and country target specificity. +To assess confounding, we fitted univariate models to contrast unadjusted and adjusted estimates of the partial effect of model structure on WIS. +We used the *mgcv* package in R [(24)](https://www.zotero.org/google-docs/?Jv7oFh), using fast restricted maximum likelihood estimation (fREML). +We checked model fit with visual inspection of residuals via Q-Q plots. +All data and code are publicly available online at [(25)](https://www.zotero.org/google-docs/?qHy24c), with details in Supplement. diff --git a/report/_quarto/_references.qmd b/report/quarto/_references.qmd similarity index 100% rename from report/_quarto/_references.qmd rename to report/quarto/_references.qmd diff --git a/report/_quarto/_results.qmd b/report/quarto/_results.qmd similarity index 100% rename from report/_quarto/_results.qmd rename to report/quarto/_results.qmd diff --git a/report/quarto/supplement/_supplement.qmd b/report/quarto/supplement/_supplement.qmd new file mode 100644 index 0000000..f2e8915 --- /dev/null +++ b/report/quarto/supplement/_supplement.qmd @@ -0,0 +1,257 @@ +--- +title: "The influence of model structure on predictive accuracy among European COVID-19 forecasts" +subtitle: "Supplementary information" +output: + bookdown::pdf_document2 +--- + +```{r} +#| label: set-up +#| include: false + +library(here) +library(dplyr) +library(readr) +library(tidyr) +library(ggplot2) +library(gratia) +library(patchwork) +source(here("R", "process-data.R")) +source(here("R", "analysis-descriptive.R")) +source(here("R", "plot-model-results.R")) +theme_set(theme_classic()) +knitr::opts_chunk$set( + eval = TRUE, echo = FALSE, + message = FALSE, warning = FALSE +) +``` + +## Code and data availability + +### Code + +The codebase for this paper is publicly available at: + +- Github: +- Zenodo with DOI: + +Comments and code contributions are welcome - please use Github [Issues](https://github.com/epiforecasts/eval-by-method/issues). + +Please cite code using: + +- Katharine Sherratt & Sebastian Funk. + (2025). + epiforecasts/eval-by-method: Zenodo. + + +### Source data + +Forecast and observed data were sourced from the European COVID-19 Forecast Hub, available to view at . +All Hub data are now archived at: + +- Github: +- Zenodo with DOI: + +Data for this work were downloaded on 30th May 2023. +These data are available in the Github repository for this paper at: + +```{r} +#| label: load-data + +# Load data +scores <- process_data(scoring_scale = "log") +ensemble <- scores |> + filter(grepl("EuroCOVIDhub-ensemble", Model)) +scores <- scores |> + filter(!grepl("EuroCOVIDhub-ensemble", Model)) +n_forecasts <- nrow(scores) +``` + +\newpage + +## Participating model characteristics + +### Eligibility criteria + +```{r} +#| label: model-flow +#| fig-cap: "Eligibility criteria for models contributing case (left) and death (right) forecasts to the European COVID-19 Forecast Hub, March 2021 - March 2023" + +# source(here("R", "plot-model-flow.R")) +# flow_chart <- create_model_flow() +knitr::include_graphics(here("output", "plots", "flowchart.png")) +``` + +### Model characteristics + +```{r} +#| label: metadata + +table_metadata(scores) |> + select(-Description) |> + kable(caption = "Model characteristics contributing to the European COVID-19 Forecast Hub, by method used, number of countries targeted, and number of forecasts contributed.") +``` + +```{r} +#| label: classification + +model_classifier <- classify_models(return_majority = FALSE) |> + distinct(model, classification, raters, votes, agreement) |> + group_by(model) |> + mutate(final = classification[which.max(votes)]) |> + pivot_wider(names_from = classification, values_from = votes, values_fill = 0) |> + filter(model %in% unique(scores$Model)) |> + select(Model = model, `Final classification` = final, + Agreement = agreement, `Total raters` = raters, everything()) |> + arrange(Agreement, Model) + +model_classifier |> + kable(caption = "Classification of models by number of raters in total and agreement on model structure") +``` + +\newpage + +## Statistical methods + +### Assumptions + +To illustrate our approach to forecast evaluation, we show the assumptions underlying our adjustment strategy as a directed acyclic graph (DAG). +We emphasise that this work is exploratory rather than inferential, and we suggest substantial further work is needed to support inferential approaches to forecast evaluation. +Latent nodes capture unobserved constructs for which we use observed variables as partial proxies. + +```{r} +#| label: dag +#| fig-cap: "Directed acyclic graph of assumed causal relationships among forecast performance (wis), model structure (Method), and covariates." + +source(here::here("R", "dag-check.R")) +p +``` + +```{r} +#| label: dag-adjustment +#| fig-cap: "Minimal adjustment sets implied by the DAG for estimating the total effect of Method on wis. Note that the joint model fitted in the main analysis goes beyond this set by also conditioning on country-target specificity and individual model identity, both of which are downstream of Method in the DAG; this targets a controlled direct effect rather than the total effect." + +p_adj +``` + +The joint model fitted in the main analysis adjusts for task-difficulty variables (Trend, Horizon, Incidence, Location, VariantPhase), which block backdoor paths from epidemic dynamics into the score, and additionally conditions on country-target specificity (CountryTargets) and individual model identity (Model). +Both of the latter are descendants of Method in the DAG: CountryTargets sits on a Method → Task difficulty → score pathway, and Model sits on a Method → Model → score pathway. +Adjusting for them therefore does not estimate the total effect of Method on WIS but a controlled direct effect — the partial association between Method and log-WIS with these mediators held fixed at their observed values. +Residual confounding from unobserved team resources, which feed into the latent *Modeller strategy* node, is not addressed by this adjustment and is the principal limitation of any causal reading of these estimates. +A more extended discussion of estimand choice for forecast-hub evaluations, including the distinction between total effects, direct effects, and controlled direct effects in this setting, is given in the accompanying methodological sketch (`dag-reasoning-sketch.qmd`). + +### Epidemic trend identification + +We retrospectively categorised each week as "Stable", "Decreasing", or "Increasing", based on the difference over a three-week moving average of incidence (with a change of +/-5% as "Stable"). + +```{r} +#| label: trends +#| fig-cap: "Trends (cases)" +#| fig-height: 8 +#| fig-width: 10 + +scores |> + filter(epi_target == "Cases") |> + trends_plot() + + ggtitle("Cases") +``` + +```{r} +#| label: death-trends +#| fig-cap: "Trends (deaths)" +#| fig-height: 8 +#| fig-width: 10 + +scores |> + filter(epi_target == "Deaths") |> + trends_plot() + + ggtitle("Deaths") +``` + +\newpage + +### Variant phase identification + +```{r} +#| label: variant-phases +#| fig-cap: "Variant phases identified by dominant variant in each location and week" + +# see: R/utils-variants.R +variant_phases <- classify_variant_phases() +variant_phases |> + ggplot(aes(x = target_end_date, y = location, fill = VariantPhase)) + + geom_tile() + + scale_fill_brewer(type = "qual", palette = 2) + + labs(x = NULL, y = "Location", fill = "Variant phase") + + theme_minimal() + + theme(legend.position = "bottom") +``` + +Genomic surveillance data were obtained from three sources: ECDC (covering 30 European countries), UKHSA (Great Britain), and the Swiss Federal Office of Public Health (Switzerland). +Variant lineages were mapped to six named phases in expected chronological order: Alpha, Delta, Omicron-BA.1, Omicron-BA.2, Omicron-BA.4/5, and Omicron-BQ/XBB. +For each country, we identified the first week in which each named variant exceeded 50% of sequenced samples. +We enforced chronological ordering by removing any out-of-sequence phases, then expanded phase assignments to all weeks by filling forward and backward from observed transition dates. +This per-location approach accounts for the fact that variant dominance dates differed substantially across European countries. +Where genomic surveillance data were too sparse to identify a transition (Hungary), we supplemented with epidemiological reports to set the Alpha-to-Delta transition date. + +\newpage + +### Model fitting + +```{r} +#| label: model-wis + +results <- readRDS(here("output", "results.rds")) +``` + +#### Model formula + +`r results$formula` + +#### Model diagnostics + +##### Cases + +```{r} +#| label: gamm-diagnostics-cases + +# QQ plot, residuals +knitr::include_graphics(here("output", "plots", "check_Cases.pdf")) +``` + +##### Deaths + +```{r} +#| label: gamm-diagnostics-deaths + +# QQ plot, residuals +knitr::include_graphics(here("output", "plots", "check_Deaths.pdf")) +``` + +## Results + +### Difference in effects for each modelled variable + +```{r} +#| label: plot-model-effects +#| fig-height: 10 +#| fig-cap: "Effect of counfounding factors on forecast performance (WIS), showing epidemic trend, country location, and dominant variant phase. An effect <0 indicates improved forecast performance relative to other forecast targets. Effects of each variable shown from a univariate model (unadjusted), and in a joint (adjusted) model accounting for all variables." + +plot_effects(results$effects, + variables = c("Trend", "Location", "VariantPhase")) +``` + +### Results on the natural scale + +We present results from scoring forecast error on the natural scale (difference between observation and prediction in count of case or death incidence), from which the WIS is calculated. + +```{r} +#| label: scores-natural + +scores <- process_data(scoring_scale = "natural") +ensemble <- scores |> + filter(grepl("EuroCOVIDhub-ensemble", Model)) +scores <- scores |> + filter(!grepl("EuroCOVIDhub-", Model)) +print_table1(scores) +``` diff --git a/report/quarto/supplement/dag-reasoning-sketch.qmd b/report/quarto/supplement/dag-reasoning-sketch.qmd new file mode 100644 index 0000000..4acedc0 --- /dev/null +++ b/report/quarto/supplement/dag-reasoning-sketch.qmd @@ -0,0 +1,101 @@ +--- +title: "Using DAGs to reason about forecast accuracy in collaborative hubs" +subtitle: "Sketch for a separate methodological piece" +format: html +--- + +## Motivation + +Forecast hubs generate a peculiar dataset: many *teams* run many *models* over many *targets* and *horizons*, and each combination yields a realised score. Researchers want to ask causal questions of this dataset — "does method class X produce more accurate forecasts than Y?", "does using mobility data improve calibration?" — but the data are observational, exposures are bundled, and the most important confounder (team capacity) is unmeasured. + +Directed acyclic graphs (DAGs) are an under-used tool for this setting. This sketch proposes a short methodological paper that develops DAG-based reasoning for forecast-hub evaluation, using the European COVID-19 Forecast Hub as a worked example. + +## Argument in one paragraph + +Most published forecast-hub evaluations report scores stratified by model attributes and treat differences as substantive. This conflates predictive ranking with causal inference. A DAG forces explicit reasoning about (i) what the *target* causal estimand is — total effect of a strategy choice, direct effect holding execution constant, or controlled direct effect under a counterfactual recalibration; (ii) which task-difficulty variables are confounders versus precision variables; (iii) which apparently-useful adjustments are colliders that bias the comparison (ensemble inclusion, submission persistence). The European Hub is a useful demonstration case because the team has iterated on multiple DAG formulations, each surfacing a different identification problem. + +## Proposed structure + +### 1. The problem + +- Hub evaluations report differences in scores across model attributes; readers (and modellers) interpret these as evidence about *what works*. +- The leap from "X scores better than Y" to "choosing X causes better forecasts" requires assumptions that are rarely stated. +- DAGs make those assumptions auditable. + +### 2. A minimal causal vocabulary for forecast evaluation + +- Exposure: a *feature of a modelling strategy* (method class, data inputs, calibration approach, geographic scope, update cadence, uncertainty representation). +- Outcome: a realised score (WIS, log score, coverage). Note: the score is one realisation from a distribution; expected score is the estimand. +- Unit: forecast-instance, nested within model, nested within team. +- Confounder taxonomy: team-level (latent capacity), task-level (calendar time, location, variant phase, horizon, incidence, trend), and selection-induced (submission persistence, ensemble inclusion). + +### 3. Total vs direct effects + +This is the section the European Hub work makes vivid. + +- **Total effect** of method class on WIS = effect through *all* downstream pathways, including any post-hoc recalibration the team applies because of their method choice. This is what most readers think they're getting. +- **Direct effect** = effect not mediated by validation effort (recalibration, ensembling). This requires controlling for those mediators and risks overadjustment. +- **Controlled direct effect** under a counterfactual ("what if every team applied the same recalibration?") is often what we *should* want to estimate, but is not identifiable without an explicit model of the mediator. + +The Schlesinger (1979) triad — qualification, verification, validation — gives a useful sub-structure of the Strategy node: + +- *Qualification* choices (method class, state space, data inputs) sit upstream; +- *Verification* (code correctness, pipeline integrity) is mostly latent and produces tail-risk; +- *Validation* effort (recalibration, ensembling, backtesting) sits downstream, between qualification and realised WIS, and is itself a strategy choice. + +The total effect of a qualification choice flows partly through validation effort. Whether to block that pathway depends on the question. + +### 4. A minimum adjustment set for the European Hub + +For the effect of method class on log WIS, we propose: + +- **Block task-difficulty backdoors**: calendar time, location, variant phase, horizon, incidence, trend. +- **Partially block team-capacity backdoor**: capacity proxies (number of locations, number of targets, submission consistency). All are themselves strategy features, so this is overadjustment-aware. +- **Do not condition on**: ensemble membership (collider), submission persistence (collider on the team→accuracy path), Model identity (absorbs the exposure). +- **Random effects**: Location, VariantPhase as crossed random effects to capture residual task heterogeneity; Team (not Model) as the cluster for capacity. + +What this gives us: a *bounded* estimate of the between-team effect, with the residual capacity confounding quantified via E-values or similar bias analysis. + +What it does not give us: the within-team controlled direct effect. That requires a different design (same team running multiple methods over the same period — rare but does occur in the Hub). + +### 5. DAGs we have iterated through + +This is the empirical/historical section, drawing on the project's actual development. + +- **DAG v1**: Method → WIS, with Location/Horizon/Trend as confounders. Team capacity invisible. *Problem identified*: open backdoor through capacity. +- **DAG v2**: Capacity proxied by CountryTargets; Method → WIS adjusted for CountryTargets. *Problem identified*: CountryTargets is itself a strategy feature, so adjustment partially blocks the effect of geographic scope. +- **DAG v3** (this paper): Strategy node decomposed into qualification/verification/validation; Calendar time elevated to top-level confounder; explicit collider warnings (ensemble inclusion, submission persistence); Verification-failure node added to absorb pipeline-bug variance. + +Each iteration revealed an assumption the previous version had buried. The methodological point: *a single DAG is rarely the answer; a sequence of DAGs documents which assumptions were testable and which weren't.* + +### 6. What is and isn't identifiable from hub data + +A short table: + +| Question | Identifiable? | Why / why not | +|---|---|---| +| Within-team change in calibration approach over time | Yes | Capacity differenced out | +| Between-team method-class comparison, capacity proxies adjusted | Bounded | Residual capacity confounding; quantify with E-values | +| Marginal causal effect of "choosing mechanistic over statistical" | No | No instrument, no within-team variation, capacity unmeasured | +| Effect of ensembling on WIS | Mediator-on-pathway | Depends on whether qualification choice is held fixed | + +### 7. Practical recommendations for hub evaluators + +- State the estimand before the model. +- Draw the DAG and publish it with the evaluation. +- Distinguish reporting (predictive ranking) from inference (causal claims). +- Treat capacity proxies as partial; report bias bounds. +- Avoid colliders that look like obvious adjustments (ensemble inclusion is the worst offender). + +## Output format options + +- A short methodological paper (~3000 words) for *Epidemics*, *International Journal of Forecasting*, or *American Journal of Epidemiology* (methods section). +- A commentary tied to the main project paper, in the same journal. +- A blog post / preprint as a lower-stakes first version. + +## Open questions for the methodological piece + +- Is the right comparison case the European Hub alone, or a parallel treatment using the US Hub for external validity? +- Should it engage with the wider scoring-rule literature (proper scores as estimands) or stay narrowly causal? +- Do we want simulation evidence (generate a known DGP, show that naive comparisons recover the wrong sign) or stay at the conceptual level? +- How much space for the Schlesinger framing — a paragraph, a section, or a structuring device throughout? diff --git a/report/_quarto/supplement/density.jpg b/report/quarto/supplement/density.jpg similarity index 100% rename from report/_quarto/supplement/density.jpg rename to report/quarto/supplement/density.jpg diff --git a/report/_quarto/supplement/diagnostics.jpg b/report/quarto/supplement/diagnostics.jpg similarity index 100% rename from report/_quarto/supplement/diagnostics.jpg rename to report/quarto/supplement/diagnostics.jpg diff --git a/report/_quarto/supplement/estimates.jpg b/report/quarto/supplement/estimates.jpg similarity index 100% rename from report/_quarto/supplement/estimates.jpg rename to report/quarto/supplement/estimates.jpg diff --git a/report/_quarto/supplement/fig-anomalies.pdf b/report/quarto/supplement/fig-anomalies.pdf similarity index 100% rename from report/_quarto/supplement/fig-anomalies.pdf rename to report/quarto/supplement/fig-anomalies.pdf diff --git a/report/_quarto/future-work.qmd b/report/quarto/supplement/future-work.qmd similarity index 100% rename from report/_quarto/future-work.qmd rename to report/quarto/supplement/future-work.qmd