Add automatic Downstream Evaluation to Modalities by CYHSM · Pull Request #450 · Modalities/modalities

CYHSM · 2026-06-09T12:28:37Z

What does this PR do?

This adds a downstream evaluation pipeline that hooks directly into the training loop callbacks. For this, the model is first converted to Hugginface format and then runs through an evaluation tool (Olmes here, but can be any)

General Changes

ModelConverter: Converts model to HF
DownstreamEvaluator: Uses HF model to runs Olmes evaluation

Example Addition to Yaml:

model_converter:
  component_key: model_converter
  variant_key: default
  config:
    command_template: "CUDA_VISIBLE_DEVICES=7 python src/modalities/conversion/gpt2/convert_gpt2.py {modalities_config} {output_dir} --checkpoint_path {checkpoint_path} > {checkpoint_path}/conversion.log 2>&1"
    checkpoint_dir: ${settings.paths.experiments_root_path}/${settings.experiment_id}
    global_rank: ${settings.cuda_env.global_rank}
    eval_interval: ${settings.intervals.checkpointing_interval_in_steps}

downstream_evaluator:
  component_key: downstream_evaluator
  variant_key: default
  config:
    tokenizer:
      instance_key: tokenizer
      pass_type: BY_REFERENCE
    tasks:
      - "minerva_math_algebra:bpb::olmes"
      - "minerva_math_counting_and_probability:bpb::olmes"
      - "minerva_math_geometry:bpb::olmes"
      - "minerva_math_intermediate_algebra:bpb::olmes"
      - "minerva_math_number_theory:bpb::olmes"
      - "minerva_math_prealgebra:bpb::olmes"
      - "minerva_math_precalculus:bpb::olmes"
      - "arc_challenge:rc::olmes:full"
      - "arc_easy:rc::olmes:full"
      - "hellaswag:rc::olmes:full"
      - "winogrande:rc::olmes:full"
      - "socialiqa:rc::olmes:full"
      - "piqa:rc::olmes:full"
      - "qasper_yesno:rc::olmes"
      - "lambada"
      - "arc_challenge:rc:bpb::olmes:full"
      - "arc_easy:rc:bpb::olmes:full"
      - "hellaswag:rc:bpb::olmes:full"
      - "winogrande:rc:bpb::olmes:full"
      - "socialiqa:rc:bpb::olmes:full"
      - "piqa:rc:bpb::olmes:full"
      - "qasper_yesno:rc:bpb::olmes"
      - "lambada:bpb"
      - "gsm8k::olmes"
    eval_interval: ${settings.intervals.evaluation_interval_in_steps}
    checkpoint_dir: ${settings.paths.experiments_root_path}/${settings.experiment_id}
    global_rank: ${settings.cuda_env.global_rank}
    olmes_command_template: "CUDA_VISIBLE_DEVICES=7 . /home/markus_frey/Github/olmes/.venv/bin/activate && olmes --model {hf_model_dir} --model-args '{{\"trust_remote_code\": true}}' --task {tasks} --limit 8 --output-dir {hf_model_dir}/olmes_eval_{step} > {hf_model_dir}/olmes_eval_{step}.log 2>&1"

Both only run if added to yaml, otherwise normal pipeline is unchanged. This PR also supercedes #448 so we can merge this one instead of #448.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

CYHSM added 8 commits June 2, 2026 09:28

extracted clean logging code from hpc branch

f66dc37

Initial commit for evaluation with olmes and checkpoint conversion

aa55562

initial commit for automated evals

7fbc010

added script to precache tasks for compute nodes without internet access

43f08f9

bf running tests on leonardo

4623f57

final fixes for downstream evaluation implementation

5248de3

Merge branch 'main' into downstream-evaluator

65722b0

fixed conversion script to be strict

db434d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic Downstream Evaluation to Modalities#450

Add automatic Downstream Evaluation to Modalities#450
CYHSM wants to merge 8 commits into
Modalities:mainfrom
CYHSM:downstream-evaluator

CYHSM commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CYHSM commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Checklist before submitting final PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CYHSM commented Jun 9, 2026 •

edited

Loading