Skip to content

Add automatic Downstream Evaluation to Modalities#450

Open
CYHSM wants to merge 8 commits into
Modalities:mainfrom
CYHSM:downstream-evaluator
Open

Add automatic Downstream Evaluation to Modalities#450
CYHSM wants to merge 8 commits into
Modalities:mainfrom
CYHSM:downstream-evaluator

Conversation

@CYHSM

@CYHSM CYHSM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This adds a downstream evaluation pipeline that hooks directly into the training loop callbacks. For this, the model is first converted to Hugginface format and then runs through an evaluation tool (Olmes here, but can be any)

General Changes

  • ModelConverter: Converts model to HF
  • DownstreamEvaluator: Uses HF model to runs Olmes evaluation

Example Addition to Yaml:

model_converter:
  component_key: model_converter
  variant_key: default
  config:
    command_template: "CUDA_VISIBLE_DEVICES=7 python src/modalities/conversion/gpt2/convert_gpt2.py {modalities_config} {output_dir} --checkpoint_path {checkpoint_path} > {checkpoint_path}/conversion.log 2>&1"
    checkpoint_dir: ${settings.paths.experiments_root_path}/${settings.experiment_id}
    global_rank: ${settings.cuda_env.global_rank}
    eval_interval: ${settings.intervals.checkpointing_interval_in_steps}

downstream_evaluator:
  component_key: downstream_evaluator
  variant_key: default
  config:
    tokenizer:
      instance_key: tokenizer
      pass_type: BY_REFERENCE
    tasks:
      - "minerva_math_algebra:bpb::olmes"
      - "minerva_math_counting_and_probability:bpb::olmes"
      - "minerva_math_geometry:bpb::olmes"
      - "minerva_math_intermediate_algebra:bpb::olmes"
      - "minerva_math_number_theory:bpb::olmes"
      - "minerva_math_prealgebra:bpb::olmes"
      - "minerva_math_precalculus:bpb::olmes"
      - "arc_challenge:rc::olmes:full"
      - "arc_easy:rc::olmes:full"
      - "hellaswag:rc::olmes:full"
      - "winogrande:rc::olmes:full"
      - "socialiqa:rc::olmes:full"
      - "piqa:rc::olmes:full"
      - "qasper_yesno:rc::olmes"
      - "lambada"
      - "arc_challenge:rc:bpb::olmes:full"
      - "arc_easy:rc:bpb::olmes:full"
      - "hellaswag:rc:bpb::olmes:full"
      - "winogrande:rc:bpb::olmes:full"
      - "socialiqa:rc:bpb::olmes:full"
      - "piqa:rc:bpb::olmes:full"
      - "qasper_yesno:rc:bpb::olmes"
      - "lambada:bpb"
      - "gsm8k::olmes"
    eval_interval: ${settings.intervals.evaluation_interval_in_steps}
    checkpoint_dir: ${settings.paths.experiments_root_path}/${settings.experiment_id}
    global_rank: ${settings.cuda_env.global_rank}
    olmes_command_template: "CUDA_VISIBLE_DEVICES=7 . /home/markus_frey/Github/olmes/.venv/bin/activate && olmes --model {hf_model_dir} --model-args '{{\"trust_remote_code\": true}}' --task {tasks} --limit 8 --output-dir {hf_model_dir}/olmes_eval_{step} > {hf_model_dir}/olmes_eval_{step}.log 2>&1"

Both only run if added to yaml, otherwise normal pipeline is unchanged. This PR also supercedes #448 so we can merge this one instead of #448.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant