Add Harbor Dockerfile and multi-step support#1865
Draft
gabeorlanski wants to merge 7 commits into
Draft
Conversation
fbc6288 to
483b08d
Compare
Add Harbor Dockerfile and multi-step support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds Harbor Dockerfile and multi-step task support across the Harbor implementations.
Key changes:
harbor-v1and experimentalHarborEnv.harbor-v1.Type of Change
Testing
uv run pytestlocally.Ran:
uv run ruff check verifiers/harbor.py verifiers/v1/tasksets/harbor_v1 verifiers/envs/experimental/harbor_env/env.py tests/v1/test_harbor_taskset.py uv run pytest tests/v1/test_harbor_taskset.py tests/test_harbor_env_mcp.py -q uv run pytest tests/v1 -q -k 'not e2e'Note: I did not run the full
uv run pytestsuite locally because the full v1 e2e matrix includes environment-dependent cases such as Modal.Checklist
Additional Notes
This keeps framework changes out of the PR and confines implementation to Harbor/shared-Harbor code paths plus tests/docs.
Note
Add Dockerfile build support and multi-step task scoring to Harbor taskset
DockerfilePolicyto taskset.py withbuild,ignore, anderrormodes;buildconstructs a deterministic local Docker image (taggedvf-harbor-*) from the task's Dockerfile context with content-hash caching.HarborTaskandparse_taskto carry multi-step metadata (HarborSteplist,multi_step_reward_strategy,workdir), with aggregate harness/scoring timeouts computed as sums (unbounded if any step has no timeout).HarborTaskset.solvedand introducesrun_verifier/record_step_resultsto evaluate steps sequentially, support early stopping onmin_rewardfailure, and aggregate results using mean or final strategy intotrace.info.instruction.mdis absent.dockerfile_policychanges tobuild; existing callers using--taskset.ignore-dockerfilemust migrate to--taskset.dockerfile-policy ignore.Macroscope summarized fbc6288. (Automatic summaries will resume when PR exits draft mode or review begins).