feat(v1): add Trace.last_reply and adopt it across v1 envs#1897
Merged
Conversation
ApprovabilityVerdict: Approved Adds a convenience property You can customize Macroscope's approvability policy. Learn more. |
Add a `last_reply` property on `vf.Trace`: the stripped text of the final sampled assistant message, or `""` when there are no assistant messages or the last one has no text. A typed convenience over the `(assistant_messages[-1].content or "").strip()` idiom every scoring reward repeats, and safe by default (returns "" instead of IndexError on an empty trace; AssistantMessage.content is `str | None`). Adopt `trace.last_reply` in the v1 envs and docs that read the model's final reply as text for scoring/judging — reverse-text, gsm8k, glossary, deepwiki, code-golf, scratchpad, wiki-search, and the GUIDE snippets. Each previously hand-rolled `assistant_messages[-1].content if assistant_messages else ""` (or the unguarded `[-1].content`), so the swap also removes the empty-trace footgun. Envs that need the full message list (alphabet-sort iterates every turn; color-codeword takes the message objects) are left on `assistant_messages`. Verified: env tasksets import against the patched source; ruff clean. Amp-Thread-ID: https://ampcode.com/threads/T-019f130a-103e-72fc-8182-e1a43db7d403
a5ed55f to
d76f20a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds
Trace.last_reply— the stripped text of the final sampled assistant message (or""when there are no assistant messages / the last one has no text) — and adopts it across the v1 envs and docs that read the model's final reply as text for scoring/judging.Every scoring reward that wants "the answer" hand-rolled the same idiom (
assistant_messages[-1].content if assistant_messages else "", or the unguarded[-1].content).last_replyis the typed convenience over that, and safe by default — it returns""instead ofIndexErroron an empty trace, andAssistantMessage.contentisstr | None, so theNonecase is handled too.Type of Change
What changed
verifiers/v1/trace.py— newlast_replyproperty (@property, next toassistant_messages):(msgs[-1].content or "").strip() if msgs else "".tests/v1/test_trace.py—test_last_reply: newest sampled turn wins (" a1 "then"a2"→"a2"), no turns →"",Nonecontent →"".trace.last_reply(each previously hand-rolled the guard):reverse-text-v1,gsm8k-v1,glossary-v1,deepwiki-v1,code-golf-v1,scratchpad-v1,wiki-search-v1.verifiers/v1/GUIDE.md— the canonical reverse-text reward and two other snippets now usetrace.last_reply.alphabet-sort-v1iterates every assistant turn ([m.content or "" for m in trace.assistant_messages]);color-codeword-v1takes theAssistantMessageobjects themselves. Neither is "last reply" semantics, so they stay onassistant_messages.Testing
tests/v1/test_trace.py::test_last_reply(passes; fulltest_trace.pygreen, 4/4).Trace.last_reply).ruff checkclean on all changed files.@pytest.mark.e2eenv runs (real model rollouts + runtimes) are not executed locally; the change is a read-only property with no new imports and is covered by the unit test + import check.Checklist
Additional Notes
Companion to the Prime CLI v1 work (PrimeIntellect-ai/prime#766).
last_replyis purely additive; the env swaps are behavior-preserving (the only behavioral change is an outer.strip()on the final reply, which is benign for the substring/regex/judge checks these rewards use — and matches what the GUIDE's reverse-text example already did).Note
Low Risk
Additive trace helper plus mechanical reward refactors; only scoring input normalization changes (strip / empty-string fallback), with no auth, runtime, or wire-format changes.
Overview
Introduces
Trace.last_replyonverifiers/v1/trace.pyas the canonical way to read the model’s final sampled assistant text: whitespace-strippedcontent, or""when there are no assistant messages or content isNone.Seven v1 tasksets (
code-golf,deepwiki,glossary,gsm8k,reverse-text,scratchpad,wiki-search) andverifiers/v1/GUIDE.mdexamples now usetrace.last_replyinstead of hand-rolledassistant_messages[-1].contentguards.Scoring and stop hooks therefore get normalized, safe final-reply text (including an outer
.strip()); empty traces no longer riskIndexErroron[-1].Reviewed by Cursor Bugbot for commit d76f20a. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add
Trace.last_replyproperty and adopt it across v1 environmentsTrace.last_replyto trace.py as a convenience property returning the last assistant message content, whitespace-stripped, or an empty string when no assistant messages exist.trace.assistant_messages[-1].contentaccess in reward functions across all v1 environments (code_golf,deepwiki,glossary,gsm8k,reverse_text,scratchpad,wiki_search) withtrace.last_reply.Macroscope summarized d76f20a.