specdec_bench modelopt release#1886
Conversation
Signed-off-by: talora <talora@nvidia.com>
|
/claude review |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (31)
👮 Files not reviewed due to content moderation or server errors (31)
📝 Walkthrough
🚥 Pre-merge checks | ❌ 1❌ Failed checks (1 inconclusive)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1886 +/- ##
==========================================
+ Coverage 70.21% 75.12% +4.90%
==========================================
Files 515 515
Lines 57244 57244
==========================================
+ Hits 40196 43002 +2806
+ Misses 17048 14242 -2806
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Claude review — 2 CRITICAL, 0 IMPORTANT, 0 SUGGESTION.
Scope: examples-only PR (31 files, all under examples/specdec_bench/; no modelopt/ changes). Reviewed the high-risk source files: run.py, run_utils.py, all four engine wrappers (sglang.py, trtllm_torch_api.py, vllm.py, auto_deploy.py), models/base.py, runners, the metrics, and the dataset changes. Did not deeply review the README/markdown or sweep_example.yaml (docs/config).
Most impactful findings:
-
[CRITICAL] SGLANG engine constructor raises
TypeErroron every run (models/sglang.py:43).speculative_algorithmis read via.get()so it stays inkwargs, thenengine_kwargs = dict(..., speculative_algorithm=speculative_algorithm, ..., **kwargs)passes the keyword twice, producing a duplicate-keywordTypeError. It also shadows the remapped value (e.g. MTP to EAGLE) and leaksparallel_drafting(always sent byrun.py) intosgl.Engine. Needs.pop()plus handling ofparallel_drafting. -
[CRITICAL] AutoDeploy output-format regression (
models/auto_deploy.py:69). Unlike the collector-based engines, AutoDeploy still returns nested[beam][chunk][token]asoutput_idswith nochunk_lengths.run.pycallsdecode_chat(tokenizer, output_ids[0])on a list-of-lists viatokenizer.decode, which fails/garbles for the AUTO_DEPLOY path. Convert it toCumulativeTokenCollectoror flatten and passchunk_lengths.
Risk assessment: Medium-high. The refactor to the record-based metrics pipeline and CumulativeTokenCollector is coherent, and the sweep / error-handling additions look solid, but the two engine wrappers above are broken on their primary code paths (SGLANG unconditionally, AutoDeploy on decode). Both are self-contained fixes. The metric-interface change is an acknowledged intentional internal-API break, and the run.py entrypoint stays backward compatible. Recommend addressing the two CRITICALs before merge.
| output_dict["output_logits"] = None | ||
| output_dict["token_times"] = timing | ||
| return output_dict | ||
| return build_model_output(reformatted_output_ids, timing) |
There was a problem hiding this comment.
[CRITICAL Algorithm] Output-format contract is inconsistent with the other engines. The collector-based models (TRTLLM/vLLM/SGLANG) now return output_ids as a flat per-beam token list (collector.output_ids()), with chunking carried separately in chunk_lengths. Here AutoDeploy still returns the nested reformatted_output_ids ([beam][chunk][token]) as output_ids and passes no chunk_lengths.
run.py then calls decode_chat(tokenizer, output_tokens["output_ids"][0]), i.e. tokenizer.decode(...) on beam 0. For the collector models that's a flat list[int] (works); for AutoDeploy it's a list[list[int]] (list of chunks), which tokenizer.decode cannot handle — the AUTO_DEPLOY engine path produces a decode error / garbage output. (Metrics happen to survive because Metric.chunk_lengths() detects the nested shape, but decoding does not.)
Suggest converting AutoDeploy to CumulativeTokenCollector like the other engines, or flattening tokens per beam and passing the per-chunk lengths via build_model_output(flat_output_ids, timing, chunk_lengths=...).
What does this PR do?
Type of change: new features
New release of specdec_bench including the following features:
Usage
Example on how to use the agentic split:
Testing
Tested the following scenarios:
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).run.pybut is not backward compatible in term of the internal APIs in specdec_bench including changes to the metrics interfaces.CONTRIBUTING.md: N/AAdditional Information