Add deployment test cases, fix deployment-related issues, and remove private models#1873
Add deployment test cases, fix deployment-related issues, and remove private models#1873nvSiruiW wants to merge 24 commits into
Conversation
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
TRT-LLM 1.3.0rc17 changed the default execution path for NVFP4 MoE models, triggering TRT engine JIT compilation (>2 hours) instead of the PyTorch eager path used in rc16 (~105s). Explicitly set backend="pytorch" to restore the previous behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
…g-FP8 Signed-off-by: Sirui Wang <siruiw@nvidia.com>
…g-NVFP4 Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Add FP8 and NVFP4 variants of Nemotron-3-Nano-Omni-30B-A3B-Reasoning to the flashinfer attention backend branch in deploy_utils.py. These models use a Mamba hybrid architecture requiring page_size=1, which MambaRadixCache v1 enforces — the default page_size=64 causes an AssertionError at engine init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Kept upstream backend restrictions for new models (K2.6, Qwen3.5/3.6, Gemma-4, GLM-5.1, MiniMax-M3). Kept stashed chat template fix for TRT-LLM deploy to apply chat templates when tokenizer has chat_template. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove 13 models no longer being tested: Kimi-K2-Instruct-NVFP4, Llama-4-Maverick-17B-128E-Instruct-NVFP4, Mixtral-8x7B-Instruct-v0.1-FP8/NVFP4, Qwen3-30B-A3B-Eagle3, Llama-3.1-70B/405B-Medusa-FP8, EAGLE3-NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, gemma-3-12b/27b-it-NVFP4/FP8, QwQ-32B-NVFP4. Also fix missing AutoTokenizer import in trtllm deploy path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Raise examples default timeout to 3600s, add per-test overrides: 14400s for Nemotron-3-Ultra-550B and BF16 variants, 7200s for Nemotron. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR updates backend deploy test execution and prompt rendering, increases the default examples timeout, and revises the HF PTQ deployment matrix with added, removed, and new standalone model deployment tests. ChangesDeploy test execution and backend generation updates
Estimated code review effort: 3 (Moderate) | ~25 minutes Example test timeout and deployment matrix updates
Estimated code review effort: 2 (Simple) | ~15 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/conftest.py (1)
50-58: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winBroad default timeout bump may mask hangs in unrelated example tests.
Raising
_DEFAULT_TIMEOUT["examples"]from 300s to 3600s applies to every test undertests/examples, not just the new large-model deploy tests. Since the large models already get explicit@pytest.mark.timeout(...)overrides (3600/7200/14400s) intest_deploy.py, the smaller/faster examples tests lose their tighter 300s safety net and could hang for up to an hour before being caught.Consider keeping the default lower and relying on explicit per-test
@pytest.mark.timeoutoverrides for the newly added heavy deploy tests (as is already partly done), consistent with CONTRIBUTING.md's guidance thattests/examplesintegration tests "should not take more than a few minutes to run."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/conftest.py` around lines 50 - 58, The examples timeout default is now too broad and weakens hang detection for unrelated tests. Revert the `_DEFAULT_TIMEOUT["examples"]` baseline in `tests/conftest.py` to the tighter default and rely on explicit per-test `@pytest.mark.timeout(...)` overrides in `test_deploy.py` for the heavy deploy cases. Keep the broader timeout only on the specific large-model deploy tests so `tests/examples` remains covered by its usual safety net.Source: Coding guidelines
tests/examples/hf_ptq/test_deploy.py (1)
516-516: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffMulti-hour deploy tests conflict with
tests/examplescost-profile guidance.
@pytest.mark.timeout(14400),timeout(7200), andtimeout(3600)allow individual tests to run for up to 4 hours. CONTRIBUTING.md documentstests/examplesas integration tests that "should not take more than a few minutes to run," and the path instructions call out flagging "Tests placed in the wrong directory for their cost profile" as an important issue.If these very-large-model deploy tests are intentionally exempted from that budget (e.g., they only run in a dedicated nightly/release CI lane), consider documenting that exception, or moving them to a group whose cost profile matches (mirroring how
tests/gpu*are separated fromtests/unit).Also applies to: 654-656, 725-727, 741-743
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/examples/hf_ptq/test_deploy.py` at line 516, The deploy tests in the hf_ptq example suite use very long timeout markers, which conflicts with the expected cost profile for tests/examples. Review the affected test functions in test_deploy.py and either reduce these timeout values to match the “few minutes” guidance or move/exempt the large-model cases into a dedicated slow/nightly test group; if they must remain here, add explicit documentation for the exception. Use the timeout decorators and the test_deploy test functions as the locating symbols.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/_test_utils/deploy_utils.py`:
- Line 152: The new in-function imports in deploy_utils are not following the
module import conventions: move tempfile to module scope, and for AutoTokenizer
either import it at the top with the rest of the imports or add a short inline
comment in the relevant helper explaining that it is intentionally deferred
because it is an optional/heavy dependency. Update the affected helpers in
deploy_utils so the import placement is consistent and explicitly justified
where deferred.
- Line 477: The output normalization in the deploy test currently falls back to
str(output), which can hide unexpected backend shapes and make invalid responses
look like valid text. Update the logic around generated_text in the deploy_utils
test helper to accept only the expected dict and string forms, and treat any
other output type as a test failure instead of coercing it. Use the existing
output handling branch in the test utility to validate the shape explicitly.
---
Nitpick comments:
In `@tests/conftest.py`:
- Around line 50-58: The examples timeout default is now too broad and weakens
hang detection for unrelated tests. Revert the `_DEFAULT_TIMEOUT["examples"]`
baseline in `tests/conftest.py` to the tighter default and rely on explicit
per-test `@pytest.mark.timeout(...)` overrides in `test_deploy.py` for the heavy
deploy cases. Keep the broader timeout only on the specific large-model deploy
tests so `tests/examples` remains covered by its usual safety net.
In `@tests/examples/hf_ptq/test_deploy.py`:
- Line 516: The deploy tests in the hf_ptq example suite use very long timeout
markers, which conflicts with the expected cost profile for tests/examples.
Review the affected test functions in test_deploy.py and either reduce these
timeout values to match the “few minutes” guidance or move/exempt the
large-model cases into a dedicated slow/nightly test group; if they must remain
here, add explicit documentation for the exception. Use the timeout decorators
and the test_deploy test functions as the locating symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b3edc916-3c3a-4d6d-8bf1-105de14aa937
📒 Files selected for processing (3)
tests/_test_utils/deploy_utils.pytests/conftest.pytests/examples/hf_ptq/test_deploy.py
| eagle3_one_model: bool, | ||
| ) -> None: | ||
| """Run deploy in a subprocess and print its stdout/stderr so pytest capture=tee-sys captures to DB.""" | ||
| import tempfile |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Move or justify the new in-function imports.
tempfile should be imported at module scope. For AutoTokenizer, either move it to the top or add a brief comment explaining that it is intentionally deferred for an optional/heavy dependency.
As per path instructions, imports inside tests need explicit justification when deferred. As per coding guidelines, imports should stay at the top unless there is a concrete reason.
Also applies to: 328-328, 456-456
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/_test_utils/deploy_utils.py` at line 152, The new in-function imports
in deploy_utils are not following the module import conventions: move tempfile
to module scope, and for AutoTokenizer either import it at the top with the rest
of the imports or add a short inline comment in the relevant helper explaining
that it is intentionally deferred because it is an optional/heavy dependency.
Update the affected helpers in deploy_utils so the import placement is
consistent and explicitly justified where deferred.
Sources: Coding guidelines, Path instructions
| f"Expected {len(COMMON_PROMPTS)} outputs, got {len(outputs)}" | ||
| ) | ||
| for i, output in enumerate(outputs): | ||
| generated_text = output["text"] if isinstance(output, dict) else str(output) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Avoid converting unexpected SGLang outputs with str().
str(output) makes any unexpected object look like valid non-empty text, so this test can pass while the backend returns the wrong shape. Accept only the expected dict/string forms.
Proposed fix
- generated_text = output["text"] if isinstance(output, dict) else str(output)
+ generated_text = output.get("text") if isinstance(output, dict) else output
assert isinstance(generated_text, str), f"Output {i} text is not a string"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| generated_text = output["text"] if isinstance(output, dict) else str(output) | |
| generated_text = output.get("text") if isinstance(output, dict) else output | |
| assert isinstance(generated_text, str), f"Output {i} text is not a string" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/_test_utils/deploy_utils.py` at line 477, The output normalization in
the deploy test currently falls back to str(output), which can hide unexpected
backend shapes and make invalid responses look like valid text. Update the logic
around generated_text in the deploy_utils test helper to accept only the
expected dict and string forms, and treat any other output type as a test
failure instead of coercing it. Use the existing output handling branch in the
test utility to validate the shape explicitly.
| # A test can override its cap by adding ``@pytest.mark.timeout(...)`` | ||
| _DEFAULT_TIMEOUT = { | ||
| "examples": 300, | ||
| "examples": 3600, |
There was a problem hiding this comment.
Can you make this default to 300 as before but overwritable by some env variable like MODELOPT_QA_TEST_TIMEOUT so you can set to very high timeout only for QA tests
There was a problem hiding this comment.
That way you wont have to have per-test timeout override in test_deploy also
|
/claude review |
|
@nvSiruiW please address coderabbit / claude comments and also fix CI checks (need to sign your commits as per CONTRIBUTING.md |
| @@ -324,25 +380,28 @@ def _deploy_vllm_impl(self): | |||
| quantization=quantization_method, | |||
| tensor_parallel_size=self.tensor_parallel_size, | |||
| trust_remote_code=True, | |||
| max_model_len=4096, | |||
| ) | |||
There was a problem hiding this comment.
[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.
Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.
Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.
| @pytest.mark.parametrize( | ||
| "command", | ||
| [ | ||
| *ModelDeployerList( | ||
| model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4", | ||
| backend=("trtllm", "sglang"), | ||
| tensor_parallel_size=1, | ||
| mini_sm=100, | ||
| ), | ||
| ], | ||
| ids=idfn, | ||
| ) | ||
| def test_wan2_2_t2v_a14b_diffusers_nvfp4(command): | ||
| command.run() |
There was a problem hiding this comment.
[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).
Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.
Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.
There was a problem hiding this comment.
Claude Review Summary
Scope: Test-only PR (3 files, all under tests/): tests/_test_utils/deploy_utils.py, tests/conftest.py, tests/examples/hf_ptq/test_deploy.py. Reviewed all three in full. No modelopt/ source or public API changed.
Findings by severity: CRITICAL: 0 · IMPORTANT: 1 · SUGGESTION: 1
Most impactful
[IMPORTANT Compatibility] The new test_nvidia_nemotron_3_ultra_550b_a55b_bf16 case runs an unquantized BF16 checkpoint through the vLLM and SGLang deploy paths, both of which unconditionally pass quantization=modelopt (id has no fp4/fp8 marker). quantization=modelopt expects ModelOpt quant scales in the checkpoint, so this newly added test looks likely to fail or misconfigure on those two backends. Suggest gating the quantization arg on whether the model id indicates a quantized checkpoint. (Inline on deploy_utils.py.)
Suggestion
[SUGGESTION] The new Wan2.2-T2V and diffusiongemma diffusion cases are routed through the text-generation ModelDeployer path, which asserts a non-empty .text output. A T2V/diffusion model does not emit text, so these assertions may pass vacuously or fail confusingly (the SGLang str(output) fallback can mask a non-text return). Worth confirming these engines expose a text API for these checkpoints, or giving diffusion models a dedicated path. (Inline on test_deploy.py.)
Not duplicated
CodeRabbit already raised the broad examples timeout bump (300 to 3600s), the multi-hour per-test timeout markers vs the tests/examples cost profile, the deferred in-function imports, and the str(output) fallback — I defer to those and did not repeat them. Also note the CI reminder to sign commits (git commit -s -S) per CONTRIBUTING.md.
Risk
Low overall — no production code changes. The one IMPORTANT item is a likely-broken new test rather than a regression to existing behavior.
| @@ -324,25 +380,28 @@ def _deploy_vllm_impl(self): | |||
| quantization=quantization_method, | |||
| tensor_parallel_size=self.tensor_parallel_size, | |||
| trust_remote_code=True, | |||
| max_model_len=4096, | |||
| ) | |||
There was a problem hiding this comment.
[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.
Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.
Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.
| @pytest.mark.parametrize( | ||
| "command", | ||
| [ | ||
| *ModelDeployerList( | ||
| model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4", | ||
| backend=("trtllm", "sglang"), | ||
| tensor_parallel_size=1, | ||
| mini_sm=100, | ||
| ), | ||
| ], | ||
| ids=idfn, | ||
| ) | ||
| def test_wan2_2_t2v_a14b_diffusers_nvfp4(command): | ||
| command.run() |
There was a problem hiding this comment.
[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).
Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.
Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit