Add deployment test cases, fix deployment-related issues, and remove private models by nvSiruiW · Pull Request #1873 · NVIDIA/Model-Optimizer

nvSiruiW · 2026-07-01T09:36:17Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Bug Fixes
- Improved deployment test reliability by standardizing prompt rendering and validating generated outputs consistently across backends.
New Features
- Expanded deployment test coverage with additional model variants (including reasoning, diffusion, and larger models) and updated coverage for speculative execution scenarios.
Tests
- Increased the default timeout for example tests to reduce false failures during longer runs.

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

TRT-LLM 1.3.0rc17 changed the default execution path for NVFP4 MoE models, triggering TRT engine JIT compilation (>2 hours) instead of the PyTorch eager path used in rc16 (~105s). Explicitly set backend="pytorch" to restore the previous behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

…g-FP8 Signed-off-by: Sirui Wang <siruiw@nvidia.com>

…g-NVFP4 Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add FP8 and NVFP4 variants of Nemotron-3-Nano-Omni-30B-A3B-Reasoning to the flashinfer attention backend branch in deploy_utils.py. These models use a Mamba hybrid architecture requiring page_size=1, which MambaRadixCache v1 enforces — the default page_size=64 causes an AssertionError at engine init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Kept upstream backend restrictions for new models (K2.6, Qwen3.5/3.6, Gemma-4, GLM-5.1, MiniMax-M3). Kept stashed chat template fix for TRT-LLM deploy to apply chat templates when tokenizer has chat_template. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove 13 models no longer being tested: Kimi-K2-Instruct-NVFP4, Llama-4-Maverick-17B-128E-Instruct-NVFP4, Mixtral-8x7B-Instruct-v0.1-FP8/NVFP4, Qwen3-30B-A3B-Eagle3, Llama-3.1-70B/405B-Medusa-FP8, EAGLE3-NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, gemma-3-12b/27b-it-NVFP4/FP8, QwQ-32B-NVFP4. Also fix missing AutoTokenizer import in trtllm deploy path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Raise examples default timeout to 3600s, add per-test overrides: 14400s for Nemotron-3-Ultra-550B and BF16 variants, 7200s for Nemotron. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-07-01T09:36:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-07-01T09:38:49Z

📝 Walkthrough

Walkthrough

This PR updates backend deploy test execution and prompt rendering, increases the default examples timeout, and revises the HF PTQ deployment matrix with added, removed, and new standalone model deployment tests.

Changes

Deploy test execution and backend generation updates

Layer / File(s)	Summary
Subprocess execution rework `tests/_test_utils/deploy_utils.py`	`_run_deploy_via_subprocess` builds a project-root-inclusive `PYTHONPATH`, launches TensorRT-LLM via `mpirun` with a temp-file script, launches other backends via `python -c`, and captures combined output before checking return codes.
Backend prompt generation updates `tests/_test_utils/deploy_utils.py`	TensorRT-LLM adds `backend="pytorch"` and tokenizer chat-template rendering; vLLM adds `max_model_len=4096` and switches to `llm.chat`; SGLang extends its allowlist and uses tokenizer-rendered prompts with normalized output validation and printing.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Example test timeout and deployment matrix updates

Layer / File(s)	Summary
Default examples timeout increase `tests/conftest.py`	The default timeout for the examples test group increases from 300 seconds to 3600 seconds.
Deployment matrix parameter updates `tests/examples/hf_ptq/test_deploy.py`	HF PTQ parametrizations add and remove `ModelDeployerList` entries across DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, and Medusa/Eagle3 blocks, and add a timeout decorator before `test_llama_nemotron`.
New standalone deployment tests `tests/examples/hf_ptq/test_deploy.py`	Adds standalone tests for Nemotron 3 Ultra 550B A55B, Wan2.2 T2V Diffusers, DiffusionGemma 26B A4B, and Nemotron 3 Nano Omni Reasoning, each with a single-item `ModelDeployerList` parametrization.

Estimated code review effort: 2 (Simple) | ~15 minutes

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	tests/_test_utils/deploy_utils.py hardcodes trust_remote_code=True in tokenizer/backend loads with no safety justification.	Expose trust_remote_code as a caller option defaulting to False, or add explicit codeowner-approved justification if remote code loading is required.
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the PR’s main themes: added deployment tests, deployment fixes, and removal of private/deprecated models.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/conftest.py (1)
50-58: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Broad default timeout bump may mask hangs in unrelated example tests.

Raising _DEFAULT_TIMEOUT["examples"] from 300s to 3600s applies to every test under tests/examples, not just the new large-model deploy tests. Since the large models already get explicit @pytest.mark.timeout(...) overrides (3600/7200/14400s) in test_deploy.py, the smaller/faster examples tests lose their tighter 300s safety net and could hang for up to an hour before being caught.

Consider keeping the default lower and relying on explicit per-test @pytest.mark.timeout overrides for the newly added heavy deploy tests (as is already partly done), consistent with CONTRIBUTING.md's guidance that tests/examples integration tests "should not take more than a few minutes to run."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/conftest.py` around lines 50 - 58, The examples timeout default is now
too broad and weakens hang detection for unrelated tests. Revert the
`_DEFAULT_TIMEOUT["examples"]` baseline in `tests/conftest.py` to the tighter
default and rely on explicit per-test `@pytest.mark.timeout(...)` overrides in
`test_deploy.py` for the heavy deploy cases. Keep the broader timeout only on
the specific large-model deploy tests so `tests/examples` remains covered by its
usual safety net.
Source: Coding guidelines
tests/examples/hf_ptq/test_deploy.py (1)
516-516: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

Multi-hour deploy tests conflict with tests/examples cost-profile guidance.

@pytest.mark.timeout(14400), timeout(7200), and timeout(3600) allow individual tests to run for up to 4 hours. CONTRIBUTING.md documents tests/examples as integration tests that "should not take more than a few minutes to run," and the path instructions call out flagging "Tests placed in the wrong directory for their cost profile" as an important issue.

If these very-large-model deploy tests are intentionally exempted from that budget (e.g., they only run in a dedicated nightly/release CI lane), consider documenting that exception, or moving them to a group whose cost profile matches (mirroring how tests/gpu* are separated from tests/unit).

Also applies to: 654-656, 725-727, 741-743
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/examples/hf_ptq/test_deploy.py` at line 516, The deploy tests in the
hf_ptq example suite use very long timeout markers, which conflicts with the
expected cost profile for tests/examples. Review the affected test functions in
test_deploy.py and either reduce these timeout values to match the “few minutes”
guidance or move/exempt the large-model cases into a dedicated slow/nightly test
group; if they must remain here, add explicit documentation for the exception.
Use the timeout decorators and the test_deploy test functions as the locating
symbols.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/_test_utils/deploy_utils.py`:
- Line 152: The new in-function imports in deploy_utils are not following the
module import conventions: move tempfile to module scope, and for AutoTokenizer
either import it at the top with the rest of the imports or add a short inline
comment in the relevant helper explaining that it is intentionally deferred
because it is an optional/heavy dependency. Update the affected helpers in
deploy_utils so the import placement is consistent and explicitly justified
where deferred.
- Line 477: The output normalization in the deploy test currently falls back to
str(output), which can hide unexpected backend shapes and make invalid responses
look like valid text. Update the logic around generated_text in the deploy_utils
test helper to accept only the expected dict and string forms, and treat any
other output type as a test failure instead of coercing it. Use the existing
output handling branch in the test utility to validate the shape explicitly.

---

Nitpick comments:
In `@tests/conftest.py`:
- Around line 50-58: The examples timeout default is now too broad and weakens
hang detection for unrelated tests. Revert the `_DEFAULT_TIMEOUT["examples"]`
baseline in `tests/conftest.py` to the tighter default and rely on explicit
per-test `@pytest.mark.timeout(...)` overrides in `test_deploy.py` for the heavy
deploy cases. Keep the broader timeout only on the specific large-model deploy
tests so `tests/examples` remains covered by its usual safety net.

In `@tests/examples/hf_ptq/test_deploy.py`:
- Line 516: The deploy tests in the hf_ptq example suite use very long timeout
markers, which conflicts with the expected cost profile for tests/examples.
Review the affected test functions in test_deploy.py and either reduce these
timeout values to match the “few minutes” guidance or move/exempt the
large-model cases into a dedicated slow/nightly test group; if they must remain
here, add explicit documentation for the exception. Use the timeout decorators
and the test_deploy test functions as the locating symbols.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3edc916-3c3a-4d6d-8bf1-105de14aa937

📥 Commits

Reviewing files that changed from the base of the PR and between a05850b and 2444a81.

📒 Files selected for processing (3)

tests/_test_utils/deploy_utils.py
tests/conftest.py
tests/examples/hf_ptq/test_deploy.py

coderabbitai · 2026-07-01T09:46:51Z

    eagle3_one_model: bool,
 ) -> None:
    """Run deploy in a subprocess and print its stdout/stderr so pytest capture=tee-sys captures to DB."""
+    import tempfile


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move or justify the new in-function imports.

tempfile should be imported at module scope. For AutoTokenizer, either move it to the top or add a brief comment explaining that it is intentionally deferred for an optional/heavy dependency.

As per path instructions, imports inside tests need explicit justification when deferred. As per coding guidelines, imports should stay at the top unless there is a concrete reason.

Also applies to: 328-328, 456-456

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/_test_utils/deploy_utils.py` at line 152, The new in-function imports in deploy_utils are not following the module import conventions: move tempfile to module scope, and for AutoTokenizer either import it at the top with the rest of the imports or add a short inline comment in the relevant helper explaining that it is intentionally deferred because it is an optional/heavy dependency. Update the affected helpers in deploy_utils so the import placement is consistent and explicitly justified where deferred.

Sources: Coding guidelines, Path instructions

coderabbitai · 2026-07-01T09:46:51Z

+            f"Expected {len(COMMON_PROMPTS)} outputs, got {len(outputs)}"
+        )
+        for i, output in enumerate(outputs):
+            generated_text = output["text"] if isinstance(output, dict) else str(output)


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Avoid converting unexpected SGLang outputs with str().

str(output) makes any unexpected object look like valid non-empty text, so this test can pass while the backend returns the wrong shape. Accept only the expected dict/string forms.

Proposed fix

- generated_text = output["text"] if isinstance(output, dict) else str(output) + generated_text = output.get("text") if isinstance(output, dict) else output assert isinstance(generated_text, str), f"Output {i} text is not a string"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

generated_text = output["text"] if isinstance(output, dict) else str(output)

generated_text = output.get("text") if isinstance(output, dict) else output

assert isinstance(generated_text, str), f"Output {i} text is not a string"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/_test_utils/deploy_utils.py` at line 477, The output normalization in the deploy test currently falls back to str(output), which can hide unexpected backend shapes and make invalid responses look like valid text. Update the logic around generated_text in the deploy_utils test helper to accept only the expected dict and string forms, and treat any other output type as a test failure instead of coercing it. Use the existing output handling branch in the test utility to validate the shape explicitly.

kevalmorabia97 · 2026-07-01T13:16:18Z

 # A test can override its cap by adding ``@pytest.mark.timeout(...)``
 _DEFAULT_TIMEOUT = {
-    "examples": 300,
+    "examples": 3600,


Can you make this default to 300 as before but overwritable by some env variable like MODELOPT_QA_TEST_TIMEOUT so you can set to very high timeout only for QA tests

That way you wont have to have per-test timeout override in test_deploy also

kevalmorabia97 · 2026-07-01T13:17:02Z

/claude review

kevalmorabia97 · 2026-07-01T13:17:38Z

@nvSiruiW please address coderabbit / claude comments and also fix CI checks (need to sign your commits as per CONTRIBUTING.md

claude · 2026-07-01T13:21:19Z

@@ -324,25 +380,28 @@ def _deploy_vllm_impl(self):
                quantization=quantization_method,
                tensor_parallel_size=self.tensor_parallel_size,
                trust_remote_code=True,
+                max_model_len=4096,
            )


[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.

Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.

Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.

claude · 2026-07-01T13:21:22Z

+@pytest.mark.parametrize(
+    "command",
+    [
+        *ModelDeployerList(
+            model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4",
+            backend=("trtllm", "sglang"),
+            tensor_parallel_size=1,
+            mini_sm=100,
+        ),
+    ],
+    ids=idfn,
+)
+def test_wan2_2_t2v_a14b_diffusers_nvfp4(command):
+    command.run()


[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).

Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.

Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.

claude

Claude Review Summary

Scope: Test-only PR (3 files, all under tests/): tests/_test_utils/deploy_utils.py, tests/conftest.py, tests/examples/hf_ptq/test_deploy.py. Reviewed all three in full. No modelopt/ source or public API changed.

Findings by severity: CRITICAL: 0 · IMPORTANT: 1 · SUGGESTION: 1

Most impactful

[IMPORTANT Compatibility] The new test_nvidia_nemotron_3_ultra_550b_a55b_bf16 case runs an unquantized BF16 checkpoint through the vLLM and SGLang deploy paths, both of which unconditionally pass quantization=modelopt (id has no fp4/fp8 marker). quantization=modelopt expects ModelOpt quant scales in the checkpoint, so this newly added test looks likely to fail or misconfigure on those two backends. Suggest gating the quantization arg on whether the model id indicates a quantized checkpoint. (Inline on deploy_utils.py.)

Suggestion

[SUGGESTION] The new Wan2.2-T2V and diffusiongemma diffusion cases are routed through the text-generation ModelDeployer path, which asserts a non-empty .text output. A T2V/diffusion model does not emit text, so these assertions may pass vacuously or fail confusingly (the SGLang str(output) fallback can mask a non-text return). Worth confirming these engines expose a text API for these checkpoints, or giving diffusion models a dedicated path. (Inline on test_deploy.py.)

Not duplicated

CodeRabbit already raised the broad examples timeout bump (300 to 3600s), the multi-hour per-test timeout markers vs the tests/examples cost profile, the deferred in-function imports, and the str(output) fallback — I defer to those and did not repeat them. Also note the CI reminder to sign commits (git commit -s -S) per CONTRIBUTING.md.

Risk

Low overall — no production code changes. The one IMPORTANT item is a likely-broken new test rather than a regression to existing behavior.

claude · 2026-07-01T13:22:34Z

@@ -324,25 +380,28 @@ def _deploy_vllm_impl(self):
                quantization=quantization_method,
                tensor_parallel_size=self.tensor_parallel_size,
                trust_remote_code=True,
+                max_model_len=4096,
            )


[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.

Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.

Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.

claude · 2026-07-01T13:22:34Z

+@pytest.mark.parametrize(
+    "command",
+    [
+        *ModelDeployerList(
+            model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4",
+            backend=("trtllm", "sglang"),
+            tensor_parallel_size=1,
+            mini_sm=100,
+        ),
+    ],
+    ids=idfn,
+)
+def test_wan2_2_t2v_a14b_diffusers_nvfp4(command):
+    command.run()


[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).

Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.

Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.

nvSiruiW and others added 23 commits July 1, 2026 02:34

Add deploy test case for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

4e37c06

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Qwen3.6-35B-A3B-NVFP4

20df239

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/DeepSeek-V4-Pro-NVFP4

02f0456

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Qwen3.5-122B-A10B-NVFP4

af118a9

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Gemma-4-26B-A4B-NVFP4

b2e47ba

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/GLM-5.1-NVFP4

c15f913

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Kimi-K2.6-NVFP4

9d874db

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4

99781f2

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Wan2.2-T2V-A14B-Diffusers-FP8

e8eee56

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/DeepSeek-V4-Flash-NVFP4

72d1c4e

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/MiniMax-M3-NVFP4

04bfb27

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Kimi-K2.6-Eagle3

b1f11c9

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/diffusiongemma-26B-A4B-it-NVFP4

6071669

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasonin…

2fb6697

…g-FP8 Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasonin…

aa2f34b

…g-NVFP4 Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/GLM-5.2-NVFP4

90a951d

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add deploy test case for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

044c8e9

Signed-off-by: Sirui Wang <siruiw@nvidia.com>

Add max_model_len=4096 to vLLM deploy to limit KV cache allocation

f8c1907

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Increase timeouts for large model deploy tests

2444a81

Raise examples default timeout to 3600s, add per-test overrides: 14400s for Nemotron-3-Ultra-550B and BF16 variants, 7200s for Nemotron. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nvSiruiW requested a review from a team as a code owner July 1, 2026 09:36

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

kevalmorabia97 reviewed Jul 1, 2026

View reviewed changes

claude Bot reviewed Jul 1, 2026

View reviewed changes

Add deploy cases for Medusa and Kimi Eagle3

0add6bb

	generated_text = output["text"] if isinstance(output, dict) else str(output)
	generated_text = output.get("text") if isinstance(output, dict) else output
	assert isinstance(generated_text, str), f"Output {i} text is not a string"

Uh oh!

Conversation

nvSiruiW commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented Jul 1, 2026

Uh oh!

kevalmorabia97 commented Jul 1, 2026

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvSiruiW commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading