Skip to content

Add deployment test cases, fix deployment-related issues, and remove private models#1873

Open
nvSiruiW wants to merge 24 commits into
NVIDIA:mainfrom
noeyy-mino:auto/add-cases
Open

Add deployment test cases, fix deployment-related issues, and remove private models#1873
nvSiruiW wants to merge 24 commits into
NVIDIA:mainfrom
noeyy-mino:auto/add-cases

Conversation

@nvSiruiW

@nvSiruiW nvSiruiW commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Improved deployment test reliability by standardizing prompt rendering and validating generated outputs consistently across backends.
  • New Features
    • Expanded deployment test coverage with additional model variants (including reasoning, diffusion, and larger models) and updated coverage for speculative execution scenarios.
  • Tests
    • Increased the default timeout for example tests to reduce false failures during longer runs.

nvSiruiW and others added 23 commits July 1, 2026 02:34
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
TRT-LLM 1.3.0rc17 changed the default execution path for NVFP4 MoE
models, triggering TRT engine JIT compilation (>2 hours) instead of
the PyTorch eager path used in rc16 (~105s). Explicitly set
backend="pytorch" to restore the previous behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
…g-FP8

Signed-off-by: Sirui Wang <siruiw@nvidia.com>
…g-NVFP4

Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Add FP8 and NVFP4 variants of Nemotron-3-Nano-Omni-30B-A3B-Reasoning to
the flashinfer attention backend branch in deploy_utils.py. These models
use a Mamba hybrid architecture requiring page_size=1, which MambaRadixCache
v1 enforces — the default page_size=64 causes an AssertionError at engine init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Signed-off-by: Sirui Wang <siruiw@nvidia.com>
Kept upstream backend restrictions for new models (K2.6, Qwen3.5/3.6,
Gemma-4, GLM-5.1, MiniMax-M3). Kept stashed chat template fix for
TRT-LLM deploy to apply chat templates when tokenizer has chat_template.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove 13 models no longer being tested: Kimi-K2-Instruct-NVFP4,
Llama-4-Maverick-17B-128E-Instruct-NVFP4, Mixtral-8x7B-Instruct-v0.1-FP8/NVFP4,
Qwen3-30B-A3B-Eagle3, Llama-3.1-70B/405B-Medusa-FP8,
EAGLE3-NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, gemma-3-12b/27b-it-NVFP4/FP8,
QwQ-32B-NVFP4. Also fix missing AutoTokenizer import in trtllm deploy path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Raise examples default timeout to 3600s, add per-test overrides:
14400s for Nemotron-3-Ultra-550B and BF16 variants, 7200s for Nemotron.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nvSiruiW nvSiruiW requested a review from a team as a code owner July 1, 2026 09:36
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates backend deploy test execution and prompt rendering, increases the default examples timeout, and revises the HF PTQ deployment matrix with added, removed, and new standalone model deployment tests.

Changes

Deploy test execution and backend generation updates

Layer / File(s) Summary
Subprocess execution rework
tests/_test_utils/deploy_utils.py
_run_deploy_via_subprocess builds a project-root-inclusive PYTHONPATH, launches TensorRT-LLM via mpirun with a temp-file script, launches other backends via python -c, and captures combined output before checking return codes.
Backend prompt generation updates
tests/_test_utils/deploy_utils.py
TensorRT-LLM adds backend="pytorch" and tokenizer chat-template rendering; vLLM adds max_model_len=4096 and switches to llm.chat; SGLang extends its allowlist and uses tokenizer-rendered prompts with normalized output validation and printing.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Example test timeout and deployment matrix updates

Layer / File(s) Summary
Default examples timeout increase
tests/conftest.py
The default timeout for the examples test group increases from 300 seconds to 3600 seconds.
Deployment matrix parameter updates
tests/examples/hf_ptq/test_deploy.py
HF PTQ parametrizations add and remove ModelDeployerList entries across DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, and Medusa/Eagle3 blocks, and add a timeout decorator before test_llama_nemotron.
New standalone deployment tests
tests/examples/hf_ptq/test_deploy.py
Adds standalone tests for Nemotron 3 Ultra 550B A55B, Wan2.2 T2V Diffusers, DiffusionGemma 26B A4B, and Nemotron 3 Nano Omni Reasoning, each with a single-item ModelDeployerList parametrization.

Estimated code review effort: 2 (Simple) | ~15 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error tests/_test_utils/deploy_utils.py hardcodes trust_remote_code=True in tokenizer/backend loads with no safety justification. Expose trust_remote_code as a caller option defaulting to False, or add explicit codeowner-approved justification if remote code loading is required.
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the PR’s main themes: added deployment tests, deployment fixes, and removal of private/deprecated models.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/conftest.py (1)

50-58: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Broad default timeout bump may mask hangs in unrelated example tests.

Raising _DEFAULT_TIMEOUT["examples"] from 300s to 3600s applies to every test under tests/examples, not just the new large-model deploy tests. Since the large models already get explicit @pytest.mark.timeout(...) overrides (3600/7200/14400s) in test_deploy.py, the smaller/faster examples tests lose their tighter 300s safety net and could hang for up to an hour before being caught.

Consider keeping the default lower and relying on explicit per-test @pytest.mark.timeout overrides for the newly added heavy deploy tests (as is already partly done), consistent with CONTRIBUTING.md's guidance that tests/examples integration tests "should not take more than a few minutes to run."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/conftest.py` around lines 50 - 58, The examples timeout default is now
too broad and weakens hang detection for unrelated tests. Revert the
`_DEFAULT_TIMEOUT["examples"]` baseline in `tests/conftest.py` to the tighter
default and rely on explicit per-test `@pytest.mark.timeout(...)` overrides in
`test_deploy.py` for the heavy deploy cases. Keep the broader timeout only on
the specific large-model deploy tests so `tests/examples` remains covered by its
usual safety net.

Source: Coding guidelines

tests/examples/hf_ptq/test_deploy.py (1)

516-516: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

Multi-hour deploy tests conflict with tests/examples cost-profile guidance.

@pytest.mark.timeout(14400), timeout(7200), and timeout(3600) allow individual tests to run for up to 4 hours. CONTRIBUTING.md documents tests/examples as integration tests that "should not take more than a few minutes to run," and the path instructions call out flagging "Tests placed in the wrong directory for their cost profile" as an important issue.

If these very-large-model deploy tests are intentionally exempted from that budget (e.g., they only run in a dedicated nightly/release CI lane), consider documenting that exception, or moving them to a group whose cost profile matches (mirroring how tests/gpu* are separated from tests/unit).

Also applies to: 654-656, 725-727, 741-743

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/examples/hf_ptq/test_deploy.py` at line 516, The deploy tests in the
hf_ptq example suite use very long timeout markers, which conflicts with the
expected cost profile for tests/examples. Review the affected test functions in
test_deploy.py and either reduce these timeout values to match the “few minutes”
guidance or move/exempt the large-model cases into a dedicated slow/nightly test
group; if they must remain here, add explicit documentation for the exception.
Use the timeout decorators and the test_deploy test functions as the locating
symbols.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/_test_utils/deploy_utils.py`:
- Line 152: The new in-function imports in deploy_utils are not following the
module import conventions: move tempfile to module scope, and for AutoTokenizer
either import it at the top with the rest of the imports or add a short inline
comment in the relevant helper explaining that it is intentionally deferred
because it is an optional/heavy dependency. Update the affected helpers in
deploy_utils so the import placement is consistent and explicitly justified
where deferred.
- Line 477: The output normalization in the deploy test currently falls back to
str(output), which can hide unexpected backend shapes and make invalid responses
look like valid text. Update the logic around generated_text in the deploy_utils
test helper to accept only the expected dict and string forms, and treat any
other output type as a test failure instead of coercing it. Use the existing
output handling branch in the test utility to validate the shape explicitly.

---

Nitpick comments:
In `@tests/conftest.py`:
- Around line 50-58: The examples timeout default is now too broad and weakens
hang detection for unrelated tests. Revert the `_DEFAULT_TIMEOUT["examples"]`
baseline in `tests/conftest.py` to the tighter default and rely on explicit
per-test `@pytest.mark.timeout(...)` overrides in `test_deploy.py` for the heavy
deploy cases. Keep the broader timeout only on the specific large-model deploy
tests so `tests/examples` remains covered by its usual safety net.

In `@tests/examples/hf_ptq/test_deploy.py`:
- Line 516: The deploy tests in the hf_ptq example suite use very long timeout
markers, which conflicts with the expected cost profile for tests/examples.
Review the affected test functions in test_deploy.py and either reduce these
timeout values to match the “few minutes” guidance or move/exempt the
large-model cases into a dedicated slow/nightly test group; if they must remain
here, add explicit documentation for the exception. Use the timeout decorators
and the test_deploy test functions as the locating symbols.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3edc916-3c3a-4d6d-8bf1-105de14aa937

📥 Commits

Reviewing files that changed from the base of the PR and between a05850b and 2444a81.

📒 Files selected for processing (3)
  • tests/_test_utils/deploy_utils.py
  • tests/conftest.py
  • tests/examples/hf_ptq/test_deploy.py

eagle3_one_model: bool,
) -> None:
"""Run deploy in a subprocess and print its stdout/stderr so pytest capture=tee-sys captures to DB."""
import tempfile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move or justify the new in-function imports.

tempfile should be imported at module scope. For AutoTokenizer, either move it to the top or add a brief comment explaining that it is intentionally deferred for an optional/heavy dependency.

As per path instructions, imports inside tests need explicit justification when deferred. As per coding guidelines, imports should stay at the top unless there is a concrete reason.

Also applies to: 328-328, 456-456

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/_test_utils/deploy_utils.py` at line 152, The new in-function imports
in deploy_utils are not following the module import conventions: move tempfile
to module scope, and for AutoTokenizer either import it at the top with the rest
of the imports or add a short inline comment in the relevant helper explaining
that it is intentionally deferred because it is an optional/heavy dependency.
Update the affected helpers in deploy_utils so the import placement is
consistent and explicitly justified where deferred.

Sources: Coding guidelines, Path instructions

f"Expected {len(COMMON_PROMPTS)} outputs, got {len(outputs)}"
)
for i, output in enumerate(outputs):
generated_text = output["text"] if isinstance(output, dict) else str(output)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Avoid converting unexpected SGLang outputs with str().

str(output) makes any unexpected object look like valid non-empty text, so this test can pass while the backend returns the wrong shape. Accept only the expected dict/string forms.

Proposed fix
-            generated_text = output["text"] if isinstance(output, dict) else str(output)
+            generated_text = output.get("text") if isinstance(output, dict) else output
             assert isinstance(generated_text, str), f"Output {i} text is not a string"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
generated_text = output["text"] if isinstance(output, dict) else str(output)
generated_text = output.get("text") if isinstance(output, dict) else output
assert isinstance(generated_text, str), f"Output {i} text is not a string"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/_test_utils/deploy_utils.py` at line 477, The output normalization in
the deploy test currently falls back to str(output), which can hide unexpected
backend shapes and make invalid responses look like valid text. Update the logic
around generated_text in the deploy_utils test helper to accept only the
expected dict and string forms, and treat any other output type as a test
failure instead of coercing it. Use the existing output handling branch in the
test utility to validate the shape explicitly.

Comment thread tests/conftest.py
# A test can override its cap by adding ``@pytest.mark.timeout(...)``
_DEFAULT_TIMEOUT = {
"examples": 300,
"examples": 3600,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this default to 300 as before but overwritable by some env variable like MODELOPT_QA_TEST_TIMEOUT so you can set to very high timeout only for QA tests

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way you wont have to have per-test timeout override in test_deploy also

@kevalmorabia97

Copy link
Copy Markdown
Collaborator

/claude review

@kevalmorabia97

Copy link
Copy Markdown
Collaborator

@nvSiruiW please address coderabbit / claude comments and also fix CI checks (need to sign your commits as per CONTRIBUTING.md

Comment on lines 374 to 384
@@ -324,25 +380,28 @@ def _deploy_vllm_impl(self):
quantization=quantization_method,
tensor_parallel_size=self.tensor_parallel_size,
trust_remote_code=True,
max_model_len=4096,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.

Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.

Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.

Comment on lines +659 to +672
@pytest.mark.parametrize(
"command",
[
*ModelDeployerList(
model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4",
backend=("trtllm", "sglang"),
tensor_parallel_size=1,
mini_sm=100,
),
],
ids=idfn,
)
def test_wan2_2_t2v_a14b_diffusers_nvfp4(command):
command.run()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).

Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.

Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Review Summary

Scope: Test-only PR (3 files, all under tests/): tests/_test_utils/deploy_utils.py, tests/conftest.py, tests/examples/hf_ptq/test_deploy.py. Reviewed all three in full. No modelopt/ source or public API changed.

Findings by severity: CRITICAL: 0 · IMPORTANT: 1 · SUGGESTION: 1

Most impactful

[IMPORTANT Compatibility] The new test_nvidia_nemotron_3_ultra_550b_a55b_bf16 case runs an unquantized BF16 checkpoint through the vLLM and SGLang deploy paths, both of which unconditionally pass quantization=modelopt (id has no fp4/fp8 marker). quantization=modelopt expects ModelOpt quant scales in the checkpoint, so this newly added test looks likely to fail or misconfigure on those two backends. Suggest gating the quantization arg on whether the model id indicates a quantized checkpoint. (Inline on deploy_utils.py.)

Suggestion

[SUGGESTION] The new Wan2.2-T2V and diffusiongemma diffusion cases are routed through the text-generation ModelDeployer path, which asserts a non-empty .text output. A T2V/diffusion model does not emit text, so these assertions may pass vacuously or fail confusingly (the SGLang str(output) fallback can mask a non-text return). Worth confirming these engines expose a text API for these checkpoints, or giving diffusion models a dedicated path. (Inline on test_deploy.py.)

Not duplicated

CodeRabbit already raised the broad examples timeout bump (300 to 3600s), the multi-hour per-test timeout markers vs the tests/examples cost profile, the deferred in-function imports, and the str(output) fallback — I defer to those and did not repeat them. Also note the CI reminder to sign commits (git commit -s -S) per CONTRIBUTING.md.

Risk

Low overall — no production code changes. The one IMPORTANT item is a likely-broken new test rather than a regression to existing behavior.

Comment on lines 374 to 384
@@ -324,25 +380,28 @@ def _deploy_vllm_impl(self):
quantization=quantization_method,
tensor_parallel_size=self.tensor_parallel_size,
trust_remote_code=True,
max_model_len=4096,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] The non-eagle vLLM path unconditionally sets quantization="modelopt" (or modelopt_fp4 when the id contains "fp4"), with no branch for an unquantized checkpoint. This PR adds test_nvidia_nemotron_3_ultra_550b_a55b_bf16 (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, backend=("trtllm", "vllm", "sglang")) — a plain BF16 model whose id contains neither "fp4" nor any quant marker. It will be loaded with quantization="modelopt", which expects ModelOpt quant scales in the checkpoint and will fail on a BF16 model.

Why it matters: the newly added BF16 test will fail (or silently misconfigure) on the vLLM backend — and the same applies to the SGLang path below (_deploy_sglang_impl, lines 412-413), which uses the identical quantization="modelopt" default. The TRT-LLM path is fine since it doesn't pass a quantization arg.

Suggestion: gate the quantization argument on whether the model is actually quantized, e.g. only pass quantization=... when the id indicates fp8/fp4/nvfp4, and omit it (or pass None) for BF16/unquantized checkpoints.

Comment on lines +659 to +672
@pytest.mark.parametrize(
"command",
[
*ModelDeployerList(
model_id="nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4",
backend=("trtllm", "sglang"),
tensor_parallel_size=1,
mini_sm=100,
),
],
ids=idfn,
)
def test_wan2_2_t2v_a14b_diffusers_nvfp4(command):
command.run()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Wan2.2-T2V-A14B-Diffusers is a text-to-video diffusion model, and diffusiongemma-26B-A4B-it (below) is a diffusion model — but both are routed through ModelDeployer, whose backend implementations (_deploy_trtllm_impl / _deploy_sglang_impl / _deploy_vllm_impl) call llm.generate(prompts) and then assert output.outputs[0].text (trtllm/vllm) or output["text"] (sglang) is a non-empty string. A T2V model does not emit .text — it produces video latents — so if these engines even load it via the text-LLM API, the isinstance(generated_text, str) / .strip() assertions would not exercise anything meaningful (and the SGLang str(output) fallback would mask a non-text return by stringifying it).

Why it matters: these cases may pass vacuously or fail confusingly depending on how each engine handles a diffusion checkpoint through the text API.

Suggestion: confirm these engines expose a text-generation API for T2V/diffusion checkpoints; if not, give diffusion models a dedicated deploy path (or explicitly skip the text-only assertions) rather than reusing the token-generation path. If it is intentional and supported, a one-line comment noting that would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants