Skip to content

[feat]: support wan2.2 s2v, support dist infer, pose-audio#1113

Merged
helloyongyang merged 2 commits into
mainfrom
s2v
Jun 2, 2026
Merged

[feat]: support wan2.2 s2v, support dist infer, pose-audio#1113
helloyongyang merged 2 commits into
mainfrom
s2v

Conversation

@Watebear
Copy link
Copy Markdown
Collaborator

@Watebear Watebear commented Jun 2, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Wan2.2 S2V (Speech-to-Video) model, adding configuration files, a dedicated runner, networks, and utilities for audio encoding, audio injection, and frame packing. The review feedback highlights several key improvements: fixing an AttributeError in weight synchronization within casual_audio.py, replacing a float comparison with an integer frame count check in audio_encoder.py to avoid precision issues, dynamically determining latent channels in framepack.py, ensuring multi-device portability in pre_infer.py by avoiding hardcoded device strings, removing an unused variable in transformer_infer.py, and correcting a type annotation syntax error in wan_causal_audio_module.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +32 to +33
dst.final_linear.weight.data.copy_(src.final_linear.weight.t())
dst.final_linear.bias.data.copy_(src.final_linear.bias)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

src.final_linear is registered as an MMWeight module, which wraps the underlying PyTorch tensors in WeightTensor objects. Accessing .weight and .bias directly on src.final_linear returns these wrapper objects rather than raw PyTorch tensors, which will cause an AttributeError when calling .t() or copying data. You should access .tensor on them, matching how other weights are copied in this function.

Suggested change
dst.final_linear.weight.data.copy_(src.final_linear.weight.t())
dst.final_linear.bias.data.copy_(src.final_linear.bias)
dst.final_linear.weight.data.copy_(src.final_linear.weight.tensor.t())
dst.final_linear.bias.data.copy_(src.final_linear.bias.tensor)

Comment on lines +14 to +15
if required_duration > total_frames / original_fps:
raise ValueError("required_duration must be less than video length")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Comparing float durations (required_duration > total_frames / original_fps) can lead to precision issues (e.g., when they are mathematically equal but float representation makes one slightly larger), causing unexpected ValueError exceptions. Comparing the integer frame counts (required_origin_frames > total_frames) is much more robust.

Suggested change
if required_duration > total_frames / original_fps:
raise ValueError("required_duration must be less than video length")
if required_origin_frames > total_frames:
raise ValueError("required_duration must be less than video length")


for m in motion_latents:
lat_height, lat_width = m.shape[2], m.shape[3]
padd_lat = torch.zeros(16, zip_frame_buckets.sum(), lat_height, lat_width, device=m.device, dtype=m.dtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number of latent channels is hardcoded to 16. To make the frame packing logic robust to different VAE architectures or configurations, it is better to dynamically use m.shape[0] instead of a magic number.

Suggested change
padd_lat = torch.zeros(16, zip_frame_buckets.sum(), lat_height, lat_width, device=m.device, dtype=m.dtype)
padd_lat = torch.zeros(m.shape[0], zip_frame_buckets.sum(), lat_height, lat_width, device=m.device, dtype=m.dtype)

Comment thread lightx2v/models/networks/wan/infer/s2v/pre_infer.py
from lightx2v.models.networks.wan.infer.transformer_infer import WanTransformerInfer
from lightx2v_platform.base.global_var import AI_DEVICE

torch_device_module = getattr(torch, AI_DEVICE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

torch_device_module is defined but never used anywhere in this file. It can be safely removed to clean up the code and avoid potential AttributeError on unsupported devices.



class MotionEncoder_tc(nn.Module):
def __init__(self, in_dim: int, hidden_dim: int, num_heads=int, need_global=True, dtype=None, device=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the method signature, num_heads=int uses the assignment operator = instead of a colon : for the type annotation. This makes the default value of num_heads the int class itself, which will cause runtime errors (e.g., in rearrange) if the argument is ever omitted. It should be written as num_heads: int.

Suggested change
def __init__(self, in_dim: int, hidden_dim: int, num_heads=int, need_global=True, dtype=None, device=None):
def __init__(self, in_dim: int, hidden_dim: int, num_heads: int, need_global=True, dtype=None, device=None):

@Watebear Watebear force-pushed the s2v branch 2 times, most recently from 24d4f7e to cd75cec Compare June 2, 2026 09:19
@helloyongyang helloyongyang merged commit 92b2f1b into main Jun 2, 2026
2 checks passed
@helloyongyang helloyongyang deleted the s2v branch June 2, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants