Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 104 additions & 1 deletion docs/source/en/api/pipelines/cosmos3.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ From one model you can:
- Generate physically plausible video worlds from text, images, or action inputs (image-to-video, text-to-video, action-conditioned video generation).
- Reason about physical properties like motion, causality, and spatial relationships.
- Predict future video and action sequences from the current state.
- Transfer scenes across viewpoints and conditions with structural control *(coming soon)*.
- Transfer scenes across viewpoints and conditions with structural control (edge, blur, depth, segmentation, world-scenario maps).

Under the hood, a single `Cosmos3OmniTransformer` runs a Qwen-style language model in parallel with a diffusion generation pathway: text tokens flow through a causal "understanding" stream while video and sound latents flow through a bi-directionally-attended "generation" stream, joined by a 3D multimodal RoPE. See the [Cosmos World Foundation Model Platform paper](https://huggingface.co/papers/2501.03575) for the architectural background.

Expand Down Expand Up @@ -371,6 +371,109 @@ export_to_video(result.video, "cosmos3_v2v.mp4", fps=24, macro_block_size=1)
</hfoption>
</hfoptions>

## Transfer (structural control)

Transfer generates a target clip that follows a **precomputed control video** (a spatial control signal): edge (Canny), blur, depth, segmentation, or a world-scenario map (WSM). Pass it through `control_videos=` as a mapping from hint name to a loaded video. The control map is resized, temporally padded, normalized, and VAE-encoded into a clean conditioning item placed before the noisy target; the model then generates the target to match it. Transfer is video-only (no `image`, `video`, `action`, or `enable_sound`), and the prompt is a pre-upsampled JSON caption (see [Prompt upsampling](#prompt-upsampling)).

Diffusers does not ship the control assets. Ready-made ones (a control video + matching `prompt.json` per hint, plus a shared `negative_prompt.json`) live in the [Cosmos cookbook](https://github.com/NVIDIA/cosmos/tree/main/cookbooks/cosmos3/generator/transfer/assets). For the edge example below, download them into a local `assets/` folder:

```bash
base=https://github.com/NVIDIA/cosmos/raw/refs/heads/main/cookbooks/cosmos3/generator/transfer/assets
mkdir -p assets/edge
curl -sL "$base/edge/control_edge.mp4" -o assets/edge/control_edge.mp4
curl -sL "$base/edge/prompt.json" -o assets/edge/prompt.json
curl -sL "$base/negative_prompt.json" -o assets/negative_prompt.json
```

Guidance uses a nested control/text classifier-free-guidance blend. `guidance_scale` is the usual text CFG; `control_guidance` (`!= 1.0`) additionally amplifies the control signal. Recommended starting values per hint (matching the Cosmos Framework defaults):

| Hint | `guidance_scale` | `control_guidance` | `flow_shift` | Geometry |
| --- | --- | --- | --- | --- |
| Edge / Blur / Depth | 3.0 | 1.5 | 10.0 | 121 frames @ 30 FPS |
| Segmentation | 3.0 | 2.0 | 10.0 | 121 frames @ 30 FPS |
| World scenario (WSM) | 1.0 | 3.0 | 10.0 | 101 frames @ 10 FPS |

Depth, segmentation, and WSM control maps must be precomputed by external models; edge/blur maps can be produced offline with any Canny/blur tool. The shipped cookbook configs use a single hint each; passing several entries in `control_videos` to combine hints is supported by the pipeline but is not a tuned/validated cookbook path (set `guidance_scale` / `control_guidance` explicitly, since the per-hint defaults above assume a single hint). Long clips are generated autoregressively in chunks of `num_video_frames_per_chunk` and stitched automatically.

<hfoptions id="model">
<hfoption id="Nano">

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_video

# Downloaded into assets/ from the Cosmos cookbook (see the curl snippet above).
json_prompt = json.load(open("assets/edge/prompt.json"))
negative_prompt = json.load(open("assets/negative_prompt.json"))
control_edge = load_video("assets/edge/control_edge.mp4")

pipe = Cosmos3OmniPipeline.from_pretrained(
"nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
)

result = pipe(
prompt=json.dumps(json_prompt),
negative_prompt=json.dumps(negative_prompt),
control_videos={"edge": control_edge},
num_frames=121,
height=720,
width=1280,
fps=30.0,
num_inference_steps=35,
guidance_scale=3.0,
control_guidance=1.5,
)
# macro_block_size=1 allows arbitrary frame sizes (Cosmos3 outputs are not always divisible by 16).
export_to_video(result.video, "cosmos3_transfer_edge.mp4", fps=30, macro_block_size=1)
```

</hfoption>
<hfoption id="Super">

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_video

# Downloaded into assets/ from the Cosmos cookbook (see the curl snippet above).
json_prompt = json.load(open("assets/edge/prompt.json"))
negative_prompt = json.load(open("assets/negative_prompt.json"))
control_edge = load_video("assets/edge/control_edge.mp4")

pipe = Cosmos3OmniPipeline.from_pretrained(
"nvidia/Cosmos3-Super", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipe.scheduler = UniPCMultistepScheduler.from_config(
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
)

result = pipe(
prompt=json.dumps(json_prompt),
negative_prompt=json.dumps(negative_prompt),
control_videos={"edge": control_edge},
num_frames=121,
height=720,
width=1280,
fps=30.0,
num_inference_steps=35,
guidance_scale=3.0,
control_guidance=1.5,
)
# macro_block_size=1 allows arbitrary frame sizes (Cosmos3 outputs are not always divisible by 16).
export_to_video(result.video, "cosmos3_transfer_edge.mp4", fps=30, macro_block_size=1)
```

</hfoption>
</hfoptions>

## Video-to-video with sound

When the checkpoint carries a `sound_tokenizer`, add `enable_sound=True` to the video-to-video call to jointly generate a synchronized audio track. The waveform is returned alongside the video and can be muxed into the MP4 with [`~utils.encode_video`].
Expand Down
116 changes: 115 additions & 1 deletion examples/cosmos3/inference_cosmos3.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@
Video-to-video:
python inference_cosmos3.py --prompt "..." --video-path /path/to/video.mp4

Transfer (ready-made control_*.mp4 + prompt.json are hosted in the Cosmos cookbook; --control-path / --prompt
accept URLs or local paths: https://github.com/NVIDIA/cosmos/tree/main/cookbooks/cosmos3/generator/transfer/assets):
base=https://github.com/NVIDIA/cosmos/raw/refs/heads/main/cookbooks/cosmos3/generator/transfer/assets
python inference_cosmos3.py --prompt "$(curl -sL $base/edge/prompt.json)" \
--transfer-hint edge --control-path $base/edge/control_edge.mp4 \
--guidance-scale 3.0 --control-guidance 1.5 --flow-shift 10.0 --num-frames 121 --fps 30

Text-to-video-with-sound (requires a sound-capable checkpoint):
python inference_cosmos3.py --prompt "..." --enable-sound
"""
Expand Down Expand Up @@ -62,6 +69,11 @@ def _load_action(path: str | None):
def main():
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--prompt", required=True, help="Text prompt.")
parser.add_argument(
"--negative-prompt",
default=None,
help="Optional negative prompt text.",
)
parser.add_argument(
"--model",
choices=sorted(HF_REPOS),
Expand Down Expand Up @@ -89,6 +101,60 @@ def main():
default="first",
help="Take the video-to-video conditioning frames from the first or last of the source clip (default: first).",
)
parser.add_argument(
"--transfer-hint",
action="append",
choices=["edge", "blur", "depth", "seg", "wsm"],
default=None,
help="Enable transfer with a control hint. Repeat (paired with --control-path) to combine multiple hints.",
)
parser.add_argument(
"--control-path",
action="append",
default=None,
help="URL or local path to a precomputed control video, paired in order with each --transfer-hint.",
)
parser.add_argument(
"--control-guidance",
type=float,
default=1.0,
help="Transfer control-CFG scale (recommended 1.5 for edge/blur/depth, 2.0 for seg, 3.0 for wsm).",
)
parser.add_argument(
"--control-guidance-interval",
default=None,
help="Comma-separated [lo,hi] timestep window for control guidance (default: applied at every step).",
)
parser.add_argument(
"--guidance-interval",
default=None,
help="Comma-separated [lo,hi] timestep window for text guidance in transfer (default: every step).",
)
parser.add_argument(
"--num-conditional-frames",
type=int,
default=1,
help="Frames carried over from the previous chunk as conditioning (transfer multi-chunk).",
)
parser.add_argument(
"--num-first-chunk-conditional-frames",
type=int,
default=0,
help="Leading frames of --video-path used to condition the first transfer chunk (requires --video-path).",
)
parser.add_argument(
"--num-video-frames-per-chunk",
type=int,
default=None,
help="Max frames generated per autoregressive transfer chunk (default: whole clip in one chunk).",
)
parser.add_argument(
"--no-share-vision-temporal-positions",
dest="share_vision_temporal_positions",
action="store_false",
default=True,
help="Give control maps and the target distinct temporal mRoPE positions instead of sharing them (transfer).",
)
parser.add_argument("--output", default=".", help="Directory to save generated video/image/audio files.")
parser.add_argument(
"--height",
Expand Down Expand Up @@ -198,7 +264,52 @@ def main():
output_dir.mkdir(parents=True, exist_ok=True)
generator = torch.Generator().manual_seed(args.seed) if args.seed is not None else None

if args.action_mode is not None:
def _parse_interval(value):
if value is None:
return None
parts = [float(v) for v in value.split(",") if v.strip()]
if len(parts) != 2:
raise ValueError(f"Expected a comma-separated [lo,hi] interval, got {value!r}.")
return (parts[0], parts[1])

if args.transfer_hint is not None:
control_paths = args.control_path or []
if len(control_paths) != len(args.transfer_hint):
raise ValueError("Pass one --control-path per --transfer-hint, in matching order.")
control_videos = {hint: load_video(path) for hint, path in zip(args.transfer_hint, control_paths)}
# `--video-path` is an OPTIONAL RGB prefix that only seeds the first chunk, and is consulted solely when
# --num-first-chunk-conditional-frames > 0. It is unrelated to the control hints (which always drive transfer).
conditioning_video = None
if args.num_first_chunk_conditional_frames > 0:
if args.video_path is None:
raise ValueError(
"--num-first-chunk-conditional-frames > 0 requires --video-path (an RGB prefix clip)."
)
conditioning_video = load_video(args.video_path)
elif args.video_path is not None:
print("Ignoring --video-path: it only applies when --num-first-chunk-conditional-frames > 0.")
result = pipeline(
prompt=args.prompt,
negative_prompt=args.negative_prompt,
control_videos=control_videos,
video=conditioning_video,
num_frames=args.num_frames if args.num_frames != 189 else None,
height=args.height,
width=args.width,
fps=args.fps,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
control_guidance=args.control_guidance,
control_guidance_interval=_parse_interval(args.control_guidance_interval),
guidance_interval=_parse_interval(args.guidance_interval),
num_conditional_frames=args.num_conditional_frames,
num_first_chunk_conditional_frames=args.num_first_chunk_conditional_frames,
num_video_frames_per_chunk=args.num_video_frames_per_chunk,
share_vision_temporal_positions=args.share_vision_temporal_positions,
generator=generator,
enable_safety_check=not args.no_safety_check,
)
elif args.action_mode is not None:
if args.vision_path is None:
raise ValueError("--vision-path must point to a conditioning video for action modes.")
if args.action_chunk_size is None:
Expand All @@ -207,6 +318,7 @@ def main():
raw_actions = _load_action(args.action_path) if args.action_mode == "forward_dynamics" else None
result = pipeline(
prompt=args.prompt,
negative_prompt=args.negative_prompt,
action=CosmosActionCondition(
mode=args.action_mode,
chunk_size=args.action_chunk_size,
Expand Down Expand Up @@ -234,6 +346,7 @@ def main():
)
result = pipeline(
prompt=args.prompt,
negative_prompt=args.negative_prompt,
video=video,
condition_frame_indexes_vision=condition_frame_indexes_vision,
condition_video_keep=args.condition_video_keep,
Expand All @@ -253,6 +366,7 @@ def main():
image = load_image(args.vision_path) if args.vision_path is not None else None
result = pipeline(
prompt=args.prompt,
negative_prompt=args.negative_prompt,
image=image,
num_frames=args.num_frames,
height=args.height,
Expand Down
Loading
Loading