Three knobs, in priority order
Every shot has three audio surfaces:- Native audio, what the video model emits on its own. Some
models speak; some don’t. The shot’s
native_audio.modecontrols whether that’s used, dropped, or mixed. - Attached tracks, voiceovers, music beds, and uploaded files wired into the shot through the workflow graph.
- Captions, word-timed transcriptions on voiceover tracks,
styled per
subtitleStyle, burned into the final render.
native_audio
| Mode | Render output |
|---|---|
off | Silence the model’s baked audio. Attached tracks are the only sound. |
native | Use only the model’s baked audio. Attached tracks are skipped. |
mix | Blend the model’s audio with attached tracks. |
native, most users want the audio that came with the
shot. Switch to mix when you have a professional voiceover that
should sit on top of the baked dialogue, or to off when the baked
audio is wrong for your context.
audio_tracks
A derived list. The server walks the workflow graph, finds every audio node connected to the shot, and produces a track object for each one, every time it answersGET /audio.
This means there is no separate “track table” that can drift from the
graph. If a human deletes a voiceover node, its track is gone on the
next read. If an agent calls attach_track, the next read includes
it. Same source of truth for both sides.
Track-id determinism
Track IDs are reproducible: an agent can compute the ID it will have before the track is created. That makes pre-flight planning possible (and gives idempotency keys somewhere natural to anchor on).Subtitle pipeline
When a voiceover track has word-timed text and a non-off
subtitleStyle, the render pipeline burns word-level captions
directly into the video.
off, minimal, tiktok, bold, cinematic.
Agents typically pick tiktok for short-form and cinematic for
narrative. Picking the wrong style won’t break anything, it’s pure
presentation, but the right style adds a lot of polish for free.