Skip to main content

Three knobs, in priority order

Every shot has three audio surfaces:
  1. Native audio, what the video model emits on its own. Some models speak; some don’t. The shot’s native_audio.mode controls whether that’s used, dropped, or mixed.
  2. Attached tracks, voiceovers, music beds, and uploaded files wired into the shot through the workflow graph.
  3. Captions, word-timed transcriptions on voiceover tracks, styled per subtitleStyle, burned into the final render.

native_audio

"native_audio": { "mode": "mix", "volume": 0.6 }
ModeRender output
offSilence the model’s baked audio. Attached tracks are the only sound.
nativeUse only the model’s baked audio. Attached tracks are skipped.
mixBlend the model’s audio with attached tracks.
The default is native, most users want the audio that came with the shot. Switch to mix when you have a professional voiceover that should sit on top of the baked dialogue, or to off when the baked audio is wrong for your context.

audio_tracks

A derived list. The server walks the workflow graph, finds every audio node connected to the shot, and produces a track object for each one, every time it answers GET /audio. This means there is no separate “track table” that can drift from the graph. If a human deletes a voiceover node, its track is gone on the next read. If an agent calls attach_track, the next read includes it. Same source of truth for both sides.

Track-id determinism

Track IDs are reproducible: an agent can compute the ID it will have before the track is created. That makes pre-flight planning possible (and gives idempotency keys somewhere natural to anchor on).

Subtitle pipeline

When a voiceover track has word-timed text and a non-off subtitleStyle, the render pipeline burns word-level captions directly into the video.
{
  "transcribedWords": [
    { "word": "Hello",   "start": 0.10, "end": 0.42 },
    { "word": "world",   "start": 0.45, "end": 0.78 }
  ],
  "subtitleStyle": "tiktok"
}
Available styles: off, minimal, tiktok, bold, cinematic. Agents typically pick tiktok for short-form and cinematic for narrative. Picking the wrong style won’t break anything, it’s pure presentation, but the right style adds a lot of polish for free.

Putting it together

A shot with a professional voiceover, a music bed under it, and the shot’s baked dialogue blended in:
{
  "clip_id": "shot_1",
  "native_audio": { "mode": "mix", "volume": 0.6 },
  "audio_tracks": [
    { "kind": "voiceover", "subtitleStyle": "tiktok", "volume": 1.0 },
    { "kind": "music",     "ducking": true,           "volume": 0.4 }
  ]
}
Render output: baked dialogue at 60% under the voiceover at full volume, music at 40% ducking another 8 dB whenever the voiceover speaks, captions burned in.