The audio model - Lavendly

Three knobs, in priority order

Every shot has three audio surfaces:

Native audio, what the video model emits on its own. Some models speak; some don’t. The shot’s native_audio.mode controls whether that’s used, dropped, or mixed.
Attached tracks, voiceovers, music beds, and uploaded files wired into the shot through the workflow graph.
Captions, word-timed transcriptions on voiceover tracks, styled per subtitleStyle, burned into the final render.

native_audio

"native_audio": { "mode": "mix", "volume": 0.6 }

Mode	Render output
`off`	Silence the model’s baked audio. Attached tracks are the only sound.
`native`	Use only the model’s baked audio. Attached tracks are skipped.
`mix`	Blend the model’s audio with attached tracks.

The default is native, most users want the audio that came with the shot. Switch to mix when you have a professional voiceover that should sit on top of the baked dialogue, or to off when the baked audio is wrong for your context.

audio_tracks

A derived list. The server walks the workflow graph, finds every audio node connected to the shot, and produces a track object for each one, every time it answers GET /audio. This means there is no separate “track table” that can drift from the graph. If a human deletes a voiceover node, its track is gone on the next read. If an agent calls attach_track, the next read includes it. Same source of truth for both sides.

Track-id determinism

Track IDs are reproducible: an agent can compute the ID it will have before the track is created. That makes pre-flight planning possible (and gives idempotency keys somewhere natural to anchor on).

Subtitle pipeline

When a voiceover track has word-timed text and a non-off subtitleStyle, the render pipeline burns word-level captions directly into the video.

{
  "transcribedWords": [
    { "word": "Hello",   "start": 0.10, "end": 0.42 },
    { "word": "world",   "start": 0.45, "end": 0.78 }
  ],
  "subtitleStyle": "tiktok"
}

Available styles: off, minimal, tiktok, bold, cinematic. Agents typically pick tiktok for short-form and cinematic for narrative. Picking the wrong style won’t break anything, it’s pure presentation, but the right style adds a lot of polish for free.

Putting it together

A shot with a professional voiceover, a music bed under it, and the shot’s baked dialogue blended in:

{
  "clip_id": "shot_1",
  "native_audio": { "mode": "mix", "volume": 0.6 },
  "audio_tracks": [
    { "kind": "voiceover", "subtitleStyle": "tiktok", "volume": 1.0 },
    { "kind": "music",     "ducking": true,           "volume": 0.4 }
  ]
}

Render output: baked dialogue at 60% under the voiceover at full volume, music at 40% ducking another 8 dB whenever the voiceover speaks, captions burned in.

​Three knobs, in priority order

​native_audio

​audio_tracks

​Track-id determinism

​Subtitle pipeline

​Putting it together