> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lavendly.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# The audio model

> How native audio and attached tracks compose. One graph, no hidden state.

## Three knobs, in priority order

Every shot has three audio surfaces:

1. **Native audio**, what the video model emits on its own. Some
   models speak; some don't. The shot's `native_audio.mode` controls
   whether that's used, dropped, or mixed.
2. **Attached tracks**, voiceovers, music beds, and uploaded files
   wired into the shot through the workflow graph.
3. **Captions**, word-timed transcriptions on voiceover tracks,
   styled per `subtitleStyle`, burned into the final render.

## native\_audio

```json theme={null}
"native_audio": { "mode": "mix", "volume": 0.6 }
```

| Mode     | Render output                                                        |
| -------- | -------------------------------------------------------------------- |
| `off`    | Silence the model's baked audio. Attached tracks are the only sound. |
| `native` | Use only the model's baked audio. Attached tracks are skipped.       |
| `mix`    | Blend the model's audio with attached tracks.                        |

The default is `native`, most users want the audio that came with the
shot. Switch to `mix` when you have a professional voiceover that
should sit on top of the baked dialogue, or to `off` when the baked
audio is wrong for your context.

## audio\_tracks

A derived list. The server walks the workflow graph, finds every audio
node connected to the shot, and produces a track object for each one,
every time it answers `GET /audio`.

This means there is no separate "track table" that can drift from the
graph. If a human deletes a voiceover node, its track is gone on the
next read. If an agent calls `attach_track`, the next read includes
it. Same source of truth for both sides.

## Track-id determinism

Track IDs are reproducible: an agent can compute the ID it *will*
have before the track is created. That makes pre-flight planning
possible (and gives idempotency keys somewhere natural to anchor on).

## Subtitle pipeline

When a voiceover track has word-timed text and a non-`off`
`subtitleStyle`, the render pipeline burns word-level captions
directly into the video.

```json theme={null}
{
  "transcribedWords": [
    { "word": "Hello",   "start": 0.10, "end": 0.42 },
    { "word": "world",   "start": 0.45, "end": 0.78 }
  ],
  "subtitleStyle": "tiktok"
}
```

Available styles: `off`, `minimal`, `tiktok`, `bold`, `cinematic`.
Agents typically pick `tiktok` for short-form and `cinematic` for
narrative. Picking the wrong style won't break anything, it's pure
presentation, but the right style adds a lot of polish for free.

## Putting it together

A shot with a professional voiceover, a music bed under it, and the
shot's baked dialogue blended in:

```json theme={null}
{
  "clip_id": "shot_1",
  "native_audio": { "mode": "mix", "volume": 0.6 },
  "audio_tracks": [
    { "kind": "voiceover", "subtitleStyle": "tiktok", "volume": 1.0 },
    { "kind": "music",     "ducking": true,           "volume": 0.4 }
  ]
}
```

Render output: baked dialogue at 60% under the voiceover at full
volume, music at 40% ducking another 8 dB whenever the voiceover
speaks, captions burned in.
