Model
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio_url | string | Yes | - | URL of the audio file to transcribe |
language_code | string | No | auto-detect | ISO 639-1/639-3 language code |
diarize | boolean | No | false | Identify who is speaking (speaker diarization) |
num_speakers | integer | No | auto | Max number of speakers (up to 32) |
diarization_threshold | number | No | 0.22 | Higher = fewer speakers, lower = more. Only used when diarize is true and num_speakers is not set |
timestamps_granularity | string | No | word | "none", "word", or "character" |
tag_audio_events | boolean | No | true | Tag events like (laughter), (music), (footsteps) |
entity_detection | string/array | No | - | Detect entities: "all", or specific types like "pii", "phi", "pci", "offensive_language" |
temperature | number | No | 0 | Randomness (0–2). Higher = more diverse output |
seed | integer | No | - | Seed for deterministic results (0–2147483647) |
keyterms | array | No | - | Words/phrases to boost recognition accuracy (max 100, each under 50 chars) |
Example - Basic Transcription
Example - With Diarization
Response
Completed Response
Word Object
Each word in thewords array contains:
| Field | Type | Description |
|---|---|---|
text | string | The transcribed text |
start | number | Start time in seconds |
end | number | End time in seconds |
type | string | "word", "spacing", or audio event type |
speaker_id | string | Speaker identifier (when diarize is enabled) |
