Transcribe audio files to text using ElevenLabs Scribe v2 with word-level timestamps, speaker diarization, and entity detection.

Model

elevenlabs/speech-to-text

Parameters

ParameterTypeRequiredDefaultDescription
audio_urlstringYes-URL of the audio file to transcribe
language_codestringNoauto-detectISO 639-1/639-3 language code
diarizebooleanNofalseIdentify who is speaking (speaker diarization)
num_speakersintegerNoautoMax number of speakers (up to 32)
diarization_thresholdnumberNo0.22Higher = fewer speakers, lower = more. Only used when diarize is true and num_speakers is not set
timestamps_granularitystringNoword"none", "word", or "character"
tag_audio_eventsbooleanNotrueTag events like (laughter), (music), (footsteps)
entity_detectionstring/arrayNo-Detect entities: "all", or specific types like "pii", "phi", "pci", "offensive_language"
temperaturenumberNo0Randomness (0–2). Higher = more diverse output
seedintegerNo-Seed for deterministic results (0–2147483647)
keytermsarrayNo-Words/phrases to boost recognition accuracy (max 100, each under 50 chars)

Example - Basic Transcription

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/audio.mp3"
    }
  }'

Example - With Diarization

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/meeting.mp3",
      "diarize": true,
      "num_speakers": 3,
      "language_code": "en",
      "timestamps_granularity": "word",
      "tag_audio_events": true,
      "entity_detection": "all",
      "keyterms": ["ElevenLabs", "API"]
    }
  }'

Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "pending"
  }
}

Completed Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "completed",
    "output": {
      "language_code": "eng",
      "language_probability": 0.97,
      "text": "Hello world",
      "words": [
        {
          "text": "Hello",
          "start": 0.0,
          "end": 0.5,
          "type": "word",
          "speaker_id": "speaker_0"
        },
        {
          "text": " ",
          "start": 0.5,
          "end": 0.6,
          "type": "spacing"
        },
        {
          "text": "world",
          "start": 0.6,
          "end": 1.0,
          "type": "word",
          "speaker_id": "speaker_0"
        }
      ]
    }
  }
}

Word Object

Each word in the words array contains:
FieldTypeDescription
textstringThe transcribed text
startnumberStart time in seconds
endnumberEnd time in seconds
typestring"word", "spacing", or audio event type
speaker_idstringSpeaker identifier (when diarize is enabled)