Transcribe audio files to text using ElevenLabs AI with word-level timestamps and speaker detection.

Model

elevenlabs/speech-to-text

Parameters

ParameterTypeRequiredDefaultDescription
audio_urlstringYes-URL of the audio file to transcribe
tag_audio_eventsbooleanNotrueTag audio events like laughter
include_subtitlesbooleanNofalseInclude subtitle timing
keytermsstringNo-Comma-separated keywords for better accuracy

Example - Basic Transcription

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/audio.mp3"
    }
  }'

Example - With Options

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/audio.mp3",
      "tag_audio_events": true,
      "include_subtitles": false,
      "keyterms": "ElevenLabs,API,transcription"
    }
  }'

Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "pending"
  }
}

Completed Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "completed",
    "output": {
      "language_code": "eng",
      "language_probability": 0.97,
      "words": [
        {
          "text": "Hello",
          "start": 0.0,
          "end": 0.5,
          "type": "word",
          "speaker_id": "speaker_0"
        },
        {
          "text": " ",
          "start": 0.5,
          "end": 0.6,
          "type": "spacing",
          "speaker_id": "speaker_0"
        },
        {
          "text": "world",
          "start": 0.6,
          "end": 1.0,
          "type": "word",
          "speaker_id": "speaker_0"
        }
      ],
      "file_duration_seconds": 20.5
    }
  }
}

Word Object

Each word in the words array contains:
FieldTypeDescription
textstringThe transcribed text
startnumberStart time in seconds
endnumberEnd time in seconds
typestring"word", "spacing", or audio event type
speaker_idstringSpeaker identifier for multi-speaker audio

Pricing

UnitPrice
Per second of audio$0.001056

Notes

  • Output is text data, not audio
  • Supports multiple speakers with automatic speaker detection
  • Use keyterms to improve accuracy for specific words or names
  • Audio events (like laughter) are tagged when tag_audio_events is enabled