Speech-to-Text - Unifically

Transcribe audio files to text using ElevenLabs AI with word-level timestamps and speaker detection.

Model

elevenlabs/speech-to-text

Parameters

Parameter	Type	Required	Default	Description
`audio_url`	string	Yes	-	URL of the audio file to transcribe
`tag_audio_events`	boolean	No	`true`	Tag audio events like laughter
`include_subtitles`	boolean	No	`false`	Include subtitle timing
`keyterms`	string	No	-	Comma-separated keywords for better accuracy

Example - Basic Transcription

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/audio.mp3"
    }
  }'

Example - With Options

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "elevenlabs/speech-to-text",
    "input": {
      "audio_url": "https://example.com/audio.mp3",
      "tag_audio_events": true,
      "include_subtitles": false,
      "keyterms": "ElevenLabs,API,transcription"
    }
  }'

Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "pending"
  }
}

Completed Response

{
  "code": 200,
  "success": true,
  "data": {
    "task_id": "abc123def456",
    "status": "completed",
    "output": {
      "language_code": "eng",
      "language_probability": 0.97,
      "words": [
        {
          "text": "Hello",
          "start": 0.0,
          "end": 0.5,
          "type": "word",
          "speaker_id": "speaker_0"
        },
        {
          "text": " ",
          "start": 0.5,
          "end": 0.6,
          "type": "spacing",
          "speaker_id": "speaker_0"
        },
        {
          "text": "world",
          "start": 0.6,
          "end": 1.0,
          "type": "word",
          "speaker_id": "speaker_0"
        }
      ],
      "file_duration_seconds": 20.5
    }
  }
}

Word Object

Each word in the words array contains:

Field	Type	Description
`text`	string	The transcribed text
`start`	number	Start time in seconds
`end`	number	End time in seconds
`type`	string	`"word"`, `"spacing"`, or audio event type
`speaker_id`	string	Speaker identifier for multi-speaker audio

Pricing

Unit	Price
Per second of audio	$0.001056

Notes

Output is text data, not audio
Supports multiple speakers with automatic speaker detection
Use keyterms to improve accuracy for specific words or names
Audio events (like laughter) are tagged when tag_audio_events is enabled

Audio Models

​Model

​Parameters

​Example - Basic Transcription

​Example - With Options

​Response

​Completed Response

​Word Object

​Pricing

​Notes

Model

Parameters

Example - Basic Transcription

Example - With Options

Response

Completed Response

Word Object

Pricing

Notes