AWS Transcribe: Speech to Text at Scale

What it does

Amazon Transcribe is a fully managed automatic speech recognition (ASR) service. You give it audio, it gives you text. Two operating modes:

Batch transcription — upload a media file to S3, call StartTranscriptionJob, poll until done, retrieve a JSON transcript. The job runs asynchronously and writes results to S3. Use this for recorded calls, podcasts, video subtitles, and archived audio.

Streaming transcription — send audio in real time over HTTP/2 or WebSockets via StartStreamTranscription. Transcripts arrive incrementally as the audio flows. Use this for live captioning, real-time call monitoring, and voice interfaces.

In November 2023, AWS launched a new speech foundation model — a multi-billion parameter model trained on millions of hours of multilingual audio. It delivered 20–50% accuracy improvement across most languages and expanded support to over 100 languages. Existing API calls use the new model automatically. No migration required.

How we use it

We run a FastAPI backend that accepts audio file uploads and returns transcribed text. The flow is straightforward:

import boto3
import asyncio
import uuid

class TranscriptionService:
    def __init__(self):
        self.s3 = boto3.client("s3", region_name="ap-south-1")
        self.transcribe = boto3.client("transcribe", region_name="ap-south-1")
        self.bucket = "livsyt-transcription-qa"

    async def transcribe_audio(self, audio_bytes, filename, content_type):
        job_name = f"transcribe-{uuid.uuid4().hex[:12]}"
        s3_key = f"temp-audio/{job_name}/{filename}"

        # Upload to S3
        await asyncio.to_thread(
            self.s3.put_object,
            Bucket=self.bucket,
            Key=s3_key,
            Body=audio_bytes,
            ContentType=content_type,
        )

        # Start transcription with automatic language detection
        await asyncio.to_thread(
            self.transcribe.start_transcription_job,
            TranscriptionJobName=job_name,
            Media={"MediaFileUri": f"s3://{self.bucket}/{s3_key}"},
            IdentifyLanguage=True,
            LanguageOptions=[
                "en-US", "es-US", "fr-FR", "de-DE", "it-IT",
                "pt-BR", "ja-JP", "ko-KR", "zh-CN", "ar-SA",
                "hi-IN", "ru-RU", "nl-NL", "pl-PL", "tr-TR",
            ],
        )

        # Poll until complete
        transcript = await self._wait_for_job(job_name)
        return transcript

Three design decisions worth noting:

Automatic language detection. We pass IdentifyLanguage=True with a 15-language hint list instead of hardcoding a language. The hint list improves detection speed and accuracy — AWS recommends 2–5 languages, but up to 15 works. This is critical for a multilingual user base.

Async wrapping with asyncio.to_thread. boto3 is synchronous. Wrapping every call in asyncio.to_thread prevents the FastAPI event loop from blocking during S3 uploads and Transcribe API calls.

Cleanup in a finally block. The temporary S3 object is deleted regardless of success or failure. The Transcribe job itself is also deleted after retrieving the result. Job records persist for 90 days otherwise — cleaning them up avoids confusion when debugging.

The router validates content types (WAV, MP3, OGG, WebM) and enforces a 50MB size limit before the audio reaches the service:

_ALLOWED_AUDIO_TYPES = {
    "audio/wav", "audio/wave", "audio/x-wav",
    "audio/mpeg", "audio/mp3",
    "audio/ogg", "audio/webm",
}

@router.post("/api/transcribe")
async def transcribe_audio(
    file: UploadFile,
    user=Depends(get_current_user_required),
):
    if file.content_type not in _ALLOWED_AUDIO_TYPES:
        raise HTTPException(415, "Unsupported audio format")
    audio_bytes = await file.read(50 * 1024 * 1024)
    result = await service.transcribe_audio(
        audio_bytes, file.filename, file.content_type
    )
    return {"success": True, "transcription": result["text"],
            "language": result["language"]}

Languages

Amazon Transcribe supports over 100 languages as of the November 2023 foundation model release. Here are the languages most relevant for application development, with their batch and streaming availability:

+--------------------+-------+------------+-----------+
| Language           | Code  | Batch      | Streaming |
+--------------------+-------+------------+-----------+
| English (US)       | en-US | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| English (UK)       | en-GB | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| English            | en-AU | Yes        | Yes       |
| (Australian)       |       |            |           |
+--------------------+-------+------------+-----------+
| English (Indian)   | en-IN | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Spanish (US)       | es-US | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Spanish (Spain)    | es-ES | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| French             | fr-FR | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| French (Canadian)  | fr-CA | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| German             | de-DE | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Italian            | it-IT | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Portuguese         | pt-BR | Yes        | Yes       |
| (Brazilian)        |       |            |           |
+--------------------+-------+------------+-----------+
| Japanese           | ja-JP | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Korean             | ko-KR | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Chinese            | zh-CN | Yes        | Yes       |
| (Simplified)       |       |            |           |
+--------------------+-------+------------+-----------+
| Chinese            | zh-TW | Yes        | Yes       |
| (Traditional)      |       |            |           |
+--------------------+-------+------------+-----------+
| Hindi              | hi-IN | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Arabic (Saudi)     | ar-SA | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Dutch              | nl-NL | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Polish             | pl-PL | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Russian            | ru-RU | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Turkish            | tr-TR | Batch only | No        |
+--------------------+-------+------------+-----------+
| Swedish            | sv-SE | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Thai               | th-TH | Yes        | Yes       |
+--------------------+-------+------------+-----------+
| Vietnamese         | vi-VN | Yes        | Yes       |
+--------------------+-------+------------+-----------+

The full list includes dozens more — Afrikaans, Basque, Catalan, Czech, Danish, Finnish, Greek, Hebrew, Hungarian, Indonesian, Latvian, Malay, Norwegian, Romanian, Serbian, Slovak, Swahili, Ukrainian, and many others. Batch-only languages include Bengali, Gujarati, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu, and several Central Asian languages.

For the complete reference: Supported languages.

Automatic language identification

Rather than requiring the caller to specify a language, Transcribe can detect the language automatically. Two modes:

Single-language identification — detects the dominant language in the audio. Provide a LanguageOptions hint list (2–5 recommended) to improve speed and accuracy.

Multi-language identification — detects all languages present and produces a multilingual transcript with per-language duration metrics. Supports 37 languages. Useful for meetings where participants switch between languages.

Audio formats

+-----------------+-------+-----------+
| Format          | Batch | Streaming |
+-----------------+-------+-----------+
| FLAC            | Yes   | Yes       |
+-----------------+-------+-----------+
| MP3             | Yes   | No        |
+-----------------+-------+-----------+
| MP4 / M4A       | Yes   | No        |
+-----------------+-------+-----------+
| WAV             | Yes   | No        |
+-----------------+-------+-----------+
| Ogg / Opus      | Yes   | Yes       |
+-----------------+-------+-----------+
| WebM            | Yes   | No        |
+-----------------+-------+-----------+
| AMR             | Yes   | No        |
+-----------------+-------+-----------+
| PCM (16-bit LE) | No    | Yes       |
+-----------------+-------+-----------+

For batch transcription, FLAC and WAV give the best accuracy. MP3’s lossy compression degrades quality, especially at low bitrates.

For streaming, FLAC is recommended. PCM (signed 16-bit little-endian) works at any sample rate — 16,000 Hz is the sweet spot between quality and bandwidth. Telephony audio typically arrives at 8,000 Hz, which is also supported.

Features

+-------------------------+--------------+------------+------------------------------+
| Feature                 | Batch        | Streaming  | Notes                        |
+-------------------------+--------------+------------+------------------------------+
| Custom vocabulary       | Yes          | Yes        | Pronunciation hints for      |
|                         |              |            | domain terms                 |
+-------------------------+--------------+------------+------------------------------+
| Custom language models  | Yes          | Yes        | Train on your text corpus    |
|                         |              |            | (10K+ words)                 |
+-------------------------+--------------+------------+------------------------------+
| Auto language detection | Yes          | Yes        | Single or multi-language     |
+-------------------------+--------------+------------+------------------------------+
| PII redaction           | en-US, es-US | 16 locales | Replaces PII with [PII]      |
+-------------------------+--------------+------------+------------------------------+
| Speaker diarization     | Yes          | Yes        | Up to 30 speakers            |
+-------------------------+--------------+------------+------------------------------+
| Channel identification  | Yes          | Yes        | 2 channels max               |
+-------------------------+--------------+------------+------------------------------+
| Subtitle generation     | Yes          | No         | WebVTT and SRT output        |
+-------------------------+--------------+------------+------------------------------+
| Toxicity detection      | Yes          | No         | en-US only                   |
+-------------------------+--------------+------------+------------------------------+
| Vocabulary filtering    | Yes          | Yes        | Mask, remove, or tag words   |
+-------------------------+--------------+------------+------------------------------+

Custom vocabulary

Improves recognition of domain-specific terms: brand names, technical acronyms, proper nouns. You provide a table with columns for Phrase, DisplayAs, IPA (pronunciation), and SoundsLike.

Phrase          DisplayAs       SoundsLike
kubernetes      Kubernetes      koo-ber-net-eez
livsyt          LivSYT          liv-sit
nginx           NGINX           engine-x

Up to 100 vocabularies per account, 50KB each, 256 characters max per entry. Created via CreateVocabulary and referenced by name in the job parameters.

Custom language models (CLM)

A step beyond custom vocabulary. CLMs teach the ASR model how words appear in context — not just pronunciation, but co-occurrence patterns. You provide training data (up to 2GB of text — transcripts, technical documents, domain-specific content) and optional tuning data (up to 200MB).

Best results require 10,000+ words of in-domain transcript text. Up to 10 CLMs per account; 3 can train concurrently.

PII redaction

Replaces personally identifiable information with [PII] in the transcript text. Configurable entity types — you can redact names and credit card numbers while keeping addresses visible, or vice versa.

Available for batch and streaming in en-US and es-US. Streaming-only redaction extends to 16 locales including en-GB, fr-FR, de-DE, it-IT, pt-BR, and es-ES.

Important: PII redaction is ML-based and may miss some instances. It is not sufficient for HIPAA de-identification on its own.

Speaker diarization

Distinguishes up to 30 speakers (increased from 10 in May 2024). Labels each speech segment as spk_0 through spk_29. The output includes per-word speaker assignments with timestamps.

Subtitle generation

Batch only. Produces WebVTT (.vtt) and/or SubRip (.srt) alongside the JSON transcript. Set OutputStartIndex=1 — the default is 0, which breaks most subtitle players.

Toxicity detection

Categorizes toxic speech content using ML. Outputs category labels and confidence scores. Currently en-US only, batch only.

Streaming deep dive

Streaming transcription sends audio in real time and receives transcripts incrementally. Three transport options:

AWS SDKs — the recommended approach. Handles protocol details automatically.

HTTP/2 — bidirectional streaming to transcribestreaming.<region>.amazonaws.com. Audio frames go upstream, transcript events come downstream.

WebSocket — connect via a presigned URL. Useful for browser-based applications where HTTP/2 bidirectional streaming is not available.

Partial results and stabilization

Transcripts arrive as a stream of TranscriptEvent objects. Each contains Results[] with IsPartial: true or IsPartial: false. Words are revised as more context arrives — earlier words may change until the segment closes.

Partial-result stabilization lets you lock words in place sooner. Enable it with EnablePartialResultsStabilization=true and set a stability level:

High — faster lockdown, slightly lower accuracy. Each locked word gets "Stable": true and will not change.
Medium — balanced.
Low — highest accuracy, more revisions before locking.

This matters for live captioning where viewers need to see committed text quickly, not words that keep changing.

Recommended audio settings

Sample rate: 16,000 Hz (best quality/bandwidth tradeoff)
Telephony: 8,000 Hz
Chunk duration: 50–200ms
Chunk size formula: chunk_bytes = (chunk_ms / 1000) × sample_rate × 2
Encode silence as zero bytes, never drop silent frames

Streaming example (Node.js)

import {
  TranscribeStreamingClient,
  StartStreamTranscriptionCommand,
} from "@aws-sdk/client-transcribe-streaming";
import { createReadStream } from "fs";

const client = new TranscribeStreamingClient({ region: "us-east-1" });

async function* audioStream() {
  const stream = createReadStream("audio.pcm");
  for await (const chunk of stream) {
    yield { AudioEvent: { AudioChunk: chunk } };
  }
}

const response = await client.send(
  new StartStreamTranscriptionCommand({
    LanguageCode: "en-US",
    MediaEncoding: "pcm",
    MediaSampleRateHertz: 16000,
    AudioStream: audioStream(),
    EnablePartialResultsStabilization: true,
    PartialResultsStability: "medium",
  })
);

for await (const event of response.TranscriptResultStream) {
  if (event.TranscriptEvent) {
    const results = event.TranscriptEvent.Transcript.Results;
    for (const result of results) {
      if (!result.IsPartial) {
        console.log(result.Alternatives[0].Transcript);
      }
    }
  }
}

Service variants

+----------------+---------------------------------+----------------------------------+
| Variant        | Use case                        | Key difference                   |
+----------------+---------------------------------+----------------------------------+
| Standard       | General transcription           | Widest language support, all     |
|                |                                 | features                         |
+----------------+---------------------------------+----------------------------------+
| Call Analytics | Contact centers                 | Sentiment, categories,           |
|                |                                 | summarization, 2-channel         |
|                |                                 | required                         |
+----------------+---------------------------------+----------------------------------+
| Medical        | Clinical dictation/conversation | Medical vocabulary,              |
|                |                                 | HIPAA-eligible                   |
+----------------+---------------------------------+----------------------------------+
| HealthScribe   | Clinical documentation          | Generates SOAP notes from        |
|                |                                 | conversations                    |
+----------------+---------------------------------+----------------------------------+

Call Analytics

Purpose-built for contact centers. Requires two-channel audio (agent on channel 0, customer on channel 1). Produces sentiment analysis per turn, talk time metrics, interruption counts, and custom category matching based on keywords or sentiment thresholds.

In November 2023, AWS added generative call summarization — AI-generated summaries of issues, action items, and outcomes. Available as an add-on at $0.0024/minute.

Medical Transcribe

Trained on medical vocabulary. Supports dictation mode (clinician dictating notes) and conversation mode (multi-speaker clinical dialogue). Specialties include cardiology, neurology, oncology, primary care, urology. HIPAA-eligible.

HealthScribe

The newest tier (2023, actively developed). Combines ASR with generative AI to produce full clinical documentation — not just transcripts, but structured SOAP notes, classified dialogue sections, and extracted medical terms. Supports 22 specialties and 7 note templates (SOAP, BIRP, SIRP, DAP, and others). Streaming support added January 2025.

Pricing

+--------------------+-------------------+---------------+------------------+
| Provider           | Streaming ($/min) | Batch ($/min) | Min billing unit |
+--------------------+-------------------+---------------+------------------+
| AWS Transcribe     | $0.024            | $0.024        | 15 seconds       |
+--------------------+-------------------+---------------+------------------+
| Deepgram Nova-3    | $0.0077           | $0.0043       | 1 second         |
+--------------------+-------------------+---------------+------------------+
| Google STT v2      | $0.016            | $0.003        | 1 second         |
+--------------------+-------------------+---------------+------------------+
| Azure AI Speech    | $0.0167           | $0.003        | 1 second         |
+--------------------+-------------------+---------------+------------------+
| OpenAI Whisper API | N/A               | $0.006        | ~1 min file      |
+--------------------+-------------------+---------------+------------------+
| AssemblyAI         | ~$0.0025          | $0.0045       | 1 second         |
+--------------------+-------------------+---------------+------------------+

AWS charges per second of audio processed, with a 15-second minimum per request. This is the most important pricing detail to understand. If your users send 3-second voice notes, you are paying for 15 seconds each time. For short-utterance workloads, this creates an effective 2–5x cost uplift compared to providers that bill per actual second.

Standard transcription tiers (US East):

Tier	Monthly minutes	Rate per minute
Tier 1	First 250,000	$0.024
Tier 2	Next 750,000	$0.015
Tier 3	Over 1,000,000	$0.0102

Free tier: 60 minutes per month for the first 12 months.

Call Analytics: $0.030/min at Tier 1, dropping to $0.0138/min above 1M minutes.

Medical Transcribe: ~$0.075/min (~$4.50 per hour-long session).

Limits

+--------------------+-------------------+------------+
| Limit              | Value             | Adjustable |
+--------------------+-------------------+------------+
| Max audio duration | 4 hours (14,400s) | No         |
+--------------------+-------------------+------------+
| Max file size      | 2 GB              | No         |
+--------------------+-------------------+------------+
| Audio channels     | 2                 | No         |
+--------------------+-------------------+------------+
| Concurrent batch   | 250               | Yes        |
| jobs               |                   |            |
+--------------------+-------------------+------------+
| Concurrent streams | 25                | Yes        |
+--------------------+-------------------+------------+
| Max speakers       | 30                | No         |
| (diarization)      |                   |            |
+--------------------+-------------------+------------+
| Custom             | 100 per account   | Yes        |
| vocabularies       |                   |            |
+--------------------+-------------------+------------+
| Custom language    | 10 per account    | Yes        |
| models             |                   |            |
+--------------------+-------------------+------------+
| Job record         | 90 days           | No         |
| retention          |                   |            |
+--------------------+-------------------+------------+
| Min audio duration | 500ms             | No         |
+--------------------+-------------------+------------+

The concurrent streaming limit of 25 is the one that bites first in production. If you have more than 25 simultaneous users streaming audio, additional requests will be throttled. This limit is adjustable via Service Quotas.

AWS Transcribe vs. alternatives

+-------------------------+--------------------------+-------------------+-------------------------+
| Dimension               | AWS Transcribe           | Deepgram          | Whisper (OpenAI)        |
+-------------------------+--------------------------+-------------------+-------------------------+
| Languages               | 100+                     | ~36               | 99                      |
+-------------------------+--------------------------+-------------------+-------------------------+
| Streaming               | Yes (HTTP/2, WebSocket)  | Yes               | No                      |
+-------------------------+--------------------------+-------------------+-------------------------+
| Self-hostable           | No                       | No (cloud)        | Yes (open-source model) |
+-------------------------+--------------------------+-------------------+-------------------------+
| Speaker diarization     | Up to 30                 | Yes               | No (needs pyannote)     |
+-------------------------+--------------------------+-------------------+-------------------------+
| PII redaction           | Built-in                 | Built-in (Redact) | No                      |
+-------------------------+--------------------------+-------------------+-------------------------+
| Custom vocabulary       | Yes                      | Yes (Keywords)    | No                      |
+-------------------------+--------------------------+-------------------+-------------------------+
| Free tier               | 60 min/month (12 months) | Pay-per-use       | $0.006/min (API)        |
+-------------------------+--------------------------+-------------------+-------------------------+
| Batch speed (1hr audio) | ~5 minutes               | ~20 seconds       | 10-30 min (API)         |
+-------------------------+--------------------------+-------------------+-------------------------+
| AWS integration         | Native (S3, Lambda,      | External          | External                |
|                         | EventBridge)             |                   |                         |
+-------------------------+--------------------------+-------------------+-------------------------+

When to choose AWS Transcribe:

You are already on AWS and need S3/Lambda/EventBridge integration without cross-cloud networking
You need Call Analytics features (sentiment, categories, summarization) as a managed service
You need medical transcription or HealthScribe for clinical documentation
Your compliance requirements mandate data staying within AWS

When to consider Deepgram:

Batch speed matters — Deepgram processes 1 hour of audio in ~20 seconds vs. ~5 minutes for AWS
Short utterance workloads — Deepgram bills per actual second with no minimum
Price sensitivity at scale — Deepgram Nova-3 batch is $0.0043/min vs. $0.024/min

When to consider Whisper:

You need self-hosted transcription (data cannot leave your infrastructure)
You have GPU capacity and can tolerate higher latency
You do not need streaming — Whisper is batch-only
You want a single model that handles 99 languages without per-language configuration

Common architectures

Batch pipeline: S3 event → Transcribe → Lambda

Audio uploaded to S3
  → S3 Event Notification triggers Lambda
    → Lambda calls StartTranscriptionJob
      → Transcribe writes JSON to output S3 bucket
        → S3 Event Notification triggers processing Lambda
          → Lambda reads transcript, stores in DynamoDB

This is the standard pattern for call recording archives, podcast transcription, and video subtitle pipelines. Grant Transcribe read access on the input bucket and write access on the output bucket via IAM. Always set OutputBucketName — if omitted, Transcribe writes to a service-managed bucket with a 15-minute presigned URL that expires.

Real-time: browser → backend → Transcribe Streaming

Browser microphone (getUserMedia)
  → WebAudio API → PCM chunks
    → WebSocket to your backend (API Gateway + Lambda or EC2)
      → HTTP/2 stream to transcribestreaming.<region>.amazonaws.com
        → Transcript events back to browser via WebSocket/SSE

Proxy through a server-side component. Direct browser-to-Transcribe requires a presigned WebSocket URL with SigV4 signing — possible but puts credential management in client-side code.

Call Analytics: post-call analysis

Contact center recording (Amazon Connect or external)
  → S3 (two-channel audio: agent ch0, customer ch1)
    → StartCallAnalyticsJob
      → Output: transcript + sentiment + categories + summary
        → EventBridge → Lambda → Dashboard

Call Analytics requires two-channel audio. If your recordings are mono, you cannot use this variant — use standard transcription with speaker diarization instead.

The 15-second billing trap

This deserves its own section because it is the single most common source of unexpected Transcribe costs.

AWS bills a minimum of 15 seconds per batch transcription job. If your average voice note is 5 seconds, you are paying 3x the actual audio duration. At $0.024/minute, a 5-second clip costs $0.006 (the price of 15 seconds), not $0.002 (the price of 5 seconds).

At 100,000 clips per month averaging 5 seconds each:

Actual audio: ~139 hours → $200
Billed audio: ~417 hours → $600

Three ways to mitigate:

Use streaming for short utterances — streaming has no minimum billing
Batch multiple clips into a single longer file with silence between them, then split the transcript by timestamp
Switch providers — Deepgram and AssemblyAI bill per actual second

Amazon Transcribe is the obvious choice when you are already on AWS and need managed ASR that integrates with S3, Lambda, and EventBridge without leaving the ecosystem. The foundation model update in November 2023 closed the accuracy gap with competitors across most languages. But the 15-second billing minimum is a tax on short-utterance workloads, and the $0.024/minute rate is 3–6x more expensive than Deepgram or Whisper for pure batch transcription. Know your workload shape before committing.