This Docker Compose configuration runs the AssemblyAI streaming services as a standalone self-hosted stack.
Two compose files are shipped. Pick the one that matches the model you want to serve — they are mutually exclusive (run one at a time):
| File | Models served | GPU requirement |
|---|---|---|
docker-compose.yml |
Universal English + Multilingual streaming | NVIDIA T4+ per ASR container |
docker-compose.u3pro.yml |
U3 Pro | 24 GB+ VRAM (e.g. L4, A10, A100); image bundles ~14 GB of weights |
To switch between stacks, run docker compose down (or docker compose -f docker-compose.u3pro.yml down) before starting the other.
Both stacks include:
- streaming-api: Gateway API service handling WebSocket connections.
- streaming-asr-lb: nginx load balancer for ASR services with header-based routing.
- license-and-usage-proxy: License validation and usage reporting service.
ASR backends differ by stack:
- Universal stack (
docker-compose.yml):streaming-asr-englishandstreaming-asr-multilang. - U3 Pro stack (
docker-compose.u3pro.yml):streaming-asr-u3pro.
Universal stack (docker-compose.yml):
Websocket client → streaming-api:8080 (WebSocket)
│
├─ Usage reporting ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
│ │
├─ License validation ─────────┘
│
└─ ASR requests ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
├── en-default → streaming-asr-english:50051 (gRPC)
└── ml-default → streaming-asr-multilang:50051 (gRPC)
U3 Pro stack (docker-compose.u3pro.yml):
Websocket client → streaming-api:8080 (WebSocket)
│
├─ Usage reporting ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
│ │
├─ License validation ─────────┘
│
└─ ASR requests ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
└── u3-pro → streaming-asr-u3pro:50051 (gRPC)
Both stacks share the same nginx_streaming_asr.conf, which routes by X-Model-Version header. Each stack only deploys the backends it needs — websocket clients should use a speech_model query parameter value that routes to an available backend.
- AssemblyAI license: Valid for the streaming self-hosted product.
- Docker & Docker Compose: Ensure Docker and Docker Compose are installed.
- GPU Support: NVIDIA Container Toolkit for GPU-enabled services.
- AWS Access: Valid AWS credentials to pull images from ECR.
1.1 Verify NVIDIA drivers are installed:
nvidia-smi1.2 Install NVIDIA Container Toolkit:
Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.
1.3 Verify the Docker runtime has GPU access:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi# Login to ECR to pull container images
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.comUse the reference .env.example file to create a .env file with container image references:
Set the image variables relevant to the stack you plan to run:
# Required for both stacks:
STREAMING_API_IMAGE=<CUSTOM_IMAGE>
LICENSE_AND_USAGE_PROXY_IMAGE=<CUSTOM_IMAGE>
# Required for the universal stack (docker-compose.yml):
STREAMING_ASR_ENGLISH_IMAGE=<CUSTOM_IMAGE>
STREAMING_ASR_MULTILANG_IMAGE=<CUSTOM_IMAGE>
# Required for the U3 Pro stack (docker-compose.u3pro.yml):
STREAMING_ASR_U3PRO_IMAGE=<CUSTOM_IMAGE>Ensure you have your AssemblyAI license file in the current working directory as license.jwt, or modify the LICENSE_FILE_PATH environment variable in the relevant Docker Compose file to point to your license file location.
Pick the stack you want to run. Both use the same streaming-api, load balancer, and license proxy — they differ only in the ASR backend.
For the U3 Pro stack, websocket clients should set query parameter speech_model to "u3-rt-pro" so the load balancer routes to the U3 Pro backend.
Universal stack (English + Multilingual):
docker compose up -d
docker compose logs -f
# Check service status
docker compose ps
# Stop services before switching stacks
docker compose downU3 Pro stack:
docker compose -f docker-compose.u3pro.yml up -d
docker compose -f docker-compose.u3pro.yml logs -f
# Check service status
docker compose ps
# Stop services before switching stacks
docker compose -f docker-compose.u3pro.yml down- WebSocket:
ws://localhost:8080
A Python example script is provided to demonstrate how to stream audio to the self-hosted stack.
Note: You can initiate a session as soon as the relevant ASR container is healthy. streaming-asr-english and streaming-asr-multilang log "Ready to serve!" when ready (typically ~2 min). streaming-asr-u3pro logs "U3Pro ASR Server ready!" when ready (typically ~5 min).
Change the current directory to the streaming_example directory:
cd streaming_exampleCreate a fresh Python virtual environment and activate it:
python -m venv streaming_venv
source streaming_venv/bin/activateInstall the required packages to run the example script:
pip install -r requirements.txtThe example script (example_with_prerecorded_audio_file.py) accepts several CLI arguments:
Basic usage:
- Universal stack English:
python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "universal-streaming-english"
- Universal stack Multilingual:
python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "universal-streaming-multilingual"
- U3 Pro stack:
python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "u3-rt-pro"
Command-line arguments:
| Argument | Description | Default |
|---|---|---|
--audio-file |
Path to the audio file to transcribe | example_audio_file.wav |
--endpoint |
WebSocket endpoint URL | ws://localhost:8080 |
--speech-model |
Speech model to use (e.g., 'universal-streaming-multilingual') | `` |
View help:
python example_with_prerecorded_audio_file.py --helpASR Load Balancer (nginx_streaming_asr.conf):
- gRPC proxying to ASR services.
- Routes to English or Multilang model based on the
X-Model-Versionheader value.
The license-and-usage-proxy service supports two billing modes based on your AssemblyAI license:
If your license is configured for flat billing, usage tracking is disabled. No additional configuration is required.
If your license is configured for usage-based billing, the proxy will automatically report usage data to AssemblyAI's usage tracker service. You must configure the following environment variable in the docker-compose.yml for the license-and-usage-proxy service:
environment:
- USAGE_TRACKING_API_KEY=<your-api-key>Important Notes:
- For the API key, any key retrieved from the AssemblyAI dashboard can be used.
- At startup, the proxy validates connectivity by registering with AssemblyAI's https://usage-tracker.assemblyai.com.
- If connectivity validation fails, the proxy will shut down.
- Usage data is batched and reported every few seconds.
- The proxy automatically retries failed requests up to several times.
Critical Behavior: If https://usage-tracker.assemblyai.com becomes unreachable and all retry attempts fail (after 5-60 minutes), the license-and-usage-proxy service will terminate itself. This is a fail-safe mechanism to ensure usage data integrity. Your service orchestrator should be configured to automatically replace the container with a new one.
Monitoring Recommendations:
- Monitor the proxy's logs for warnings about failed usage reporting attempts.
- Set up alerts for proxy restarts, which may indicate persistent connectivity issues.
- If the in-memory usage queue size exceeds 1000 items, the proxy will log a warning suggesting upscaling.
# Container status
docker compose ps
# Resource usage
docker stats# Check nginx configurations
docker compose exec streaming-asr-lb nginx -t
# Restart specific service (universal stack)
docker compose restart streaming-api
docker compose restart streaming-asr-english
docker compose restart streaming-asr-multilang
# Restart specific service (U3 Pro stack)
docker compose -f docker-compose.u3pro.yml restart streaming-asr-u3pro- Deployment Strategy: We recommend doing Blue/Green deployments to avoid disrupting ongoing sessions. Once you fully shift the traffic to the new color, wait at least 3 hours (the max session duration) before shutting down the old color to ensure no sessions get disrupted.
- Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it's better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
- Autoscaling: We recommend setting up autoscaling based on the number of active sessions. A container with 1 CPU can generally handle around 32 concurrent sessions.
- Monitoring: Always monitor the logs during deployment to catch any potential issues early.
- Dependencies: For successful startup, the service depends on the license-and-usage-proxy service being up and running.
- Configuration: You can enable features like TLS encryption and structured logging via environment variables.
- Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
- Usage Reporting Behavior: After each session completes, the streaming-api reports usage to the license-and-usage-proxy with automatic retries on failure. Monitor logs any messages at a >= warning level.
- Deployment Strategy: Do gradual rollouts to ensure stability. Consider implementing monitoring and alerting for service restarts.
- Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it's better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
- Monitoring: Always monitor logs during deployment to catch any potential issues early. You can set up an alert based on the responses of the
/v1/statusendpoint to alert you on any license issues. For usage-based billing, also monitor for usage reporting warnings and service restarts. - Dependencies:
- For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the
LICENSE_FILE_PATHenvironment variable to point to the license file path on the host machine. - For usage-based billing, the service also requires connectivity to https://usage-tracker.assemblyai.com at startup. If connectivity validation fails, the container will terminate. Ensure the
USAGE_TRACKING_API_KEYenvironment variable is properly configured.
- For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the
- Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
- Usage Reporting Resilience:
- Network connectivity to the https://usage-tracker.assemblyai.com endpoint must be reliable for production deployments with usage-based billing.
- Run at least a few containers behind a load balancer to ensure high availability.
The /v1/status endpoint provides real-time information about the license validation state:
Endpoint: GET /v1/status
Response Schema:
{
"state": "Ready | Connected | TrustBased | Failed",
"last_successful_checkin": "2025-01-01T12:00:00.000000Z",
"trust_expiration": "2025-01-05T12:00:00.000000Z"
}State Descriptions:
Ready: Initial state when the service starts before any license validation has occurred.Connected: Last license validation check was successful.TrustBased: Last license validation check failed, but the request was within the trust window grace period, so services will remain operational.Failed: Last license validation check failed and the trust window has expired. streaming-api containers will shut down and stop serving requests.
Fields:
state: Current license validation state.last_successful_checkin: ISO 8601 timestamp of the last successful license validation (null if never successful).trust_expiration: ISO 8601 timestamp when the trust window expires (null if no successful validation yet).
Recommended Alerts:
- Alert when
statetransitions toTrustBased(indicates license validation issues). - Critical alert when
stateisFailed(services will shut down).
- Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr container if a persistent connection gets disrupted with minimal state loss.
- Hardware Requirements: The services can run on NVIDIA T4 or newer GPUs. We recommend allocating at least 4 CPU and 16GB of RAM per container.
- Autoscaling: You can set up autoscaling based on the number of active sessions. A container with recommended hardware can generally handle up to 28 concurrent sessions.
- Monitoring: Always monitor logs during deployment to catch any potential issues early.
- Health Checks: Use the healthcheck command provided in the Docker Compose file to monitor container health.
- Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr-u3-pro container if a persistent connection gets disrupted with minimal state loss.
- Hardware Requirements: NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. The container also needs ~14 GB of disk for the bundled model weights.
- Autoscaling: You can set up autoscaling based on the number of active sessions. A container using L40S GPU can generally handle up to 40 concurrent sessions.
- Monitoring: Always monitor logs during deployment to catch any potential issues early.
- Health Checks: Use the healthcheck command provided in the Docker Compose file to monitor container health.
This release introduces the U3 Pro self-hosted stack (docker-compose.u3pro.yml), which serves the U3 Pro streaming model. U3 Pro delivers significant improvements over the universal English model on complex entities, short utterances, and end-of-turn (EOT) latency, and is targeted at voice agent scenarios.
Hardware: NVIDIA L4 / A10 / A100 / L40S / H100 (24 GB+ VRAM).
Highlights of U3 Pro behavior delivered with this release:
- New transcription prompt ("Transcribe verbatim with standard punctuation. Include filler words and incomplete utterances.") — 22% reduction in voice-agent hallucinations, 10% WER and 29% short-utterance error-rate reduction on voice-agent traffic, 5% improvement on medical, and improved EP F1.
- Continuous partials during long turns — partials are emitted incrementally instead of being delayed; turns now stitch up to 60s instead of hard-cutting at 16s/32s.
- Early partial at 750ms of detected speech for faster UI feedback.
continuous_partialsquery parameter — clients can opt into continuous partials during long turns.- Structured logging — both the U3 Pro ASR server and the universal ASR server now honor
USE_STRUCTURED_LOGGING, matching the streaming-api behavior.
- Various logging and metrics improvements across the streaming-api and ASR services.
- Bug fixes and stability improvements.
A new English model is released, which produces already-formatted outputs directly and delivers large quality gains on digits, telephony, medical, and CI segments:
- 34% improvement on digit sequence error rate (DSER)
- 17% improvement on telephony WER
- 12% average improvement on medical WER
- 10% average improvement on CI segments WER
- ~2.4% absolute F1 score improvement on keyterms prompting
- Significantly improved timestamp accuracy — resolves overlapping and zero-duration word issues
- ~70% absolute improvement in timestamp accuracy — fixes overlapping words and zero-duration word bugs
- Error and Warning WebSocket message types — Dedicated message types that let clients distinguish actionable errors from non-fatal warnings without relying on close codes.
- Configuration echoed in SessionBegins — The
SessionBeginsmessage now includes the resolved session configuration so clients can verify applied settings. - Explicit speech-model selection — Clients explicitly select the speech model at session start.
- More specific WebSocket close codes for session termination scenarios, making client-side error handling more precise.
- Improved
word_finalizedevents — All word finalizations are emitted (not only the last word of a turn).
- Various logging, metrics, and observability improvements across the streaming-api and ASR services.
- Bug fixes and stability improvements.
Major improvements to short utterance handling and hallucination reduction:
- 100% reduction in hallucinations
- 12.8% improvement on short utterances - Better performance for voice agent use cases
- 7.39% improvement on digit sequence error rate
- 1.75% improvement on proper nouns
- 0.46% improvement on CI segments
- 0.39% improvement on accented speech
- Context biasing support - Customers can now use context biasing (model-based biasing) with the multilingual model
- Increased concurrent session handling per container, leading to reduced deployment costs
- Improved observability for the license-and-usage-proxy service
- Various bug fixes and stability improvements