Add cascade latency levers and streaming robustness#140
Open
sauhardjain wants to merge 2 commits into
Open
Conversation
- 'cartesia' now maps to Cartesia's latest ink-2 (CartesiaTurnsSTTService): server-driven endpointing, so ModelConfig auto-forces external turn strategies + VAD off. The older ink-whisper is preserved as 'cartesia-multilingual' (standard VAD / smart-turn). - Cartesia STT declares the 16 kHz pipeline input rate (STT_INPUT_SAMPLE_RATE), not SAMPLE_RATE (24 kHz, the TTS output rate). The base STTService doesn't resample, so 24 kHz mislabels 16 kHz audio (~1.5x fast/pitched) and garbles spelled letters / confirmation codes. Other STT providers are unchanged. - pipecat_server logs ink-2 eager-end / resume / committed-end diagnostics; only committed turn boundaries drive aggregation.
…ming robustness
CASCADE-only; every lever defaults off, so the canonical config is unchanged unless set.
- pre_tool_speech {off, auto}: 'auto' adds a write-aware lead-in directive (disclose cost +
confirm before a write action); no deterministic fillers.
- llm_streaming: complete_stream() streams Chat-Completions tokens to TTS sentence-by-sentence;
Responses-API deployments fall back to non-streaming (one warning).
- parallel_tool_calls: tri-state ModelConfig knob, forwarded only when tools are present.
- _pair_orphaned_tool_calls: pair an assistant tool_call left unanswered by transfer_to_agent or
a barge-in, so Responses-API models don't 400 ("No tool output found") on the next turn.
- _record_partial_streamed_output: record already-spoken streamed text on interruption/failure
rather than dropping it or speaking a generic error over it.
- truncate_to_spoken: match across TTS segments so streaming doesn't truncate scored transcripts.
Author
|
cc @fanny-riols as discussed last week! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds opt-in latency controls for cascade (STT -> LLM -> TTS) systems and fixes audit/history edge cases that can distort benchmark results. The default cascade configuration remains unchanged unless these flags are enabled.
Motivation
The cascade harness can add avoidable latency when it waits for a full LLM response before sending text to TTS, or when tool calls leave the caller in silence. It also does not expose the parallel tool-call setting used by the ElevenAgents cascade assistant configuration, which makes cross-harness comparisons harder to interpret.
Interrupted and tool-heavy turns need careful handling as well. An interrupted transfer can leave an assistant tool call without a matching tool result, and streamed speech can reach TTS before the LLM call is cancelled or fails. This PR keeps the next model request valid and keeps the audit log aligned with emitted audio.
Changes
EVA_MODEL__PRE_TOOL_SPEECHto let the model produce a brief lead-in before tool calls. The lead-in is model-generated, not templated filler.EVA_MODEL__LLM_STREAMINGfor sentence-level Chat Completions streaming to TTS. Responses API deployments warn and use the existing non-streaming path.EVA_MODEL__PARALLEL_TOOL_CALLSto forward the provider setting when tools are present. Leaving it unset preserves provider defaults; setting it tofalsematches the ElevenAgents assistant config..env.exampleand the experiment setup guide.Reviewing
Run one cascade record with the new flags unset and confirm that default behavior is unchanged.
Then run a tool-using cascade record with
EVA_MODEL__PRE_TOOL_SPEECH=auto,EVA_MODEL__LLM_STREAMING=true, andEVA_MODEL__PARALLEL_TOOL_CALLS=false. Check that speech can begin before the final LLM response is complete, that the audit log records the assistant response, and that interrupted streamed speech is preserved instead of replaced by a generic error.