Skip to content

Add cascade latency levers and streaming robustness#140

Open
sauhardjain wants to merge 2 commits into
ServiceNow:mainfrom
sauhardjain:pr/cascade-orchestration
Open

Add cascade latency levers and streaming robustness#140
sauhardjain wants to merge 2 commits into
ServiceNow:mainfrom
sauhardjain:pr/cascade-orchestration

Conversation

@sauhardjain

@sauhardjain sauhardjain commented Jun 8, 2026

Copy link
Copy Markdown

Summary

This PR adds opt-in latency controls for cascade (STT -> LLM -> TTS) systems and fixes audit/history edge cases that can distort benchmark results. The default cascade configuration remains unchanged unless these flags are enabled.

Motivation

The cascade harness can add avoidable latency when it waits for a full LLM response before sending text to TTS, or when tool calls leave the caller in silence. It also does not expose the parallel tool-call setting used by the ElevenAgents cascade assistant configuration, which makes cross-harness comparisons harder to interpret.

Interrupted and tool-heavy turns need careful handling as well. An interrupted transfer can leave an assistant tool call without a matching tool result, and streamed speech can reach TTS before the LLM call is cancelled or fails. This PR keeps the next model request valid and keeps the audit log aligned with emitted audio.

Changes

  • Adds EVA_MODEL__PRE_TOOL_SPEECH to let the model produce a brief lead-in before tool calls. The lead-in is model-generated, not templated filler.
  • Adds EVA_MODEL__LLM_STREAMING for sentence-level Chat Completions streaming to TTS. Responses API deployments warn and use the existing non-streaming path.
  • Adds EVA_MODEL__PARALLEL_TOOL_CALLS to forward the provider setting when tools are present. Leaving it unset preserves provider defaults; setting it to false matches the ElevenAgents assistant config.
  • Repairs orphaned assistant tool calls before replaying history to the model, while ignoring malformed non-string tool result IDs.
  • Records any streamed text that was already emitted if a stream is cancelled or fails mid-turn.
  • Updates transcript processing so fully spoken streamed responses can span multiple TTS segments.
  • Documents the new flags in .env.example and the experiment setup guide.

Reviewing

Run one cascade record with the new flags unset and confirm that default behavior is unchanged.

Then run a tool-using cascade record with EVA_MODEL__PRE_TOOL_SPEECH=auto, EVA_MODEL__LLM_STREAMING=true, and EVA_MODEL__PARALLEL_TOOL_CALLS=false. Check that speech can begin before the final LLM response is complete, that the audit log records the assistant response, and that interrupted streamed speech is preserved instead of replaced by a generic error.

- 'cartesia' now maps to Cartesia's latest ink-2 (CartesiaTurnsSTTService): server-driven
  endpointing, so ModelConfig auto-forces external turn strategies + VAD off. The older
  ink-whisper is preserved as 'cartesia-multilingual' (standard VAD / smart-turn).
- Cartesia STT declares the 16 kHz pipeline input rate (STT_INPUT_SAMPLE_RATE), not SAMPLE_RATE
  (24 kHz, the TTS output rate). The base STTService doesn't resample, so 24 kHz mislabels 16 kHz
  audio (~1.5x fast/pitched) and garbles spelled letters / confirmation codes. Other STT
  providers are unchanged.
- pipecat_server logs ink-2 eager-end / resume / committed-end diagnostics; only committed turn
  boundaries drive aggregation.
…ming robustness

CASCADE-only; every lever defaults off, so the canonical config is unchanged unless set.

- pre_tool_speech {off, auto}: 'auto' adds a write-aware lead-in directive (disclose cost +
  confirm before a write action); no deterministic fillers.
- llm_streaming: complete_stream() streams Chat-Completions tokens to TTS sentence-by-sentence;
  Responses-API deployments fall back to non-streaming (one warning).
- parallel_tool_calls: tri-state ModelConfig knob, forwarded only when tools are present.
- _pair_orphaned_tool_calls: pair an assistant tool_call left unanswered by transfer_to_agent or
  a barge-in, so Responses-API models don't 400 ("No tool output found") on the next turn.
- _record_partial_streamed_output: record already-spoken streamed text on interruption/failure
  rather than dropping it or speaking a generic error over it.
- truncate_to_spoken: match across TTS segments so streaming doesn't truncate scored transcripts.
@sauhardjain

Copy link
Copy Markdown
Author

cc @fanny-riols as discussed last week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant