feat: expressiveness mode, stateless Instructions, structured LLM output#5635
feat: expressiveness mode, stateless Instructions, structured LLM output#5635theomonnom wants to merge 12 commits intomainfrom
Conversation
- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS markup instructions and speaker context into LLM system messages - Rework Instructions from str subclass to stateless class with common/audio/text fields. No Pydantic dependency, no runtime state. - Add AgentInstructions with expressiveness templates, WorkflowInstructions replaces InstructionParts - Add TTS Markup inner class (llm_instructions + to_text) with shared _provider_format.py for Cartesia/ElevenLabs - Add RecognizeStream.context + SpeakerContext protocol for STT metadata - Privatize AudioRecognition, expose only stt_context - Add llm_output_format class-level attribute for structured LLM output with streaming JSON partial parsing - Add llm.Response annotation, ChatMessage.llm_output field - Validate all llm_output_format fields have defaults at class definition
BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Batch path in blingfire also merges split-tag sentences. Removes unused TagAwareBuffer — tokenizer handles it natively. Fixes AgentConfigUpdate.instructions to use str instead of Instructions. 21 regression tests for batch + streaming with all TTS tag patterns.
Covers batch + streaming paths with: self-closing tags, wrapping tags, periods in attributes and content, abbreviations (U.S.A., N.A.S.A.), phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming, unicode (French, Chinese, emoji), mixed tags, and a realistic multi-sentence conversation.
…cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.
bea7e82 to
c36f944
Compare
| if text_transforms: | ||
| input = _apply_text_transforms(input, text_transforms) | ||
| # text transforms only apply to plain text mode (no structured output) | ||
| input = _apply_text_transforms(input, text_transforms) # type: ignore[arg-type] |
There was a problem hiding this comment.
🔴 Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline
When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.
Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.
Was this helpful? React with 👍 or 👎 to provide feedback.
| current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id) | ||
| if instructions is not None: | ||
| current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions) | ||
| current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions)) |
There was a problem hiding this comment.
this only adds the common part to the trace but not the version used for this turn?
There was a problem hiding this comment.
Fixed — trace now shows the modality-resolved text, not just the common part.
| chat_ctx.add_message(role="system", content=[instructions]) | ||
| # re-resolve instructions for the current turn's modality | ||
| turn_modality = speech_handle.input_details.modality | ||
| turn_instructions = instructions if instructions is not None else self._agent.instructions |
There was a problem hiding this comment.
is that expected for replacing the original instructions with the turn instructions entirely?
…failures - ChatContext only stores str, never Instructions objects - Per-turn modality resolution only when Instructions has audio/text variants - Plain str instructions pass through unchanged (no re-resolution) - Revert unintended fake_llm changes - Fix add_message to resolve Instructions to str
| # If the start and end indices are not available, we attempt to locate the token within the text using str.find. # noqa: E501 | ||
| TokenizeCallable = Callable[[str], list[str] | list[tuple[str, int, int]]] | ||
|
|
||
| _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>") |
There was a problem hiding this comment.
🟡 XML tag regex fails to detect self-closing tags with attributes due to greedy [^>] consuming the /*
The regex _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\ s*>") at token_stream.py:14 uses [^>]* which greedily matches all characters except >, including the / that precedes > in self-closing tags. For any self-closing tag with attributes (e.g. <emotion value="happy"/>, <speed ratio="1.5"/>, <break time="1s"/>), group(3) is always "" instead of "/", so is_self_closing is always False.
This causes _has_unclosed_xml_tags() to incorrectly return True for text containing self-closing TTS markup tags, making both the batch tokenizer (blingfire.py:49) and the streaming tokenizer (token_stream.py:82) overly conservative — they merge sentences that don't need merging, increasing TTS latency by sending larger text chunks.
| _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>") | |
| _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*?(/?)\ s*>") |
Was this helpful? React with 👍 or 👎 to provide feedback.
| Instructions("You are a helpful assistant.") | ||
|
|
||
| @property | ||
| def audio(self) -> str: |
There was a problem hiding this comment.
this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')
There was a problem hiding this comment.
Instructions was supposed to be in beta, I'm not sure if anybody is using it
| rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent], | ||
| Generic[TEvent], | ||
| ): | ||
| class Markup: |
There was a problem hiding this comment.
nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?
There was a problem hiding this comment.
It is the case tho?
tts.markup?
| llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN, | ||
| tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN, | ||
| mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN, | ||
| expressiveness: NotGivenOr[bool] = NOT_GIVEN, |
There was a problem hiding this comment.
if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?
should this be a bool | ExpressivenessOptions?
There was a problem hiding this comment.
They do it inside the new AgentInstructions class
| return self._interruption_detection | ||
|
|
||
| @property | ||
| def expressiveness(self) -> NotGivenOr[bool]: |
There was a problem hiding this comment.
if we want options, then it'd be better to always return options vs a bool
| str(instructions) if not isinstance(instructions, str) else instructions | ||
| ) | ||
|
|
||
| class _SafeFormatter(string.Formatter): |
…with render(), improved provider prompts - ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS - Instructions: removed format/as_modality/__add__, added render(modality, data) returning str - Instructions: added resolve_template() static method for workflow modality-aware composition - safe_render utility in utils/misc.py with nested dict→SimpleNamespace, error logging with full dotted paths - Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion) - AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API - Cartesia prompt: complete 62 emotion list, examples, XML format explained - ElevenLabs prompt: normalization rules, SSML tags, examples - Removed _concat_optional, _safe_format, AgentInstructions
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)
At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.
View 12 additional findings in Devin Review.
…Labs v3) <expression value="..."/> is the XML bridge for providers that use [] brackets natively. The LLM always generates XML, plugins convert to native format before sending to API. - Cartesia: native XML, no conversion needed - ElevenLabs v2: native SSML, no conversion - ElevenLabs v3: <expression> → [laughs], [whispers], etc. - Inworld TTS 2: <expression> → [say excitedly], [laugh], etc. Added TTS.Markup.convert() method, convert_expression_tags() and strip_bracket_tags() helpers, complete provider prompts with examples.
…ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class — no manual self._markup assignment needed.
Summary
strsubclass to plain class withcommon/audio/textRecognizeStream.context+SpeakerContextprotocolstt_contextllm_output_formatwithllm.Responseannotation, streaming JSON partial parsingTTS.Markupinner class, shared_provider_format.pyfor Cartesia/ElevenLabsBufferedTokenStreamholds back tokens with unclosed XML tags (53 regression tests)InstructionPartsExpressiveness mode
The framework injects system messages telling the LLM about available TTS tags:
The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:
<emotion value="sad"/> I understand how you feel.<emotion value="sad"/> I understand how you feel.I understand how you feel.I understand how you feel.Custom templates and per-plugin overrides:
ElevenLabs example with normalization:
Stateless Instructions
Reworked from
strsubclass to plain class. No Pydantic, no runtime state.Hierarchy:
Instructions→AgentInstructions→WorkflowInstructionsInstructionPartsremoved, replaced byWorkflowInstructions(AgentInstructions).STT speaker context + AudioRecognition
STT plugins set metadata on their stream. Accessible anywhere on the Agent:
AudioRecognitionis now a public class but all fields and methods are private — onlystt_contextis exposed.Structured LLM output
All fields must have defaults — validated at class definition via
__init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed viapydantic_core.from_json(allow_partial=True).tts_nodereceivesBaseModelchunks (explicit opt-in — existing customtts_nodeimplementations that only handlestrare unaffected unlessllm_output_formatis set):Parsed output stored on
ChatMessage.llm_output:XML-aware sentence tokenizer
BufferedTokenStreamnow holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like<spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.