feat: expressiveness mode, stateless Instructions, structured LLM output by theomonnom · Pull Request #5635 · livekit/agents

theomonnom · 2026-05-04T03:30:51Z

Summary

Expressiveness mode — auto-injects TTS markup instructions + speaker context into LLM, strips markup from transcripts
Stateless Instructions — reworked from str subclass to plain class with common/audio/text
STT speaker context — RecognizeStream.context + SpeakerContext protocol
AudioRecognition — now public, all fields/methods private except stt_context
Structured LLM output — llm_output_format with llm.Response annotation, streaming JSON partial parsing
TTS markup — TTS.Markup inner class, shared _provider_format.py for Cartesia/ElevenLabs
XML-aware tokenizer — BufferedTokenStream holds back tokens with unclosed XML tags (53 regression tests)
WorkflowInstructions — replaces InstructionParts

Expressiveness mode

from livekit.agents import Agent, AgentSession, inference

agent = Agent(
    instructions="You are an empathetic therapist.",
    expressiveness=True,
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o"),
    tts=inference.TTS("cartesia/sonic-3"),
)
session = AgentSession()
await session.start(agent, room=room)

The framework injects system messages telling the LLM about available TTS tags:

The TTS supports the following formatting capabilities...
<emotion value="EMOTION"/> where EMOTION is one of: neutral, angry, excited...
<speed ratio="VALUE"/>, <volume ratio="VALUE"/>, <break time="1s"/>...

The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:

Path	Text
LLM output	`<emotion value="sad"/> I understand how you feel.`
TTS receives	`<emotion value="sad"/> I understand how you feel.`
Transcript	`I understand how you feel.`
Chat history	`I understand how you feel.`

Custom templates and per-plugin overrides:

from livekit.agents.llm.chat_context import AgentInstructions
from livekit.plugins import cartesia

# Custom framing for the injected instructions
agent = Agent(
    instructions=AgentInstructions(
        "You are helpful.",
        tts_instructions_template="Use speech markup sparingly:\n\n{tts_instructions}",
        audio_recognition_instructions_template="Speaker: {speaker_context}",
    ),
    expressiveness=True,
    tts=inference.TTS("cartesia/sonic-3"),
    llm=inference.LLM("openai/gpt-4o"),
)

# Override specific parts of a plugin's default instructions
tts = cartesia.TTS(
    instruction_parts=cartesia.InstructionParts(
        constraints="Only use emotion tags. Never use speed or volume."
    )
)

ElevenLabs example with normalization:

from livekit.plugins import elevenlabs

agent = Agent(
    instructions="You are a friendly customer support agent.",
    expressiveness=True,
    llm=inference.LLM("openai/gpt-4o"),
    tts=elevenlabs.TTS(model="eleven_flash_v2"),
)

# LLM receives ElevenLabs-specific instructions:
#   "Normalize numbers and symbols for spoken clarity..."
#   "$42.50 → forty-two dollars and fifty cents"
#   "SSML: <break time="1.5s"/>, <phoneme alphabet="cmu-arpabet" ph="...">word</phoneme>"
#
# LLM outputs: Hold on, let me check. <break time="1.5s"/> Your total is forty-two dollars.
# Transcript:  Hold on, let me check. Your total is forty-two dollars.

Stateless Instructions

Reworked from str subclass to plain class. No Pydantic, no runtime state.

from livekit.agents.llm.chat_context import Instructions

# Simple — same for all modalities
Instructions("You are helpful.")

# Modality-aware — common text + per-modality additions
instr = Instructions(
    "You are a helpful assistant.",
    audio="Keep responses short for voice.",
    text="Use markdown formatting.",
)
instr.as_modality("audio")  # "You are a helpful assistant.\n\nKeep responses short for voice."
instr.as_modality("text")   # "You are a helpful assistant.\n\nUse markdown formatting."
str(instr)                   # "You are a helpful assistant."

Hierarchy: Instructions → AgentInstructions → WorkflowInstructions

InstructionParts removed, replaced by WorkflowInstructions(AgentInstructions).

STT speaker context + AudioRecognition

STT plugins set metadata on their stream. Accessible anywhere on the Agent:

from pydantic import BaseModel
from livekit.agents.stt import SpeakerContext

# STT plugin defines its own context model
class MySpeakerProfile(BaseModel):
    emotion: str | None = None
    gender: str | None = None

    def to_instructions(self) -> str:
        parts = []
        if self.emotion:
            parts.append(f"Emotion: {self.emotion}")
        if self.gender:
            parts.append(f"Gender: {self.gender}")
        return "\n".join(parts)

# Plugin sets it during recognition:
self.context = MySpeakerProfile(emotion="happy", gender="female")

# Agent reads it anywhere — nodes, tools, callbacks:
self.audio_recognition.stt_context  # MySpeakerProfile instance or None

AudioRecognition is now a public class but all fields and methods are private — only stt_context is exposed.

Structured LLM output

from pydantic import BaseModel
from livekit.agents import Agent, llm
from livekit.plugins import openai
from livekit.agents import inference

class TherapistOutput(BaseModel):
    emotion: str | None = None
    therapeutic_technique: str | None = None
    response: llm.Response = ""

class TherapistAgent(Agent):
    llm_output_format = TherapistOutput

agent = TherapistAgent(
    instructions="You are an empathetic therapist.",
    llm=openai.LLM(),
    tts=inference.TTS("cartesia/sonic-3"),
)

All fields must have defaults — validated at class definition via __init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed via pydantic_core.from_json(allow_partial=True).

tts_node receives BaseModel chunks (explicit opt-in — existing custom tts_node implementations that only handle str are unaffected unless llm_output_format is set):

from collections.abc import AsyncIterable

class MyAgent(Agent):
    llm_output_format = TherapistOutput

    async def tts_node(
        self, text: AsyncIterable[TherapistOutput], model_settings  # type: ignore[override]
    ):
        async for chunk in text:
            chunk.emotion               # "empathetic" — populated before first text token
            chunk.therapeutic_technique  # "active listening"
            chunk.response              # text delta (accumulated)
        return Agent.default.tts_node(self, text, model_settings)

Parsed output stored on ChatMessage.llm_output:

result = await session.run(user_input="I'm having a terrible day")
msg = result.expect.next_event(type="message").event().item
msg.text_content  # "I understand how you feel..."
msg.llm_output    # TherapistOutput(emotion="empathetic", response="...")

XML-aware sentence tokenizer

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.

- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS markup instructions and speaker context into LLM system messages - Rework Instructions from str subclass to stateless class with common/audio/text fields. No Pydantic dependency, no runtime state. - Add AgentInstructions with expressiveness templates, WorkflowInstructions replaces InstructionParts - Add TTS Markup inner class (llm_instructions + to_text) with shared _provider_format.py for Cartesia/ElevenLabs - Add RecognizeStream.context + SpeakerContext protocol for STT metadata - Privatize AudioRecognition, expose only stt_context - Add llm_output_format class-level attribute for structured LLM output with streaming JSON partial parsing - Add llm.Response annotation, ChatMessage.llm_output field - Validate all llm_output_format fields have defaults at class definition

…s enabled

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Batch path in blingfire also merges split-tag sentences. Removes unused TagAwareBuffer — tokenizer handles it natively. Fixes AgentConfigUpdate.instructions to use str instead of Instructions. 21 regression tests for batch + streaming with all TTS tag patterns.

Covers batch + streaming paths with: self-closing tags, wrapping tags, periods in attributes and content, abbreviations (U.S.A., N.A.S.A.), phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming, unicode (French, Chinese, emoji), mixed tags, and a realistic multi-sentence conversation.

…cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-05-04T04:08:57Z

        if text_transforms:
-            input = _apply_text_transforms(input, text_transforms)
+            # text transforms only apply to plain text mode (no structured output)
+            input = _apply_text_transforms(input, text_transforms)  # type: ignore[arg-type]


🔴 Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline

When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.

Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.

Was this helpful? React with 👍 or 👎 to provide feedback.

longcw · 2026-05-04T04:12:59Z

        current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id)
        if instructions is not None:
-            current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions)
+            current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions))


this only adds the common part to the trace but not the version used for this turn?

Fixed — trace now shows the modality-resolved text, not just the common part.

longcw · 2026-05-04T04:16:35Z

-            chat_ctx.add_message(role="system", content=[instructions])
+        # re-resolve instructions for the current turn's modality
+        turn_modality = speech_handle.input_details.modality
+        turn_instructions = instructions if instructions is not None else self._agent.instructions


is that expected for replacing the original instructions with the turn instructions entirely?

…failures - ChatContext only stores str, never Instructions objects - Per-turn modality resolution only when Instructions has audio/text variants - Plain str instructions pass through unchanged (no re-resolution) - Revert unintended fake_llm changes - Fix add_message to resolve Instructions to str

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-05-04T04:39:38Z

 # If the start and end indices are not available, we attempt to locate the token within the text using str.find.  # noqa: E501
 TokenizeCallable = Callable[[str], list[str] | list[tuple[str, int, int]]]

+_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>")


🟡 XML tag regex fails to detect self-closing tags with attributes due to greedy [^>] consuming the /*

The regex _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\ s*>") at token_stream.py:14 uses [^>]* which greedily matches all characters except >, including the / that precedes > in self-closing tags. For any self-closing tag with attributes (e.g. <emotion value="happy"/>, <speed ratio="1.5"/>, <break time="1s"/>), group(3) is always "" instead of "/", so is_self_closing is always False.

This causes _has_unclosed_xml_tags() to incorrectly return True for text containing self-closing TTS markup tags, making both the batch tokenizer (blingfire.py:49) and the streaming tokenizer (token_stream.py:82) overly conservative — they merge sentences that don't need merging, increasing TTS latency by sending larger text chunks.

Suggested change

_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>")

_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*?(/?)\ s*>")

Was this helpful? React with 👍 or 👎 to provide feedback.

davidzhao · 2026-05-04T05:49:06Z

+        Instructions("You are a helpful assistant.")

-    @property
-    def audio(self) -> str:


this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')

Instructions was supposed to be in beta, I'm not sure if anybody is using it

davidzhao · 2026-05-04T06:05:08Z

    rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent],
    Generic[TEvent],
 ):
+    class Markup:


nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?

It is the case tho?

tts.markup?

davidzhao · 2026-05-04T06:10:32Z

        llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN,
        tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN,
        mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN,
+        expressiveness: NotGivenOr[bool] = NOT_GIVEN,


if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?

should this be a bool | ExpressivenessOptions?

They do it inside the new AgentInstructions class

davidzhao · 2026-05-04T06:11:05Z

        return self._interruption_detection

+    @property
+    def expressiveness(self) -> NotGivenOr[bool]:


if we want options, then it'd be better to always return options vs a bool

davidzhao · 2026-05-04T06:35:12Z

+                str(instructions) if not isinstance(instructions, str) else instructions
+            )
+
+        class _SafeFormatter(string.Formatter):


nit: should this be util?

…with render(), improved provider prompts - ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS - Instructions: removed format/as_modality/__add__, added render(modality, data) returning str - Instructions: added resolve_template() static method for workflow modality-aware composition - safe_render utility in utils/misc.py with nested dict→SimpleNamespace, error logging with full dotted paths - Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion) - AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API - Cartesia prompt: complete 62 emotion list, examples, XML format explained - ElevenLabs prompt: normalization rules, SSML tags, examples - Removed _concat_optional, _safe_format, AgentInstructions

devin-ai-integration

Devin Review found 1 new potential issue.

🐛 1 issue in files not directly in the diff

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (`livekit-agents/livekit/agents/voice/agent_activity.py:771`)

At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.

View 12 additional findings in Devin Review.

…Labs v3) <expression value="..."/> is the XML bridge for providers that use [] brackets natively. The LLM always generates XML, plugins convert to native format before sending to API. - Cartesia: native XML, no conversion needed - ElevenLabs v2: native SSML, no conversion - ElevenLabs v3: <expression> → [laughs], [whispers], etc. - Inworld TTS 2: <expression> → [say excitedly], [laugh], etc. Added TTS.Markup.convert() method, convert_expression_tags() and strip_bracket_tags() helpers, complete provider prompts with examples.

…ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class — no manual self._markup assignment needed.

chenghao-mou requested a review from a team May 4, 2026 03:31

This comment was marked as resolved.

Sign in to view

theomonnom added 5 commits May 3, 2026 20:37

fix: strip TTS markup from streamed transcripts when expressiveness i…

d5cb09f

…s enabled

remove unused get_tags from _provider_format

d8538ee

test: use full sentences inside wrapping tags to ensure merge is exer…

c36f944

…cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.

theomonnom force-pushed the theo/expressiveness-mode branch from bea7e82 to c36f944 Compare May 4, 2026 03:58

devin-ai-integration Bot reviewed May 4, 2026

View reviewed changes

longcw reviewed May 4, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 4, 2026

View reviewed changes

davidzhao reviewed May 4, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 5, 2026

View reviewed changes

theomonnom added 4 commits May 4, 2026 21:16

feat: add Markup support to Inworld TTS plugin

9a7f8e0

refactor: plugins override Markup inner class instead of manually set…

639c362

…ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class — no manual self._markup assignment needed.

refactor: audio_recognition property raises instead of returning None

4372f88

	_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>](/?)\s>")
	_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]?(/?)\ s>")

Conversation

theomonnom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expressiveness mode

Stateless Instructions

STT speaker context + AudioRecognition

Structured LLM output

XML-aware sentence tokenizer

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theomonnom commented May 4, 2026 •

edited

Loading

devin-ai-integration Bot May 4, 2026 •

edited

Loading

theomonnom May 4, 2026 •

edited

Loading

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (`livekit-agents/livekit/agents/voice/agent_activity.py:771`)