Skip to content

feat: expressiveness mode, stateless Instructions, structured LLM output#5635

Open
theomonnom wants to merge 12 commits intomainfrom
theo/expressiveness-mode
Open

feat: expressiveness mode, stateless Instructions, structured LLM output#5635
theomonnom wants to merge 12 commits intomainfrom
theo/expressiveness-mode

Conversation

@theomonnom
Copy link
Copy Markdown
Member

@theomonnom theomonnom commented May 4, 2026

Summary

  • Expressiveness mode — auto-injects TTS markup instructions + speaker context into LLM, strips markup from transcripts
  • Stateless Instructions — reworked from str subclass to plain class with common/audio/text
  • STT speaker contextRecognizeStream.context + SpeakerContext protocol
  • AudioRecognition — now public, all fields/methods private except stt_context
  • Structured LLM outputllm_output_format with llm.Response annotation, streaming JSON partial parsing
  • TTS markupTTS.Markup inner class, shared _provider_format.py for Cartesia/ElevenLabs
  • XML-aware tokenizerBufferedTokenStream holds back tokens with unclosed XML tags (53 regression tests)
  • WorkflowInstructions — replaces InstructionParts

Expressiveness mode

from livekit.agents import Agent, AgentSession, inference

agent = Agent(
    instructions="You are an empathetic therapist.",
    expressiveness=True,
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o"),
    tts=inference.TTS("cartesia/sonic-3"),
)
session = AgentSession()
await session.start(agent, room=room)

The framework injects system messages telling the LLM about available TTS tags:

The TTS supports the following formatting capabilities...
<emotion value="EMOTION"/> where EMOTION is one of: neutral, angry, excited...
<speed ratio="VALUE"/>, <volume ratio="VALUE"/>, <break time="1s"/>...

The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:

Path Text
LLM output <emotion value="sad"/> I understand how you feel.
TTS receives <emotion value="sad"/> I understand how you feel.
Transcript I understand how you feel.
Chat history I understand how you feel.

Custom templates and per-plugin overrides:

from livekit.agents.llm.chat_context import AgentInstructions
from livekit.plugins import cartesia

# Custom framing for the injected instructions
agent = Agent(
    instructions=AgentInstructions(
        "You are helpful.",
        tts_instructions_template="Use speech markup sparingly:\n\n{tts_instructions}",
        audio_recognition_instructions_template="Speaker: {speaker_context}",
    ),
    expressiveness=True,
    tts=inference.TTS("cartesia/sonic-3"),
    llm=inference.LLM("openai/gpt-4o"),
)

# Override specific parts of a plugin's default instructions
tts = cartesia.TTS(
    instruction_parts=cartesia.InstructionParts(
        constraints="Only use emotion tags. Never use speed or volume."
    )
)

ElevenLabs example with normalization:

from livekit.plugins import elevenlabs

agent = Agent(
    instructions="You are a friendly customer support agent.",
    expressiveness=True,
    llm=inference.LLM("openai/gpt-4o"),
    tts=elevenlabs.TTS(model="eleven_flash_v2"),
)

# LLM receives ElevenLabs-specific instructions:
#   "Normalize numbers and symbols for spoken clarity..."
#   "$42.50 → forty-two dollars and fifty cents"
#   "SSML: <break time="1.5s"/>, <phoneme alphabet="cmu-arpabet" ph="...">word</phoneme>"
#
# LLM outputs: Hold on, let me check. <break time="1.5s"/> Your total is forty-two dollars.
# Transcript:  Hold on, let me check. Your total is forty-two dollars.

Stateless Instructions

Reworked from str subclass to plain class. No Pydantic, no runtime state.

from livekit.agents.llm.chat_context import Instructions

# Simple — same for all modalities
Instructions("You are helpful.")

# Modality-aware — common text + per-modality additions
instr = Instructions(
    "You are a helpful assistant.",
    audio="Keep responses short for voice.",
    text="Use markdown formatting.",
)
instr.as_modality("audio")  # "You are a helpful assistant.\n\nKeep responses short for voice."
instr.as_modality("text")   # "You are a helpful assistant.\n\nUse markdown formatting."
str(instr)                   # "You are a helpful assistant."

Hierarchy: InstructionsAgentInstructionsWorkflowInstructions

InstructionParts removed, replaced by WorkflowInstructions(AgentInstructions).

STT speaker context + AudioRecognition

STT plugins set metadata on their stream. Accessible anywhere on the Agent:

from pydantic import BaseModel
from livekit.agents.stt import SpeakerContext

# STT plugin defines its own context model
class MySpeakerProfile(BaseModel):
    emotion: str | None = None
    gender: str | None = None

    def to_instructions(self) -> str:
        parts = []
        if self.emotion:
            parts.append(f"Emotion: {self.emotion}")
        if self.gender:
            parts.append(f"Gender: {self.gender}")
        return "\n".join(parts)

# Plugin sets it during recognition:
self.context = MySpeakerProfile(emotion="happy", gender="female")

# Agent reads it anywhere — nodes, tools, callbacks:
self.audio_recognition.stt_context  # MySpeakerProfile instance or None

AudioRecognition is now a public class but all fields and methods are private — only stt_context is exposed.

Structured LLM output

from pydantic import BaseModel
from livekit.agents import Agent, llm
from livekit.plugins import openai
from livekit.agents import inference

class TherapistOutput(BaseModel):
    emotion: str | None = None
    therapeutic_technique: str | None = None
    response: llm.Response = ""

class TherapistAgent(Agent):
    llm_output_format = TherapistOutput

agent = TherapistAgent(
    instructions="You are an empathetic therapist.",
    llm=openai.LLM(),
    tts=inference.TTS("cartesia/sonic-3"),
)

All fields must have defaults — validated at class definition via __init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed via pydantic_core.from_json(allow_partial=True).

tts_node receives BaseModel chunks (explicit opt-in — existing custom tts_node implementations that only handle str are unaffected unless llm_output_format is set):

from collections.abc import AsyncIterable

class MyAgent(Agent):
    llm_output_format = TherapistOutput

    async def tts_node(
        self, text: AsyncIterable[TherapistOutput], model_settings  # type: ignore[override]
    ):
        async for chunk in text:
            chunk.emotion               # "empathetic" — populated before first text token
            chunk.therapeutic_technique  # "active listening"
            chunk.response              # text delta (accumulated)
        return Agent.default.tts_node(self, text, model_settings)

Parsed output stored on ChatMessage.llm_output:

result = await session.run(user_input="I'm having a terrible day")
msg = result.expect.next_event(type="message").event().item
msg.text_content  # "I understand how you feel..."
msg.llm_output    # TherapistOutput(emotion="empathetic", response="...")

XML-aware sentence tokenizer

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.

- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS
  markup instructions and speaker context into LLM system messages
- Rework Instructions from str subclass to stateless class with
  common/audio/text fields. No Pydantic dependency, no runtime state.
- Add AgentInstructions with expressiveness templates, WorkflowInstructions
  replaces InstructionParts
- Add TTS Markup inner class (llm_instructions + to_text) with shared
  _provider_format.py for Cartesia/ElevenLabs
- Add RecognizeStream.context + SpeakerContext protocol for STT metadata
- Privatize AudioRecognition, expose only stt_context
- Add llm_output_format class-level attribute for structured LLM output
  with streaming JSON partial parsing
- Add llm.Response annotation, ChatMessage.llm_output field
- Validate all llm_output_format fields have defaults at class definition
@chenghao-mou chenghao-mou requested a review from a team May 4, 2026 03:31
devin-ai-integration[bot]

This comment was marked as resolved.

theomonnom added 5 commits May 3, 2026 20:37
BufferedTokenStream now holds back tokens that contain unclosed XML tags,
preventing sentence splits inside markup like <spell>U.S.A.</spell>.
Batch path in blingfire also merges split-tag sentences.

Removes unused TagAwareBuffer — tokenizer handles it natively.
Fixes AgentConfigUpdate.instructions to use str instead of Instructions.
21 regression tests for batch + streaming with all TTS tag patterns.
Covers batch + streaming paths with: self-closing tags, wrapping tags,
periods in attributes and content, abbreviations (U.S.A., N.A.S.A.),
phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming,
unicode (French, Chinese, emoji), mixed tags, and a realistic
multi-sentence conversation.
…cised

Blingfire doesn't split tiny fragments. Tests now use realistic
multi-sentence content inside tags so splits actually trigger and
the XML-aware merge is verified.
@theomonnom theomonnom force-pushed the theo/expressiveness-mode branch from bea7e82 to c36f944 Compare May 4, 2026 03:58
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines 323 to +325
if text_transforms:
input = _apply_text_transforms(input, text_transforms)
# text transforms only apply to plain text mode (no structured output)
input = _apply_text_transforms(input, text_transforms) # type: ignore[arg-type]
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline

When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.

Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id)
if instructions is not None:
current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions)
current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only adds the common part to the trace but not the version used for this turn?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — trace now shows the modality-resolved text, not just the common part.

Comment on lines +2426 to +2478
chat_ctx.add_message(role="system", content=[instructions])
# re-resolve instructions for the current turn's modality
turn_modality = speech_handle.input_details.modality
turn_instructions = instructions if instructions is not None else self._agent.instructions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that expected for replacing the original instructions with the turn instructions entirely?

…failures

- ChatContext only stores str, never Instructions objects
- Per-turn modality resolution only when Instructions has audio/text variants
- Plain str instructions pass through unchanged (no re-resolution)
- Revert unintended fake_llm changes
- Fix add_message to resolve Instructions to str
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

# If the start and end indices are not available, we attempt to locate the token within the text using str.find. # noqa: E501
TokenizeCallable = Callable[[str], list[str] | list[tuple[str, int, int]]]

_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 XML tag regex fails to detect self-closing tags with attributes due to greedy [^>] consuming the /*

The regex _XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\ s*>") at token_stream.py:14 uses [^>]* which greedily matches all characters except >, including the / that precedes > in self-closing tags. For any self-closing tag with attributes (e.g. <emotion value="happy"/>, <speed ratio="1.5"/>, <break time="1s"/>), group(3) is always "" instead of "/", so is_self_closing is always False.

This causes _has_unclosed_xml_tags() to incorrectly return True for text containing self-closing TTS markup tags, making both the batch tokenizer (blingfire.py:49) and the streaming tokenizer (token_stream.py:82) overly conservative — they merge sentences that don't need merging, increasing TTS latency by sending larger text chunks.

Suggested change
_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*(/?)\s*>")
_XML_TAG_RE = re.compile(r"<(/?)(\w+)[^>]*?(/?)\ s*>")
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Instructions("You are a helpful assistant.")

@property
def audio(self) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructions was supposed to be in beta, I'm not sure if anybody is using it

rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent],
Generic[TEvent],
):
class Markup:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the case tho?

tts.markup?

llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN,
tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN,
mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN,
expressiveness: NotGivenOr[bool] = NOT_GIVEN,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?

should this be a bool | ExpressivenessOptions?

Copy link
Copy Markdown
Member Author

@theomonnom theomonnom May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do it inside the new AgentInstructions class

return self._interruption_detection

@property
def expressiveness(self) -> NotGivenOr[bool]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want options, then it'd be better to always return options vs a bool

str(instructions) if not isinstance(instructions, str) else instructions
)

class _SafeFormatter(string.Formatter):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be util?

…with render(), improved provider prompts

- ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS
- Instructions: removed format/as_modality/__add__, added render(modality, data) returning str
- Instructions: added resolve_template() static method for workflow modality-aware composition
- safe_render utility in utils/misc.py with nested dict→SimpleNamespace, error logging with full dotted paths
- Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion)
- AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API
- Cartesia prompt: complete 62 emotion list, examples, XML format explained
- ElevenLabs prompt: normalization rules, SSML tags, examples
- Removed _concat_optional, _safe_format, AgentInstructions
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

🐛 1 issue in files not directly in the diff

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)

At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.

View 12 additional findings in Devin Review.

Open in Devin Review

theomonnom added 4 commits May 4, 2026 21:16
…Labs v3)

<expression value="..."/> is the XML bridge for providers that use []
brackets natively. The LLM always generates XML, plugins convert to
native format before sending to API.

- Cartesia: native XML, no conversion needed
- ElevenLabs v2: native SSML, no conversion
- ElevenLabs v3: <expression> → [laughs], [whispers], etc.
- Inworld TTS 2: <expression> → [say excitedly], [laugh], etc.

Added TTS.Markup.convert() method, convert_expression_tags() and
strip_bracket_tags() helpers, complete provider prompts with examples.
…ting self._markup

Base TTS.__init__ calls self.Markup(self) automatically. Plugins just
define their Markup inner class — no manual self._markup assignment needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants