Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions configs/prompts/judge.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,7 @@ judge:
}}
]

agent_speech_fidelity:
tts_fidelity:
user_prompt: |
You are an expert evaluator judging the fidelity of this audio file against the intended text.
You will listen to one audio clip and verify that the spoken content faithfully reproduces the intended text, with special attention to TTS-critical entities.
Expand Down Expand Up @@ -737,7 +737,8 @@ judge:
}}
]
}}
s2s_user_prompt: |
agent_speech_fidelity:
user_prompt: |
You are an expert evaluator checking the **speech clarity and articulation** of entities spoken by an AI voice agent.

You will receive:
Expand Down
11 changes: 6 additions & 5 deletions docs/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Measures whether the agent accomplished the user's goal correctly:
|--------|------|-------------|-------------|
| [`task_completion`](task_completion.md) | Deterministic | Speech Recognition, Language Model | Binary pass/fail via scenario DB state hash comparison (0-1) |
| [`faithfulness`](faithfulness.md) | Judge (Claude Opus) | Speech Recognition (audio-native only), Language Model | Faithfulness to information, policies, and instructions (1-3) |
| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1) |
| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether key entities in the assistant's speech audio matches intended text (0-1) |

### Experience (3 metrics)

Expand All @@ -48,12 +48,13 @@ Measures the quality of the user's conversational experience:
| [`conciseness`](conciseness.md) | Judge | Language Model | Whether responses are appropriately concise for voice (1-3) |
| [`conversation_progression`](conversation_progression.md) | Judge | Language Model | Whether assistant moves conversation forward without repetition (1-3) |

### Diagnostic (6 metrics)
### Diagnostic (7 metrics)

Metrics that help isolate root causes of failures. These provide signals for understanding what went wrong, but are not directly used in final evaluation scores.

| Metric | Type | Capabilities | Description |
|--------|------|-------------|-------------|
| [`tts_fidelity`](tts_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1). **Opt-in** — excluded from the default run; enable via `--metrics tts_fidelity`. |
| [`authentication_success`](authentication_success.md) | Deterministic | Speech Recognition, Language Model | Whether get_reservation was called successfully (0-1) |
| [`response_speed`](response_speed.md) | Deterministic | VAD, Pipeline | Latency between user utterance end and assistant response start (seconds) |
| [`speakability`](speakability.md) | Judge | Language Model | Whether text is voice-friendly and appropriate for TTS (0-1) |
Expand All @@ -80,7 +81,7 @@ Each metric implements `BaseMetric.compute(context: MetricContext) -> MetricScor
**LLM-as-Judge:**
- Integer/boolean ratings (not decimals) to avoid precision issues
- Structured prompts in `configs/prompts/judge.yaml`
- GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness
- GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


**Audio Evaluation:**
- Audio encoded as base64 WAV, sent to Gemini via LiteLLM
Expand Down Expand Up @@ -122,10 +123,10 @@ python main.py \
--run-id <existing_run_id> \
--metrics turn_taking,conciseness,conversation_progression

# Run diagnostic metrics
# Run diagnostic metrics (tts_fidelity is opt-in and only runs when named explicitly)
python main.py \
--run-id <existing_run_id> \
--metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities
--metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities,tts_fidelity

# Run validation metrics
python main.py \
Expand Down
52 changes: 20 additions & 32 deletions docs/metrics/agent_speech_fidelity.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,42 @@
# Agent Speech Fidelity

> **Accuracy Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was.
> **Accuracy Metric**: If the agent's spoken audio garbles or misstates key information, the user receives incorrect information regardless of how good the text reasoning was.

## Overview

Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses).
Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the **key entities** (dates, names, numbers, codes, addresses, etc.), using Gemini for multimodal analysis.

To keep the EVA score **apples-to-apples across all pipeline setups**, the same entity-focused metric runs for every pipeline type — cascade, S2S, and audio-LLM. It does not require any intended text.

### Capabilities Measured

- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio.
- **Speech Synthesis**: Measures whether the assistant's spoken audio accurately represents the key entities.

## How It Works

### Evaluation Method

- **Type**: Audio Judge (multimodal LLM with audio input)
- **Model**: Gemini 3.1 Pro
- **Model**: Gemini 3 Flash
- **Granularity**: Per-turn (each assistant turn evaluated independently)

### Input Data

Uses the following MetricContext fields:
- `audio_assistant_path`: Path to assistant-only audio file
- `intended_assistant_turns`: What the assistant intended to say

### Audio-Native vs Cascade

The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from:

- **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output).
- **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech.
- `conversation_trace`: User utterances and tool responses are kept as-is (the sources of the entities to listen for); assistant turns are **redacted** to a placeholder so the judge evaluates articulation, not whether the agent gave the "right" answer.

### Evaluation Methodology

The judge compares intended text against spoken audio, focusing on:

- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items
- **Error types**: Missing words, added words, wrong words, entity errors

**Special handling:**
- Minor pronunciation variations that don't change meaning are acceptable
- Filler words (um, uh) that don't affect core content are ignored
- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized
- Missing words at END of LAST turn only are not penalized (audio cutoff)
The judge receives the agent audio plus a redacted conversation trace and, for each assistant turn, checks whether the spoken audio accurately represents the key entities: Names, dates, times, codes, dollar amounts, flight numbers, etc. Turns with no entities to evaluate are flagged (`has_entities = false`) and **excluded** from scoring.

### Scoring

- **Scale**: 0-1 (binary per turn)
- 1: High Fidelity — audio accurately says all words from intended text
- 0: Low Fidelity — missing, added, or wrong words detected
- 1: Entities clearly articulated
- 0: An entity is unclear, garbled, or wrongly articulated
- **Normalization**: Already 0-1 scale
- **Aggregation**: Mean across all assistant turns
- **Aggregation**: Mean across scored assistant turns (turns with no entities are skipped)

## Example Output

Expand All @@ -62,25 +48,27 @@ The judge compares intended text against spoken audio, focusing on:
"details": {
"aggregation": "mean",
"num_turns": 7,
"num_evaluated": 7,
"per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1},
"num_evaluated": 4,
"num_skipped_no_entities": 3,
"per_turn_ratings": {"0": 1, "1": 1, "3": 0, "5": 1},
"per_turn_has_entities": {"0": true, "1": true, "2": false, "3": true},
"per_turn_explanations": {
"3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error."
"3": "Confirmation code unclear: heard 'ZK three F F' but trace shows 'ZK3FFW'."
}
}
}
```

## Related Metrics

- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side
- [tts_fidelity.md](tts_fidelity.md) - Stricter, word-for-word diagnostic metric, only for pipelines that expose intended text
- [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results
- [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern)

## Implementation Details

- **File**: `src/eva/metrics/accuracy/agent_speech_fidelity.py`
- **Class**: `AgentSpeechFidelityMetric`
- **File**: `src/eva/metrics/accuracy/speech_fidelity.py`
- **Class**: `SpeechFidelityMetric`
- **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric`
- **Prompt**: `configs/prompts/judge.yaml` under `judge.agent_speech_fidelity`
- **Configuration**: `audio_judge_model` (default: Gemini 3.1 Pro), `aggregation` (default: "mean")
- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean")
82 changes: 82 additions & 0 deletions docs/metrics/tts_fidelity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# TTS Fidelity

> **Diagnostic Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was.

## Overview

Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses).

> [!NOTE]
> By default, this diagnostic metric is excluded. Enable it explicitly with `--metrics tts_fidelity` (or include it in a comma-separated `--metrics` list).

### Capabilities Measured

- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio.

## How It Works

### Evaluation Method

- **Type**: Audio Judge (multimodal LLM with audio input)
- **Model**: Gemini 3 Flash
- **Granularity**: Per-turn (each assistant turn evaluated independently)

### Input Data

Uses the following MetricContext fields:
- `audio_assistant_path`: Path to assistant-only audio file
- `intended_assistant_turns`: What the assistant intended to say

### Evaluation Methodology

The judge compares intended text against spoken audio, focusing on:

- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items
- **Error types**: Missing words, added words, wrong words, entity errors

**Special handling:**
- Minor pronunciation variations that don't change meaning are acceptable
- Filler words (um, uh) that don't affect core content are ignored
- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized
- Missing words at END of LAST turn only are not penalized (audio cutoff)

### Scoring

- **Scale**: 0-1 (binary per turn)
- 1: High Fidelity — audio accurately says all words from intended text
- 0: Low Fidelity — missing, added, or wrong words detected
- **Normalization**: Already 0-1 scale
- **Aggregation**: Mean across all assistant turns

## Example Output

```json
{
"name": "tts_fidelity",
"score": 0.875,
"normalized_score": 0.875,
"details": {
"aggregation": "mean",
"num_turns": 7,
"num_evaluated": 7,
"per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1},
"per_turn_explanations": {
"3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error."
}
}
}
```

## Related Metrics

- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side
- [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results
- [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern)

## Implementation Details

- **File**: `src/eva/metrics/diagnostic/tts_fidelity.py`
- **Class**: `TTSFidelityMetric`
- **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric`
- **Prompt**: `configs/prompts/judge.yaml` under `judge.tts_fidelity`
- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean")
4 changes: 2 additions & 2 deletions docs/metrics/user_speech_fidelity.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Audio-based validation metric that evaluates whether the user simulator's **spok
### Evaluation Method

- **Type**: Audio Judge (multimodal LLM with audio input)
- **Model**: Gemini 3.1 Pro
- **Model**: Gemini 3 Flash
- **Granularity**: Per-turn (each user turn evaluated independently)

### Input Data
Expand Down Expand Up @@ -67,5 +67,5 @@ This metric uses a 1-3 scale instead of binary 0-1 (like agent speech fidelity)
- **Prompt location**: `configs/prompts/judge.yaml` under `judge.user_speech_fidelity`
- Uses the same speech fidelity prompt structure as `agent_speech_fidelity` but with `evaluation_mode="user"` and user turns
- **Configuration options**:
- `audio_judge_model`: LLM model (default: Gemini 3.1 Pro)
- `audio_judge_model`: LLM model (default: Gemini 3 Flash)
- `aggregation`: Aggregation method (default: "mean")
2 changes: 1 addition & 1 deletion src/eva/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@

# Bump metrics_version when changes affect metric computation (metrics code,
# judge prompts, pricing tables, postprocessor).
metrics_version = "2.1.2"
metrics_version = "2.2.0"
6 changes: 2 additions & 4 deletions src/eva/metrics/accuracy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
"""Task completion metrics - measuring whether the agent accomplished the user's goal."""

from . import agent_speech_fidelity # noqa
from . import agent_speech_fidelity_s2s # noqa
from . import faithfulness # noqa
from . import speech_fidelity # noqa
from . import task_completion # noqa

__all__ = [
"agent_speech_fidelity",
"agent_speech_fidelity_s2s",
"faithfulness",
"speech_fidelity",
"task_completion",
]
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Agent speech fidelity metric for S2S models — entity-focused evaluation.
"""Agent speech fidelity metric — entity-focused, pipeline-agnostic evaluation.

For S2S (speech-to-speech) models, there is no intended text to compare against.
Instead, this metric verifies that key entities spoken by the agent (from tool
Because S2S (speech-to-speech) models expose no intended text to compare against,
this metric instead verifies that key entities spoken by the agent (from tool
responses and user utterances) are accurate by sending a redacted conversation
trace alongside the agent audio to Gemini.
"""
Expand All @@ -10,13 +10,15 @@
from typing import Any

from eva.metrics.base import MetricContext
from eva.metrics.registry import register_metric
from eva.metrics.speech_fidelity_base import SpeechFidelityBaseMetric
from eva.metrics.utils import aggregate_per_turn_scores, normalize_rating, resolve_turn_id
from eva.models.results import MetricScore


class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric):
"""Audio-based entity fidelity metric for S2S agent speech.
@register_metric
class SpeechFidelityMetric(SpeechFidelityBaseMetric):
"""Audio-based entity fidelity metric for agent speech.

Evaluates whether key entities (from tool responses and user utterances) are
spoken correctly by the agent, without requiring intended text.
Expand All @@ -25,8 +27,8 @@ class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric):
"""

name = "agent_speech_fidelity"
version = "v0.2"
description = "Audio-based evaluation of agent entity fidelity for S2S models"
version = "v0.4"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AgentSpeechFidelityMetric (the other variant of the metric), we had version = "v0.3", so I'm bumping above that.

description = "Audio-based evaluation of agent entity fidelity"
category = "accuracy"
role = "assistant"
rating_scale = (0, 1)
Expand Down Expand Up @@ -63,7 +65,6 @@ async def compute(self, context: MetricContext) -> MetricScore:
audio_b64 = self.encode_audio_segment(audio_segment)

prompt = self.get_judge_prompt(
prompt_key="s2s_user_prompt",
conversation_trace_formatted=trace_formatted,
expected_language=context.language_display_name,
)
Expand Down Expand Up @@ -145,7 +146,6 @@ async def compute(self, context: MetricContext) -> MetricScore:
avg_rating = sum(valid_ratings) / len(valid_ratings) if valid_ratings else None

details: dict[str, Any] = {
"variant": "s2s",
"aggregation": self.aggregation,
"num_turns": num_turns,
"num_evaluated": len(valid_ratings),
Expand Down
1 change: 1 addition & 0 deletions src/eva/metrics/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ class BaseMetric(ABC):
metric_type: MetricType = MetricType.CODE # Override in subclasses
pass_at_k_threshold: float = 0.5 # Normalized score threshold for pass@k pass/fail
exclude_from_pass_at_k: bool = False # Set True for metrics not suitable for pass@k
exclude_from_default_metrics: bool = False
supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType) # Pipeline types this metric supports
# Bump on intentional logic changes; MetricsRunner stamps this onto every MetricScore
# produced by compute(). Required on all concrete subclasses — drift test enforces.
Expand Down
2 changes: 2 additions & 0 deletions src/eva/metrics/diagnostic/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from . import stt_wer # noqa
from . import tool_call_validity # noqa
from . import transcription_accuracy_key_entities # noqa
from . import tts_fidelity # noqa

__all__ = [
"authentication_success",
Expand All @@ -18,4 +19,5 @@
"stt_wer",
"tool_call_validity",
"transcription_accuracy_key_entities",
"tts_fidelity",
]
Loading
Loading