ServiceNow · JosephMarinier · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · JosephMarinier
diff --git a/configs/prompts/judge.yaml b/configs/prompts/judge.yaml
@@ -640,7 +640,7 @@ judge:
         }}
       ]
 
-  agent_speech_fidelity:
+  tts_fidelity:
     user_prompt: |
         You are an expert evaluator judging the fidelity of this audio file against the intended text. 
         You will listen to one audio clip and verify that the spoken content faithfully reproduces the intended text, with special attention to TTS-critical entities.
@@ -737,7 +737,8 @@ judge:
             }}
           ]
         }}
-    s2s_user_prompt: |
+  agent_speech_fidelity:
+    user_prompt: |
         You are an expert evaluator checking the **speech clarity and articulation** of entities spoken by an AI voice agent.
 
         You will receive:

diff --git a/docs/metrics/README.md b/docs/metrics/README.md
@@ -36,7 +36,7 @@ Measures whether the agent accomplished the user's goal correctly:
 |--------|------|-------------|-------------|
 | [`task_completion`](task_completion.md) | Deterministic | Speech Recognition, Language Model | Binary pass/fail via scenario DB state hash comparison (0-1) |
 | [`faithfulness`](faithfulness.md) | Judge (Claude Opus) | Speech Recognition (audio-native only), Language Model | Faithfulness to information, policies, and instructions (1-3) |
-| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1) |
+| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether key entities in the assistant's speech audio matches intended text (0-1) |
 
 ### Experience (3 metrics)
 
@@ -48,12 +48,13 @@ Measures the quality of the user's conversational experience:
 | [`conciseness`](conciseness.md) | Judge | Language Model | Whether responses are appropriately concise for voice (1-3) |
 | [`conversation_progression`](conversation_progression.md) | Judge | Language Model | Whether assistant moves conversation forward without repetition (1-3) |
 
-### Diagnostic (6 metrics)
+### Diagnostic (7 metrics)
 
 Metrics that help isolate root causes of failures. These provide signals for understanding what went wrong, but are not directly used in final evaluation scores.
 
 | Metric | Type | Capabilities | Description |
 |--------|------|-------------|-------------|
+| [`tts_fidelity`](tts_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1). **Opt-in** — excluded from the default run; enable via `--metrics tts_fidelity`. |
 | [`authentication_success`](authentication_success.md) | Deterministic | Speech Recognition, Language Model | Whether get_reservation was called successfully (0-1) |
 | [`response_speed`](response_speed.md) | Deterministic | VAD, Pipeline | Latency between user utterance end and assistant response start (seconds) |
 | [`speakability`](speakability.md) | Judge | Language Model | Whether text is voice-friendly and appropriate for TTS (0-1) |
@@ -80,7 +81,7 @@ Each metric implements `BaseMetric.compute(context: MetricContext) -> MetricScor
 **LLM-as-Judge:**
 - Integer/boolean ratings (not decimals) to avoid precision issues
 - Structured prompts in `configs/prompts/judge.yaml`
-- GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness
+- GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness
 
 **Audio Evaluation:**
 - Audio encoded as base64 WAV, sent to Gemini via LiteLLM
@@ -122,10 +123,10 @@ python main.py \
     --run-id <existing_run_id> \
     --metrics turn_taking,conciseness,conversation_progression
 
-# Run diagnostic metrics
+# Run diagnostic metrics (tts_fidelity is opt-in and only runs when named explicitly)
 python main.py \
     --run-id <existing_run_id> \
-    --metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities
+    --metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities,tts_fidelity
 
 # Run validation metrics
 python main.py \

diff --git a/docs/metrics/agent_speech_fidelity.md b/docs/metrics/agent_speech_fidelity.md
@@ -1,56 +1,42 @@
 # Agent Speech Fidelity
 
-> **Accuracy Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was.
+> **Accuracy Metric**: If the agent's spoken audio garbles or misstates key information, the user receives incorrect information regardless of how good the text reasoning was.
 
 ## Overview
 
-Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses).
+Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the **key entities** (dates, names, numbers, codes, addresses, etc.), using Gemini for multimodal analysis.
+
+To keep the EVA score **apples-to-apples across all pipeline setups**, the same entity-focused metric runs for every pipeline type — cascade, S2S, and audio-LLM. It does not require any intended text.
 
 ### Capabilities Measured
 
-- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio.
+- **Speech Synthesis**: Measures whether the assistant's spoken audio accurately represents the key entities.
 
 ## How It Works
 
 ### Evaluation Method
 
 - **Type**: Audio Judge (multimodal LLM with audio input)
-- **Model**: Gemini 3.1 Pro
+- **Model**: Gemini 3 Flash
 - **Granularity**: Per-turn (each assistant turn evaluated independently)
 
 ### Input Data
 
 Uses the following MetricContext fields:
 - `audio_assistant_path`: Path to assistant-only audio file
-- `intended_assistant_turns`: What the assistant intended to say
-
-### Audio-Native vs Cascade
-
-The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from:
-
-- **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output).
-- **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech.
+- `conversation_trace`: User utterances and tool responses are kept as-is (the sources of the entities to listen for); assistant turns are **redacted** to a placeholder so the judge evaluates articulation, not whether the agent gave the "right" answer.
 
 ### Evaluation Methodology
 
-The judge compares intended text against spoken audio, focusing on:
-
-- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items
-- **Error types**: Missing words, added words, wrong words, entity errors
-
-**Special handling:**
-- Minor pronunciation variations that don't change meaning are acceptable
-- Filler words (um, uh) that don't affect core content are ignored
-- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized
-- Missing words at END of LAST turn only are not penalized (audio cutoff)
+The judge receives the agent audio plus a redacted conversation trace and, for each assistant turn, checks whether the spoken audio accurately represents the key entities: Names, dates, times, codes, dollar amounts, flight numbers, etc. Turns with no entities to evaluate are flagged (`has_entities = false`) and **excluded** from scoring.
 
 ### Scoring
 
 - **Scale**: 0-1 (binary per turn)
-  - 1: High Fidelity — audio accurately says all words from intended text
-  - 0: Low Fidelity — missing, added, or wrong words detected
+  - 1: Entities clearly articulated
+  - 0: An entity is unclear, garbled, or wrongly articulated
 - **Normalization**: Already 0-1 scale
-- **Aggregation**: Mean across all assistant turns
+- **Aggregation**: Mean across scored assistant turns (turns with no entities are skipped)
 
 ## Example Output
 
@@ -62,25 +48,27 @@ The judge compares intended text against spoken audio, focusing on:
   "details": {
     "aggregation": "mean",
     "num_turns": 7,
-    "num_evaluated": 7,
-    "per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1},
+    "num_evaluated": 4,
+    "num_skipped_no_entities": 3,
+    "per_turn_ratings": {"0": 1, "1": 1, "3": 0, "5": 1},
+    "per_turn_has_entities": {"0": true, "1": true, "2": false, "3": true},
     "per_turn_explanations": {
-      "3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error."
+      "3": "Confirmation code unclear: heard 'ZK three F F' but trace shows 'ZK3FFW'."
     }
   }
 }
 ```
 
 ## Related Metrics
 
-- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side
+- [tts_fidelity.md](tts_fidelity.md) - Stricter, word-for-word diagnostic metric, only for pipelines that expose intended text
 - [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results
 - [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern)
 
 ## Implementation Details
 
-- **File**: `src/eva/metrics/accuracy/agent_speech_fidelity.py`
-- **Class**: `AgentSpeechFidelityMetric`
+- **File**: `src/eva/metrics/accuracy/speech_fidelity.py`
+- **Class**: `SpeechFidelityMetric`
 - **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric`
 - **Prompt**: `configs/prompts/judge.yaml` under `judge.agent_speech_fidelity`
-- **Configuration**: `audio_judge_model` (default: Gemini 3.1 Pro), `aggregation` (default: "mean")
+- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean")
diff --git a/docs/metrics/tts_fidelity.md b/docs/metrics/tts_fidelity.md
@@ -0,0 +1,82 @@
+# TTS Fidelity
+
+> **Diagnostic Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was.
+
+## Overview
+
+Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses).
+
+> [!NOTE]
+> By default, this diagnostic metric is excluded. Enable it explicitly with `--metrics tts_fidelity` (or include it in a comma-separated `--metrics` list).
+
+### Capabilities Measured
+
+- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio.
+
+## How It Works
+
+### Evaluation Method
+
+- **Type**: Audio Judge (multimodal LLM with audio input)
+- **Model**: Gemini 3 Flash
+- **Granularity**: Per-turn (each assistant turn evaluated independently)
+
+### Input Data
+
+Uses the following MetricContext fields:
+- `audio_assistant_path`: Path to assistant-only audio file
+- `intended_assistant_turns`: What the assistant intended to say
+
+### Evaluation Methodology
+
+The judge compares intended text against spoken audio, focusing on:
+
+- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items
+- **Error types**: Missing words, added words, wrong words, entity errors
+
+**Special handling:**
+- Minor pronunciation variations that don't change meaning are acceptable
+- Filler words (um, uh) that don't affect core content are ignored
+- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized
+- Missing words at END of LAST turn only are not penalized (audio cutoff)
+
+### Scoring
+
+- **Scale**: 0-1 (binary per turn)
+  - 1: High Fidelity — audio accurately says all words from intended text
+  - 0: Low Fidelity — missing, added, or wrong words detected
+- **Normalization**: Already 0-1 scale
+- **Aggregation**: Mean across all assistant turns
+
+## Example Output
+
+```json
+{
+  "name": "tts_fidelity",
+  "score": 0.875,
+  "normalized_score": 0.875,
+  "details": {
+    "aggregation": "mean",
+    "num_turns": 7,
+    "num_evaluated": 7,
+    "per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1},
+    "per_turn_explanations": {
+      "3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error."
+    }
+  }
+}
+```
+
+## Related Metrics
+
+- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side
+- [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results
+- [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern)
+
+## Implementation Details
+
+- **File**: `src/eva/metrics/diagnostic/tts_fidelity.py`
+- **Class**: `TTSFidelityMetric`
+- **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric`
+- **Prompt**: `configs/prompts/judge.yaml` under `judge.tts_fidelity`
+- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean")
diff --git a/docs/metrics/user_speech_fidelity.md b/docs/metrics/user_speech_fidelity.md
@@ -11,7 +11,7 @@ Audio-based validation metric that evaluates whether the user simulator's **spok
 ### Evaluation Method
 
 - **Type**: Audio Judge (multimodal LLM with audio input)
-- **Model**: Gemini 3.1 Pro
+- **Model**: Gemini 3 Flash
 - **Granularity**: Per-turn (each user turn evaluated independently)
 
 ### Input Data
@@ -67,5 +67,5 @@ This metric uses a 1-3 scale instead of binary 0-1 (like agent speech fidelity)
 - **Prompt location**: `configs/prompts/judge.yaml` under `judge.user_speech_fidelity`
   - Uses the same speech fidelity prompt structure as `agent_speech_fidelity` but with `evaluation_mode="user"` and user turns
 - **Configuration options**:
-  - `audio_judge_model`: LLM model (default: Gemini 3.1 Pro)
+  - `audio_judge_model`: LLM model (default: Gemini 3 Flash)
   - `aggregation`: Aggregation method (default: "mean")
diff --git a/src/eva/__init__.py b/src/eva/__init__.py
@@ -11,4 +11,4 @@
 
 # Bump metrics_version when changes affect metric computation (metrics code,
 # judge prompts, pricing tables, postprocessor).
-metrics_version = "2.1.2"
+metrics_version = "2.2.0"
diff --git a/src/eva/metrics/accuracy/__init__.py b/src/eva/metrics/accuracy/__init__.py
@@ -1,13 +1,11 @@
 """Task completion metrics - measuring whether the agent accomplished the user's goal."""
 
-from . import agent_speech_fidelity  # noqa
-from . import agent_speech_fidelity_s2s  # noqa
 from . import faithfulness  # noqa
+from . import speech_fidelity  # noqa
 from . import task_completion  # noqa
 
 __all__ = [
-    "agent_speech_fidelity",
-    "agent_speech_fidelity_s2s",
     "faithfulness",
+    "speech_fidelity",
     "task_completion",
 ]
diff --git a/...ics/accuracy/agent_speech_fidelity_s2s.py → src/eva/metrics/accuracy/speech_fidelity.py b/...ics/accuracy/agent_speech_fidelity_s2s.py → src/eva/metrics/accuracy/speech_fidelity.py
@@ -1,7 +1,7 @@
-"""Agent speech fidelity metric for S2S models — entity-focused evaluation.
+"""Agent speech fidelity metric — entity-focused, pipeline-agnostic evaluation.
 
-For S2S (speech-to-speech) models, there is no intended text to compare against.
-Instead, this metric verifies that key entities spoken by the agent (from tool
+Because S2S (speech-to-speech) models expose no intended text to compare against,
+this metric instead verifies that key entities spoken by the agent (from tool
 responses and user utterances) are accurate by sending a redacted conversation
 trace alongside the agent audio to Gemini.
 """
@@ -10,13 +10,15 @@
 from typing import Any
 
 from eva.metrics.base import MetricContext
+from eva.metrics.registry import register_metric
 from eva.metrics.speech_fidelity_base import SpeechFidelityBaseMetric
 from eva.metrics.utils import aggregate_per_turn_scores, normalize_rating, resolve_turn_id
 from eva.models.results import MetricScore
 
 
-class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric):
-    """Audio-based entity fidelity metric for S2S agent speech.
+@register_metric
+class SpeechFidelityMetric(SpeechFidelityBaseMetric):
+    """Audio-based entity fidelity metric for agent speech.
 
     Evaluates whether key entities (from tool responses and user utterances) are
     spoken correctly by the agent, without requiring intended text.
@@ -25,8 +27,8 @@ class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric):
     """
 
     name = "agent_speech_fidelity"
-    version = "v0.2"
-    description = "Audio-based evaluation of agent entity fidelity for S2S models"
+    version = "v0.4"
+    description = "Audio-based evaluation of agent entity fidelity"
     category = "accuracy"
     role = "assistant"
     rating_scale = (0, 1)
@@ -63,7 +65,6 @@ async def compute(self, context: MetricContext) -> MetricScore:
             audio_b64 = self.encode_audio_segment(audio_segment)
 
             prompt = self.get_judge_prompt(
-                prompt_key="s2s_user_prompt",
                 conversation_trace_formatted=trace_formatted,
                 expected_language=context.language_display_name,
             )
@@ -145,7 +146,6 @@ async def compute(self, context: MetricContext) -> MetricScore:
             avg_rating = sum(valid_ratings) / len(valid_ratings) if valid_ratings else None
 
             details: dict[str, Any] = {
-                "variant": "s2s",
                 "aggregation": self.aggregation,
                 "num_turns": num_turns,
                 "num_evaluated": len(valid_ratings),

diff --git a/src/eva/metrics/base.py b/src/eva/metrics/base.py
@@ -167,6 +167,7 @@ class BaseMetric(ABC):
     metric_type: MetricType = MetricType.CODE  # Override in subclasses
     pass_at_k_threshold: float = 0.5  # Normalized score threshold for pass@k pass/fail
     exclude_from_pass_at_k: bool = False  # Set True for metrics not suitable for pass@k
+    exclude_from_default_metrics: bool = False
     supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType)  # Pipeline types this metric supports
     # Bump on intentional logic changes; MetricsRunner stamps this onto every MetricScore
     # produced by compute(). Required on all concrete subclasses — drift test enforces.

diff --git a/src/eva/metrics/diagnostic/__init__.py b/src/eva/metrics/diagnostic/__init__.py
@@ -8,6 +8,7 @@
 from . import stt_wer  # noqa
 from . import tool_call_validity  # noqa
 from . import transcription_accuracy_key_entities  # noqa
+from . import tts_fidelity  # noqa
 
 __all__ = [
     "authentication_success",
@@ -18,4 +19,5 @@
     "stt_wer",
     "tool_call_validity",
     "transcription_accuracy_key_entities",
+    "tts_fidelity",
 ]