diff --git a/configs/prompts/judge.yaml b/configs/prompts/judge.yaml index 86fa9ce3..a7a7f2db 100644 --- a/configs/prompts/judge.yaml +++ b/configs/prompts/judge.yaml @@ -640,7 +640,7 @@ judge: }} ] - agent_speech_fidelity: + tts_fidelity: user_prompt: | You are an expert evaluator judging the fidelity of this audio file against the intended text. You will listen to one audio clip and verify that the spoken content faithfully reproduces the intended text, with special attention to TTS-critical entities. @@ -737,7 +737,8 @@ judge: }} ] }} - s2s_user_prompt: | + agent_speech_fidelity: + user_prompt: | You are an expert evaluator checking the **speech clarity and articulation** of entities spoken by an AI voice agent. You will receive: diff --git a/docs/metrics/README.md b/docs/metrics/README.md index 65a26c55..277a1e7d 100644 --- a/docs/metrics/README.md +++ b/docs/metrics/README.md @@ -36,7 +36,7 @@ Measures whether the agent accomplished the user's goal correctly: |--------|------|-------------|-------------| | [`task_completion`](task_completion.md) | Deterministic | Speech Recognition, Language Model | Binary pass/fail via scenario DB state hash comparison (0-1) | | [`faithfulness`](faithfulness.md) | Judge (Claude Opus) | Speech Recognition (audio-native only), Language Model | Faithfulness to information, policies, and instructions (1-3) | -| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1) | +| [`agent_speech_fidelity`](agent_speech_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether key entities in the assistant's speech audio matches intended text (0-1) | ### Experience (3 metrics) @@ -48,12 +48,13 @@ Measures the quality of the user's conversational experience: | [`conciseness`](conciseness.md) | Judge | Language Model | Whether responses are appropriately concise for voice (1-3) | | [`conversation_progression`](conversation_progression.md) | Judge | Language Model | Whether assistant moves conversation forward without repetition (1-3) | -### Diagnostic (6 metrics) +### Diagnostic (7 metrics) Metrics that help isolate root causes of failures. These provide signals for understanding what went wrong, but are not directly used in final evaluation scores. | Metric | Type | Capabilities | Description | |--------|------|-------------|-------------| +| [`tts_fidelity`](tts_fidelity.md) | Audio Judge (Gemini) | Speech Synthesis | Whether assistant speech audio matches intended text (0-1). **Opt-in** — excluded from the default run; enable via `--metrics tts_fidelity`. | | [`authentication_success`](authentication_success.md) | Deterministic | Speech Recognition, Language Model | Whether get_reservation was called successfully (0-1) | | [`response_speed`](response_speed.md) | Deterministic | VAD, Pipeline | Latency between user utterance end and assistant response start (seconds) | | [`speakability`](speakability.md) | Judge | Language Model | Whether text is voice-friendly and appropriate for TTS (0-1) | @@ -80,7 +81,7 @@ Each metric implements `BaseMetric.compute(context: MetricContext) -> MetricScor **LLM-as-Judge:** - Integer/boolean ratings (not decimals) to avoid precision issues - Structured prompts in `configs/prompts/judge.yaml` -- GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness +- GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness **Audio Evaluation:** - Audio encoded as base64 WAV, sent to Gemini via LiteLLM @@ -122,10 +123,10 @@ python main.py \ --run-id \ --metrics turn_taking,conciseness,conversation_progression -# Run diagnostic metrics +# Run diagnostic metrics (tts_fidelity is opt-in and only runs when named explicitly) python main.py \ --run-id \ - --metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities + --metrics authentication_success,response_speed,speakability,stt_wer,tool_call_validity,transcription_accuracy_key_entities,tts_fidelity # Run validation metrics python main.py \ diff --git a/docs/metrics/agent_speech_fidelity.md b/docs/metrics/agent_speech_fidelity.md index 07d9dddd..ad52a058 100644 --- a/docs/metrics/agent_speech_fidelity.md +++ b/docs/metrics/agent_speech_fidelity.md @@ -1,56 +1,42 @@ # Agent Speech Fidelity -> **Accuracy Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was. +> **Accuracy Metric**: If the agent's spoken audio garbles or misstates key information, the user receives incorrect information regardless of how good the text reasoning was. ## Overview -Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses). +Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the **key entities** (dates, names, numbers, codes, addresses, etc.), using Gemini for multimodal analysis. + +To keep the EVA score **apples-to-apples across all pipeline setups**, the same entity-focused metric runs for every pipeline type — cascade, S2S, and audio-LLM. It does not require any intended text. ### Capabilities Measured -- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio. +- **Speech Synthesis**: Measures whether the assistant's spoken audio accurately represents the key entities. ## How It Works ### Evaluation Method - **Type**: Audio Judge (multimodal LLM with audio input) -- **Model**: Gemini 3.1 Pro +- **Model**: Gemini 3 Flash - **Granularity**: Per-turn (each assistant turn evaluated independently) ### Input Data Uses the following MetricContext fields: - `audio_assistant_path`: Path to assistant-only audio file -- `intended_assistant_turns`: What the assistant intended to say - -### Audio-Native vs Cascade - -The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from: - -- **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output). -- **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech. +- `conversation_trace`: User utterances and tool responses are kept as-is (the sources of the entities to listen for); assistant turns are **redacted** to a placeholder so the judge evaluates articulation, not whether the agent gave the "right" answer. ### Evaluation Methodology -The judge compares intended text against spoken audio, focusing on: - -- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items -- **Error types**: Missing words, added words, wrong words, entity errors - -**Special handling:** -- Minor pronunciation variations that don't change meaning are acceptable -- Filler words (um, uh) that don't affect core content are ignored -- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized -- Missing words at END of LAST turn only are not penalized (audio cutoff) +The judge receives the agent audio plus a redacted conversation trace and, for each assistant turn, checks whether the spoken audio accurately represents the key entities: Names, dates, times, codes, dollar amounts, flight numbers, etc. Turns with no entities to evaluate are flagged (`has_entities = false`) and **excluded** from scoring. ### Scoring - **Scale**: 0-1 (binary per turn) - - 1: High Fidelity — audio accurately says all words from intended text - - 0: Low Fidelity — missing, added, or wrong words detected + - 1: Entities clearly articulated + - 0: An entity is unclear, garbled, or wrongly articulated - **Normalization**: Already 0-1 scale -- **Aggregation**: Mean across all assistant turns +- **Aggregation**: Mean across scored assistant turns (turns with no entities are skipped) ## Example Output @@ -62,10 +48,12 @@ The judge compares intended text against spoken audio, focusing on: "details": { "aggregation": "mean", "num_turns": 7, - "num_evaluated": 7, - "per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1}, + "num_evaluated": 4, + "num_skipped_no_entities": 3, + "per_turn_ratings": {"0": 1, "1": 1, "3": 0, "5": 1}, + "per_turn_has_entities": {"0": true, "1": true, "2": false, "3": true}, "per_turn_explanations": { - "3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error." + "3": "Confirmation code unclear: heard 'ZK three F F' but trace shows 'ZK3FFW'." } } } @@ -73,14 +61,14 @@ The judge compares intended text against spoken audio, focusing on: ## Related Metrics -- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side +- [tts_fidelity.md](tts_fidelity.md) - Stricter, word-for-word diagnostic metric, only for pipelines that expose intended text - [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results - [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern) ## Implementation Details -- **File**: `src/eva/metrics/accuracy/agent_speech_fidelity.py` -- **Class**: `AgentSpeechFidelityMetric` +- **File**: `src/eva/metrics/accuracy/speech_fidelity.py` +- **Class**: `SpeechFidelityMetric` - **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric` - **Prompt**: `configs/prompts/judge.yaml` under `judge.agent_speech_fidelity` -- **Configuration**: `audio_judge_model` (default: Gemini 3.1 Pro), `aggregation` (default: "mean") +- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean") diff --git a/docs/metrics/tts_fidelity.md b/docs/metrics/tts_fidelity.md new file mode 100644 index 00000000..e031914b --- /dev/null +++ b/docs/metrics/tts_fidelity.md @@ -0,0 +1,82 @@ +# TTS Fidelity + +> **Diagnostic Metric**: If the agent's spoken audio doesn't match what it intended to say, the user receives incorrect information regardless of how good the text reasoning was. + +## Overview + +Audio-based metric that evaluates whether the assistant's **spoken audio** accurately represents the intended text, using Gemini for multimodal analysis. This metric evaluates the speech output regardless of how it was produced — whether by a separate TTS engine or generated directly by an audio-native model. Specifically, it checks that all words from the intended text are present (no missing words), no extra words were added (no insertions), words are spoken correctly (no substitutions), and key entities are accurately conveyed (dates, names, numbers, codes, addresses). + +> [!NOTE] +> By default, this diagnostic metric is excluded. Enable it explicitly with `--metrics tts_fidelity` (or include it in a comma-separated `--metrics` list). + +### Capabilities Measured + +- **Speech Synthesis**: Measures whether the TTS engine (cascade) or the model's direct audio generation (audio-native) accurately produces the intended text as spoken audio. + +## How It Works + +### Evaluation Method + +- **Type**: Audio Judge (multimodal LLM with audio input) +- **Model**: Gemini 3 Flash +- **Granularity**: Per-turn (each assistant turn evaluated independently) + +### Input Data + +Uses the following MetricContext fields: +- `audio_assistant_path`: Path to assistant-only audio file +- `intended_assistant_turns`: What the assistant intended to say + +### Evaluation Methodology + +The judge compares intended text against spoken audio, focusing on: + +- **TTS-critical entities**: Names, dates, times, codes, dollar amounts, flight numbers — these are the highest-priority items +- **Error types**: Missing words, added words, wrong words, entity errors + +**Special handling:** +- Minor pronunciation variations that don't change meaning are acceptable +- Filler words (um, uh) that don't affect core content are ignored +- Interruption tags (e.g., `[likely cut off by user]`, `[assistant interrupts]`) are non-spoken metadata in the intended text — words in regions flagged by these tags as likely not spoken are not penalized +- Missing words at END of LAST turn only are not penalized (audio cutoff) + +### Scoring + +- **Scale**: 0-1 (binary per turn) + - 1: High Fidelity — audio accurately says all words from intended text + - 0: Low Fidelity — missing, added, or wrong words detected +- **Normalization**: Already 0-1 scale +- **Aggregation**: Mean across all assistant turns + +## Example Output + +```json +{ + "name": "tts_fidelity", + "score": 0.875, + "normalized_score": 0.875, + "details": { + "aggregation": "mean", + "num_turns": 7, + "num_evaluated": 7, + "per_turn_ratings": {"0": 1, "1": 1, "2": 1, "3": 0, "4": 1, "5": 1, "6": 1}, + "per_turn_explanations": { + "3": "Missing word: intended 'flight SW102' but audio said 'flight SW12'. Key entity error." + } + } +} +``` + +## Related Metrics + +- [user_speech_fidelity.md](user_speech_fidelity.md) - Same metric for the user simulator side +- [faithfulness.md](faithfulness.md) - Faithfulness for the text layer: evaluates whether the assistant's responses are grounded in instructions, policies, and tool results +- [speakability.md](speakability.md) - Checks if text is voice-friendly (upstream concern) + +## Implementation Details + +- **File**: `src/eva/metrics/diagnostic/tts_fidelity.py` +- **Class**: `TTSFidelityMetric` +- **Base Class**: `SpeechFidelityBaseMetric` → `AudioJudgeMetric` +- **Prompt**: `configs/prompts/judge.yaml` under `judge.tts_fidelity` +- **Configuration**: `audio_judge_model` (default: Gemini 3 Flash), `aggregation` (default: "mean") diff --git a/docs/metrics/user_speech_fidelity.md b/docs/metrics/user_speech_fidelity.md index 4427acb3..afe0b643 100644 --- a/docs/metrics/user_speech_fidelity.md +++ b/docs/metrics/user_speech_fidelity.md @@ -11,7 +11,7 @@ Audio-based validation metric that evaluates whether the user simulator's **spok ### Evaluation Method - **Type**: Audio Judge (multimodal LLM with audio input) -- **Model**: Gemini 3.1 Pro +- **Model**: Gemini 3 Flash - **Granularity**: Per-turn (each user turn evaluated independently) ### Input Data @@ -67,5 +67,5 @@ This metric uses a 1-3 scale instead of binary 0-1 (like agent speech fidelity) - **Prompt location**: `configs/prompts/judge.yaml` under `judge.user_speech_fidelity` - Uses the same speech fidelity prompt structure as `agent_speech_fidelity` but with `evaluation_mode="user"` and user turns - **Configuration options**: - - `audio_judge_model`: LLM model (default: Gemini 3.1 Pro) + - `audio_judge_model`: LLM model (default: Gemini 3 Flash) - `aggregation`: Aggregation method (default: "mean") diff --git a/src/eva/__init__.py b/src/eva/__init__.py index 166c4b55..eb2eaa6a 100644 --- a/src/eva/__init__.py +++ b/src/eva/__init__.py @@ -11,4 +11,4 @@ # Bump metrics_version when changes affect metric computation (metrics code, # judge prompts, pricing tables, postprocessor). -metrics_version = "2.1.2" +metrics_version = "2.2.0" diff --git a/src/eva/metrics/accuracy/__init__.py b/src/eva/metrics/accuracy/__init__.py index ab2fff74..78adb8e3 100644 --- a/src/eva/metrics/accuracy/__init__.py +++ b/src/eva/metrics/accuracy/__init__.py @@ -1,13 +1,11 @@ """Task completion metrics - measuring whether the agent accomplished the user's goal.""" -from . import agent_speech_fidelity # noqa -from . import agent_speech_fidelity_s2s # noqa from . import faithfulness # noqa +from . import speech_fidelity # noqa from . import task_completion # noqa __all__ = [ - "agent_speech_fidelity", - "agent_speech_fidelity_s2s", "faithfulness", + "speech_fidelity", "task_completion", ] diff --git a/src/eva/metrics/accuracy/agent_speech_fidelity_s2s.py b/src/eva/metrics/accuracy/speech_fidelity.py similarity index 95% rename from src/eva/metrics/accuracy/agent_speech_fidelity_s2s.py rename to src/eva/metrics/accuracy/speech_fidelity.py index 7ec2979b..ea34dfde 100644 --- a/src/eva/metrics/accuracy/agent_speech_fidelity_s2s.py +++ b/src/eva/metrics/accuracy/speech_fidelity.py @@ -1,7 +1,7 @@ -"""Agent speech fidelity metric for S2S models — entity-focused evaluation. +"""Agent speech fidelity metric — entity-focused, pipeline-agnostic evaluation. -For S2S (speech-to-speech) models, there is no intended text to compare against. -Instead, this metric verifies that key entities spoken by the agent (from tool +Because S2S (speech-to-speech) models expose no intended text to compare against, +this metric instead verifies that key entities spoken by the agent (from tool responses and user utterances) are accurate by sending a redacted conversation trace alongside the agent audio to Gemini. """ @@ -10,13 +10,15 @@ from typing import Any from eva.metrics.base import MetricContext +from eva.metrics.registry import register_metric from eva.metrics.speech_fidelity_base import SpeechFidelityBaseMetric from eva.metrics.utils import aggregate_per_turn_scores, normalize_rating, resolve_turn_id from eva.models.results import MetricScore -class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric): - """Audio-based entity fidelity metric for S2S agent speech. +@register_metric +class SpeechFidelityMetric(SpeechFidelityBaseMetric): + """Audio-based entity fidelity metric for agent speech. Evaluates whether key entities (from tool responses and user utterances) are spoken correctly by the agent, without requiring intended text. @@ -25,8 +27,8 @@ class AgentSpeechFidelityS2SMetric(SpeechFidelityBaseMetric): """ name = "agent_speech_fidelity" - version = "v0.2" - description = "Audio-based evaluation of agent entity fidelity for S2S models" + version = "v0.4" + description = "Audio-based evaluation of agent entity fidelity" category = "accuracy" role = "assistant" rating_scale = (0, 1) @@ -63,7 +65,6 @@ async def compute(self, context: MetricContext) -> MetricScore: audio_b64 = self.encode_audio_segment(audio_segment) prompt = self.get_judge_prompt( - prompt_key="s2s_user_prompt", conversation_trace_formatted=trace_formatted, expected_language=context.language_display_name, ) @@ -145,7 +146,6 @@ async def compute(self, context: MetricContext) -> MetricScore: avg_rating = sum(valid_ratings) / len(valid_ratings) if valid_ratings else None details: dict[str, Any] = { - "variant": "s2s", "aggregation": self.aggregation, "num_turns": num_turns, "num_evaluated": len(valid_ratings), diff --git a/src/eva/metrics/base.py b/src/eva/metrics/base.py index 9b491024..ca15d8b6 100644 --- a/src/eva/metrics/base.py +++ b/src/eva/metrics/base.py @@ -167,6 +167,7 @@ class BaseMetric(ABC): metric_type: MetricType = MetricType.CODE # Override in subclasses pass_at_k_threshold: float = 0.5 # Normalized score threshold for pass@k pass/fail exclude_from_pass_at_k: bool = False # Set True for metrics not suitable for pass@k + exclude_from_default_metrics: bool = False supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType) # Pipeline types this metric supports # Bump on intentional logic changes; MetricsRunner stamps this onto every MetricScore # produced by compute(). Required on all concrete subclasses — drift test enforces. diff --git a/src/eva/metrics/diagnostic/__init__.py b/src/eva/metrics/diagnostic/__init__.py index 02fced4a..cfbb5af9 100644 --- a/src/eva/metrics/diagnostic/__init__.py +++ b/src/eva/metrics/diagnostic/__init__.py @@ -8,6 +8,7 @@ from . import stt_wer # noqa from . import tool_call_validity # noqa from . import transcription_accuracy_key_entities # noqa +from . import tts_fidelity # noqa __all__ = [ "authentication_success", @@ -18,4 +19,5 @@ "stt_wer", "tool_call_validity", "transcription_accuracy_key_entities", + "tts_fidelity", ] diff --git a/src/eva/metrics/accuracy/agent_speech_fidelity.py b/src/eva/metrics/diagnostic/tts_fidelity.py similarity index 76% rename from src/eva/metrics/accuracy/agent_speech_fidelity.py rename to src/eva/metrics/diagnostic/tts_fidelity.py index a7b9b318..90f9b01b 100644 --- a/src/eva/metrics/accuracy/agent_speech_fidelity.py +++ b/src/eva/metrics/diagnostic/tts_fidelity.py @@ -1,9 +1,10 @@ -"""Agent speech fidelity metric using audio + LLM judge (Gemini).""" +"""TTS fidelity diagnostic metric using audio + LLM judge (Gemini).""" from eva.metrics.base import MetricContext from eva.metrics.registry import register_metric from eva.metrics.speech_fidelity_base import SpeechFidelityBaseMetric from eva.metrics.utils import build_per_category_rate_sub_metrics +from eva.models.config import PipelineType from eva.models.results import MetricScore _SPEECH_FIDELITY_FAILURE_MODES = ( @@ -16,7 +17,7 @@ @register_metric -class AgentSpeechFidelityMetric(SpeechFidelityBaseMetric): +class TTSFidelityMetric(SpeechFidelityBaseMetric): """Audio-based speech fidelity metric for agent using Gemini. Evaluates whether the agent's spoken audio accurately represents the intended text. @@ -24,13 +25,15 @@ class AgentSpeechFidelityMetric(SpeechFidelityBaseMetric): Evaluates each agent turn for missing, added, or incorrect words. """ - name = "agent_speech_fidelity" + name = "tts_fidelity" version = "v0.3" - description = "Audio-based evaluation of agent speech fidelity to the intended text" - category = "accuracy" + description = "Diagnostic metric: TTS fidelity to the intended text" + category = "diagnostic" role = "assistant" + exclude_from_pass_at_k = True + exclude_from_default_metrics = True + supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM}) rating_scale = (0, 1) - pass_at_k_threshold = 0.95 def build_sub_metrics( self, diff --git a/src/eva/metrics/registry.py b/src/eva/metrics/registry.py index 5c6c87b6..a09731bb 100644 --- a/src/eva/metrics/registry.py +++ b/src/eva/metrics/registry.py @@ -71,8 +71,8 @@ def create( return metric_class(config=config) def list_metrics(self) -> list[str]: - """Get list of all registered metric names.""" - return list(self._metrics.keys()) + """Get the names of metrics that run by default.""" + return [name for name, cls in self._metrics.items() if not cls.exclude_from_default_metrics] def get_all(self) -> dict[str, type[BaseMetric]]: """Get all registered metrics.""" diff --git a/src/eva/metrics/runner.py b/src/eva/metrics/runner.py index 77881a38..0133b1b9 100644 --- a/src/eva/metrics/runner.py +++ b/src/eva/metrics/runner.py @@ -11,7 +11,6 @@ import yaml -from eva.metrics.accuracy.agent_speech_fidelity_s2s import AgentSpeechFidelityS2SMetric from eva.metrics.aggregation import ( compute_record_aggregates, compute_run_level_aggregates, @@ -141,13 +140,6 @@ def __init__( else: logger.warning(f"Metric '{name}' not found, skipping") - # For S2S pipelines, swap agent_speech_fidelity with entity-focused variant - if self._pipeline_type == PipelineType.S2S: - self.metrics = [ - AgentSpeechFidelityS2SMetric(config=m.config) if m.name == "agent_speech_fidelity" else m - for m in self.metrics - ] - logger.info(f"Metrics runner initialized with {len(self.metrics)} metrics") def _load_agent_config(self) -> dict[str, Any]: diff --git a/src/eva/metrics/signatures.py b/src/eva/metrics/signatures.py index f7ca1f36..16a84cc8 100644 --- a/src/eva/metrics/signatures.py +++ b/src/eva/metrics/signatures.py @@ -27,8 +27,8 @@ def _all_concrete_versioned_metric_classes() -> dict[str, type[BaseMetric]]: """Walk BaseMetric subclasses; return concrete classes that set a version. - Keyed on class qualname so co-named classes (e.g., the cascade vs S2S - variants of `agent_speech_fidelity`) get distinct entries. + Keyed on class qualname (not metric name) so each concrete class gets a + distinct entry even if two ever shared a `name`. """ result: dict[str, type[BaseMetric]] = {} diff --git a/src/eva/metrics/speech_fidelity_base.py b/src/eva/metrics/speech_fidelity_base.py index 0e5e9cc1..d6ca93e0 100644 --- a/src/eva/metrics/speech_fidelity_base.py +++ b/src/eva/metrics/speech_fidelity_base.py @@ -67,7 +67,6 @@ async def compute(self, context: MetricContext) -> MetricScore: intended_turns_formatted = self._format_intended_turns(intended_turns) prompt = self.get_judge_prompt( - prompt_key="user_prompt", intended_turns_formatted=intended_turns_formatted, expected_language=context.language_display_name, ) diff --git a/tests/fixtures/metric_signatures.json b/tests/fixtures/metric_signatures.json index a8c2023a..7be38f09 100644 --- a/tests/fixtures/metric_signatures.json +++ b/tests/fixtures/metric_signatures.json @@ -1,16 +1,4 @@ { - "AgentSpeechFidelityMetric": { - "name": "agent_speech_fidelity", - "prompt_hash": "c3a02ab03f06", - "source_hash": "4bb84934ffb3", - "version": "v0.3" - }, - "AgentSpeechFidelityS2SMetric": { - "name": "agent_speech_fidelity", - "prompt_hash": "c3a02ab03f06", - "source_hash": "db35af2de8a3", - "version": "v0.2" - }, "AuthenticationSuccessMetric": { "name": "authentication_success", "prompt_hash": null, @@ -71,6 +59,18 @@ "source_hash": "75a203410ec1", "version": "v0.2" }, + "SpeechFidelityMetric": { + "name": "agent_speech_fidelity", + "prompt_hash": "c614dd12d4fe", + "source_hash": "0ada0bb6e360", + "version": "v0.4" + }, + "TTSFidelityMetric": { + "name": "tts_fidelity", + "prompt_hash": "c3a02ab03f06", + "source_hash": "2be67c42f3d0", + "version": "v0.3" + }, "TaskCompletion": { "name": "task_completion", "prompt_hash": null, diff --git a/tests/unit/metrics/test_speech_fidelity_s2s.py b/tests/unit/metrics/test_agent_speech_fidelity.py similarity index 98% rename from tests/unit/metrics/test_speech_fidelity_s2s.py rename to tests/unit/metrics/test_agent_speech_fidelity.py index f5e13cb8..7ef385c1 100644 --- a/tests/unit/metrics/test_speech_fidelity_s2s.py +++ b/tests/unit/metrics/test_agent_speech_fidelity.py @@ -1,4 +1,4 @@ -"""Tests for agent_speech_fidelity S2S variant (entity-focused evaluation).""" +"""Tests for agent_speech_fidelity metric.""" import json import logging @@ -6,7 +6,7 @@ import pytest -from eva.metrics.accuracy.agent_speech_fidelity_s2s import AgentSpeechFidelityS2SMetric +from eva.metrics.accuracy.speech_fidelity import SpeechFidelityMetric from eva.models.config import PipelineType from .conftest import make_judge_metric, make_metric_context @@ -20,7 +20,7 @@ def make_judge_response(turns: list[dict]) -> str: @pytest.fixture def s2s_metric(): return make_judge_metric( - AgentSpeechFidelityS2SMetric, + SpeechFidelityMetric, mock_llm=True, logger_name="test_agent_speech_fidelity_s2s", ) @@ -230,7 +230,6 @@ async def test_all_high_fidelity(self, s2s_metric): assert result.normalized_score == 1.0 assert result.details["num_turns"] == 2 assert result.details["num_evaluated"] == 2 - assert result.details["variant"] == "s2s" assert result.error is None @pytest.mark.asyncio diff --git a/tests/unit/metrics/test_registry.py b/tests/unit/metrics/test_registry.py index 4e889872..15acb9dc 100644 --- a/tests/unit/metrics/test_registry.py +++ b/tests/unit/metrics/test_registry.py @@ -25,6 +25,16 @@ async def compute(self, context: MetricContext) -> MetricScore: return MetricScore(name=self.name, score=0.5, normalized_score=0.5) +class ExcludedFakeMetric(BaseMetric): + name = "excluded_fake_metric" + description = "An excluded fake metric" + metric_type = "code" + exclude_from_default_metrics = True + + async def compute(self, context: MetricContext) -> MetricScore: + return MetricScore(name=self.name, score=0.42, normalized_score=0.42) + + class TestMetricRegistry: def setup_method(self): self.registry = MetricRegistry() @@ -71,8 +81,11 @@ def test_create_unknown_returns_none(self): def test_list_metrics(self): self.registry.register(FakeMetric) self.registry.register(AnotherFakeMetric) + self.registry.register(ExcludedFakeMetric) names = self.registry.list_metrics() assert set(names) == {"fake_metric", "another_fake"} + # The excluded_fake_metric is still resolvable by name for explicit --metrics selection. + assert self.registry.get("excluded_fake_metric") is ExcludedFakeMetric def test_get_all_returns_copy(self): self.registry.register(FakeMetric) diff --git a/tests/unit/metrics/test_speech_fidelity.py b/tests/unit/metrics/test_tts_fidelity.py similarity index 98% rename from tests/unit/metrics/test_speech_fidelity.py rename to tests/unit/metrics/test_tts_fidelity.py index 36a1e7d0..1a550cd7 100644 --- a/tests/unit/metrics/test_speech_fidelity.py +++ b/tests/unit/metrics/test_tts_fidelity.py @@ -1,4 +1,4 @@ -"""Tests for agent_speech_fidelity and user_speech_fidelity metrics.""" +"""Tests for tts_fidelity and user_speech_fidelity metrics.""" import json import logging @@ -7,7 +7,7 @@ import pytest from google.api_core import exceptions as google_exceptions -from eva.metrics.accuracy.agent_speech_fidelity import AgentSpeechFidelityMetric +from eva.metrics.diagnostic.tts_fidelity import TTSFidelityMetric from eva.metrics.validation.user_speech_fidelity import UserSpeechFidelityMetric from .conftest import make_judge_metric, make_metric_context @@ -21,7 +21,7 @@ def make_judge_response(turns: list[dict]) -> str: @pytest.fixture def agent_metric(): return make_judge_metric( - AgentSpeechFidelityMetric, + TTSFidelityMetric, mock_llm=True, logger_name="test_agent_speech_fidelity", ) @@ -54,11 +54,11 @@ class TestClassAttributes: """Verify subclass metadata is set correctly.""" def test_agent_metric_attributes(self, agent_metric): - assert agent_metric.name == "agent_speech_fidelity" - assert agent_metric.category == "accuracy" + assert agent_metric.name == "tts_fidelity" + assert agent_metric.category == "diagnostic" assert agent_metric.role == "assistant" assert agent_metric.rating_scale == (0, 1) - assert agent_metric.pass_at_k_threshold == 0.95 + assert agent_metric.exclude_from_pass_at_k is True def test_user_metric_attributes(self, user_metric): assert user_metric.name == "user_speech_fidelity" @@ -244,7 +244,7 @@ async def test_failure_modes_produce_rate_sub_metrics(self, agent_metric): assert result.sub_metrics["garbled_hallucination_rate"].score == 0.0 assert result.sub_metrics["insertion_hallucination_rate"].score == 0.0 assert result.sub_metrics["wrong_language_rate"].score == 0.0 - assert result.sub_metrics["entity_error_rate"].name == "agent_speech_fidelity.entity_error_rate" + assert result.sub_metrics["entity_error_rate"].name == "tts_fidelity.entity_error_rate" assert result.sub_metrics["entity_error_rate"].details == { "count": 1, "num_rated": 2,