Always run entity fidelity by JosephMarinier · Pull Request #146 · ServiceNow/eva

JosephMarinier · 2026-06-10T16:28:43Z

I recommend reviewing the commits individually.

Copy docs/metrics/agent_speech_fidelity.md to docs/metrics/tts_fidelity.md so the following commit is clearer.
Always run entity fidelity to compare all types of pipelines apples to apples, and run TTS fidelity as a diagnostic metric when applicable (cascade and audio LLM).
Exclude TTS fidelity diagnostic metric from default metrics

…ty.md

to compare all types of pipelines apples to apples, and run TTS fidelity as a diagnostic metric when applicable (cascade and audio LLM).

JosephMarinier · 2026-06-10T22:40:05Z

-The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from:
-
- **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output).
- **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech.


Am I following? That was outdated, and both cascade and audio-LLM feed the "intended" text to the TTS engine.

JosephMarinier · 2026-06-10T22:42:23Z

 - Integer/boolean ratings (not decimals) to avoid precision issues
 - Structured prompts in `configs/prompts/judge.yaml`
- GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness
+- GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness


That seems to be outdated since we did Update audio judge model to gemini 3 flash due to cost constraints #84.

JosephMarinier · 2026-06-10T22:44:14Z

+    Keyed on class qualname (not metric name) so each concrete class gets a
+    distinct entry even if two ever shared a `name`.


Now that there is a single class for agent_speech_fidelity, we could also consider keying with the metric name instead of the class name. Any thoughts?

JosephMarinier · 2026-06-10T22:45:31Z

    name = "agent_speech_fidelity"
-    version = "v0.2"
-    description = "Audio-based evaluation of agent entity fidelity for S2S models"
+    version = "v0.4"


In AgentSpeechFidelityMetric (the other variant of the metric), we had version = "v0.3", so I'm bumping above that.

JosephMarinier requested a review from gabegma June 10, 2026 16:28

JosephMarinier self-assigned this Jun 10, 2026

JosephMarinier added 3 commits June 10, 2026 17:10

Copy docs/metrics/agent_speech_fidelity.md to docs/metrics/tts_fideli…

38822cb

…ty.md

Always run entity fidelity

5742228

to compare all types of pipelines apples to apples, and run TTS fidelity as a diagnostic metric when applicable (cascade and audio LLM).

Exclude TTS fidelity diagnostic metric from default metrics

1fc4136

JosephMarinier force-pushed the joseph/always-run-entity-fidelity branch from ef84e56 to 1fc4136 Compare June 10, 2026 21:50

JosephMarinier marked this pull request as ready for review June 10, 2026 21:55

JosephMarinier commented Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always run entity fidelity#146

Always run entity fidelity#146
JosephMarinier wants to merge 3 commits into
mainfrom
joseph/always-run-entity-fidelity

JosephMarinier commented Jun 10, 2026 •

edited

Loading

Uh oh!

JosephMarinier Jun 10, 2026

Uh oh!

JosephMarinier Jun 10, 2026

Uh oh!

JosephMarinier Jun 10, 2026

Uh oh!

JosephMarinier Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		Keyed on class qualname (not metric name) so each concrete class gets a
		distinct entry even if two ever shared a `name`.

Conversation

JosephMarinier commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JosephMarinier Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JosephMarinier commented Jun 10, 2026 •

edited

Loading