Always run entity fidelity#146
Open
JosephMarinier wants to merge 3 commits into
Open
Conversation
to compare all types of pipelines apples to apples, and run TTS fidelity as a diagnostic metric when applicable (cascade and audio LLM).
ef84e56 to
1fc4136
Compare
JosephMarinier
commented
Jun 10, 2026
Comment on lines
-29
to
-32
| The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from: | ||
|
|
||
| - **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output). | ||
| - **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech. |
Collaborator
Author
There was a problem hiding this comment.
Am I following? That was outdated, and both cascade and audio-LLM feed the "intended" text to the TTS engine.
JosephMarinier
commented
Jun 10, 2026
| - Integer/boolean ratings (not decimals) to avoid precision issues | ||
| - Structured prompts in `configs/prompts/judge.yaml` | ||
| - GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness | ||
| - GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness |
Collaborator
Author
There was a problem hiding this comment.
- That seems to be outdated since we did Update audio judge model to gemini 3 flash due to cost constraints #84.
JosephMarinier
commented
Jun 10, 2026
Comment on lines
+30
to
+31
| Keyed on class qualname (not metric name) so each concrete class gets a | ||
| distinct entry even if two ever shared a `name`. |
Collaborator
Author
There was a problem hiding this comment.
Now that there is a single class for agent_speech_fidelity, we could also consider keying with the metric name instead of the class name. Any thoughts?
JosephMarinier
commented
Jun 10, 2026
| name = "agent_speech_fidelity" | ||
| version = "v0.2" | ||
| description = "Audio-based evaluation of agent entity fidelity for S2S models" | ||
| version = "v0.4" |
Collaborator
Author
There was a problem hiding this comment.
In AgentSpeechFidelityMetric (the other variant of the metric), we had version = "v0.3", so I'm bumping above that.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I recommend reviewing the commits individually.