Skip to content

Always run entity fidelity#146

Open
JosephMarinier wants to merge 3 commits into
mainfrom
joseph/always-run-entity-fidelity
Open

Always run entity fidelity#146
JosephMarinier wants to merge 3 commits into
mainfrom
joseph/always-run-entity-fidelity

Conversation

@JosephMarinier

@JosephMarinier JosephMarinier commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

I recommend reviewing the commits individually.

@JosephMarinier JosephMarinier requested a review from gabegma June 10, 2026 16:28
@JosephMarinier JosephMarinier self-assigned this Jun 10, 2026
to compare all types of pipelines apples to apples, and run TTS fidelity as a diagnostic metric when applicable (cascade and audio LLM).
@JosephMarinier JosephMarinier force-pushed the joseph/always-run-entity-fidelity branch from ef84e56 to 1fc4136 Compare June 10, 2026 21:50
@JosephMarinier JosephMarinier marked this pull request as ready for review June 10, 2026 21:55
Comment thread docs/metrics/tts_fidelity.md Outdated
Comment on lines -29 to -32
The evaluation is the same in both cases — compare `intended_assistant_turns` against the actual spoken audio. The only difference is where the intended text comes from:

- **Cascade**: The intended text is the input to the TTS engine (i.e., the LLM's text output).
- **Audio-native (S2S, S2T+TTS):** The intended text is the text output that the audio-native model returns alongside its generated speech.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I following? That was outdated, and both cascade and audio-LLM feed the "intended" text to the TTS engine.

Comment thread docs/metrics/README.md
- Integer/boolean ratings (not decimals) to avoid precision issues
- Structured prompts in `configs/prompts/judge.yaml`
- GPT-5.2 for text judges, Gemini 3.1 Pro for audio judges, Claude Opus for faithfulness
- GPT-5.2 for text judges, Gemini 3 Flash for audio judges, Claude Opus for faithfulness

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +30 to +31
Keyed on class qualname (not metric name) so each concrete class gets a
distinct entry even if two ever shared a `name`.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there is a single class for agent_speech_fidelity, we could also consider keying with the metric name instead of the class name. Any thoughts?

name = "agent_speech_fidelity"
version = "v0.2"
description = "Audio-based evaluation of agent entity fidelity for S2S models"
version = "v0.4"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AgentSpeechFidelityMetric (the other variant of the metric), we had version = "v0.3", so I'm bumping above that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant