pydantic strip wrapper tokens by colinbennettbrain · Pull Request #321 · braintrustdata/braintrust-sdk-python

colinbennettbrain · 2026-04-17T15:27:05Z

retyped wrapper spans (agent_run, model_request, streaming wrappers) from LLM to TASK, but they kept logging the same prompt_tokens/completion_tokens/tokens as their nested leaf chat <model> span. The server derives estimated_cost per-span from tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace totals with coalesce_add over every non-scorer span regardless of type (summary.rs accumulate_metrics), and sums experiment-level token/cost over all non-scorer spans without filtering on span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to TASK therefore did not stop double-counting on any of those three axes.

Route every wrapper log site through a new _wrapper_span_metrics helper that emits only {start, end, duration, optional time_to_first_token}. Leaf chat <model> spans (from _wrap_concrete_model_class and _DirectStreamWrapper when span_type=LLM) keep full _extract_response_metrics. _DirectStreamWrapper now branches on span_type since it serves as both leaf and wrapper. Delete now-dead _extract_usage_metrics and _extract_stream_usage_metrics.

Flip existing cassette-backed assertions (test_agent_run_async, test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools, test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens / tokens / prompt_cached_tokens are absent from wrapper spans and present only on the leaf. No cassette re-recording needed -- the change is purely in post-processing.

#312 retyped wrapper spans (agent_run, model_request, streaming wrappers) from LLM to TASK, but they kept logging the same prompt_tokens/completion_tokens/tokens as their nested leaf `chat <model>` span. The server derives estimated_cost per-span from tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace totals with `coalesce_add` over every non-scorer span regardless of type (summary.rs accumulate_metrics), and sums experiment-level token/cost over all non-scorer spans without filtering on span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to TASK therefore did not stop double-counting on any of those three axes. Route every wrapper log site through a new `_wrapper_span_metrics` helper that emits only {start, end, duration, optional time_to_first_token}. Leaf `chat <model>` spans (from _wrap_concrete_model_class and _DirectStreamWrapper when span_type=LLM) keep full _extract_response_metrics. `_DirectStreamWrapper` now branches on span_type since it serves as both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and `_extract_stream_usage_metrics`. Flip existing cassette-backed assertions (test_agent_run_async, test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools, test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens / tokens / prompt_cached_tokens are absent from wrapper spans and present only on the leaf. No cassette re-recording needed -- the change is purely in post-processing.

Stripping wrapper-span metrics in 6a6cb76 deleted `_extract_usage_metrics`, which had been the only call site that read `usage.input_audio_tokens`, `usage.output_audio_tokens`, and dict-style `usage.details`. Leaf chat spans went through `_extract_response_metrics`, which (a) never knew about the audio fields and (b) reached for `usage.details.reasoning_tokens` as an attribute. pydantic_ai's `RequestUsage.details` is `dict[str, int]` (see pydantic_ai/usage.py and `models/openai.py:_map_usage` populating `details['reasoning_tokens']` from `output_tokens_details` plus spreading `completion_tokens_details.model_dump()`). `hasattr(dict, "reasoning_tokens")` is `False`, so the existing branch silently dropped reasoning tokens on every leaf span -- including for o1/o3. `_extract_response_metrics` now reads: - prompt_audio_tokens <- usage.input_audio_tokens - completion_audio_tokens <- usage.output_audio_tokens - completion_reasoning_tokens <- details["reasoning_tokens"] - prompt_cached_tokens <- details["cached_tokens"] (overrides cache_read_tokens when both present, matching the pre-strip helper) Adds a synthesized-RequestUsage unit test covering all of the above. Rewrites `test_reasoning_tokens_extraction` to use `RequestUsage` instead of `MagicMock` -- the mock satisfied attribute-style `hasattr` lookups and was the reason the broken `details.reasoning_tokens` branch passed CI while never firing in production. Adds wrapper-no-tokens regressions to test_agent_run_stream_events, test_direct_model_request_stream, and test_direct_model_request_stream_sync, mirroring the assertions added to non-stream agent_run tests in 6a6cb76.

pylint's astroid doesn't narrow Optional via 'assert x is not None', so subscripting tripped unsubscriptable-object. Assign through a typed local to make the type explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

colinbennettbrain and others added 5 commits April 16, 2026 22:20

Merge branch 'main' into colin/pydantic-strip-wrapper-tokens

171897c

fix(py): fix pylint in pydantic ai metrics test

39c970a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pydantic strip wrapper tokens#321

pydantic strip wrapper tokens#321
colinbennettbrain wants to merge 5 commits intomainfrom
colin/pydantic-strip-wrapper-tokens

colinbennettbrain commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

colinbennettbrain commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants