Skip to content

pydantic strip wrapper tokens#321

Open
colinbennettbrain wants to merge 5 commits intomainfrom
colin/pydantic-strip-wrapper-tokens
Open

pydantic strip wrapper tokens#321
colinbennettbrain wants to merge 5 commits intomainfrom
colin/pydantic-strip-wrapper-tokens

Conversation

@colinbennettbrain
Copy link
Copy Markdown
Contributor

retyped wrapper spans (agent_run, model_request, streaming wrappers) from LLM to TASK, but they kept logging the same prompt_tokens/completion_tokens/tokens as their nested leaf chat <model> span. The server derives estimated_cost per-span from tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace totals with coalesce_add over every non-scorer span regardless of type (summary.rs accumulate_metrics), and sums experiment-level token/cost over all non-scorer spans without filtering on span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to TASK therefore did not stop double-counting on any of those three axes.

Route every wrapper log site through a new _wrapper_span_metrics helper that emits only {start, end, duration, optional time_to_first_token}. Leaf chat <model> spans (from _wrap_concrete_model_class and _DirectStreamWrapper when span_type=LLM) keep full _extract_response_metrics. _DirectStreamWrapper now branches on span_type since it serves as both leaf and wrapper. Delete now-dead _extract_usage_metrics and _extract_stream_usage_metrics.

Flip existing cassette-backed assertions (test_agent_run_async, test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools, test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens / tokens / prompt_cached_tokens are absent from wrapper spans and present only on the leaf. No cassette re-recording needed -- the change is purely in post-processing.

colinbennettbrain and others added 5 commits April 16, 2026 22:20
#312 retyped wrapper spans (agent_run, model_request, streaming
wrappers) from LLM to TASK, but they kept logging the same
prompt_tokens/completion_tokens/tokens as their nested leaf
`chat <model>` span. The server derives estimated_cost per-span from
tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace
totals with `coalesce_add` over every non-scorer span regardless of
type (summary.rs accumulate_metrics), and sums experiment-level
token/cost over all non-scorer spans without filtering on
span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to
TASK therefore did not stop double-counting on any of those three
axes.

Route every wrapper log site through a new `_wrapper_span_metrics`
helper that emits only {start, end, duration, optional
time_to_first_token}. Leaf `chat <model>` spans (from
_wrap_concrete_model_class and _DirectStreamWrapper when
span_type=LLM) keep full _extract_response_metrics.
`_DirectStreamWrapper` now branches on span_type since it serves as
both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and
`_extract_stream_usage_metrics`.

Flip existing cassette-backed assertions (test_agent_run_async,
test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools,
test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens
/ tokens / prompt_cached_tokens are absent from wrapper spans and
present only on the leaf. No cassette re-recording needed -- the
change is purely in post-processing.
Stripping wrapper-span metrics in 6a6cb76 deleted `_extract_usage_metrics`,
which had been the only call site that read `usage.input_audio_tokens`,
`usage.output_audio_tokens`, and dict-style `usage.details`. Leaf chat spans
went through `_extract_response_metrics`, which (a) never knew about the audio
fields and (b) reached for `usage.details.reasoning_tokens` as an attribute.
pydantic_ai's `RequestUsage.details` is `dict[str, int]` (see
pydantic_ai/usage.py and `models/openai.py:_map_usage` populating
`details['reasoning_tokens']` from `output_tokens_details` plus spreading
`completion_tokens_details.model_dump()`). `hasattr(dict, "reasoning_tokens")`
is `False`, so the existing branch silently dropped reasoning tokens on every
leaf span -- including for o1/o3.

`_extract_response_metrics` now reads:
  - prompt_audio_tokens   <- usage.input_audio_tokens
  - completion_audio_tokens <- usage.output_audio_tokens
  - completion_reasoning_tokens <- details["reasoning_tokens"]
  - prompt_cached_tokens  <- details["cached_tokens"] (overrides cache_read_tokens
    when both present, matching the pre-strip helper)

Adds a synthesized-RequestUsage unit test covering all of the above. Rewrites
`test_reasoning_tokens_extraction` to use `RequestUsage` instead of `MagicMock`
-- the mock satisfied attribute-style `hasattr` lookups and was the reason the
broken `details.reasoning_tokens` branch passed CI while never firing in
production. Adds wrapper-no-tokens regressions to test_agent_run_stream_events,
test_direct_model_request_stream, and test_direct_model_request_stream_sync,
mirroring the assertions added to non-stream agent_run tests in 6a6cb76.
pylint's astroid doesn't narrow Optional via 'assert x is not None',
so subscripting tripped unsubscriptable-object. Assign through a
typed local to make the type explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants