Skip to content

fix(pydantic): retype wrapper spans as task to stop cost double-counting#312

Merged
Abhijeet Prasad (AbhiPrasad) merged 1 commit intomainfrom
colin-pydantic_ai
Apr 16, 2026
Merged

fix(pydantic): retype wrapper spans as task to stop cost double-counting#312
Abhijeet Prasad (AbhiPrasad) merged 1 commit intomainfrom
colin-pydantic_ai

Conversation

@colinbennettbrain
Copy link
Copy Markdown
Contributor

Agent wrapper spans (agent_run, agent_run_sync, agent_run_stream*, agent_to_cli_sync, model_request*) were tagged type=llm and logged the same usage metrics as their nested leaf chat <model> span. Experiment aggregations that sum metrics across type=llm spans therefore counted a single provider call twice (wrapper + leaf), inflating reported tokens and cost ~2x for a single-turn agent and more for multi-turn runs.

Retag every wrapper span as SpanTypeAttribute.TASK; only the leaf chat <model> emitted by _wrap_concrete_model_class stays LLM. _DirectStreamWrapper is used both as wrapper (direct.model_request_stream) and as leaf (Model.request_stream), so it gains a span_type parameter defaulting to LLM; wrapper callers pass TASK explicitly.

Test coverage: flip existing wrapper-type assertions to TASK (leaf chat_span assertions stay LLM) and extend the cassette-backed test_agent_run_async with a regression check that exactly one type=llm span exists and that prompt/completion tokens summed across llm spans equal the leaf's values.

Agent wrapper spans (agent_run, agent_run_sync, agent_run_stream*,
agent_to_cli_sync, model_request*) were tagged type=llm and logged the
same usage metrics as their nested leaf `chat <model>` span. Experiment
aggregations that sum metrics across type=llm spans therefore counted a
single provider call twice (wrapper + leaf), inflating reported tokens
and cost ~2x for a single-turn agent and more for multi-turn runs.

Retag every wrapper span as SpanTypeAttribute.TASK; only the leaf
`chat <model>` emitted by _wrap_concrete_model_class stays LLM.
_DirectStreamWrapper is used both as wrapper (direct.model_request_stream)
and as leaf (Model.request_stream), so it gains a span_type parameter
defaulting to LLM; wrapper callers pass TASK explicitly.

Test coverage: flip existing wrapper-type assertions to TASK (leaf
chat_span assertions stay LLM) and extend the cassette-backed
test_agent_run_async with a regression check that exactly one type=llm
span exists and that prompt/completion tokens summed across llm spans
equal the leaf's values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbhiPrasad Abhijeet Prasad (AbhiPrasad) changed the title retype wrapper spans as task to stop cost double-counting fix(pydantic): retype wrapper spans as task to stop cost double-counting Apr 16, 2026
@AbhiPrasad Abhijeet Prasad (AbhiPrasad) merged commit 31d7c07 into main Apr 16, 2026
84 checks passed
@AbhiPrasad Abhijeet Prasad (AbhiPrasad) deleted the colin-pydantic_ai branch April 16, 2026 19:10
colinbennettbrain added a commit that referenced this pull request Apr 16, 2026
#312 retyped wrapper spans (agent_run, model_request, streaming
wrappers) from LLM to TASK, but they kept logging the same
prompt_tokens/completion_tokens/tokens as their nested leaf
`chat <model>` span. The server derives estimated_cost per-span from
tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace
totals with `coalesce_add` over every non-scorer span regardless of
type (summary.rs accumulate_metrics), and sums experiment-level
token/cost over all non-scorer spans without filtering on
span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to
TASK therefore did not stop double-counting on any of those three
axes.

Route every wrapper log site through a new `_wrapper_span_metrics`
helper that emits only {start, end, duration, optional
time_to_first_token}. Leaf `chat <model>` spans (from
_wrap_concrete_model_class and _DirectStreamWrapper when
span_type=LLM) keep full _extract_response_metrics.
`_DirectStreamWrapper` now branches on span_type since it serves as
both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and
`_extract_stream_usage_metrics`.

Flip existing cassette-backed assertions (test_agent_run_async,
test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools,
test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens
/ tokens / prompt_cached_tokens are absent from wrapper spans and
present only on the leaf. No cassette re-recording needed -- the
change is purely in post-processing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
colinbennettbrain added a commit that referenced this pull request Apr 16, 2026
#312 retyped wrapper spans (agent_run, model_request, streaming
wrappers) from LLM to TASK, but they kept logging the same
prompt_tokens/completion_tokens/tokens as their nested leaf
`chat <model>` span. The server derives estimated_cost per-span from
tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace
totals with `coalesce_add` over every non-scorer span regardless of
type (summary.rs accumulate_metrics), and sums experiment-level
token/cost over all non-scorer spans without filtering on
span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to
TASK therefore did not stop double-counting on any of those three
axes.

Route every wrapper log site through a new `_wrapper_span_metrics`
helper that emits only {start, end, duration, optional
time_to_first_token}. Leaf `chat <model>` spans (from
_wrap_concrete_model_class and _DirectStreamWrapper when
span_type=LLM) keep full _extract_response_metrics.
`_DirectStreamWrapper` now branches on span_type since it serves as
both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and
`_extract_stream_usage_metrics`.

Flip existing cassette-backed assertions (test_agent_run_async,
test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools,
test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens
/ tokens / prompt_cached_tokens are absent from wrapper spans and
present only on the leaf. No cassette re-recording needed -- the
change is purely in post-processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants