feat(streaming): emit OTel metrics for ttft, tps, token counts#347
Open
feat(streaming): emit OTel metrics for ttft, tps, token counts#347
Conversation
…counts Adds six metrics to TemporalStreamingModel.get_response so applications that configure an OTel MeterProvider can see streaming-call behavior without per-app instrumentation: - agentex.llm.ttft (histogram, ms): time from request start to first content delta. Captured on the first ResponseTextDeltaEvent / ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent. - agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration. Use 1/tps for time-per-output-token (tpot). - agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters): pulled from the captured ResponsesAPI Usage at end-of-stream. Cache hit rate is computed at query time as rate(cached_input_tokens) / rate(input_tokens). Why - The data was already captured (line 854 captured_usage = response.usage) but never emitted as metrics. Apps could only see total LLM call duration, not the meaningful breakdowns. - Doing this in the SDK rather than each app means every consumer of TemporalStreamingModel gets the metrics for free. - Cardinality is bounded — only `model` is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus. The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.
Three changes from PR #347 review: 1. Move stream_start_perf above responses.create() so ttft captures the full request-to-first-token latency (HTTP round-trip + model TTFB), not just post-connect event-loop delay. Drop the duplicate assignment that was being overwritten after the await. 2. Track last_token_at alongside first_token_at and use the generation window (first→last delta) as the tps denominator. Total stream duration was inflated by tool-call argument deltas, reasoning events, and stream_update awaits — making tps under-report the model's actual generation speed for agentic responses. 3. Lazy-create the OTel instruments inside _StreamingMetrics rather than at import time, so a MeterProvider configured *after* this module is imported still binds correctly. Apps that bootstrap OTel in a startup hook (common with lazy-init patterns) would otherwise silently send to the no-op provider.
- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent too so the tps generation window covers all event types whose tokens land in usage.output_tokens. Previously the numerator counted argument tokens but the denominator excluded their generation time, inflating tps for tool-heavy responses. - Lifted the bookmarking out of the text-delta branch into a single up-front check covering all four token-producing event types — cleaner than duplicating across branches. - Documented the single-token skip case (window collapses to 0) inline at the guard. TPS is undefined for a one-token response so emitting nothing is correct; the comment makes the intent visible to future readers.
…t counter
Two follow-up changes from the PR review:
1. Move the LLM metric instruments from _StreamingMetrics in
temporal_streaming_model.py to a new module:
agentex.lib.core.observability.llm_metrics
Public API: get_llm_metrics() returns a singleton LLMMetrics with
the same six instruments (ttft, tps, input_tokens, output_tokens,
cached_input_tokens, reasoning_tokens) plus a new requests counter.
This makes the temporal+openai_agents plugin one of several future
call sites — the sync ACP path and the Claude SDK plugin can
record to the same instruments without redefining names, units,
or descriptions. Keeps cross-provider naming consistent.
2. Add agentex.llm.requests counter with a status label so 429s,
5xxs, timeouts, and other failures are observable on the SDK
side without scraping logs. classify_status() maps exception
types to a small fixed set (success / rate_limit / server_error
/ client_error / timeout / network_error / other_error) by class
name, so it works across OpenAI, Anthropic, and other provider
SDKs that use similar exception naming.
Recorded in two places: success path (alongside token counters)
and the existing get_response except handler (so terminal
failures emit a counter event before re-raising).
Cardinality remains bounded — model + status (7 values) on the
counter; all other metrics keep just `model`.
Comment on lines
1057
to
1065
| except Exception as e: | ||
| logger.error(f"Error using Responses API: {e}") | ||
| # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are | ||
| # observable on the SDK side. Status histograms / token counters | ||
| # only fire on successful completion above. | ||
| get_llm_metrics().requests.add( | ||
| 1, {"model": self.model_name, "status": classify_status(e)} | ||
| ) | ||
| raise |
There was a problem hiding this comment.
OTel exception in error handler replaces original LLM exception. If
get_llm_metrics().requests.add() throws (e.g., a misbehaving exporter), the raise is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (RateLimitError, APITimeoutError, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare except prevents this.
Suggested change
| except Exception as e: | |
| logger.error(f"Error using Responses API: {e}") | |
| # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are | |
| # observable on the SDK side. Status histograms / token counters | |
| # only fire on successful completion above. | |
| get_llm_metrics().requests.add( | |
| 1, {"model": self.model_name, "status": classify_status(e)} | |
| ) | |
| raise | |
| except Exception as e: | |
| logger.error(f"Error using Responses API: {e}") | |
| # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are | |
| # observable on the SDK side. Status histograms / token counters | |
| # only fire on successful completion above. | |
| try: | |
| get_llm_metrics().requests.add( | |
| 1, {"model": self.model_name, "status": classify_status(e)} | |
| ) | |
| except Exception: | |
| pass | |
| raise |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py
Line: 1057-1065
Comment:
**OTel exception in error handler replaces original LLM exception.** If `get_llm_metrics().requests.add()` throws (e.g., a misbehaving exporter), the `raise` is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (`RateLimitError`, `APITimeoutError`, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare `except` prevents this.
```suggestion
except Exception as e:
logger.error(f"Error using Responses API: {e}")
# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
# observable on the SDK side. Status histograms / token counters
# only fire on successful completion above.
try:
get_llm_metrics().requests.add(
1, {"model": self.model_name, "status": classify_status(e)}
)
except Exception:
pass
raise
```
How can I resolve this? If you propose a fix, please make it concise.
joshua0chen
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
Six OTel metrics to
TemporalStreamingModel.get_response:agentex.llm.ttft(histogram, ms) — time from request start to first content deltaagentex.llm.tps(histogram, tokens/s) — output tokens per secondagentex.llm.input_tokens/output_tokens/cached_input_tokens/reasoning_tokens(counters)Cache hit rate computed at query time:
rate(cached_input_tokens) / rate(input_tokens).Time-per-output-token (TPOT) computed as
1/tps.Why in the SDK rather than per-app
The data is already captured at line 854 (
captured_usage = response.usage) — every app loses out on it because nothing emits it as metrics. Doing this in the SDK means every consumer ofTemporalStreamingModelgets the metrics for free, with consistent naming.The meter is a no-op when no
MeterProvideris configured, so this is safe for apps that don't run with OTel.Cardinality
Only
modelis a metric attribute. Resource attributes (service.name,k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus.Test plan
Notes
TTFT is measured client-side (request start → first delta seen). It includes any client-side queuing / network setup. Useful for end-user perceived latency, less useful for raw model latency.
Greptile Summary
This PR adds six OTel metrics (ttft, tps, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens) plus a requests counter to TemporalStreamingModel.get_response, centralized in a new llm_metrics.py module with lazy initialization. The TTFT timer is correctly placed before responses.create() and TPS uses a generation window denominator. One new P1 was found: in the except error handler, an OTel exception from requests.add() replaces the original LLM exception before raise executes, breaking caller retry/circuit-breaker logic. A P2 was also flagged for non-thread-safe singleton initialization in llm_metrics.py.
Confidence Score: 3/5
Not safe to merge: a P1 in the error handler can replace the original LLM exception with an OTel exception, breaking caller retry logic.
Two unguarded OTel call sites remain — one on the success path (flagged in previous threads) and one new on the error path where an OTel exception replaces the original LLM exception before raise. Multiple P1s lower the score below the 4/5 ceiling.
temporal_streaming_model.py lines 1017-1065 (both metrics emission blocks need try/except guards); llm_metrics.py singleton init.
Important Files Changed
Comments Outside Diff (1)
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py, line 635-637 (link)stream_start_perfis assigned on line 689, afterawait self.client.responses.create(...)has already returned (line 637–673). Thatawaitcompletes only after the TCP connection, the full HTTP request send, and the server returning response headers — easily 200–800 ms for a typical LLM call. As a result,agentex.llm.ttftwill systematically under-report TTFT relative to its documented definition ("time from streaming-request start to first content token"). Moving the bookmark to just before theresponses.createcall captures the full client-perceived TTFT.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (4): Last reviewed commit: "review (stas): extract llm metrics to co..." | Re-trigger Greptile