Skip to content

feat(streaming): emit OTel metrics for ttft, tps, token counts#347

Open
x wants to merge 4 commits intomainfrom
x/llm-metrics-instrumentation
Open

feat(streaming): emit OTel metrics for ttft, tps, token counts#347
x wants to merge 4 commits intomainfrom
x/llm-metrics-instrumentation

Conversation

@x
Copy link
Copy Markdown
Contributor

@x x commented May 7, 2026

What this adds

Six OTel metrics to TemporalStreamingModel.get_response:

  • agentex.llm.ttft (histogram, ms) — time from request start to first content delta
  • agentex.llm.tps (histogram, tokens/s) — output tokens per second
  • agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters)

Cache hit rate computed at query time: rate(cached_input_tokens) / rate(input_tokens).
Time-per-output-token (TPOT) computed as 1/tps.

Why in the SDK rather than per-app

The data is already captured at line 854 (captured_usage = response.usage) — every app loses out on it because nothing emits it as metrics. Doing this in the SDK means every consumer of TemporalStreamingModel gets the metrics for free, with consistent naming.

The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.

Cardinality

Only model is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus.

Test plan

  • Build and deploy to a dev environment with OTel configured
  • Confirm metrics flow to Mimir / Prometheus with expected labels
  • Verify TTFT measures sensibly (~500ms-3s for typical OpenAI/Foundry calls)
  • Verify TPS aligns with hand-calculation from existing duration + token counts

Notes

TTFT is measured client-side (request start → first delta seen). It includes any client-side queuing / network setup. Useful for end-user perceived latency, less useful for raw model latency.

Greptile Summary

This PR adds six OTel metrics (ttft, tps, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens) plus a requests counter to TemporalStreamingModel.get_response, centralized in a new llm_metrics.py module with lazy initialization. The TTFT timer is correctly placed before responses.create() and TPS uses a generation window denominator. One new P1 was found: in the except error handler, an OTel exception from requests.add() replaces the original LLM exception before raise executes, breaking caller retry/circuit-breaker logic. A P2 was also flagged for non-thread-safe singleton initialization in llm_metrics.py.

Confidence Score: 3/5

Not safe to merge: a P1 in the error handler can replace the original LLM exception with an OTel exception, breaking caller retry logic.

Two unguarded OTel call sites remain — one on the success path (flagged in previous threads) and one new on the error path where an OTel exception replaces the original LLM exception before raise. Multiple P1s lower the score below the 4/5 ceiling.

temporal_streaming_model.py lines 1017-1065 (both metrics emission blocks need try/except guards); llm_metrics.py singleton init.

Important Files Changed

Filename Overview
src/agentex/lib/core/observability/llm_metrics.py New module introducing lazy-init OTel instruments (7 counters/histograms) with classify_status helper; singleton initialization is not thread-safe under concurrent calls.
src/agentex/lib/core/observability/init.py Empty package init file — no issues.
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Adds TTFT/TPS timers and token-count metric emission; OTel exception in the error handler can replace the original LLM exception, breaking caller retry logic; success metrics block is also unguarded (previous thread).

Comments Outside Diff (1)

  1. src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py, line 635-637 (link)

    P1 TTFT timer starts after the HTTP round-trip, not at request start. stream_start_perf is assigned on line 689, after await self.client.responses.create(...) has already returned (line 637–673). That await completes only after the TCP connection, the full HTTP request send, and the server returning response headers — easily 200–800 ms for a typical LLM call. As a result, agentex.llm.ttft will systematically under-report TTFT relative to its documented definition ("time from streaming-request start to first content token"). Moving the bookmark to just before the responses.create call captures the full client-perceived TTFT.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py
    Line: 635-637
    
    Comment:
    **TTFT timer starts after the HTTP round-trip, not at request start.** `stream_start_perf` is assigned on line 689, after `await self.client.responses.create(...)` has already returned (line 637–673). That `await` completes only after the TCP connection, the full HTTP request send, and the server returning response headers — easily 200–800 ms for a typical LLM call. As a result, `agentex.llm.ttft` will systematically under-report TTFT relative to its documented definition ("time from streaming-request start to first content token"). Moving the bookmark to just before the `responses.create` call captures the full client-perceived TTFT.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py:1057-1065
**OTel exception in error handler replaces original LLM exception.** If `get_llm_metrics().requests.add()` throws (e.g., a misbehaving exporter), the `raise` is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (`RateLimitError`, `APITimeoutError`, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare `except` prevents this.

```suggestion
            except Exception as e:
                logger.error(f"Error using Responses API: {e}")
                # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
                # observable on the SDK side. Status histograms / token counters
                # only fire on successful completion above.
                try:
                    get_llm_metrics().requests.add(
                        1, {"model": self.model_name, "status": classify_status(e)}
                    )
                except Exception:
                    pass
                raise
```

### Issue 2 of 2
src/agentex/lib/core/observability/llm_metrics.py:80-88
`get_llm_metrics()` is not safe for concurrent initialization in a multi-threaded runtime. Two threads reaching the `if _llm_metrics is None` check simultaneously will each create a `LLMMetrics` instance, and whichever finishes second wins — leaving dangling instrument handles from the first. Using `functools.lru_cache` on a module-level function is the idiomatic Python way to get thread-safe lazy singletons without an explicit lock.

```suggestion
import functools

@functools.lru_cache(maxsize=None)
def get_llm_metrics() -> LLMMetrics:
    """Return the LLM metrics singleton, creating it on first use.

    ``lru_cache`` provides thread-safe one-time initialization without an
    explicit lock — safe for both threaded and async callers.
    """
    return LLMMetrics()
```

Reviews (4): Last reviewed commit: "review (stas): extract llm metrics to co..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

…counts

Adds six metrics to TemporalStreamingModel.get_response so applications
that configure an OTel MeterProvider can see streaming-call behavior
without per-app instrumentation:

- agentex.llm.ttft (histogram, ms): time from request start to first
  content delta. Captured on the first ResponseTextDeltaEvent /
  ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent.
- agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration.
  Use 1/tps for time-per-output-token (tpot).
- agentex.llm.input_tokens / output_tokens / cached_input_tokens /
  reasoning_tokens (counters): pulled from the captured ResponsesAPI
  Usage at end-of-stream. Cache hit rate is computed at query time as
  rate(cached_input_tokens) / rate(input_tokens).

Why
- The data was already captured (line 854 captured_usage = response.usage)
  but never emitted as metrics. Apps could only see total LLM call
  duration, not the meaningful breakdowns.
- Doing this in the SDK rather than each app means every consumer of
  TemporalStreamingModel gets the metrics for free.
- Cardinality is bounded — only `model` is a metric attribute. Resource
  attributes (service.name, k8s.*, etc.) come from the application's
  configured OTel resource, so cross-app comparisons work cleanly in
  Mimir/Prometheus.

The meter is a no-op when no MeterProvider is configured, so this is
safe for apps that don't run with OTel.
x added 2 commits May 7, 2026 12:40
Three changes from PR #347 review:

1. Move stream_start_perf above responses.create() so ttft captures the
   full request-to-first-token latency (HTTP round-trip + model TTFB),
   not just post-connect event-loop delay. Drop the duplicate assignment
   that was being overwritten after the await.

2. Track last_token_at alongside first_token_at and use the generation
   window (first→last delta) as the tps denominator. Total stream
   duration was inflated by tool-call argument deltas, reasoning events,
   and stream_update awaits — making tps under-report the model's
   actual generation speed for agentic responses.

3. Lazy-create the OTel instruments inside _StreamingMetrics rather
   than at import time, so a MeterProvider configured *after* this
   module is imported still binds correctly. Apps that bootstrap OTel
   in a startup hook (common with lazy-init patterns) would otherwise
   silently send to the no-op provider.
- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent
  too so the tps generation window covers all event types whose tokens land in
  usage.output_tokens. Previously the numerator counted argument tokens but
  the denominator excluded their generation time, inflating tps for tool-heavy
  responses.

- Lifted the bookmarking out of the text-delta branch into a single up-front
  check covering all four token-producing event types — cleaner than
  duplicating across branches.

- Documented the single-token skip case (window collapses to 0) inline at the
  guard. TPS is undefined for a one-token response so emitting nothing is
  correct; the comment makes the intent visible to future readers.
…t counter

Two follow-up changes from the PR review:

1. Move the LLM metric instruments from _StreamingMetrics in
   temporal_streaming_model.py to a new module:
     agentex.lib.core.observability.llm_metrics

   Public API: get_llm_metrics() returns a singleton LLMMetrics with
   the same six instruments (ttft, tps, input_tokens, output_tokens,
   cached_input_tokens, reasoning_tokens) plus a new requests counter.

   This makes the temporal+openai_agents plugin one of several future
   call sites — the sync ACP path and the Claude SDK plugin can
   record to the same instruments without redefining names, units,
   or descriptions. Keeps cross-provider naming consistent.

2. Add agentex.llm.requests counter with a status label so 429s,
   5xxs, timeouts, and other failures are observable on the SDK
   side without scraping logs. classify_status() maps exception
   types to a small fixed set (success / rate_limit / server_error
   / client_error / timeout / network_error / other_error) by class
   name, so it works across OpenAI, Anthropic, and other provider
   SDKs that use similar exception naming.

   Recorded in two places: success path (alongside token counters)
   and the existing get_response except handler (so terminal
   failures emit a counter event before re-raising).

Cardinality remains bounded — model + status (7 values) on the
counter; all other metrics keep just `model`.
Comment on lines 1057 to 1065
except Exception as e:
logger.error(f"Error using Responses API: {e}")
# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
# observable on the SDK side. Status histograms / token counters
# only fire on successful completion above.
get_llm_metrics().requests.add(
1, {"model": self.model_name, "status": classify_status(e)}
)
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 OTel exception in error handler replaces original LLM exception. If get_llm_metrics().requests.add() throws (e.g., a misbehaving exporter), the raise is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (RateLimitError, APITimeoutError, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare except prevents this.

Suggested change
except Exception as e:
logger.error(f"Error using Responses API: {e}")
# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
# observable on the SDK side. Status histograms / token counters
# only fire on successful completion above.
get_llm_metrics().requests.add(
1, {"model": self.model_name, "status": classify_status(e)}
)
raise
except Exception as e:
logger.error(f"Error using Responses API: {e}")
# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
# observable on the SDK side. Status histograms / token counters
# only fire on successful completion above.
try:
get_llm_metrics().requests.add(
1, {"model": self.model_name, "status": classify_status(e)}
)
except Exception:
pass
raise
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py
Line: 1057-1065

Comment:
**OTel exception in error handler replaces original LLM exception.** If `get_llm_metrics().requests.add()` throws (e.g., a misbehaving exporter), the `raise` is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (`RateLimitError`, `APITimeoutError`, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare `except` prevents this.

```suggestion
            except Exception as e:
                logger.error(f"Error using Responses API: {e}")
                # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
                # observable on the SDK side. Status histograms / token counters
                # only fire on successful completion above.
                try:
                    get_llm_metrics().requests.add(
                        1, {"model": self.model_name, "status": classify_status(e)}
                    )
                except Exception:
                    pass
                raise
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants