feat(streaming): emit OTel metrics for ttft, tps, token counts by x · Pull Request #347 · scaleapi/scale-agentex-python

x · 2026-05-07T16:22:50Z

What this adds

Six OTel metrics to TemporalStreamingModel.get_response:

agentex.llm.ttft (histogram, ms) — time from request start to first content delta
agentex.llm.tps (histogram, tokens/s) — output tokens per second
agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters)

Cache hit rate computed at query time: rate(cached_input_tokens) / rate(input_tokens).
Time-per-output-token (TPOT) computed as 1/tps.

Why in the SDK rather than per-app

The data is already captured at line 854 (captured_usage = response.usage) — every app loses out on it because nothing emits it as metrics. Doing this in the SDK means every consumer of TemporalStreamingModel gets the metrics for free, with consistent naming.

The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.

Cardinality

Only model is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus.

Test plan

Build and deploy to a dev environment with OTel configured
Confirm metrics flow to Mimir / Prometheus with expected labels
Verify TTFT measures sensibly (~500ms-3s for typical OpenAI/Foundry calls)
Verify TPS aligns with hand-calculation from existing duration + token counts

Notes

TTFT is measured client-side (request start → first delta seen). It includes any client-side queuing / network setup. Useful for end-user perceived latency, less useful for raw model latency.

Greptile Summary

This PR adds six OTel metrics (ttft, tps, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens) plus a requests counter to TemporalStreamingModel.get_response, centralized in a new llm_metrics.py module with lazy initialization. The TTFT timer is correctly placed before responses.create() and TPS uses a generation window denominator. One new P1 was found: in the except error handler, an OTel exception from requests.add() replaces the original LLM exception before raise executes, breaking caller retry/circuit-breaker logic. A P2 was also flagged for non-thread-safe singleton initialization in llm_metrics.py.

Confidence Score: 3/5

Not safe to merge: a P1 in the error handler can replace the original LLM exception with an OTel exception, breaking caller retry logic.

Two unguarded OTel call sites remain — one on the success path (flagged in previous threads) and one new on the error path where an OTel exception replaces the original LLM exception before raise. Multiple P1s lower the score below the 4/5 ceiling.

temporal_streaming_model.py lines 1017-1065 (both metrics emission blocks need try/except guards); llm_metrics.py singleton init.

Important Files Changed

Filename	Overview
src/agentex/lib/core/observability/llm_metrics.py	New module introducing lazy-init OTel instruments (7 counters/histograms) with classify_status helper; singleton initialization is not thread-safe under concurrent calls.
src/agentex/lib/core/observability/init.py	Empty package init file — no issues.
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py	Adds TTFT/TPS timers and token-count metric emission; OTel exception in the error handler can replace the original LLM exception, breaking caller retry logic; success metrics block is also unguarded (previous thread).

Comments Outside Diff (1)

src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py, line 635-637 (link)

TTFT timer starts after the HTTP round-trip, not at request start. stream_start_perf is assigned on line 689, after await self.client.responses.create(...) has already returned (line 637–673). That await completes only after the TCP connection, the full HTTP request send, and the server returning response headers — easily 200–800 ms for a typical LLM call. As a result, agentex.llm.ttft will systematically under-report TTFT relative to its documented definition ("time from streaming-request start to first content token"). Moving the bookmark to just before the responses.create call captures the full client-perceived TTFT.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py
Line: 635-637

Comment:
**TTFT timer starts after the HTTP round-trip, not at request start.** `stream_start_perf` is assigned on line 689, after `await self.client.responses.create(...)` has already returned (line 637–673). That `await` completes only after the TCP connection, the full HTTP request send, and the server returning response headers — easily 200–800 ms for a typical LLM call. As a result, `agentex.llm.ttft` will systematically under-report TTFT relative to its documented definition ("time from streaming-request start to first content token"). Moving the bookmark to just before the `responses.create` call captures the full client-perceived TTFT.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py:1057-1065
**OTel exception in error handler replaces original LLM exception.** If `get_llm_metrics().requests.add()` throws (e.g., a misbehaving exporter), the `raise` is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (`RateLimitError`, `APITimeoutError`, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare `except` prevents this.

```suggestion
            except Exception as e:
                logger.error(f"Error using Responses API: {e}")
                # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
                # observable on the SDK side. Status histograms / token counters
                # only fire on successful completion above.
                try:
                    get_llm_metrics().requests.add(
                        1, {"model": self.model_name, "status": classify_status(e)}
                    )
                except Exception:
                    pass
                raise
```

### Issue 2 of 2
src/agentex/lib/core/observability/llm_metrics.py:80-88
`get_llm_metrics()` is not safe for concurrent initialization in a multi-threaded runtime. Two threads reaching the `if _llm_metrics is None` check simultaneously will each create a `LLMMetrics` instance, and whichever finishes second wins — leaving dangling instrument handles from the first. Using `functools.lru_cache` on a module-level function is the idiomatic Python way to get thread-safe lazy singletons without an explicit lock.

```suggestion
import functools

@functools.lru_cache(maxsize=None)
def get_llm_metrics() -> LLMMetrics:
    """Return the LLM metrics singleton, creating it on first use.

    ``lru_cache`` provides thread-safe one-time initialization without an
    explicit lock — safe for both threaded and async callers.
    """
    return LLMMetrics()
```

_{Reviews (4): Last reviewed commit: "review (stas): extract llm metrics to co..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…counts Adds six metrics to TemporalStreamingModel.get_response so applications that configure an OTel MeterProvider can see streaming-call behavior without per-app instrumentation: - agentex.llm.ttft (histogram, ms): time from request start to first content delta. Captured on the first ResponseTextDeltaEvent / ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent. - agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration. Use 1/tps for time-per-output-token (tpot). - agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters): pulled from the captured ResponsesAPI Usage at end-of-stream. Cache hit rate is computed at query time as rate(cached_input_tokens) / rate(input_tokens). Why - The data was already captured (line 854 captured_usage = response.usage) but never emitted as metrics. Apps could only see total LLM call duration, not the meaningful breakdowns. - Doing this in the SDK rather than each app means every consumer of TemporalStreamingModel gets the metrics for free. - Cardinality is bounded — only `model` is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus. The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.

Three changes from PR #347 review: 1. Move stream_start_perf above responses.create() so ttft captures the full request-to-first-token latency (HTTP round-trip + model TTFB), not just post-connect event-loop delay. Drop the duplicate assignment that was being overwritten after the await. 2. Track last_token_at alongside first_token_at and use the generation window (first→last delta) as the tps denominator. Total stream duration was inflated by tool-call argument deltas, reasoning events, and stream_update awaits — making tps under-report the model's actual generation speed for agentic responses. 3. Lazy-create the OTel instruments inside _StreamingMetrics rather than at import time, so a MeterProvider configured *after* this module is imported still binds correctly. Apps that bootstrap OTel in a startup hook (common with lazy-init patterns) would otherwise silently send to the no-op provider.

- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent too so the tps generation window covers all event types whose tokens land in usage.output_tokens. Previously the numerator counted argument tokens but the denominator excluded their generation time, inflating tps for tool-heavy responses. - Lifted the bookmarking out of the text-delta branch into a single up-front check covering all four token-producing event types — cleaner than duplicating across branches. - Documented the single-token skip case (window collapses to 0) inline at the guard. TPS is undefined for a one-token response so emitting nothing is correct; the comment makes the intent visible to future readers.

…t counter Two follow-up changes from the PR review: 1. Move the LLM metric instruments from _StreamingMetrics in temporal_streaming_model.py to a new module: agentex.lib.core.observability.llm_metrics Public API: get_llm_metrics() returns a singleton LLMMetrics with the same six instruments (ttft, tps, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens) plus a new requests counter. This makes the temporal+openai_agents plugin one of several future call sites — the sync ACP path and the Claude SDK plugin can record to the same instruments without redefining names, units, or descriptions. Keeps cross-provider naming consistent. 2. Add agentex.llm.requests counter with a status label so 429s, 5xxs, timeouts, and other failures are observable on the SDK side without scraping logs. classify_status() maps exception types to a small fixed set (success / rate_limit / server_error / client_error / timeout / network_error / other_error) by class name, so it works across OpenAI, Anthropic, and other provider SDKs that use similar exception naming. Recorded in two places: success path (alongside token counters) and the existing get_response except handler (so terminal failures emit a counter event before re-raising). Cardinality remains bounded — model + status (7 values) on the counter; all other metrics keep just `model`.

greptile-apps · 2026-05-07T17:12:21Z

            except Exception as e:
                logger.error(f"Error using Responses API: {e}")
+                # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
+                # observable on the SDK side. Status histograms / token counters
+                # only fire on successful completion above.
+                get_llm_metrics().requests.add(
+                    1, {"model": self.model_name, "status": classify_status(e)}
+                )
                raise


OTel exception in error handler replaces original LLM exception. If get_llm_metrics().requests.add() throws (e.g., a misbehaving exporter), the raise is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (RateLimitError, APITimeoutError, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare except prevents this.

Suggested change

except Exception as e:

logger.error(f"Error using Responses API: {e}")

# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are

# observable on the SDK side. Status histograms / token counters

# only fire on successful completion above.

get_llm_metrics().requests.add(

1, {"model": self.model_name, "status": classify_status(e)}

)

raise

except Exception as e:

logger.error(f"Error using Responses API: {e}")

# Emit a request-counter event so 429s, 5xxs, timeouts, etc. are

# observable on the SDK side. Status histograms / token counters

# only fire on successful completion above.

try:

get_llm_metrics().requests.add(

1, {"model": self.model_name, "status": classify_status(e)}

)

except Exception:

pass

raise

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Line: 1057-1065 Comment: **OTel exception in error handler replaces original LLM exception.** If `get_llm_metrics().requests.add()` throws (e.g., a misbehaving exporter), the `raise` is never reached and the caller receives an OTel SDK exception instead of the original LLM error. Retry logic or circuit breakers that inspect the exception type (`RateLimitError`, `APITimeoutError`, etc.) would receive an unexpected OTel exception, silently breaking them. Wrapping in a bare `except` prevents this. ```suggestion except Exception as e: logger.error(f"Error using Responses API: {e}") # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are # observable on the SDK side. Status histograms / token counters # only fire on successful completion above. try: get_llm_metrics().requests.add( 1, {"model": self.model_name, "status": classify_status(e)} ) except Exception: pass raise ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

x added 2 commits May 7, 2026 12:40

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

joshua0chen approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(streaming): emit OTel metrics for ttft, tps, token counts#347

feat(streaming): emit OTel metrics for ttft, tps, token counts#347
x wants to merge 4 commits intomainfrom
x/llm-metrics-instrumentation

x commented May 7, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

x commented May 7, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Why in the SDK rather than per-app

Cardinality

Test plan

Notes

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x commented May 7, 2026 •

edited by greptile-apps Bot

Loading