Skip to content

fix(workflow): preserve event.output for cached LlmAgent results#5587

Open
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
ValentinCordonnier:fix/5553-preserve-event-output-on-resume
Open

fix(workflow): preserve event.output for cached LlmAgent results#5587
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
ValentinCordonnier:fix/5553-preserve-event-output-on-resume

Conversation

@ValentinCordonnier
Copy link
Copy Markdown

Fixes #5553


Summary

Runner._consume_event_queue was nulling event.output before persisting any non-partial event whose node_info.message_as_output was True. The intent looked like a denormalization step ("the LLM text already lives in event.content, no need to store it twice"), but it broke resume rehydration. On the live path, process_llm_agent_output populates event.output with the validated dict (or raw text, when no output_schema); after the strip, the persisted event lost that value. On resume, _reconstruct_node_state walked the persisted events, found event.output is None, and fell through to child.output = event.content — handing the orchestrator a raw genai.types.Content instead of the dict. Anything subscripting that result (lab[\"findings\"], Schema.model_validate(out), …) crashed with 'Content' object is not subscriptable only after a HITL pause, even though the same code worked on the initial pass.

The diagnosis in #5553 by @samarth1224 already pinned the two functions involved; this PR is the minimal change.

Changes

  • src/google/adk/runners.py: removed the four-line strip block in _consume_event_queue that copied the event and assigned event.output = None for non-partial message_as_output events. With the strip gone, event.output survives the persist / rehydrate roundtrip and _reconstruct_node_state's primary branch (if event.output is not None: child.output = event.output) returns the same Python object type the live path produced.
  • tests/unittests/workflow/test_workflow_hitl.py: added a regression test that pauses a rerun_on_resume=True orchestrator on a RequestInput, resumes, and subscripts the cached LlmAgent output. Fails on v2 HEAD with the reported TypeError; passes after this change.
  • tests/unittests/workflow/test_workflow_dynamic_nodes.py: updated test_workflow_resume_does_not_rerun_completed_llm_agent. Its previous assertion agent_event.output is None (commented "Verify that runners.py cleared the output") was asserting the buggy behavior as if intentional. Replaced with a positive check that the persisted output equals the agent's emitted text. The test's actual semantic guarantee — the LLM is not re-fired on resume — still holds via the existing agent_runs_again == [] assertion.

Motivation

Any @node(rerun_on_resume=True) orchestrator that subscripts an LlmAgent result via ctx.run_node and survives a RequestInput pause is silently broken on v2. Graph-edge propagation of LlmAgent outputs across resume has the same failure mode (the rehydrated child.output is a Content instead of the schema dict), so this isn't only a dynamic-orchestrator problem. The fix unblocks the entire HITL-with-structured-output pattern, which is a core v2 use case.

What does not change

  • Live-path semantics. _node_runner._enqueue_event runs before this strip in the queue and was already reading event.output correctly; non-resume runs are unaffected.
  • Caching policy. rerun_on_resume, dedup of completed nodes, and the _dynamic_node_scheduler state machine are all unchanged. The fix only changes what value gets returned from the cache, not whether the cache is hit.
  • The event.content fallback in _reconstruct_node_state is preserved (it now handles only legacy persisted events).
  • No public API surface changes. Event.output: Any | None is unchanged; no contract said it had to be None for LLM events.

Downsides

  • Slightly larger persisted events for LlmAgent final responses. The validated dict is stored alongside the original text in content.parts. For output_schema agents that's roughly the JSON serialization of the dict; for plain-text agents it's roughly the response text duplicated. The previous saving was illusory — rehydration was returning a different object type than the live path, which is the bug we're fixing — but storage-cost-sensitive deployments will see the bytes-per-event grow.
  • Plugin behavior change. Plugins that read event.output for LLM events used to see None; they will now see the dict / string. Any plugin that branched on event.output is None to detect "this is the final LLM event" needs to switch to event.node_info.message_as_output (unchanged, set by the live path). I couldn't find any in-tree plugin doing this, but third-party plugins exist.
  • Old persisted sessions stay broken. Sessions persisted by pre-fix runners still have output=None for these events; resuming them post-fix continues to hit the event.content fallback. That's not a regression — those sessions were broken before — but cross-version resume is not magically fixed. The fallback branch could be removed in a follow-up once legacy sessions age out.
  • One existing test asserted the buggy behavior and had to be updated (see the third bullet under Changes). Reviewers should sanity-check that the replacement assertion still expresses the test's true semantic intent.

The runner stripped event.output to None before persisting any
non-partial event with node_info.message_as_output=True. On resume,
_reconstruct_node_state fell back to event.content (a raw
genai.types.Content) for the cached child output, so subscripting the
result of ctx.run_node on an LlmAgent across a HITL pause crashed with
"'Content' object is not subscriptable". Removing the strip lets the
validated dict produced by process_llm_agent_output survive the
persist/rehydrate roundtrip.

Closes google#5553
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 4, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label May 4, 2026
@adk-bot
Copy link
Copy Markdown
Collaborator

adk-bot commented May 4, 2026

Response from ADK Triaging Agent

Hello @ValentinCordonnier, thank you for creating this detailed pull request!

It looks like the CLA check is failing. Before we can review your contribution, you'll need to sign the Contributor License Agreement. You can do so by following the instructions at https://cla.developers.google.com/.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants