fix(workflow): preserve event.output for cached LlmAgent results#5587
Open
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
Open
fix(workflow): preserve event.output for cached LlmAgent results#5587ValentinCordonnier wants to merge 1 commit intogoogle:v2from
ValentinCordonnier wants to merge 1 commit intogoogle:v2from
Conversation
The runner stripped event.output to None before persisting any non-partial event with node_info.message_as_output=True. On resume, _reconstruct_node_state fell back to event.content (a raw genai.types.Content) for the cached child output, so subscripting the result of ctx.run_node on an LlmAgent across a HITL pause crashed with "'Content' object is not subscriptable". Removing the strip lets the validated dict produced by process_llm_agent_output survive the persist/rehydrate roundtrip. Closes google#5553
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Collaborator
|
Response from ADK Triaging Agent Hello @ValentinCordonnier, thank you for creating this detailed pull request! It looks like the CLA check is failing. Before we can review your contribution, you'll need to sign the Contributor License Agreement. You can do so by following the instructions at https://cla.developers.google.com/. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5553
Summary
Runner._consume_event_queuewas nullingevent.outputbefore persisting any non-partial event whosenode_info.message_as_outputwasTrue. The intent looked like a denormalization step ("the LLM text already lives inevent.content, no need to store it twice"), but it broke resume rehydration. On the live path,process_llm_agent_outputpopulatesevent.outputwith the validated dict (or raw text, when nooutput_schema); after the strip, the persisted event lost that value. On resume,_reconstruct_node_statewalked the persisted events, foundevent.output is None, and fell through tochild.output = event.content— handing the orchestrator a rawgenai.types.Contentinstead of the dict. Anything subscripting that result (lab[\"findings\"],Schema.model_validate(out), …) crashed with'Content' object is not subscriptableonly after a HITL pause, even though the same code worked on the initial pass.The diagnosis in #5553 by @samarth1224 already pinned the two functions involved; this PR is the minimal change.
Changes
src/google/adk/runners.py: removed the four-line strip block in_consume_event_queuethat copied the event and assignedevent.output = Nonefor non-partialmessage_as_outputevents. With the strip gone,event.outputsurvives the persist / rehydrate roundtrip and_reconstruct_node_state's primary branch (if event.output is not None: child.output = event.output) returns the same Python object type the live path produced.tests/unittests/workflow/test_workflow_hitl.py: added a regression test that pauses arerun_on_resume=Trueorchestrator on aRequestInput, resumes, and subscripts the cachedLlmAgentoutput. Fails onv2HEAD with the reportedTypeError; passes after this change.tests/unittests/workflow/test_workflow_dynamic_nodes.py: updatedtest_workflow_resume_does_not_rerun_completed_llm_agent. Its previous assertionagent_event.output is None(commented "Verify that runners.py cleared the output") was asserting the buggy behavior as if intentional. Replaced with a positive check that the persisted output equals the agent's emitted text. The test's actual semantic guarantee — the LLM is not re-fired on resume — still holds via the existingagent_runs_again == []assertion.Motivation
Any
@node(rerun_on_resume=True)orchestrator that subscripts anLlmAgentresult viactx.run_nodeand survives aRequestInputpause is silently broken onv2. Graph-edge propagation ofLlmAgentoutputs across resume has the same failure mode (the rehydratedchild.outputis aContentinstead of the schema dict), so this isn't only a dynamic-orchestrator problem. The fix unblocks the entire HITL-with-structured-output pattern, which is a corev2use case.What does not change
_node_runner._enqueue_eventruns before this strip in the queue and was already readingevent.outputcorrectly; non-resume runs are unaffected.rerun_on_resume, dedup of completed nodes, and the_dynamic_node_schedulerstate machine are all unchanged. The fix only changes what value gets returned from the cache, not whether the cache is hit.event.contentfallback in_reconstruct_node_stateis preserved (it now handles only legacy persisted events).Event.output: Any | Noneis unchanged; no contract said it had to beNonefor LLM events.Downsides
LlmAgentfinal responses. The validated dict is stored alongside the original text incontent.parts. Foroutput_schemaagents that's roughly the JSON serialization of the dict; for plain-text agents it's roughly the response text duplicated. The previous saving was illusory — rehydration was returning a different object type than the live path, which is the bug we're fixing — but storage-cost-sensitive deployments will see the bytes-per-event grow.event.outputfor LLM events used to seeNone; they will now see the dict / string. Any plugin that branched onevent.output is Noneto detect "this is the final LLM event" needs to switch toevent.node_info.message_as_output(unchanged, set by the live path). I couldn't find any in-tree plugin doing this, but third-party plugins exist.output=Nonefor these events; resuming them post-fix continues to hit theevent.contentfallback. That's not a regression — those sessions were broken before — but cross-version resume is not magically fixed. The fallback branch could be removed in a follow-up once legacy sessions age out.