fix(core): add clickAndHold screenshot output parity#2015
fix(core): add clickAndHold screenshot output parity#2015BABTUNA wants to merge 1 commit intobrowserbase:mainfrom
Conversation
|
|
This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run. |
There was a problem hiding this comment.
No issues found across 4 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant LLM as AI Model/Agent
participant Tool as ClickAndHoldTool
participant Page as Browser Page
participant SH as ScreenshotHandler
participant MP as MessageProcessor
Note over LLM, MP: Tool Execution Flow
LLM->>Tool: execute(coordinates, duration)
Tool->>Page: processCoordinates()
Tool->>Page: mouse.down() / mouse.up()
activate Tool
Tool->>SH: NEW: waitAndCaptureScreenshot(page)
SH-->>Tool: screenshotBase64
deactivate Tool
Tool-->>LLM: CHANGED: ClickAndHoldToolResult (inc. screenshot)
Note over LLM, MP: Model Response Formatting
LLM->>Tool: toModelOutput(result)
alt NEW: Result has screenshot
Tool-->>LLM: Return [Text Content, Media Content]
else Error
Tool-->>LLM: Return [Text Error JSON]
end
Note over LLM, MP: Context Window Management (Cleanup)
LLM->>MP: processMessages(history)
loop For each Vision Action Message
MP->>MP: Identify click/type/dragAndDrop
MP->>MP: CHANGED: Identify clickAndHold
opt Message is old/redundant
MP->>MP: Prune screenshotBase64 (keep text only)
end
end
MP-->>LLM: Compressed message history
why
clickAndHoldwas inconsistent with other coordinate-based vision tools(
click,type,dragAndDrop):toModelOutputThis made model feedback less reliable and broke parity with existing
vision tool behavior.
what changed
clickAndHoldtool to:describe,duration,coordinates,screenshotBase64)toModelOutputwith text + media output (matching othervision tools)
ClickAndHoldToolResulttype in public agent types.clickAndHoldto vision-action compression list so olderscreenshots from this tool are pruned consistently with other vision tools.
clickAndHoldistreated as a vision action for screenshot compression.
test plan
prettier --checkon touched fileseslinton touched filestsc -p packages/core/tsconfig.json --noEmitpackages/core/tests/unit/message-processing.test.tsSummary by cubic
Brings
clickAndHoldto parity with other vision tools by capturing a post-action screenshot and emitting proper model output. Improves feedback quality and enables consistent screenshot compression.toModelOutput(text + optional image), matchingclick,type, anddragAndDrop.describe,duration,coordinates,screenshotBase64.clickAndHoldas a vision action for screenshot compression; added a unit test.ClickAndHoldToolResulttype.Written for commit 85fb256. Summary will update on new commits. Review in cubic