Evals overhaul by miguelg719 · Pull Request #2011 · browserbase/stagehand

miguelg719 · 2026-04-18T23:57:03Z

why

what changed

test plan

Summary by cubic

Rebuilt the evals system into a two‑tier core + bench framework with adapter‑abstracted tools, auto‑discovered tasks, a unified Braintrust runner, and a stable TUI CLI. Adds deterministic core tasks, runner‑provided Chrome/Browserbase targets, Braintrust report tooling, a new CLI “experiments” command, and TUI updates (startup‑warning suppression and a bin shim).

New Features
- Core framework (contracts, assertions, metrics, unified runner) with deterministic tasks under packages/evals/core/tasks/**; task auto‑discovery via defineCoreTask/defineBenchTask.
- Tool adapters: understudy_code, playwright_code, cdp_code, playwright_mcp, chrome_devtools_mcp, browse_cli; runner‑provided targets for local Chrome and Browserbase (startup profiles, session artifacts, cleanup).
- TUI CLI: bundled dist/cli with bin shim at packages/evals/bin/evals; legacy CLI via pnpm evals:old; noisy startup warnings suppressed; quieter logs via EvalLogger echo toggle; verbose default added; new experiments command to compare Braintrust runs.
- Braintrust reporting: data lib for experiment comparisons and report:core script; README updated to use WebTailBench; .gitignore ignores Playwright/MCP artifacts; deps add playwright and vitest.
Migration
- Tasks are no longer listed in evals.config.json; they’re discovered from the filesystem.
- Bench tasks must default‑export defineBenchTask({ name }) under packages/evals/tasks/bench/**; core tasks live under packages/evals/core/tasks/** using defineCoreTask.
- Use the new CLI (pnpm evals ...); the legacy interface is available via pnpm evals:old.
- Set BROWSERBASE_API_KEY (and optionally BROWSERBASE_PROJECT_ID); set CHROME_PATH if auto‑detect fails; EVAL_ENV is read at runtime to choose Browserbase vs local.
- Default categories were trimmed; update filters if you used regression_llm_providers or llm_clients.

^{Written for commit 8d38a2a. Summary will update on new commits. Review in cubic}

Entire-Checkpoint: f268fd2c0c30

Entire-Checkpoint: af001fddfe1b

changeset-bot · 2026-04-18T23:57:07Z

⚠️ No Changeset found

Latest commit: 8d38a2a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Entire-Checkpoint: 674f064a1270

pirate · 2026-04-20T18:53:31Z

+  waitForTimeout(ms: number): Promise<void>;
+
+  locator(selector: string): CoreLocatorHandle;
+
+  click(target: string | ActionTarget): Promise<void>;
+  click(x: number, y: number): Promise<void>;
+
+  hover(target: string | ActionTarget): Promise<void>;
+  hover(x: number, y: number): Promise<void>;
+
+  scroll(x: number, y: number, deltaX: number, deltaY: number): Promise<void>;
+
+  type(text: string): Promise<void>;
+  type(
+    target: string | ActionTarget | FocusedTarget,
+    text: string,
+  ): Promise<void>;
+
+  press(key: string): Promise<void>;
+  press(
+    target: string | ActionTarget | FocusedTarget,
+    key: string,
+  ): Promise<void>;
+
+  represent?(opts?: RepresentationOpts): Promise<PageRepresentation>;


this is a lot of re-encoding of our internal contracts, seems like it would break often whenever our implementation changes (E.g. much of this would have to be rewritten for v4 probs)

Is there any way we can discover this shape dynamically from existing code?

miguelg719 added 8 commits April 6, 2026 10:06

unclean evals overhaul

6a1b535

use wrapper function for all evals

904e005

Initial abstraction for core runner

735ebc3

updates on sprint 2

bbe5871

add more tool adapters and fixes

cc99e06

Entire-Checkpoint: f268fd2c0c30

add browse cli as a tool

88edd02

proper mcp adapters

9a2125b

updates and report generation

ea106d4

Entire-Checkpoint: af001fddfe1b

miguelg719 and others added 6 commits April 18, 2026 16:58

remove eval planning docs from branch

ffd3b55

Entire-Checkpoint: 674f064a1270

updates

bd0347b

Delete packages/core/examples/visa.ts

7f93e13

stable TUI

24af652

TUI updates

c76fa38

experiments command added

8d38a2a

pirate reviewed Apr 20, 2026

View reviewed changes

pirate approved these changes Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals overhaul#2011

Evals overhaul#2011
miguelg719 wants to merge 14 commits intomainfrom
evals-overhaul

miguelg719 commented Apr 18, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

pirate Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguelg719 commented Apr 18, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

pirate Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelg719 commented Apr 18, 2026 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Apr 18, 2026 •

edited

Loading