Skip to content

Evals overhaul#2011

Draft
miguelg719 wants to merge 14 commits intomainfrom
evals-overhaul
Draft

Evals overhaul#2011
miguelg719 wants to merge 14 commits intomainfrom
evals-overhaul

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented Apr 18, 2026

why

what changed

test plan


Summary by cubic

Rebuilt the evals system into a two‑tier core + bench framework with adapter‑abstracted tools, auto‑discovered tasks, a unified Braintrust runner, and a stable TUI CLI. Adds deterministic core tasks, runner‑provided Chrome/Browserbase targets, Braintrust report tooling, a new CLI “experiments” command, and TUI updates (startup‑warning suppression and a bin shim).

  • New Features

    • Core framework (contracts, assertions, metrics, unified runner) with deterministic tasks under packages/evals/core/tasks/**; task auto‑discovery via defineCoreTask/defineBenchTask.
    • Tool adapters: understudy_code, playwright_code, cdp_code, playwright_mcp, chrome_devtools_mcp, browse_cli; runner‑provided targets for local Chrome and Browserbase (startup profiles, session artifacts, cleanup).
    • TUI CLI: bundled dist/cli with bin shim at packages/evals/bin/evals; legacy CLI via pnpm evals:old; noisy startup warnings suppressed; quieter logs via EvalLogger echo toggle; verbose default added; new experiments command to compare Braintrust runs.
    • Braintrust reporting: data lib for experiment comparisons and report:core script; README updated to use WebTailBench; .gitignore ignores Playwright/MCP artifacts; deps add playwright and vitest.
  • Migration

    • Tasks are no longer listed in evals.config.json; they’re discovered from the filesystem.
    • Bench tasks must default‑export defineBenchTask({ name }) under packages/evals/tasks/bench/**; core tasks live under packages/evals/core/tasks/** using defineCoreTask.
    • Use the new CLI (pnpm evals ...); the legacy interface is available via pnpm evals:old.
    • Set BROWSERBASE_API_KEY (and optionally BROWSERBASE_PROJECT_ID); set CHROME_PATH if auto‑detect fails; EVAL_ENV is read at runtime to choose Browserbase vs local.
    • Default categories were trimmed; update filters if you used regression_llm_providers or llm_clients.

Written for commit 8d38a2a. Summary will update on new commits. Review in cubic

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 18, 2026

⚠️ No Changeset found

Latest commit: 8d38a2a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Comment on lines +93 to +117
waitForTimeout(ms: number): Promise<void>;

locator(selector: string): CoreLocatorHandle;

click(target: string | ActionTarget): Promise<void>;
click(x: number, y: number): Promise<void>;

hover(target: string | ActionTarget): Promise<void>;
hover(x: number, y: number): Promise<void>;

scroll(x: number, y: number, deltaX: number, deltaY: number): Promise<void>;

type(text: string): Promise<void>;
type(
target: string | ActionTarget | FocusedTarget,
text: string,
): Promise<void>;

press(key: string): Promise<void>;
press(
target: string | ActionTarget | FocusedTarget,
key: string,
): Promise<void>;

represent?(opts?: RepresentationOpts): Promise<PageRepresentation>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of re-encoding of our internal contracts, seems like it would break often whenever our implementation changes (E.g. much of this would have to be rewritten for v4 probs)

Is there any way we can discover this shape dynamically from existing code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants