Conversation
Entire-Checkpoint: f268fd2c0c30
Entire-Checkpoint: af001fddfe1b
|
Entire-Checkpoint: 674f064a1270
| waitForTimeout(ms: number): Promise<void>; | ||
|
|
||
| locator(selector: string): CoreLocatorHandle; | ||
|
|
||
| click(target: string | ActionTarget): Promise<void>; | ||
| click(x: number, y: number): Promise<void>; | ||
|
|
||
| hover(target: string | ActionTarget): Promise<void>; | ||
| hover(x: number, y: number): Promise<void>; | ||
|
|
||
| scroll(x: number, y: number, deltaX: number, deltaY: number): Promise<void>; | ||
|
|
||
| type(text: string): Promise<void>; | ||
| type( | ||
| target: string | ActionTarget | FocusedTarget, | ||
| text: string, | ||
| ): Promise<void>; | ||
|
|
||
| press(key: string): Promise<void>; | ||
| press( | ||
| target: string | ActionTarget | FocusedTarget, | ||
| key: string, | ||
| ): Promise<void>; | ||
|
|
||
| represent?(opts?: RepresentationOpts): Promise<PageRepresentation>; |
There was a problem hiding this comment.
this is a lot of re-encoding of our internal contracts, seems like it would break often whenever our implementation changes (E.g. much of this would have to be rewritten for v4 probs)
Is there any way we can discover this shape dynamically from existing code?
why
what changed
test plan
Summary by cubic
Rebuilt the evals system into a two‑tier core + bench framework with adapter‑abstracted tools, auto‑discovered tasks, a unified Braintrust runner, and a stable TUI CLI. Adds deterministic core tasks, runner‑provided Chrome/Browserbase targets, Braintrust report tooling, a new CLI “experiments” command, and TUI updates (startup‑warning suppression and a bin shim).
New Features
packages/evals/core/tasks/**; task auto‑discovery viadefineCoreTask/defineBenchTask.understudy_code,playwright_code,cdp_code,playwright_mcp,chrome_devtools_mcp,browse_cli; runner‑provided targets for local Chrome and Browserbase (startup profiles, session artifacts, cleanup).dist/cliwith bin shim atpackages/evals/bin/evals; legacy CLI viapnpm evals:old; noisy startup warnings suppressed; quieter logs viaEvalLoggerecho toggle;verbosedefault added; newexperimentscommand to compare Braintrust runs.report:corescript; README updated to use WebTailBench;.gitignoreignores Playwright/MCP artifacts; deps addplaywrightandvitest.Migration
evals.config.json; they’re discovered from the filesystem.defineBenchTask({ name })underpackages/evals/tasks/bench/**; core tasks live underpackages/evals/core/tasks/**usingdefineCoreTask.pnpm evals ...); the legacy interface is available viapnpm evals:old.BROWSERBASE_API_KEY(and optionallyBROWSERBASE_PROJECT_ID); setCHROME_PATHif auto‑detect fails;EVAL_ENVis read at runtime to choose Browserbase vs local.regression_llm_providersorllm_clients.Written for commit 8d38a2a. Summary will update on new commits. Review in cubic