Skip to content

explorer: search perf-smoke baseline (#167)#173

Merged
rdhyee merged 7 commits intoisamplesorg:mainfrom
rdhyee:explorer-search-perf-smoke
May 8, 2026
Merged

explorer: search perf-smoke baseline (#167)#173
rdhyee merged 7 commits intoisamplesorg:mainfrom
rdhyee:explorer-search-perf-smoke

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented May 8, 2026

Summary

Track 1a of #165 — establishes a measured browser-side baseline for the current live search before any substrate work begins.

  • Instruments `doSearch()` with `performance.mark` / `performance.measure` and emits a structured `isamples.search` console event on every completion (success or error). Captures `elapsed_ms`, `results_count`, `bytes_transfer`, `bytes_body`, per-URL `seen_urls`, and filter composition flags.
  • Search measures appended to the existing `?perf=1` panel.
  • New `tests/test_search_perf.py` drives the locked canonical query set, captures the events, and writes a JSON artifact to `tests/search_baseline_<UTC_DATE>.json`.

No behavior change to the search itself.

Baseline numbers (committed in this PR)

Local Quarto preview, 2026-05-08. `tests/search_baseline_2026-05-08.json`. All 9 canonical queries completed cleanly.

label term results cold (ms) warm (ms)
single-common `pottery` 50 8683 1940
single-rare `basalt` 50 11824 2400
multi-term `pottery Cyprus` 0 5071 2560
no-hit `xyzzyqqqplugh` 0 4870 2554
wildcard-pct `100%` 0 5178 2574
wildcard-under `_test` 0 18520 2529
diacritic `Çatalhöyük` 0 4415 2592
composed-source `pottery` × OPENCONTEXT 50 4964 1368
composed-source-material + material[0] 0 4331 2106

Headlines (also posted as a comment on #167):

  • Current ILIKE-on-samples_map_lite already misses every v1 contract budget (cold 4-12s vs ≤2s target; warm 1.4-2.6s vs ≤500ms target). The static substrate prototype must substantially beat these.
  • `pottery Cyprus` returns 0 results because samples_map_lite has 0 Cyprus matches in label/place_name — Cyprus only appears in description. Direct evidence for Explorer FTS Track 1b: Honesty fix for query-spec / live mismatch #168 honesty-fix Direction A.
  • `Çatalhöyük` returns 0 — current ILIKE doesn't strip diacritics. The v1 tokenizer in Explorer FTS Track 3: Offline index builder + tokenizer regression set #170 fixes this.
  • Wildcard literals (`100%`, `_test`) return 0 cleanly without errors — `escapeIlikePattern` is correct.
  • `bytes_transfer = 0` and `seen_urls = []` everywhere — no fetches during the search window. DuckDB-WASM caches parquet metadata at page load and uses HTTP range reads from cache. Implication: "bytes per search" is a substrate-world metric, not a current-world metric. Worth pinning in Explorer FTS Track 2: search_index_v1 contract doc #169 budget framing.

Canonical query set (locked in #167)

`pottery`, `basalt`, `pottery Cyprus`, `xyzzyqqqplugh`, `100%`, `_test`, `Çatalhöyük`, `pottery` × OPENCONTEXT only, `pottery` × OPENCONTEXT × material[0].

Each query opens a fresh browser context for a true file-cold measurement, then re-runs on the same page for a warm measurement.

Open design questions (recorded on #167)

  1. Log channel: defaulted to `console.log(JSON.stringify(...))`. Alternative `CustomEvent` deferred unless non-Playwright observers appear.
  2. Bytes-transferred: defaulted to `performance.getEntriesByType('resource')` summing `transferSize`, falling back to `encodedBodySize`. Cross-origin `transferSize` requires `Timing-Allow-Origin` on R2; the empty bytes in the baseline indicate that R2 may not be setting TAO, but more likely that DuckDB-WASM caches metadata at load time. Worth verifying on a deployed preview.

Test plan

Out of scope

Refs

#165, #167, #174 (deferred concerns from review)

🤖 Generated with Claude Code

rdhyee and others added 3 commits May 8, 2026 07:01
…amplesorg#167)

Adds per-search instrumentation:
- performance.mark + measure around the DuckDB query
- structured console.log on every completion (success or error) with
  elapsed_ms, results_count, bytes_transfer, bytes_body, and filter
  composition flags
- search measures appended to the existing ?perf=1 panel

Cross-origin transferSize is 0 without Timing-Allow-Origin on R2;
encodedBodySize is reported as a fallback so we capture something
regardless of TAO. Validation of TAO on data.isamples.org tracked in
the issue thread.

No behavior change to the search itself. Track 1a of isamplesorg#165.

Refs isamplesorg#165, isamplesorg#167.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esorg#167)

tests/test_search_perf.py drives the locked canonical query set against
the deployed (or local-preview) Explorer, captures the structured
isamples.search console events emitted by the new instrumentation, and
writes a per-run JSON artifact to tests/search_baseline_<UTC_DATE>.json.

Each query opens a fresh browser context to capture true file-cold
timings, then runs the same query a second time on the same page for a
warm measurement. Filter-composition queries set source-only filters
before searching.

Slow (~8-15 min). Run as a release-time benchmark, not on every CI
invocation. The baseline numbers will be posted as a comment on isamplesorg#167
once the instrumentation hits a deployed preview.

Refs isamplesorg#165, isamplesorg#167.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review fixes against the originally-opened PR. Two files affected:

explorer.qmd
- Capture per-URL byte data (seen_urls list) in the structured search
  log so analysis can post-hoc filter concurrent fetches that aren't
  attributable to the search itself, rather than only summed bytes.

tests/test_search_perf.py
- Replace deprecated dt.datetime.utcnow() with dt.datetime.now(UTC).
- Replace fixed 800ms wait_for_timeout after facet changes with a poll
  on '.facet-count.recomputing' clearing — the cross-filter handler
  has a 250ms debounce plus an async cluster reload that can exceed 1s
  cold; fixed waits flake on slow networks.
- Add a composed-source-material query that exercises the
  facetFilterSQL() pid-IN-subquery path, not just sourceFilterSQL().
  First material checkbox is used to avoid hard-coding URIs that may
  rotate between data snapshots.
- Tighten the final assertion: refuse to treat a benchmark with any
  silent failures as valid (was '>= half'). Partial JSON still written
  for diagnosis.

Refs isamplesorg#165, isamplesorg#167.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 8, 2026

Self-review fixes applied

Force-pushed (rebased onto upstream main + 1 fix-up commit). PR is now 2 commits, scope is purely `explorer.qmd` (+72/-1) and `tests/test_search_perf.py` (+263).

Dropped from PR: the unrelated `provenance/scancode-2026-05-04.json` commit that had been sitting unpushed on local main; carried in by accident when this branch was cut from main. Available on local main for a separate provenance PR.

Code fixes:

  • `explorer.qmd`: capture per-URL `seen_urls` list in the structured search log (rather than only summed `bytes_transfer` / `bytes_body`) so analysis can post-hoc filter concurrent fetches not attributable to the search.
  • `tests/test_search_perf.py`:
    • replace deprecated `dt.datetime.utcnow()` with `dt.datetime.now(UTC)` (Python 3.12+ compat)
    • replace fixed 800ms `wait_for_timeout` after facet changes with `wait_for_function` polling on `.facet-count.recomputing` clearing — the cross-filter handler debounces 250ms then runs an async cluster reload that can exceed 1s cold; fixed waits flake on slow networks
    • add `composed-source-material` query to canonical set; exercises the `facetFilterSQL()` pid-IN-subquery path (uses "first material checkbox" rather than hard-coded URI to survive data-snapshot rotation)
    • tighten final assertion: refuse to treat a benchmark with any silent failures as valid (was `>= half`); partial JSON still on disk for diagnosis

Deferred (filed as #174):

…amplesorg#167)

The material filter <div id='materialFilterBody'> ships with
display:none (explorer.qmd:280); Playwright's click times out because
the element is not visible. Add a one-time expansion of the section
header before iterating checkboxes.

Surfaced by the first perf-smoke run against local preview: 8/9
queries completed cleanly, only composed-source-material failed with
the visible-element timeout. The strict assertion added in commit
aec3a2b correctly caught the silent failure.

Refs isamplesorg#165, isamplesorg#167.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 9 canonical queries completed cleanly after the material-section
expand fix (commit 8023c20).

Headline: current ILIKE-on-samples_map_lite already misses every v1
contract budget (cold 4-12s vs ≤2s target; warm 1.4-2.6s vs ≤500ms).
'pottery Cyprus' returns 0 results because samples_map_lite has 0
Cyprus matches (Cyprus only appears in description, which lite
doesn't carry). 'Çatalhöyük' returns 0 because ILIKE doesn't strip
diacritics. All this is direct evidence supporting isamplesorg#168 honesty-fix
direction A and substrate-substantially-better-than-ILIKE in isamplesorg#169.

Refs isamplesorg#165, isamplesorg#167.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 8, 2026

Review findings from this pass:

  1. ?perf=1 panel will not show searches run after initial render. explorer.qmd only reads existing performance.measure entries while the OJS perfPanel cell renders once after phase1; later doSearch() calls log to console but never update the panel DOM. This undercuts the PR claim that search measures are appended to the panel per invocation. Fix by appending/updating the perf panel from doSearch() or wiring an explicit refresh path.

  2. _wait_for_facet_settle() can return before the async facet/source work has even started. The helper waits for no .recomputing, but the app schedules recompute after a 250ms debounce and source changes call refreshFacetCounts() only after await loadRes(...) completes. The benchmark can start the search while cluster reload/facet recompute is still running, contaminating the composed-filter timings. The harness needs a positive “work started then finished” signal, or an explicit app-side settle marker.

  3. _measure_one_query() leaks browser contexts on failures. Since the outer loop catches exceptions and continues, a timeout leaves a context/page alive and can skew later measurements. Wrap the body in try/finally: context.close().

Verified only with python -m py_compile tests/test_search_perf.py; I did not run the full perf-smoke.

…busy flag, context cleanup)

Codex review of isamplesorg#173 surfaced three issues, all fixed here.

explorer.qmd:
- doSearch finally block now appends a row to the ?perf=1 panel per
  search (panel renders once at boot from existing measure entries; this
  hooks each subsequent search). The PR claim "search measures are
  appended to the existing ?perf=1 panel per invocation" is now true.
- Source filter and material/context/object_type change handlers wrap
  their async work in try/finally with body.classList 'explorer-busy'
  set during the chain. The 300ms post-refreshFacetCounts wait is
  intentional: refreshFacetCounts schedules a 250ms debounce that THEN
  sets .recomputing on facet count spans; we hold busy until that fires
  so external observers see a coherent "work started, work finished"
  signal instead of racing the debounce.
- handleFacetFilterChange is now async and awaits loadViewportSamples
  in the point-mode branch (previously fire-and-forget). Required so
  the busy flag honestly covers all triggered work.

tests/test_search_perf.py:
- _wait_for_facet_settle is now two-phase: wait for body:not(.explorer-
  busy), then wait for .facet-count:not(.recomputing). Polling either
  signal alone races the app's two-stage settle.
- _measure_one_query wraps its body in try/finally on context.close()
  so a TimeoutError in _run_search no longer leaks the browser
  context.

Refs isamplesorg#165, isamplesorg#173.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 8, 2026

Codex review fixes (commit `76e37b6`)

All three findings addressed:

Medium #1 (panel updates per search): `doSearch` finally block now appends a row to the `?perf=1` panel per search. The panel renders once at boot from existing measure entries; this hooks each subsequent search so the panel stays current. The PR claim "search measures are appended to the existing ?perf=1 panel per invocation" is now true.

Medium #2 (busy-flag race): Source filter and material/context/object_type change handlers wrap their async work in try/finally with `body.classList` 'explorer-busy' set during the chain. The 300 ms post-`refreshFacetCounts` wait is intentional: `refreshFacetCounts` schedules a 250 ms debounce that THEN sets `.recomputing` on facet count spans; we hold busy until that fires so external observers see a coherent "work started → work finished" signal instead of racing the debounce. `handleFacetFilterChange` is now async and awaits `loadViewportSamples` in the point-mode branch (was fire-and-forget) so the busy flag honestly covers all triggered work.

`_wait_for_facet_settle` in the test is now two-phase: wait for `body:not(.explorer-busy)`, then wait for `.facet-count:not(.recomputing)`. Polling either signal alone races the app's two-stage settle.

Low #3 (context leak on timeout): `_measure_one_query` body wrapped in try/finally on `context.close()`. A TimeoutError in `_run_search` no longer leaks the browser context.

Open question on #176: agreed — the spec MUST clause is not closed by doc honesty alone. Filed the UI-hint fix as commit `006d335` on PR #176.

Diff: +112/-43.

Codex round-2 review caught that the binary body.classList 'explorer-
busy' flag races when multiple change handlers overlap: e.g.
_apply_source_filter clicks 3 source checkboxes back-to-back, each
firing an independent async handler, each removing the same class on
finally. The first/fastest handler clears busy while later handlers'
loadRes() and facet recompute work is still running, so
_wait_for_facet_settle() returns early and the search runs against a
partially-loaded state.

Replace with a depth-counted acquire/release pair: class added on the
0→1 transition, cleared on the 1→0 transition, Math.max-guarded
against double-release.

Both handlers (source filter, material/context/object_type) now use
busyAcquire/busyRelease in their try/finally blocks.

Refs isamplesorg#173.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 8, 2026

Round 2 fix (commit `3b97e34`)

You're right — the binary class racing on overlapping handlers was the bug behind the bug. Replaced with a depth counter:

```js
let _busyDepth = 0;
function busyAcquire() {
if (_busyDepth === 0) document.body.classList.add('explorer-busy');
_busyDepth++;
}
function busyRelease() {
_busyDepth = Math.max(0, _busyDepth - 1);
if (_busyDepth === 0) document.body.classList.remove('explorer-busy');
}
```

Class is added on the 0→1 transition and cleared only on the 1→0 transition. `Math.max`-guarded against double-release. Both handlers (source filter, material/context/object_type) now use `busyAcquire`/`busyRelease` in their try/finally blocks.

This means `_apply_source_filter` clicking 3 source checkboxes back-to-back now correctly holds the busy flag through the LAST handler's full work, not just the first/fastest.

Diff: +22/-4.

@rdhyee rdhyee merged commit 362698a into isamplesorg:main May 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant