Explorer FTS Track 4: Browser query prototype + benchmark

> **Updated 2026-05-08** per Codex review (rounds 1 + 2) on #165. §5 added in round 1; round 2 reframed it as a *BM25 oracle + escalation trigger*, not a substitute for a hosted-search comparison. Hosted search remains a permanent contingency, not a question DuckDB FTS closes.

Sub-issue of #165. **Depends on #170 (offline builder).**

## Goal

Implement the browser-side query path against the v1 substrate, behind a feature flag, and run the canonical benchmark to compare against the v1 contract budgets and the curated benchmark from #169.

## Scope

### 1. Browser query path

- New cell in `explorer.qmd` (or extracted module): `searchSubstrate(term)`.
- Steps per multi-term query:
  1. Tokenize `term` using the JS tokenizer from #170.
  2. **Apply query-time stopword policy** (per #169 §3): drop curated stopwords from the bag-of-words AND.
  3. For each surviving token: resolve shard URL via the partition function; fetch the shard via DuckDB-WASM HTTP range read.
  4. Compute per-token postings with BM25 contribution (using DF + doc_len from the substrate).
  5. AND-combine across tokens (intersect pid sets, sum scores).
  6. Apply field weights from query-side config (per #169 §5).
  7. Sort by score, take top 50.
  8. Keyed `pid IN (...)` join back to `samples_map_lite.parquet` for display fields (label, source, lat, lng, place_name).
  9. Compose with `sourceFilterSQL()` and `facetFilterSQL()` (existing helpers in `explorer.qmd:377` / `:528`).

### 2. Feature flag

- URL param `?fts=v1` routes `doSearch()` to `searchSubstrate(term)`.
- Default behavior unchanged. Substrate path opt-in only until the GO/NO-GO gate (#172).

### 3. Benchmark run (browser-path)

- Extend `tests/test_search_perf.py` (from #167) to run the canonical query set against both:
  - current live (post Track-1b)
  - substrate path (`?fts=v1`)
- Output: `tests/search_substrate_benchmark_<YYYY-MM-DD>.json`
- Run the curated benchmark from #169 to compute top-3 / top-10 overlap with the hand-labeled set.

**Per-query metrics required in the benchmark JSON.** This list is the contract between #171 (produce the data) and #172 (consume it as a hard-fail). Adding a hard-fail in #172 without adding the metric here means the gate evaluates against absent data.

| metric                                            | source                                  | feeds #172 hard-fail                        |
|---------------------------------------------------|-----------------------------------------|---------------------------------------------|
| cold latency (ms)                                 | `performance.measure('search-…')`        | performance gate                            |
| warm-repeat-same-query latency (ms)               | second invocation, same page             | performance gate                            |
| warm-new-query-after-warm-up latency (ms)         | different query, same page               | performance gate                            |
| filter-composed cold latency (ms)                 | per-query × per-filter combo             | performance gate                            |
| bytes transferred (cold + warm)                   | per #167 instrumentation                 | performance gate                            |
| results count                                     | length of result rows                    | non-empty checks                            |
| top-K result PIDs (K = 50)                        | the ranked PID list                      | top-K overlap quality gates                 |
| top-K overlap vs hand-labeled (top-3, top-10)     | computed in test harness                 | quality gate                                |
| top-K overlap vs DuckDB FTS oracle (top-3, top-10)| computed in test harness                 | quality gate                                |
| top-3 PIDs for concept-only queries (`ceramic`, `bone`, `mammal`, +1-2) | substrate run | concept-only top-3 relevance hard-fail     |
| top-10 Jaccard between `pottery from Cyprus` and `pottery Cyprus` | two runs, computed | stopword near-equivalence hard-fail        |
| top-K identity for `pottery pottery cyprus` vs `pottery cyprus` | two runs, computed | duplicate-term hard-fail                    |
| empty/short/long-token query outcome (state, fetch count, elapsed) | special test cases | edge-length-token hard-fail                  |
| all-stopword query (`a the of`) outcome           | special test case                        | controlled-empty-state hard-fail            |
| display-join-miss count (substrate hit, no `samples_map_lite` row) | per-query | missing-display-join hard-fail              |
| filter-composition top-K identity preservation per (query, filter) pair | hand-labeled expected filtered top-K | filter-composition hard-fail               |
| tokenizer-parity check: every benchmark term tokenized identically by Python and JS | side-channel, run once per benchmark | tokenizer-parity hard-fail                  |

### 4. Worst-case + concept-label coverage

- Worst-case composition: rare token + 2 facets + source filter, cold.
- **Concept-only queries** from the curated benchmark (`ceramic`, `bone`, `mammal`) must return non-empty results — verifies the v1 substrate dereferences vocabulary labels correctly. **Failing any concept-only query is a hard fail**, regardless of latency.
- **Stopword-heavy queries** (`pottery from Cyprus`) must return non-empty results — verifies query-time stopword removal works.
- Wildcard literals (`%`, `_`) handled by tokenizer (no ILIKE escape — substrate path doesn't use ILIKE).

### 5. DuckDB FTS local-only relevance oracle

Run a parallel relevance evaluation against a known-good BM25 system as a **v1 quality oracle and escalation trigger**. This anchors "does our static substrate approximate BM25 over the same document projection?" — it does **not** answer "is static-browser the right product boundary for good search?"

- Use the existing `tools/build_fts_index.py` (PR #95 spike artifact) to build a local-only DuckDB FTS index over the same v1 sample search document projection.
- Run the curated benchmark queries against the local DuckDB FTS index in the test harness. Record top-K result PIDs per query.
- Top-K overlap: substrate-path vs. DuckDB FTS reference is one input to the quality gate (#172).
- This is **not deployed**. Index lives on the test machine; size is acceptable (~200-358 MB per the #95 spike) because there's no browser delivery constraint.

**What this oracle does NOT cover** (and therefore does NOT close the hosted-search question):

- Richer analyzers (language-specific tokenizers, n-gram, stemming variations)
- Phrase search with positional indexes
- Typo tolerance / fuzzy matching
- Tuning + explainability of relevance (boost knobs, `explain` traces)
- API latency under composition with other filters at scale
- v2+ field-growth ergonomics (adding 6+ entity-derived fields without rebuild pain)

These are reasons hosted search remains a permanent contingency — see §7 below + #172 NO-GO framing.

### 6. Acceptance criteria for the prototype

This issue ships the prototype + benchmark data. **It does not decide GO/NO-GO** — that's #172.

- [ ] Substrate path implemented behind `?fts=v1`
- [ ] Composes correctly with source + facet filters
- [ ] Worst-case composition test passes
- [ ] All concept-only benchmark queries return non-empty results AND have their top-3 PIDs recorded
- [ ] Stopword-heavy queries return non-empty results AND record top-10 Jaccard vs the stopword-stripped form
- [ ] Wildcard literals handled
- [ ] **Benchmark JSON contains every metric listed in the §3 metrics-contract table** (the data #172 consumes)
- [ ] Top-3 / top-10 overlap numbers vs hand-labeled set AND vs DuckDB FTS reference posted to this issue

### 7. Hosted-search backend as a permanent contingency (not just NO-GO downstream)

The DuckDB FTS oracle in §5 anchors v1 quality. It does **not** close the hosted-search question. Hosted search (Solr / Meilisearch / Typesense / equivalent) remains a contingency for **either** of the following triggers:

- **(a) v1 GO/NO-GO failure in #172** — substrate misses budgets or quality thresholds.
- **(b) v2+ quality requirements** that exceed what a static substrate can deliver — e.g., phrase search, typo tolerance, richer analyzer pipelines, or v2 field growth that pushes the static substrate over its byte budget.

Either trigger fires the same downstream issue: `Explorer FTS Track 6: Hosted-search backend`. **A v1 GO does not close (b)**; it just means we ship the static substrate and revisit hosted search when v2 requirements demand it.

## Out of scope

- Removing the `?fts=v1` flag and shipping (that's #172 GO).
- Standing up the hosted-search backend itself (a separate downstream issue, triggered by either condition above).

## Refs

#165, #169, #170, #172, PR #95




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer FTS Track 4: Browser query prototype + benchmark #171

Goal

Scope

1. Browser query path

2. Feature flag

3. Benchmark run (browser-path)

4. Worst-case + concept-label coverage

5. DuckDB FTS local-only relevance oracle

6. Acceptance criteria for the prototype

7. Hosted-search backend as a permanent contingency (not just NO-GO downstream)

Out of scope

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

metric	source	feeds #172 hard-fail
cold latency (ms)	`performance.measure('search-…')`	performance gate
warm-repeat-same-query latency (ms)	second invocation, same page	performance gate
warm-new-query-after-warm-up latency (ms)	different query, same page	performance gate
filter-composed cold latency (ms)	per-query × per-filter combo	performance gate
bytes transferred (cold + warm)	per #167 instrumentation	performance gate
results count	length of result rows	non-empty checks
top-K result PIDs (K = 50)	the ranked PID list	top-K overlap quality gates
top-K overlap vs hand-labeled (top-3, top-10)	computed in test harness	quality gate
top-K overlap vs DuckDB FTS oracle (top-3, top-10)	computed in test harness	quality gate
top-3 PIDs for concept-only queries (`ceramic`, `bone`, `mammal`, +1-2)	substrate run	concept-only top-3 relevance hard-fail
top-10 Jaccard between `pottery from Cyprus` and `pottery Cyprus`	two runs, computed	stopword near-equivalence hard-fail
top-K identity for `pottery pottery cyprus` vs `pottery cyprus`	two runs, computed	duplicate-term hard-fail
empty/short/long-token query outcome (state, fetch count, elapsed)	special test cases	edge-length-token hard-fail
all-stopword query (`a the of`) outcome	special test case	controlled-empty-state hard-fail
display-join-miss count (substrate hit, no `samples_map_lite` row)	per-query	missing-display-join hard-fail
filter-composition top-K identity preservation per (query, filter) pair	hand-labeled expected filtered top-K	filter-composition hard-fail
tokenizer-parity check: every benchmark term tokenized identically by Python and JS	side-channel, run once per benchmark	tokenizer-parity hard-fail

Explorer FTS Track 4: Browser query prototype + benchmark #171

Description

Goal

Scope

1. Browser query path

2. Feature flag

3. Benchmark run (browser-path)

4. Worst-case + concept-label coverage

5. DuckDB FTS local-only relevance oracle

6. Acceptance criteria for the prototype

7. Hosted-search backend as a permanent contingency (not just NO-GO downstream)

Out of scope

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions