Skip to content

Explorer FTS Track 4: Browser query prototype + benchmark #171

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review (rounds 1 + 2) on #165. §5 added in round 1; round 2 reframed it as a BM25 oracle + escalation trigger, not a substitute for a hosted-search comparison. Hosted search remains a permanent contingency, not a question DuckDB FTS closes.

Sub-issue of #165. Depends on #170 (offline builder).

Goal

Implement the browser-side query path against the v1 substrate, behind a feature flag, and run the canonical benchmark to compare against the v1 contract budgets and the curated benchmark from #169.

Scope

1. Browser query path

  • New cell in explorer.qmd (or extracted module): searchSubstrate(term).
  • Steps per multi-term query:
    1. Tokenize term using the JS tokenizer from Explorer FTS Track 3: Offline index builder + tokenizer regression set #170.
    2. Apply query-time stopword policy (per Explorer FTS Track 2: search_index_v1 contract doc #169 §3): drop curated stopwords from the bag-of-words AND.
    3. For each surviving token: resolve shard URL via the partition function; fetch the shard via DuckDB-WASM HTTP range read.
    4. Compute per-token postings with BM25 contribution (using DF + doc_len from the substrate).
    5. AND-combine across tokens (intersect pid sets, sum scores).
    6. Apply field weights from query-side config (per Explorer FTS Track 2: search_index_v1 contract doc #169 §5).
    7. Sort by score, take top 50.
    8. Keyed pid IN (...) join back to samples_map_lite.parquet for display fields (label, source, lat, lng, place_name).
    9. Compose with sourceFilterSQL() and facetFilterSQL() (existing helpers in explorer.qmd:377 / :528).

2. Feature flag

3. Benchmark run (browser-path)

Per-query metrics required in the benchmark JSON. This list is the contract between #171 (produce the data) and #172 (consume it as a hard-fail). Adding a hard-fail in #172 without adding the metric here means the gate evaluates against absent data.

metric source feeds #172 hard-fail
cold latency (ms) performance.measure('search-…') performance gate
warm-repeat-same-query latency (ms) second invocation, same page performance gate
warm-new-query-after-warm-up latency (ms) different query, same page performance gate
filter-composed cold latency (ms) per-query × per-filter combo performance gate
bytes transferred (cold + warm) per #167 instrumentation performance gate
results count length of result rows non-empty checks
top-K result PIDs (K = 50) the ranked PID list top-K overlap quality gates
top-K overlap vs hand-labeled (top-3, top-10) computed in test harness quality gate
top-K overlap vs DuckDB FTS oracle (top-3, top-10) computed in test harness quality gate
top-3 PIDs for concept-only queries (ceramic, bone, mammal, +1-2) substrate run concept-only top-3 relevance hard-fail
top-10 Jaccard between pottery from Cyprus and pottery Cyprus two runs, computed stopword near-equivalence hard-fail
top-K identity for pottery pottery cyprus vs pottery cyprus two runs, computed duplicate-term hard-fail
empty/short/long-token query outcome (state, fetch count, elapsed) special test cases edge-length-token hard-fail
all-stopword query (a the of) outcome special test case controlled-empty-state hard-fail
display-join-miss count (substrate hit, no samples_map_lite row) per-query missing-display-join hard-fail
filter-composition top-K identity preservation per (query, filter) pair hand-labeled expected filtered top-K filter-composition hard-fail
tokenizer-parity check: every benchmark term tokenized identically by Python and JS side-channel, run once per benchmark tokenizer-parity hard-fail

4. Worst-case + concept-label coverage

  • Worst-case composition: rare token + 2 facets + source filter, cold.
  • Concept-only queries from the curated benchmark (ceramic, bone, mammal) must return non-empty results — verifies the v1 substrate dereferences vocabulary labels correctly. Failing any concept-only query is a hard fail, regardless of latency.
  • Stopword-heavy queries (pottery from Cyprus) must return non-empty results — verifies query-time stopword removal works.
  • Wildcard literals (%, _) handled by tokenizer (no ILIKE escape — substrate path doesn't use ILIKE).

5. DuckDB FTS local-only relevance oracle

Run a parallel relevance evaluation against a known-good BM25 system as a v1 quality oracle and escalation trigger. This anchors "does our static substrate approximate BM25 over the same document projection?" — it does not answer "is static-browser the right product boundary for good search?"

What this oracle does NOT cover (and therefore does NOT close the hosted-search question):

  • Richer analyzers (language-specific tokenizers, n-gram, stemming variations)
  • Phrase search with positional indexes
  • Typo tolerance / fuzzy matching
  • Tuning + explainability of relevance (boost knobs, explain traces)
  • API latency under composition with other filters at scale
  • v2+ field-growth ergonomics (adding 6+ entity-derived fields without rebuild pain)

These are reasons hosted search remains a permanent contingency — see §7 below + #172 NO-GO framing.

6. Acceptance criteria for the prototype

This issue ships the prototype + benchmark data. It does not decide GO/NO-GO — that's #172.

  • Substrate path implemented behind ?fts=v1
  • Composes correctly with source + facet filters
  • Worst-case composition test passes
  • All concept-only benchmark queries return non-empty results AND have their top-3 PIDs recorded
  • Stopword-heavy queries return non-empty results AND record top-10 Jaccard vs the stopword-stripped form
  • Wildcard literals handled
  • Benchmark JSON contains every metric listed in the §3 metrics-contract table (the data Explorer FTS Track 5: GO/NO-GO decision gate #172 consumes)
  • Top-3 / top-10 overlap numbers vs hand-labeled set AND vs DuckDB FTS reference posted to this issue

7. Hosted-search backend as a permanent contingency (not just NO-GO downstream)

The DuckDB FTS oracle in §5 anchors v1 quality. It does not close the hosted-search question. Hosted search (Solr / Meilisearch / Typesense / equivalent) remains a contingency for either of the following triggers:

  • (a) v1 GO/NO-GO failure in Explorer FTS Track 5: GO/NO-GO decision gate #172 — substrate misses budgets or quality thresholds.
  • (b) v2+ quality requirements that exceed what a static substrate can deliver — e.g., phrase search, typo tolerance, richer analyzer pipelines, or v2 field growth that pushes the static substrate over its byte budget.

Either trigger fires the same downstream issue: Explorer FTS Track 6: Hosted-search backend. A v1 GO does not close (b); it just means we ship the static substrate and revisit hosted search when v2 requirements demand it.

Out of scope

Refs

#165, #169, #170, #172, PR #95

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions