Skip to content

docs: add SEARCH_INDEX_V1.md (closes #169)#175

Merged
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:search-index-v1-contract
May 8, 2026
Merged

docs: add SEARCH_INDEX_V1.md (closes #169)#175
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:search-index-v1-contract

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented May 8, 2026

Summary

The v1 search-substrate contract for the Explorer FTS work. Doc-only PR; closes #169.

  • Sample-centric document projection with virtual fields (`sample.label`, `sample.description`, `sample.place_name`, `concept.label`).
  • Concept labels in v1 minimum — dereferenced from `material` / `context` / `object_type` URIs via `vocab_labels.parquet`. Coverage data made this load-bearing: `description` is only ~24% populated on MaterialSampleRecord, but facet URIs are near-universal.
  • Tokenizer: lowercase + NFKC + diacritic strip + whitespace split + length filter; no stemming; no stopwords (build-time).
  • Query-time policy distinct from build-time tokenizer: drop English stopwords from the bag-of-words AND so `pottery from Cyprus` doesn't fail on `from`.
  • BM25 with precomputed DF + doc_len, fixed `k1=1.2`, `b=0.75`. Field weights live in query code, not substrate data.
  • Hash-by-token partition with a 5 MB per-shard cap; high-frequency tokens sub-shard by `hash(pid)`.
  • Budgets are contract: cold ≤ 2 s, warm ≤ 500 ms, filter-composed ≤ 3 s, ≤ 5 MB cold + ≤ 1 MB warm. Quality thresholds calibrate after Explorer FTS Track 1a: Browser perf-smoke baseline #167 baseline lands.
  • Hard quality gate: zero concept-only or stopword-heavy queries return empty; all hard-fail invariants in Explorer FTS Track 5: GO/NO-GO decision gate #172 must pass.
  • `build_stats.json` artifact requirement — prevents the contract from drifting from the builder (Explorer FTS Track 3: Offline index builder + tokenizer regression set #170 §6).
  • v1 ≠ destination: §1 names every v1.5 / v2 expansion vector, schema admits content-only growth, no migration.

Companion to `EXPLORER_STATE.md` (PR #166): that doc pins the UI-surface contract (option C — side-panel + result-pin overlay); this doc pins the backend contract. The two are intentionally orthogonal.

Test plan

  • Review §1 v1 / v1.5 / v2 tiering against `query-spec.qmd:213-221` Solr surface
  • Confirm field weights in §5 (label 3.0, concept 2.5, place 2.0, description 1.0) match intuition
  • Confirm partition shape in §6 (hash-by-token, 5 MB cap, sub-shard rule for high-DF tokens)
  • Confirm budget table in §7 is what we want as contract
  • Confirm quality-gate shape in §9 (top-K vs hand-labeled + DuckDB FTS oracle, hard-fails)
  • Confirm `build_stats.json` schema in §10 carries the right empirical signals

Closes #169. Refs #165, #166, #170, #171, #172, #174.

🤖 Generated with Claude Code

The v1 search-substrate contract — sample-centric document projection
with concept.label as a v1 minimum, BM25 ranking with precomputed DF +
doc_len, hash-by-token partitioning with a 5 MB shard cap, query-time
stopword policy distinct from build-time tokenizer, hard quality gate,
and a build_stats.json artifact requirement that prevents the contract
from drifting from the builder.

Companion to EXPLORER_STATE.md (UI-surface contract, option C).
Backend changes are intentionally orthogonal to the UI contract; this
doc pins one half, the other doc pins the other.

Refs isamplesorg#165, isamplesorg#170, isamplesorg#171, isamplesorg#172, isamplesorg#174.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented May 8, 2026

Review result: no blocking issues found. The SEARCH_INDEX_V1 contract has a coherent v1/v1.5/v2 split, separates build-time tokenization from query-time stopword policy, and includes the build_stats requirement needed to keep implementation honest.

@rdhyee rdhyee merged commit 7a77057 into isamplesorg:main May 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explorer FTS Track 2: search_index_v1 contract doc

1 participant