Skip to content

Improve Interactive Explorer full-text search substrate #165

@rdhyee

Description

@rdhyee

Context

Follow-up to #84 and PR #95. The DuckDB FTS spike answered the first question: native DuckDB FTS works, including BM25 ranking, but the generated browser-delivered index is too large to ship by default.

As of 2026-05-08, the live Interactive Explorer at https://isamples.org/explorer.html does not have production full-text search in the search-engine sense. It has browser-side DuckDB-WASM substring search over Parquet.

Current behavior

  • The live page is backed by isamplesorg.github.io/explorer.qmd.
  • Search runs against https://data.isamples.org/isamples_202601_samples_map_lite.parquet (~60 MB).
  • The live search predicate covers only:
    • label
    • place_name
  • Search uses ILIKE '%term%', with multi-word queries split into terms and combined with AND.
  • Result ordering is a simple score:
    • label match = 3
    • place_name match = 2
  • Search composes with active source and facet filters.
  • Search does not currently cover description in the live explorer because samples_map_lite.parquet does not carry it.
  • Search does not cover the canonical Solr searchText equivalent described in query-spec.qmd.

Relevant code/docs:

  • explorer.qmd: searchTerms, textSearchWhere, textSearchScore
  • explorer.qmd: search handler around doSearch()
  • query-spec.qmd: intended text MATCHES semantics
  • tools/build_fts_index.py: preserved DuckDB FTS spike artifact

Mismatch to resolve

query-spec.qmd currently says the web Explorer subset is label + description + place_name, but the live explorer searches only label + place_name.

Either:

  1. Update the docs/UI to state the actual live subset, or
  2. Change the live search implementation to include description, probably via sample_facets_v2.parquet or another search-optimized projection.

Measured behavior

Native DuckDB measurements against the live hosted Parquet files, run on 2026-05-08. These are not browser timings, but they show the rough scan cost.

Live-style samples_map_lite search over label + place_name:

  • pottery: 50 results, ~1.8s cold
  • basalt: 50 results, ~0.5s
  • unlikely no-hit term: 0 results, ~0.2s

sample_facets_v2 search over label + description + place_name:

  • pottery: 50 results, ~3.0s
  • cyprus: 50 results, ~0.7s
  • basalt: 50 results, ~0.5s

DuckDB FTS spike findings from PR #95:

  • Full index (label + description + place_name): ~358 MB
  • Lite index (label + place_name): ~211 MB
  • ATTACH over HTTP in DuckDB-WASM works, but downloading 200-358 MB is too large for an interactive default page.

Recommended path

Build a small, browser-friendly, pre-tokenized inverted-index Parquet substrate rather than shipping a DuckDB .duckdb FTS database by default.

Sketch:

  • Offline pipeline creates token rows such as:
    • token
    • pid
    • field
    • weight
    • optional term_frequency
    • optional source
    • optional partition/hash/prefix columns
  • Host it under https://data.isamples.org/ as versioned Parquet.
  • Partition by token prefix/hash so browser queries touch only relevant byte ranges.
  • For multi-term queries, fetch/intersect PID sets, score by field/term weights, then join back to samples_map_lite or sample_facets_v2 for display fields.
  • Keep DuckDB FTS as an optional "enhanced search" experiment only if users explicitly accept a large download.

Acceptance criteria

  • Decide and document current search semantics: side-panel lookup vs global filter. Coordinate with Explorer state contract: URL/DOM/widget-state inventory + search-as-global-filter decision #164.
  • Resolve the query-spec.qmd mismatch for the current live subset.
  • Add a search substrate design note covering data shape, partitioning, update cadence, and expected size.
  • Prototype an inverted-index Parquet file and benchmark:
    • first search latency
    • repeated search latency
    • bytes transferred
    • result quality for pottery, pottery Cyprus, basalt, and source/facet-filtered searches
  • Update the Explorer UI copy so users know which fields are searched.
  • Add Playwright or equivalent smoke coverage for:
    • multi-term AND behavior
    • literal wildcard characters such as % and _
    • source/facet-filter composition
    • no-result behavior

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions