You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #84 and PR #95. The DuckDB FTS spike answered the first question: native DuckDB FTS works, including BM25 ranking, but the generated browser-delivered index is too large to ship by default.
As of 2026-05-08, the live Interactive Explorer at https://isamples.org/explorer.html does not have production full-text search in the search-engine sense. It has browser-side DuckDB-WASM substring search over Parquet.
Current behavior
The live page is backed by isamplesorg.github.io/explorer.qmd.
Search runs against https://data.isamples.org/isamples_202601_samples_map_lite.parquet (~60 MB).
The live search predicate covers only:
label
place_name
Search uses ILIKE '%term%', with multi-word queries split into terms and combined with AND.
Result ordering is a simple score:
label match = 3
place_name match = 2
Search composes with active source and facet filters.
Search does not currently cover description in the live explorer because samples_map_lite.parquet does not carry it.
Search does not cover the canonical Solr searchText equivalent described in query-spec.qmd.
Full index (label + description + place_name): ~358 MB
Lite index (label + place_name): ~211 MB
ATTACH over HTTP in DuckDB-WASM works, but downloading 200-358 MB is too large for an interactive default page.
Recommended path
Build a small, browser-friendly, pre-tokenized inverted-index Parquet substrate rather than shipping a DuckDB .duckdb FTS database by default.
Sketch:
Offline pipeline creates token rows such as:
token
pid
field
weight
optional term_frequency
optional source
optional partition/hash/prefix columns
Host it under https://data.isamples.org/ as versioned Parquet.
Partition by token prefix/hash so browser queries touch only relevant byte ranges.
For multi-term queries, fetch/intersect PID sets, score by field/term weights, then join back to samples_map_lite or sample_facets_v2 for display fields.
Keep DuckDB FTS as an optional "enhanced search" experiment only if users explicitly accept a large download.
Context
Follow-up to #84 and PR #95. The DuckDB FTS spike answered the first question: native DuckDB FTS works, including BM25 ranking, but the generated browser-delivered index is too large to ship by default.
As of 2026-05-08, the live Interactive Explorer at https://isamples.org/explorer.html does not have production full-text search in the search-engine sense. It has browser-side DuckDB-WASM substring search over Parquet.
Current behavior
isamplesorg.github.io/explorer.qmd.https://data.isamples.org/isamples_202601_samples_map_lite.parquet(~60 MB).labelplace_nameILIKE '%term%', with multi-word queries split into terms and combined withAND.labelmatch = 3place_namematch = 2descriptionin the live explorer becausesamples_map_lite.parquetdoes not carry it.searchTextequivalent described inquery-spec.qmd.Relevant code/docs:
explorer.qmd:searchTerms,textSearchWhere,textSearchScoreexplorer.qmd: search handler arounddoSearch()query-spec.qmd: intendedtext MATCHESsemanticstools/build_fts_index.py: preserved DuckDB FTS spike artifactMismatch to resolve
query-spec.qmdcurrently says the web Explorer subset islabel + description + place_name, but the live explorer searches onlylabel + place_name.Either:
description, probably viasample_facets_v2.parquetor another search-optimized projection.Measured behavior
Native DuckDB measurements against the live hosted Parquet files, run on 2026-05-08. These are not browser timings, but they show the rough scan cost.
Live-style
samples_map_litesearch overlabel + place_name:pottery: 50 results, ~1.8s coldbasalt: 50 results, ~0.5ssample_facets_v2search overlabel + description + place_name:pottery: 50 results, ~3.0scyprus: 50 results, ~0.7sbasalt: 50 results, ~0.5sDuckDB FTS spike findings from PR #95:
label + description + place_name): ~358 MBlabel + place_name): ~211 MBATTACHover HTTP in DuckDB-WASM works, but downloading 200-358 MB is too large for an interactive default page.Recommended path
Build a small, browser-friendly, pre-tokenized inverted-index Parquet substrate rather than shipping a DuckDB
.duckdbFTS database by default.Sketch:
tokenpidfieldweightterm_frequencysourcehttps://data.isamples.org/as versioned Parquet.samples_map_liteorsample_facets_v2for display fields.Acceptance criteria
query-spec.qmdmismatch for the current live subset.pottery,pottery Cyprus,basalt, and source/facet-filtered searches%and_Related