Sub-issue of #167 / #165. Captures concerns surfaced during self-review of #173 that were knowingly deferred rather than fixed in that PR.
Deferred items
1. Add a unit test for the isamples.search instrumentation itself
The structured log emitted by `doSearch()` claims invariants like `terms_count == searchTerms(term).length` and `elapsed_ms > 0`. Nothing currently verifies this. A future tokenizer or instrumentation refactor could silently corrupt the log without a test catching it.
Scope: a small Playwright unit test (or pytest fixture) that runs a single canonical search and asserts:
- exactly one `isamples.search` log emitted
- `terms_count` matches the JS-side `searchTerms(term).length`
- `elapsed_ms` is a positive integer
- `results_count` matches the rendered `#searchResults` count text
- `seen_urls` is a list of `{name, transfer_size, body_size}` objects
Land alongside the offline builder work in #170 since both touch tokenization correctness.
2. "Warm" semantics in the contract doc
The current perf-smoke measures "warm" as a re-run of the same query on the same page. If DuckDB-WASM caches result sets, this measures cache lookup, not query execution. The v1 contract doc (#169) should specify what "warm" actually means for the budget targets:
- `re-run-same-query warm` (current measurement) — measures end-to-end cache + render path
- `new-query-after-warm-up warm` — measures query execution after the parquet metadata is cached but the result set is fresh
- both, with separate budgets
Action: include this distinction in the SEARCH_INDEX_V1.md budget section. The benchmark in #171 should produce both numbers.
3. `searchTerms(term)` is now outside the try block
In #173 we moved `searchTerms(term)` to before the try-block so `terms.length` is available in the structured log even on early-error paths. `searchTerms` cannot throw with the current implementation, but a future tokenizer change that throws would now skip the structured log entirely. Action: wrap the tokenizer call defensively when the v1 substrate tokenizer lands in #170 (which is more likely to throw on edge inputs).
4. Bytes attribution beyond the search window
Even with the `seen_urls` list now captured per search, attribution is window-based: any fetch from `data.isamples.org` that starts during the search window is counted. A long-running fetch started by an earlier interaction could finish during the search and pollute `bytes_transfer`. Filtering to fetches whose `responseEnd` falls within the window would be tighter. Action: revisit when the v1 substrate is in place, since the substrate-fetch URLs are predictable and can be matched by name prefix instead.
Refs
#165, #167, #173
Sub-issue of #167 / #165. Captures concerns surfaced during self-review of #173 that were knowingly deferred rather than fixed in that PR.
Deferred items
1. Add a unit test for the
isamples.searchinstrumentation itselfThe structured log emitted by `doSearch()` claims invariants like `terms_count == searchTerms(term).length` and `elapsed_ms > 0`. Nothing currently verifies this. A future tokenizer or instrumentation refactor could silently corrupt the log without a test catching it.
Scope: a small Playwright unit test (or pytest fixture) that runs a single canonical search and asserts:
Land alongside the offline builder work in #170 since both touch tokenization correctness.
2. "Warm" semantics in the contract doc
The current perf-smoke measures "warm" as a re-run of the same query on the same page. If DuckDB-WASM caches result sets, this measures cache lookup, not query execution. The v1 contract doc (#169) should specify what "warm" actually means for the budget targets:
Action: include this distinction in the SEARCH_INDEX_V1.md budget section. The benchmark in #171 should produce both numbers.
3. `searchTerms(term)` is now outside the try block
In #173 we moved `searchTerms(term)` to before the try-block so `terms.length` is available in the structured log even on early-error paths. `searchTerms` cannot throw with the current implementation, but a future tokenizer change that throws would now skip the structured log entirely. Action: wrap the tokenizer call defensively when the v1 substrate tokenizer lands in #170 (which is more likely to throw on edge inputs).
4. Bytes attribution beyond the search window
Even with the `seen_urls` list now captured per search, attribution is window-based: any fetch from `data.isamples.org` that starts during the search window is counted. A long-running fetch started by an earlier interaction could finish during the search and pollute `bytes_transfer`. Filtering to fetches whose `responseEnd` falls within the window would be tighter. Action: revisit when the v1 substrate is in place, since the substrate-fetch URLs are predictable and can be matched by name prefix instead.
Refs
#165, #167, #173