Skip to content

Explorer FTS Track 1a follow-ups: deferred review concerns #174

@rdhyee

Description

@rdhyee

Sub-issue of #167 / #165. Captures concerns surfaced during self-review of #173 that were knowingly deferred rather than fixed in that PR.

Deferred items

1. Add a unit test for the isamples.search instrumentation itself

The structured log emitted by `doSearch()` claims invariants like `terms_count == searchTerms(term).length` and `elapsed_ms > 0`. Nothing currently verifies this. A future tokenizer or instrumentation refactor could silently corrupt the log without a test catching it.

Scope: a small Playwright unit test (or pytest fixture) that runs a single canonical search and asserts:

  • exactly one `isamples.search` log emitted
  • `terms_count` matches the JS-side `searchTerms(term).length`
  • `elapsed_ms` is a positive integer
  • `results_count` matches the rendered `#searchResults` count text
  • `seen_urls` is a list of `{name, transfer_size, body_size}` objects

Land alongside the offline builder work in #170 since both touch tokenization correctness.

2. "Warm" semantics in the contract doc

The current perf-smoke measures "warm" as a re-run of the same query on the same page. If DuckDB-WASM caches result sets, this measures cache lookup, not query execution. The v1 contract doc (#169) should specify what "warm" actually means for the budget targets:

  • `re-run-same-query warm` (current measurement) — measures end-to-end cache + render path
  • `new-query-after-warm-up warm` — measures query execution after the parquet metadata is cached but the result set is fresh
  • both, with separate budgets

Action: include this distinction in the SEARCH_INDEX_V1.md budget section. The benchmark in #171 should produce both numbers.

3. `searchTerms(term)` is now outside the try block

In #173 we moved `searchTerms(term)` to before the try-block so `terms.length` is available in the structured log even on early-error paths. `searchTerms` cannot throw with the current implementation, but a future tokenizer change that throws would now skip the structured log entirely. Action: wrap the tokenizer call defensively when the v1 substrate tokenizer lands in #170 (which is more likely to throw on edge inputs).

4. Bytes attribution beyond the search window

Even with the `seen_urls` list now captured per search, attribution is window-based: any fetch from `data.isamples.org` that starts during the search window is counted. A long-running fetch started by an earlier interaction could finish during the search and pollute `bytes_transfer`. Filtering to fetches whose `responseEnd` falls within the window would be tighter. Action: revisit when the v1 substrate is in place, since the substrate-fetch URLs are predictable and can be matched by name prefix instead.

Refs

#165, #167, #173

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions