Explorer FTS Track 3: Offline index builder + tokenizer regression set

> **Updated 2026-05-08** per Codex review round 2 on #165. Concept-label join (vocab_labels.parquet) is now an explicit input + tokenization step, not an implicit consequence of #169's contract. Build-stats artifact (`build_stats.json`) added so coverage is empirical, not narrative.

Sub-issue of #165. **Depends on #169 (search_index_v1 contract doc).**

## Goal

Implement the offline pipeline that builds the v1 search index from the **sample-centric document projection** specified in #169 / `SEARCH_INDEX_V1.md`. The projection joins across the property graph and emits per-virtual-field token rows.

## Scope

### 1. Builder

- File: `tools/build_search_index.py` (new; sibling to `tools/build_fts_index.py` from PR #95, which stays as the FTS spike artifact).

- **Inputs** (v1 minimum):
  - `samples_map_lite.parquet` → `sample.label`, `sample.place_name[]` per sample
  - `sample_facets_v2.parquet` → `sample.description` per sample, plus the URI lists for `material`, `context`, `object_type`
  - `vocab_labels.parquet` → SKOS prefLabel (en) for each URI; **load-bearing**: this is what makes `concept.label` work and what dereferences `<…>/Pottery` → `pottery` for the index

- **Document-projection step**:
  1. For each sample (`pid`), gather the v1-minimum text fragments tagged by virtual field:
     - `sample.label` (1 fragment per sample)
     - `sample.description` (1 fragment per sample, sparse coverage)
     - `sample.place_name` (1+ fragments from the array)
     - `concept.label` (N fragments — one per `material` URI + one per `context` URI + one per `object_type` URI; resolve each URI through `vocab_labels.parquet` and emit the prefLabel; URIs without a prefLabel fall back to URI tail and a `concept_label_missing_pref_label` build-stat counter increments)
  2. Tokenize each fragment with the Python tokenizer (§2). Track per-(pid, field) token frequencies and doc length.
  3. Emit token rows per #169 §4 schema:
     ```
     { token, pid, field, tf, doc_len }
     ```
  4. Compute global per-token DF across all (pid, field) pairs.

- **Outputs**:
  - Hash-partitioned token-row parquets per #169 §6, written to `<output_dir>/isamples_YYYYMM_search_index_v1/`.
  - Sidecar `df.parquet` with global token DF.
  - `build_stats.json` (see §6 below).

- Honors per-shard byte cap from the contract (default 5 MB); sub-shards high-frequency tokens automatically.

### 2. Python tokenizer

- File: `tools/search_tokenizer.py`.
- Implements lowercase + NFKC + diacritic strip + whitespace split + length filter per #169 §2.
- Pure-stdlib (`unicodedata`); no third-party dependencies beyond what's in `pyproject.toml` already.

### 3. Tokenizer regression set

- File: `tests/search_tokenizer_regression.json`.
- ≥ 30 strings covering:
  - diacritics: `Çatalhöyük`, `Köln`, `São Paulo`
  - mixed case: `MaterialSampleRecord`, `iSamples`
  - hyphenated compounds: `Iron-Age`, `co-located`
  - archaeological place names with whitespace + punctuation
  - IGSN-style ids: `IGSN:HRV000ABC`
  - numeric: `1965`, `2.5kg`
  - empty / whitespace-only
  - very long strings (length-filter edge)
  - the strings `pottery`, `ceramic`, `bone`, `mammal`, `marine`, `basalt` as plain inputs (the regression set proves the **tokenizer** handles these correctly; URI dereferencing is proved separately in §4 builder E2E, where the test fixture maps known URIs to these labels and asserts the token rows appear)
- Each entry: `{ "input": "...", "expected_tokens": ["..."] }`.

### 4. Tests

- `tests/test_search_tokenizer.py` — Python tokenizer against regression set. Scope: tokenizer only. Does NOT prove URI dereferencing.
- A JS port of the tokenizer (small, ~15 lines) lives in `assets/js/search_tokenizer.js`; a Node-based parity check (`tests/test_search_tokenizer_js.spec.js` or similar) runs the same regression set against the JS implementation. CI runs both.
- `tests/test_search_index_builder.py` — end-to-end builder fixture, scope: URI dereferencing + projection + shard structure. Build a tiny corpus (10 docs) where:
  - At least 3 docs have `material` URIs that map to known prefLabels in a fixture `vocab_labels` table (e.g., `<test://Pottery>` → `"Pottery"`, `<test://Ceramic>` → `"Ceramic"`, `<test://Bone>` → `"Bone"`).
  - At least 1 doc has a `material` URI with no prefLabel (verify URI-tail fallback + `concept_label_missing_pref_label` counter increments in `build_stats.json`).
  - Asserts that searching the resulting substrate for `pottery` returns exactly the pid(s) whose `material` is `<test://Pottery>` — the actual proof that URI dereferencing works end-to-end.
  - Also asserts shard structure, DF/doc_len values, and lookup of known tokens returns expected pids per virtual field.

### 5. CI

- Add a workflow step that runs both regression tests on every PR. Tokenizer divergence between Python and JS is a hard fail.

### 6. Build-stats artifact

The build emits `<output_dir>/build_stats.json` that turns coverage from a doc claim into an empirical artifact future runs can be regressed against:

```json
{
  "data_version": "isamples_202601",
  "built_at_utc": "2026-MM-DDTHH:MM:SSZ",
  "total_samples": 6700000,
  "fields": {
    "sample.label":       { "samples_with_field": 6680000, "total_tokens": 12345678, "avg_doc_len": 3.2 },
    "sample.description": { "samples_with_field": 1610000, "total_tokens": 89012345, "avg_doc_len": 41.5 },
    "sample.place_name":  { "samples_with_field": 2210000, "total_tokens": 4567890,  "avg_doc_len": 4.1 },
    "concept.label":      { "samples_with_field": 6650000, "total_tokens": 8901234,  "avg_doc_len": 3.0 }
  },
  "concept_label_uri_resolution": {
    "material_resolved":     0.97,
    "material_missing_pref": 0.03,
    "context_resolved":      0.95,
    "context_missing_pref":  0.05,
    "object_type_resolved":  0.99,
    "object_type_missing_pref": 0.01
  },
  "shard_count": 64,
  "shard_max_size_mb": 4.7,
  "total_bytes_uncompressed": 234567890,
  "build_seconds": 312.5,
  "top_df_tokens": [["the", 5800000], ["of", 4200000], ...]
}
```

## Acceptance

- [ ] Builder produces a v1 index from a small test corpus that round-trips against the #169 §4 schema
- [ ] **Concept-label coverage**: `samples_with_field` for `concept.label` ≥ 90% of `total_samples` (verifies the URI dereferencing actually works against real data, not just unit-test corpus)
- [ ] **Concept-label resolution rate**: ≥ 90% of facet URIs resolve to a SKOS prefLabel (en); URIs missing prefLabel fall back to URI tail with the build-stat counter incrementing
- [ ] Python and JS tokenizers produce identical output for every string in the regression set
- [ ] CI fails if the two diverge
- [ ] `build_stats.json` artifact emitted and committed alongside the PR
- [ ] Build-time stats summary in PR description (lift the headlines from `build_stats.json`)
- [ ] PR merged to `main`

## Out of scope

- Browser query path (#171)
- Hosting the built index on `data.isamples.org` (separate ops task once builder is stable)
- v1.5 / v2 fields (`event.*`, `site.*`, `agent.*`, `curation.*`) — additive, no schema change, separate issue when the time comes

## Refs

#165, #169, #171



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer FTS Track 3: Offline index builder + tokenizer regression set #170

Goal

Scope

1. Builder

2. Python tokenizer

3. Tokenizer regression set

4. Tests

5. CI

6. Build-stats artifact

Acceptance

Out of scope

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explorer FTS Track 3: Offline index builder + tokenizer regression set #170

Description

Goal

Scope

1. Builder

2. Python tokenizer

3. Tokenizer regression set

4. Tests

5. CI

6. Build-stats artifact

Acceptance

Out of scope

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions