Skip to content

Explorer FTS Track 3: Offline index builder + tokenizer regression set #170

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review round 2 on #165. Concept-label join (vocab_labels.parquet) is now an explicit input + tokenization step, not an implicit consequence of #169's contract. Build-stats artifact (build_stats.json) added so coverage is empirical, not narrative.

Sub-issue of #165. Depends on #169 (search_index_v1 contract doc).

Goal

Implement the offline pipeline that builds the v1 search index from the sample-centric document projection specified in #169 / SEARCH_INDEX_V1.md. The projection joins across the property graph and emits per-virtual-field token rows.

Scope

1. Builder

  • File: tools/build_search_index.py (new; sibling to tools/build_fts_index.py from PR Improve search: multi-term AND + relevance ranking (FTS spike) #95, which stays as the FTS spike artifact).

  • Inputs (v1 minimum):

    • samples_map_lite.parquetsample.label, sample.place_name[] per sample
    • sample_facets_v2.parquetsample.description per sample, plus the URI lists for material, context, object_type
    • vocab_labels.parquet → SKOS prefLabel (en) for each URI; load-bearing: this is what makes concept.label work and what dereferences <…>/Potterypottery for the index
  • Document-projection step:

    1. For each sample (pid), gather the v1-minimum text fragments tagged by virtual field:
      • sample.label (1 fragment per sample)
      • sample.description (1 fragment per sample, sparse coverage)
      • sample.place_name (1+ fragments from the array)
      • concept.label (N fragments — one per material URI + one per context URI + one per object_type URI; resolve each URI through vocab_labels.parquet and emit the prefLabel; URIs without a prefLabel fall back to URI tail and a concept_label_missing_pref_label build-stat counter increments)
    2. Tokenize each fragment with the Python tokenizer (§2). Track per-(pid, field) token frequencies and doc length.
    3. Emit token rows per Explorer FTS Track 2: search_index_v1 contract doc #169 §4 schema:
      { token, pid, field, tf, doc_len }
      
    4. Compute global per-token DF across all (pid, field) pairs.
  • Outputs:

  • Honors per-shard byte cap from the contract (default 5 MB); sub-shards high-frequency tokens automatically.

2. Python tokenizer

3. Tokenizer regression set

  • File: tests/search_tokenizer_regression.json.
  • ≥ 30 strings covering:
    • diacritics: Çatalhöyük, Köln, São Paulo
    • mixed case: MaterialSampleRecord, iSamples
    • hyphenated compounds: Iron-Age, co-located
    • archaeological place names with whitespace + punctuation
    • IGSN-style ids: IGSN:HRV000ABC
    • numeric: 1965, 2.5kg
    • empty / whitespace-only
    • very long strings (length-filter edge)
    • the strings pottery, ceramic, bone, mammal, marine, basalt as plain inputs (the regression set proves the tokenizer handles these correctly; URI dereferencing is proved separately in §4 builder E2E, where the test fixture maps known URIs to these labels and asserts the token rows appear)
  • Each entry: { "input": "...", "expected_tokens": ["..."] }.

4. Tests

  • tests/test_search_tokenizer.py — Python tokenizer against regression set. Scope: tokenizer only. Does NOT prove URI dereferencing.
  • A JS port of the tokenizer (small, ~15 lines) lives in assets/js/search_tokenizer.js; a Node-based parity check (tests/test_search_tokenizer_js.spec.js or similar) runs the same regression set against the JS implementation. CI runs both.
  • tests/test_search_index_builder.py — end-to-end builder fixture, scope: URI dereferencing + projection + shard structure. Build a tiny corpus (10 docs) where:
    • At least 3 docs have material URIs that map to known prefLabels in a fixture vocab_labels table (e.g., <test://Pottery>"Pottery", <test://Ceramic>"Ceramic", <test://Bone>"Bone").
    • At least 1 doc has a material URI with no prefLabel (verify URI-tail fallback + concept_label_missing_pref_label counter increments in build_stats.json).
    • Asserts that searching the resulting substrate for pottery returns exactly the pid(s) whose material is <test://Pottery> — the actual proof that URI dereferencing works end-to-end.
    • Also asserts shard structure, DF/doc_len values, and lookup of known tokens returns expected pids per virtual field.

5. CI

  • Add a workflow step that runs both regression tests on every PR. Tokenizer divergence between Python and JS is a hard fail.

6. Build-stats artifact

The build emits <output_dir>/build_stats.json that turns coverage from a doc claim into an empirical artifact future runs can be regressed against:

{
  "data_version": "isamples_202601",
  "built_at_utc": "2026-MM-DDTHH:MM:SSZ",
  "total_samples": 6700000,
  "fields": {
    "sample.label":       { "samples_with_field": 6680000, "total_tokens": 12345678, "avg_doc_len": 3.2 },
    "sample.description": { "samples_with_field": 1610000, "total_tokens": 89012345, "avg_doc_len": 41.5 },
    "sample.place_name":  { "samples_with_field": 2210000, "total_tokens": 4567890,  "avg_doc_len": 4.1 },
    "concept.label":      { "samples_with_field": 6650000, "total_tokens": 8901234,  "avg_doc_len": 3.0 }
  },
  "concept_label_uri_resolution": {
    "material_resolved":     0.97,
    "material_missing_pref": 0.03,
    "context_resolved":      0.95,
    "context_missing_pref":  0.05,
    "object_type_resolved":  0.99,
    "object_type_missing_pref": 0.01
  },
  "shard_count": 64,
  "shard_max_size_mb": 4.7,
  "total_bytes_uncompressed": 234567890,
  "build_seconds": 312.5,
  "top_df_tokens": [["the", 5800000], ["of", 4200000], ...]
}

Acceptance

  • Builder produces a v1 index from a small test corpus that round-trips against the Explorer FTS Track 2: search_index_v1 contract doc #169 §4 schema
  • Concept-label coverage: samples_with_field for concept.label ≥ 90% of total_samples (verifies the URI dereferencing actually works against real data, not just unit-test corpus)
  • Concept-label resolution rate: ≥ 90% of facet URIs resolve to a SKOS prefLabel (en); URIs missing prefLabel fall back to URI tail with the build-stat counter incrementing
  • Python and JS tokenizers produce identical output for every string in the regression set
  • CI fails if the two diverge
  • build_stats.json artifact emitted and committed alongside the PR
  • Build-time stats summary in PR description (lift the headlines from build_stats.json)
  • PR merged to main

Out of scope

Refs

#165, #169, #171

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions