You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updated 2026-05-08 per Codex review round 2 on #165. Concept-label join (vocab_labels.parquet) is now an explicit input + tokenization step, not an implicit consequence of #169's contract. Build-stats artifact (build_stats.json) added so coverage is empirical, not narrative.
Sub-issue of #165. Depends on #169 (search_index_v1 contract doc).
Goal
Implement the offline pipeline that builds the v1 search index from the sample-centric document projection specified in #169 / SEARCH_INDEX_V1.md. The projection joins across the property graph and emits per-virtual-field token rows.
samples_map_lite.parquet → sample.label, sample.place_name[] per sample
sample_facets_v2.parquet → sample.description per sample, plus the URI lists for material, context, object_type
vocab_labels.parquet → SKOS prefLabel (en) for each URI; load-bearing: this is what makes concept.label work and what dereferences <…>/Pottery → pottery for the index
Document-projection step:
For each sample (pid), gather the v1-minimum text fragments tagged by virtual field:
sample.label (1 fragment per sample)
sample.description (1 fragment per sample, sparse coverage)
sample.place_name (1+ fragments from the array)
concept.label (N fragments — one per material URI + one per context URI + one per object_type URI; resolve each URI through vocab_labels.parquet and emit the prefLabel; URIs without a prefLabel fall back to URI tail and a concept_label_missing_pref_label build-stat counter increments)
Tokenize each fragment with the Python tokenizer (§2). Track per-(pid, field) token frequencies and doc length.
Pure-stdlib (unicodedata); no third-party dependencies beyond what's in pyproject.toml already.
3. Tokenizer regression set
File: tests/search_tokenizer_regression.json.
≥ 30 strings covering:
diacritics: Çatalhöyük, Köln, São Paulo
mixed case: MaterialSampleRecord, iSamples
hyphenated compounds: Iron-Age, co-located
archaeological place names with whitespace + punctuation
IGSN-style ids: IGSN:HRV000ABC
numeric: 1965, 2.5kg
empty / whitespace-only
very long strings (length-filter edge)
the strings pottery, ceramic, bone, mammal, marine, basalt as plain inputs (the regression set proves the tokenizer handles these correctly; URI dereferencing is proved separately in §4 builder E2E, where the test fixture maps known URIs to these labels and asserts the token rows appear)
Each entry: { "input": "...", "expected_tokens": ["..."] }.
4. Tests
tests/test_search_tokenizer.py — Python tokenizer against regression set. Scope: tokenizer only. Does NOT prove URI dereferencing.
A JS port of the tokenizer (small, ~15 lines) lives in assets/js/search_tokenizer.js; a Node-based parity check (tests/test_search_tokenizer_js.spec.js or similar) runs the same regression set against the JS implementation. CI runs both.
tests/test_search_index_builder.py — end-to-end builder fixture, scope: URI dereferencing + projection + shard structure. Build a tiny corpus (10 docs) where:
At least 3 docs have material URIs that map to known prefLabels in a fixture vocab_labels table (e.g., <test://Pottery> → "Pottery", <test://Ceramic> → "Ceramic", <test://Bone> → "Bone").
At least 1 doc has a material URI with no prefLabel (verify URI-tail fallback + concept_label_missing_pref_label counter increments in build_stats.json).
Asserts that searching the resulting substrate for pottery returns exactly the pid(s) whose material is <test://Pottery> — the actual proof that URI dereferencing works end-to-end.
Also asserts shard structure, DF/doc_len values, and lookup of known tokens returns expected pids per virtual field.
5. CI
Add a workflow step that runs both regression tests on every PR. Tokenizer divergence between Python and JS is a hard fail.
6. Build-stats artifact
The build emits <output_dir>/build_stats.json that turns coverage from a doc claim into an empirical artifact future runs can be regressed against:
Concept-label coverage: samples_with_field for concept.label ≥ 90% of total_samples (verifies the URI dereferencing actually works against real data, not just unit-test corpus)
Concept-label resolution rate: ≥ 90% of facet URIs resolve to a SKOS prefLabel (en); URIs missing prefLabel fall back to URI tail with the build-stat counter incrementing
Python and JS tokenizers produce identical output for every string in the regression set
CI fails if the two diverge
build_stats.json artifact emitted and committed alongside the PR
Build-time stats summary in PR description (lift the headlines from build_stats.json)
Sub-issue of #165. Depends on #169 (search_index_v1 contract doc).
Goal
Implement the offline pipeline that builds the v1 search index from the sample-centric document projection specified in #169 /
SEARCH_INDEX_V1.md. The projection joins across the property graph and emits per-virtual-field token rows.Scope
1. Builder
File:
tools/build_search_index.py(new; sibling totools/build_fts_index.pyfrom PR Improve search: multi-term AND + relevance ranking (FTS spike) #95, which stays as the FTS spike artifact).Inputs (v1 minimum):
samples_map_lite.parquet→sample.label,sample.place_name[]per samplesample_facets_v2.parquet→sample.descriptionper sample, plus the URI lists formaterial,context,object_typevocab_labels.parquet→ SKOS prefLabel (en) for each URI; load-bearing: this is what makesconcept.labelwork and what dereferences<…>/Pottery→potteryfor the indexDocument-projection step:
pid), gather the v1-minimum text fragments tagged by virtual field:sample.label(1 fragment per sample)sample.description(1 fragment per sample, sparse coverage)sample.place_name(1+ fragments from the array)concept.label(N fragments — one permaterialURI + one percontextURI + one perobject_typeURI; resolve each URI throughvocab_labels.parquetand emit the prefLabel; URIs without a prefLabel fall back to URI tail and aconcept_label_missing_pref_labelbuild-stat counter increments)Outputs:
<output_dir>/isamples_YYYYMM_search_index_v1/.df.parquetwith global token DF.build_stats.json(see §6 below).Honors per-shard byte cap from the contract (default 5 MB); sub-shards high-frequency tokens automatically.
2. Python tokenizer
tools/search_tokenizer.py.unicodedata); no third-party dependencies beyond what's inpyproject.tomlalready.3. Tokenizer regression set
tests/search_tokenizer_regression.json.Çatalhöyük,Köln,São PauloMaterialSampleRecord,iSamplesIron-Age,co-locatedIGSN:HRV000ABC1965,2.5kgpottery,ceramic,bone,mammal,marine,basaltas plain inputs (the regression set proves the tokenizer handles these correctly; URI dereferencing is proved separately in §4 builder E2E, where the test fixture maps known URIs to these labels and asserts the token rows appear){ "input": "...", "expected_tokens": ["..."] }.4. Tests
tests/test_search_tokenizer.py— Python tokenizer against regression set. Scope: tokenizer only. Does NOT prove URI dereferencing.assets/js/search_tokenizer.js; a Node-based parity check (tests/test_search_tokenizer_js.spec.jsor similar) runs the same regression set against the JS implementation. CI runs both.tests/test_search_index_builder.py— end-to-end builder fixture, scope: URI dereferencing + projection + shard structure. Build a tiny corpus (10 docs) where:materialURIs that map to known prefLabels in a fixturevocab_labelstable (e.g.,<test://Pottery>→"Pottery",<test://Ceramic>→"Ceramic",<test://Bone>→"Bone").materialURI with no prefLabel (verify URI-tail fallback +concept_label_missing_pref_labelcounter increments inbuild_stats.json).potteryreturns exactly the pid(s) whosematerialis<test://Pottery>— the actual proof that URI dereferencing works end-to-end.5. CI
6. Build-stats artifact
The build emits
<output_dir>/build_stats.jsonthat turns coverage from a doc claim into an empirical artifact future runs can be regressed against:{ "data_version": "isamples_202601", "built_at_utc": "2026-MM-DDTHH:MM:SSZ", "total_samples": 6700000, "fields": { "sample.label": { "samples_with_field": 6680000, "total_tokens": 12345678, "avg_doc_len": 3.2 }, "sample.description": { "samples_with_field": 1610000, "total_tokens": 89012345, "avg_doc_len": 41.5 }, "sample.place_name": { "samples_with_field": 2210000, "total_tokens": 4567890, "avg_doc_len": 4.1 }, "concept.label": { "samples_with_field": 6650000, "total_tokens": 8901234, "avg_doc_len": 3.0 } }, "concept_label_uri_resolution": { "material_resolved": 0.97, "material_missing_pref": 0.03, "context_resolved": 0.95, "context_missing_pref": 0.05, "object_type_resolved": 0.99, "object_type_missing_pref": 0.01 }, "shard_count": 64, "shard_max_size_mb": 4.7, "total_bytes_uncompressed": 234567890, "build_seconds": 312.5, "top_df_tokens": [["the", 5800000], ["of", 4200000], ...] }Acceptance
samples_with_fieldforconcept.label≥ 90% oftotal_samples(verifies the URI dereferencing actually works against real data, not just unit-test corpus)build_stats.jsonartifact emitted and committed alongside the PRbuild_stats.json)mainOut of scope
data.isamples.org(separate ops task once builder is stable)event.*,site.*,agent.*,curation.*) — additive, no schema change, separate issue when the time comesRefs
#165, #169, #171