feat(search): Qdrant vector store for semantic search#3037
Open
flash7777 wants to merge 3 commits into
Open
Conversation
When open_taki is detected, IndexSpace uses a configurable worker pool for parallel extraction instead of sequential processing. This leverages the LLM backend's batch capacity (vLLM max-num-seqs=16). - 8 workers by default (configurable via SEARCH_EXTRACTOR_TIKA_MAX_WORKERS) - Only active with open_taki v2 (classic Tika stays sequential) - Workers use direct upsert (no batch, thread-safe) - IsTaki() exported on Tika extractor for runtime detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store document embeddings from open_taki v2 in Qdrant alongside the keyword index (bleve/opensearch). Enables semantic search over all indexed documents. - VectorStore config: SEARCH_VECTOR_ENABLED, SEARCH_VECTOR_URL, SEARCH_VECTOR_COLLECTION - Qdrant REST client (upsert, search, auto-create collection) - Embedding + metadata stored per document (name, title, path, summary, entities) - Graceful: disabled by default, no impact when Qdrant unavailable
β¦sults Freetext queries automatically search both Bleve (keyword) and Qdrant (semantic). Results are merged and deduplicated. Structured queries (name:, tag:, mtime:, etc.) go to Bleve only. - isFreetext() detects query type by checking for field prefixes - GetEmbedding() on Tika extractor: sends query to open_taki for embedding - searchVector(): Qdrant search with score threshold (0.3) - Results merged: Qdrant hits added if not already in Bleve results - Stat each Qdrant hit to verify access permissions
Up to standards βπ’ Issues
|
| Metric | Results |
|---|---|
| Complexity | 12 |
| Duplication | 2 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optional Qdrant integration for semantic (vector) search alongside the existing keyword index (Bleve/OpenSearch). When enabled, document embeddings from open_taki v2 are stored in Qdrant and freetext queries automatically search both backends.
Depends on PR #3036 (open_taki v2 protocol support).
How it works
name:,tag:,mtime:): Bleve only β no vector overheadConfig
Or YAML:
Changes
qdrant/client.goβ Lightweight REST client (upsert, search, auto-create collection)config/content.goβ VectorStore config (enabled, url, collection)config/config.goβ Vector field in main configconfig/defaults/defaultconfig.goβ Defaults (disabled, localhost:6333, "opencloud")content/tika.goβ GetEmbedding() for query-time embedding via open_takisearch/service.goβ Qdrant ingest in doUpsertItem + semantic search merge in Search()Backward compatible
Tested
Deployed on cloud.brandis.eu: 34+ documents in Qdrant, 0 upsert errors.