Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 23 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,11 @@ A progressive RAG system built from first principles -- from raw embeddings and
### Search

1. **Embeds** the query using the same model
2. **Queries** Chroma for the top-K nearest vectors using built-in ANN (Approximate Nearest Neighbor) search
3. **Returns** results with chunk text, source filename, and distance score
2. **Runs hybrid search** -- vector search via Chroma ANN and BM25 keyword search in parallel
3. **Merges** both ranked lists using Reciprocal Rank Fusion (RRF, k=60)
4. **Reranks** the merged candidates using a cross-encoder (`cross-encoder/ms-marco-MiniLM-L-6-v2`) for precise relevance scoring
5. **Supports metadata filtering** -- optional `filters` dict narrows search to specific source files before retrieval, with automatic fallback to unfiltered if results are too few
6. **Returns** top-5 reranked chunks with text, source filename, and chunk index

### Generation

Expand All @@ -38,6 +41,8 @@ A progressive RAG system built from first principles -- from raw embeddings and
- pymupdf (PDF parsing)
- python-docx (DOCX parsing)
- numpy (cosine similarity computation)
- rank-bm25 (BM25 keyword search)
- sentence-transformers (cross-encoder reranking)
- python-dotenv

---
Expand All @@ -57,11 +62,18 @@ rag-document-engine/
│ └── system_prompt.txt # LLM system prompt (loaded at runtime)
├── embed.py # embed_chunks and embed_query utilities
├── ingest.py # CLI entry point - parse, chunk, embed, store
├── search.py # Embed query + retrieve top-K from Chroma
├── search.py # Embed query + retrieve top-K from Chroma (with optional metadata filter)
├── hybrid_search.py # BM25 + vector search merged via RRF
├── rerank.py # Cross-encoder reranker on top of hybrid search candidates
├── generate.py # Token-budgeted answer generation via gpt-4o-mini
├── rag.py # End-to-end pipeline entry point
├── config.py # Tuneable constants (chunk size, reranker K, RRF K)
├── inspect_collection.py # Print collection stats and a sample entry
├── utils.py # chunk_text, load_document, load_documents
├── eval/
│ ├── golden_dataset.json # 20 manually written Q&A pairs for evaluation
│ ├── eval.py # Evaluation harness -- retrieval recall + LLM-as-judge scoring
│ └── results.md # Raw eval output and observations per experiment
├── chroma_db/ # Chroma persistent storage (not committed)
├── diagrams/ # Pipeline diagrams (SVG, generated via npx diagram-sync)
├── docs/ # Phase notes, PlantUML source files, and docs index
Expand Down Expand Up @@ -163,7 +175,7 @@ Note: distance is an inverse similarity score -- lower means more relevant.
| 2 | Vector Store | Complete |
| 3 | RAG Pipeline | Complete |
| 4 | Document Ingestion | Complete |
| 5 | Retrieval Quality | Planned |
| 5 | Retrieval Quality | Complete |
| 6 | Search and Chat Mode | Planned |
| 7 | Role-Based Document Access | Planned |

Expand All @@ -180,6 +192,9 @@ See [docs/implementation-plan.md](./docs/implementation-plan.md) for full phase
- **Vector database** -- stores embeddings with metadata and retrieves them by similarity using ANN search
- **RAG** -- Retrieval-Augmented Generation: retrieve relevant context, then generate a grounded answer
- **Document parsing** -- format-specific extraction that converts PDF, DOCX, and Markdown into plain text before chunking; all formats share the same embedding and storage flow after parsing
- **Hybrid search** -- combines vector similarity (semantic) and BM25 (keyword) rankings; catches cases where exact terms matter that embeddings miss
- **Reciprocal Rank Fusion** -- merges two ranked lists by summing 1/(k+rank) per item; chunks that rank high in both lists score highest
- **Cross-encoder reranking** -- reads query and chunk together to score direct relevance; more accurate than cosine similarity, used as a second pass on a small candidate set

---

Expand Down Expand Up @@ -220,3 +235,7 @@ The ingestion flow is split into 4 focused diagrams - read in this order:
**4. Upsert** - deduplication check, ChromaDB upsert with full payload

![Upsert](./diagrams/docs/pipeline-document-ingestion-upsert.svg)

### Phase 5 -- Retrieval Quality (hybrid search, reranking, metadata filtering)

![Retrieval Quality Pipeline](./diagrams/docs/pipeline-retrieval-quality.svg)
14 changes: 7 additions & 7 deletions docs/phase-3-rag-pipeline.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# RDE-3 | Phase 3: RAG Pipeline

**Status:** In Progress
**Status:** Complete
**Type:** Feature
**Priority:** High
**Depends on:** RDE-2
Expand All @@ -23,29 +23,29 @@ Close the loop -- add LLM generation on top of retrieval so the system answers q

## Tasks

- [ ] Create `generate.py` with function `generate_answer(question: str, chunks: list[dict]) -> dict`
- [x] Create `generate.py` with function `generate_answer(question: str, chunks: list[dict]) -> dict`
- Builds a prompt: system instruction + retrieved chunks as context + user question
- System prompt must instruct the model to answer ONLY from the provided context
- System prompt must instruct the model to respond with exactly `"I don't know based on the provided documents."` when the answer is not in the context
- Calls `gpt-4o-mini` via the OpenAI chat completions API
- Returns `{ "answer": "...", "sources": [...] }`
- [ ] Add token budget logic to `generate.py`
- [x] Add token budget logic to `generate.py`
- Use `tiktoken` to count tokens in each chunk before building the prompt
- Set a max context token budget (2000 tokens)
- Walk chunks in order of relevance and add each until the budget is reached -- skip any chunk that would exceed it
- [ ] Create `rag.py` as the end-to-end pipeline entry point
- [x] Create `rag.py` as the end-to-end pipeline entry point
- Accept a question via `input()` or a command-line argument
- Embed the question using OpenAI
- Query Chroma for top-5 most similar chunks
- Pass the question and chunks to `generate_answer()`
- Detect the fixed "I don't know" phrase -- if matched, print `"No answer found in the documents."` instead of passing the raw model response
- Otherwise, print the answer and source citations
- [ ] Test with at least 3 questions
- [x] Test with at least 3 questions
- One where the answer is clearly in the documents
- One where the answer spans multiple chunks
- One where the answer is NOT in the documents -- verify the no-answer path prints cleanly
- [ ] Debug retrieval vs generation separately for one question -- print raw chunks before generation to confirm the right context is being retrieved
- [ ] Tune K: run the same question with K=3 and K=7, compare answers
- [x] Debug retrieval vs generation separately for one question -- print raw chunks before generation to confirm the right context is being retrieved
- [x] Tune K: run the same question with K=3 and K=7, compare answers

---

Expand Down
14 changes: 7 additions & 7 deletions docs/phase-4-document-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# RDE-4 | Phase 4: Document Ingestion

**Status:** Todo
**Status:** Complete
**Type:** Feature
**Priority:** High
**Depends on:** RDE-3
Expand All @@ -23,12 +23,12 @@ Support real document formats and a proper ingestion trigger -- not just static

## Tasks

- [ ] Create `ingest/pdf_parser.py` -- extract plain text from PDF files using `pymupdf` or `pdfplumber`
- [ ] Create `ingest/docx_parser.py` -- extract plain text from DOCX files using `python-docx`
- [ ] Create `ingest/markdown_parser.py` -- strip Markdown syntax, return clean plain text
- [ ] Create `ingest/router.py` -- detect file type by extension, call the right parser, return plain text
- [ ] Update `ingest.py` with CLI interface -- `python ingest.py <file_or_directory>` ingests any supported format
- [ ] Add deduplication to `ingest.py` -- before adding chunks for a file, delete all existing Chroma chunks with the same `source` metadata value so re-ingesting replaces rather than duplicates
- [x] Create `ingest/pdf_parser.py` -- extract plain text from PDF files using `pymupdf` or `pdfplumber`
- [x] Create `ingest/docx_parser.py` -- extract plain text from DOCX files using `python-docx`
- [x] Create `ingest/markdown_parser.py` -- strip Markdown syntax, return clean plain text
- [x] Create `ingest/router.py` -- detect file type by extension, call the right parser, return plain text
- [x] Update `ingest.py` with CLI interface -- `python ingest.py <file_or_directory>` ingests any supported format
- [x] Add deduplication to `ingest.py` -- before adding chunks for a file, delete all existing Chroma chunks with the same `source` metadata value so re-ingesting replaces rather than duplicates

---

Expand Down
62 changes: 39 additions & 23 deletions docs/phase-5-retrieval-quality.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# RDE-5 | Phase 5: Retrieval Quality

**Status:** Todo
**Status:** Complete
**Type:** Enhancement
**Priority:** High
**Depends on:** RDE-4
Expand All @@ -13,44 +13,60 @@ Make retrieval measurably better -- not just faster, but more accurate. Cosine s

---

## What Changes from Phase 4
## What Changed from Phase 4

- Build a labeled test set before changing anything -- so improvements can be measured
- Experiment with chunk size and overlap to find what works best for the document set
- Extend `search.py` with BM25 keyword scoring alongside vector similarity
- Merge vector and BM25 results using reciprocal rank fusion before passing to the generator
- Add a cross-encoder re-ranker as a second pass on the merged top-K
- Add metadata filter support to all `collection.query()` calls
- Built a labeled evaluation set before changing anything -- so improvements can be measured
- Experimented with chunk size and overlap across three settings (300/50, 150/25, 600/100)
- Extended `search.py` with an optional metadata filter passed to Chroma's `where` argument, with a fallback when filtered results are too few
- Created `hybrid_search.py` -- runs BM25 keyword search alongside vector search and merges both ranked lists using reciprocal rank fusion (RRF, k=60)
- Created `rerank.py` -- cross-encoder model (`cross-encoder/ms-marco-MiniLM-L-6-v2`) re-scores the hybrid search candidates before passing to generation
- Updated `rag.py` to use hybrid search + reranker instead of plain vector search
- Created `config.py` to centralise all tuneable constants (chunk size, reranker K, RRF K)
- Evaluation results and observations recorded in `eval/results.md`

---

## Tasks

- [ ] Create `eval/test_set.json` -- at least 10 manually written question + expected source chunk pairs
- [ ] Create `eval/evaluate.py` -- runs each query, checks if the expected chunk appears in top-K results, reports recall@K
- [ ] Run baseline evaluation against Phase 4 retrieval and record the score
- [ ] Experiment with chunk size and overlap -- re-run ingestion with at least 2 different settings, compare eval scores
- [ ] Create `retrieval/hybrid.py` -- combine vector similarity scores with BM25 scores via reciprocal rank fusion
- [ ] Create `retrieval/reranker.py` -- cross-encoder model re-scores top-K results before they are passed to generation
- [ ] Add metadata filter interface to `search.py` -- allow filtering by source file or document type before running the query
- [ ] Re-run evaluation after each change and record the delta
- [x] Create `eval/golden_dataset.json` -- 20 manually written question + expected answer pairs (15 answerable, 3 multi-chunk, 2 unanswerable)
- [x] Create `eval/eval.py` -- runs each query through the full RAG pipeline, scores retrieval recall and answer correctness via LLM-as-judge, prints a summary
- [x] Run baseline evaluation and record scores before making any changes
- [x] Experiment with chunk size and overlap -- re-ran ingestion with 3 settings, compared eval scores, settled on chunk_size=150, overlap=25
- [x] Create `hybrid_search.py` -- BM25 + vector search merged via RRF
- [x] Create `rerank.py` -- cross-encoder re-scores hybrid search candidates
- [x] Add metadata filter interface to `search.py` -- optional `filters` dict passed to Chroma `where`, fallback to unfiltered if results < 2
- [x] Re-ran evaluation after each change and recorded results in `eval/results.md`

---

## Acceptance Criteria

- `python eval/evaluate.py` reports recall@K for the current retrieval setup against the test set
- Hybrid search recall@K is higher than pure vector search recall@K on the test set
- Re-ranker changes the order of at least some results -- verify with a debug print before and after
- Metadata filtering works: querying with `source="nutrition.txt"` returns only chunks from that file
- All changes are accompanied by eval score comparisons showing the impact
- [x] `python3 eval/eval.py` reports retrieval recall and answer correctness for the current setup
- [x] Hybrid search is wired into `rag.py` as the retrieval stage
- [x] Reranker re-scores hybrid candidates before generation
- [x] Metadata filtering works -- querying with `source="nutrition-and-health.txt"` returns only chunks from that file; falls back to unfiltered if too few results
- [x] All changes are accompanied by eval score comparisons in `eval/results.md`

---

## Stack Additions

- `rank_bm25` for keyword scoring
- `sentence-transformers` for cross-encoder re-ranking
- `rank_bm25` -- BM25 keyword scoring
- `sentence-transformers` -- cross-encoder re-ranking (`cross-encoder/ms-marco-MiniLM-L-6-v2`)

---

## Eval Summary

| Configuration | Retrieval recall | Answer correctness |
| ------------- | :--------------: | :----------------: |
| Baseline (chunk_size=300) | 18/20 (90%) | 4.3 / 5.0 |
| Small chunks (chunk_size=150) | 18/20 (90%) | 4.5 / 5.0 |
| Large chunks (chunk_size=600) | 18/20 (90%) | 4.4 / 5.0 |
| Small chunks + reranking | 18/20 (90%) | 4.3 / 5.0 |
| Small chunks + hybrid + reranking | 18/20 (90%) | 4.3 / 5.0 |

Full raw results and observations in `eval/results.md`.

---

Expand Down
74 changes: 74 additions & 0 deletions docs/pipeline-retrieval-quality.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
@startuml pipeline-retrieval-quality

skinparam backgroundColor #FFFFFF
skinparam defaultFontName Arial
skinparam defaultFontSize 15
skinparam sequenceArrowThickness 2
skinparam SequenceBoxBackgroundColor #F8F9FF
skinparam SequenceBoxBorderColor #AABBDD
skinparam ParticipantBackgroundColor #EEF3FB
skinparam ParticipantBorderColor #5577AA
skinparam ParticipantFontColor #222222
skinparam DatabaseBackgroundColor #FFF8E7
skinparam DatabaseBorderColor #CC9900
skinparam ActorBackgroundColor #F0FFF0
skinparam ActorBorderColor #448844

title Phase 5 -- Retrieval Quality (Hybrid Search + Reranking + Metadata Filtering)

participant "hybrid_search.py" as hybrid
participant "search.py" as search
participant "embed.py" as embed
participant "rerank.py" as reranker
database "Chroma DB" as chroma
participant "OpenAI API" as openai
participant "BM25Okapi" as bm25
participant "CrossEncoder" as crossencoder

== Stage 1 -- Vector Search (with optional metadata filter) ==

hybrid -> embed : embed_query(question)
embed -> openai : embeddings.create()\ntext-embedding-3-small
openai --> embed : query vector
embed --> hybrid : query vector

alt metadata filter provided
hybrid -> chroma : query(embeddings, n_results=20,\nwhere={"source": "file.txt"})
chroma --> hybrid : filtered top-20 chunks
hybrid -> hybrid : len(results) < 2?
alt too few results
hybrid -> chroma : retry without filter
chroma --> hybrid : unfiltered top-20 chunks
end
else no filter
hybrid -> chroma : query(embeddings, n_results=20)
chroma --> hybrid : top-20 chunks by cosine similarity
end

== Stage 2 -- BM25 Keyword Search ==

hybrid -> bm25 : BM25Okapi(tokenized corpus)
hybrid -> bm25 : get_scores(tokenized question)
bm25 --> hybrid : keyword relevance scores per chunk
hybrid -> hybrid : rank chunks by BM25 score\nkeep top-20

== Stage 3 -- RRF Merge ==

hybrid -> hybrid : for each chunk in vector results:\n score += 1 / (60 + rank)
hybrid -> hybrid : for each chunk in BM25 results:\n score += 1 / (60 + rank)
hybrid -> hybrid : sort by combined RRF score\nreturn top-20 candidates

== Stage 4 -- Cross-Encoder Reranking ==

hybrid --> reranker : top-20 candidates
reranker -> crossencoder : predict([(question, chunk_text)] x20)\ncross-encoder/ms-marco-MiniLM-L-6-v2
crossencoder --> reranker : relevance scores
reranker -> reranker : sort descending by score\nkeep top-5
reranker --> hybrid : top-5 reranked chunks

note over hybrid, reranker
Final output: 5 chunks ordered by cross-encoder relevance
ready to be passed to generate.py
end note

@enduml
Loading