Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 50 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,26 @@
# RAG Document Engine

A progressive RAG system built from first principles -- from raw embeddings and cosine similarity all the way to a full retrieval-augmented generation pipeline with document ingestion, reranking, and cited answers.
A progressive RAG system built from first principles -- from raw embeddings and cosine similarity all the way to a full retrieval-augmented generation pipeline with multi-format document ingestion and cited answers.

---

## What It Does (Current State)

**Ingestion**
### Ingestion

1. **Loads** `.txt` files (PDF, DOCX, Markdown from Phase 4)
1. **Parses** `.txt`, `.pdf`, `.docx`, and `.md` files into plain text via format-specific parsers
2. **Chunks** each document into overlapping word windows
3. **Embeds** each chunk using OpenAI `text-embedding-3-small`, producing a 1536-dimensional vector
4. **Stores** vectors with metadata (`source`, `chunk_index`) in a persistent Chroma collection
4. **Deduplicates** - deletes any existing chunks for the file before storing, so re-ingestion replaces rather than duplicates
5. **Stores** vectors with metadata (`source`, `chunk_index`) in a persistent Chroma collection

**Search**
### Search

1. **Embeds** the query using the same model
2. **Queries** Chroma for the top-K nearest vectors using built-in ANN (Approximate Nearest Neighbor) search
3. **Returns** results with chunk text, source filename, and distance score

**Generation**
### Generation

1. **Selects** retrieved chunks within a 2000-token budget using `tiktoken`
2. **Builds** a numbered context block from the selected chunks
Expand All @@ -34,6 +35,9 @@ A progressive RAG system built from first principles -- from raw embeddings and
- OpenAI SDK (`text-embedding-3-small` for embeddings, `gpt-4o-mini` for generation)
- Chroma (persistent vector database)
- tiktoken (token counting for context budget management)
- pymupdf (PDF parsing)
- python-docx (DOCX parsing)
- numpy (cosine similarity computation)
- python-dotenv

---
Expand All @@ -42,24 +46,25 @@ A progressive RAG system built from first principles -- from raw embeddings and

```text
rag-document-engine/
├── documents/ # Sample .txt files
│ ├── ancient-rome.txt
│ ├── climate-change.txt
│ ├── music-and-the-brain.txt
│ ├── nutrition-and-health.txt
│ └── space-exploration.txt
├── documents/ # Sample documents (.txt, .pdf, .docx, .md)
├── ingest/ # Format-specific parsers (Phase 4)
│ ├── __init__.py
│ ├── router.py # Resolves parser by file extension
│ ├── pdf_parser.py # PDF extraction via pymupdf
│ ├── docx_parser.py # DOCX extraction via python-docx
│ └── markdown_parser.py # Markdown stripping to plain text
├── prompts/
│ └── system_prompt.txt # LLM system prompt (loaded at runtime)
├── embed.py # embed_chunks and embed_query utilities
├── ingest.py # Load, chunk, embed, store in Chroma
├── ingest.py # CLI entry point - parse, chunk, embed, store
├── search.py # Embed query + retrieve top-K from Chroma
├── generate.py # Token-budgeted answer generation via gpt-4o-mini
├── rag.py # End-to-end pipeline entry point
├── inspect_collection.py # Print collection stats and a sample entry
├── utils.py # chunk_text, load_document, load_documents
├── chroma_db/ # Chroma persistent storage (not committed)
├── diagrams/ # Pipeline diagrams (SVG, auto-exported from PlantUML)
├── docs/ # PlantUML source files and implementation plan
├── diagrams/ # Pipeline diagrams (SVG, generated via npx diagram-sync)
├── docs/ # Phase notes, PlantUML source files, and docs index
├── pyproject.toml
└── .env # API keys (not committed)
```
Expand Down Expand Up @@ -88,8 +93,9 @@ TOKEN_BUDGET=2000
## Usage

```bash
# Step 1 -- Ingest documents into Chroma
python3 ingest.py
# Step 1 -- Ingest a single file or an entire directory
python3 ingest.py documents/ancient-rome.pdf
python3 ingest.py documents/

# Step 2 -- Ask a question (full RAG pipeline)
python3 rag.py "what foods are good for the heart"
Expand Down Expand Up @@ -130,15 +136,15 @@ No answer found in the documents.
**Search only** -- `python3 search.py`

```text
Result 1 (distance: 1.2862) -- nutrition-and-health.txt [chunk 0]
Result 1 (distance: 1.2862) - nutrition-and-health.txt [chunk 0]
Nutrition is the science of how food affects the body... Unsaturated fats found in olive oil,
nuts, avocados, and fatty fish are associated with reduced risk of heart disease...

Result 2 (distance: 1.3720) -- nutrition-and-health.txt [chunk 1]
Result 2 (distance: 1.3720) - nutrition-and-health.txt [chunk 1]
The Mediterranean diet -- rich in vegetables, fruit, whole grains, fish, and olive oil -- is
consistently associated with lower rates of heart disease, diabetes, and cognitive decline...

Result 3 (distance: 1.6426) -- music-and-the-brain.txt [chunk 1]
Result 3 (distance: 1.6426) - music-and-the-brain.txt [chunk 1]
Music also affects mood and stress. Slow, quiet music activates the parasympathetic nervous
system, lowering heart rate and cortisol levels...
```
Expand All @@ -156,7 +162,7 @@ Note: distance is an inverse similarity score -- lower means more relevant.
| 1 | Semantic Foundation | Complete |
| 2 | Vector Store | Complete |
| 3 | RAG Pipeline | Complete |
| 4 | Document Ingestion | Planned |
| 4 | Document Ingestion | Complete |
| 5 | Retrieval Quality | Planned |
| 6 | Search and Chat Mode | Planned |
| 7 | Role-Based Document Access | Planned |
Expand All @@ -173,14 +179,15 @@ See [docs/implementation-plan.md](./docs/implementation-plan.md) for full phase
- **Model consistency** -- the same embedding model must be used for both documents and queries
- **Vector database** -- stores embeddings with metadata and retrieves them by similarity using ANN search
- **RAG** -- Retrieval-Augmented Generation: retrieve relevant context, then generate a grounded answer
- **Document parsing** -- format-specific extraction that converts PDF, DOCX, and Markdown into plain text before chunking; all formats share the same embedding and storage flow after parsing

---

## Diagrams

Pipeline diagrams are maintained as PlantUML source files in `docs/` and auto-exported to SVG on every push to main using [diagram-sync](https://www.npmjs.com/package/diagram-sync).
Pipeline diagrams are maintained as PlantUML source files in `docs/` and exported to SVG via `npx diagram-sync` using [diagram-sync](https://www.npmjs.com/package/diagram-sync).

The three diagrams below show the system growing phase by phase -- each one builds on the previous.
The diagrams below show the system growing phase by phase -- each one builds on the previous.

### Phase 1 -- Semantic Search (cosine similarity over JSON embeddings)

Expand All @@ -193,3 +200,23 @@ The three diagrams below show the system growing phase by phase -- each one buil
### Phase 3 -- RAG Pipeline (generation on top of retrieval)

![RAG Pipeline](./diagrams/docs/pipeline-rag.svg)

### Phase 4 -- Document Ingestion (multi-format parsing, deduplication)

The ingestion flow is split into 4 focused diagrams - read in this order:

**1. Entry and Routing** - CLI validation, collection setup, file vs directory routing

![Entry and Routing](./diagrams/docs/pipeline-document-ingestion-entry-routing.svg)

**2. Parsing** - router extension resolution, all 4 parsers (PDF / DOCX / MD / TXT), flatten to plain text

![Parsing](./diagrams/docs/pipeline-document-ingestion-parsing.svg)

**3. Chunking and Embedding** - sliding window chunking, OpenAI embeddings API call

![Chunking and Embedding](./diagrams/docs/pipeline-document-ingestion-chunk-embed.svg)

**4. Upsert** - deduplication check, ChromaDB upsert with full payload

![Upsert](./diagrams/docs/pipeline-document-ingestion-upsert.svg)
52 changes: 52 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Docs

This folder contains the implementation plan, phase-by-phase notes, and sequence diagrams for the RAG Document Engine.

---

## Implementation Plan

| File | Description |
| ---- | ----------- |
| `implementation-plan.md` | Full build plan across all 7 phases — goals, what gets built, stack additions, and questions to answer per phase |

---

## Phase Notes

One file per phase. Each covers the goal, design decisions, and key concepts for that phase.

| File | Phase |
| ---- | ----- |
| `phase-1-semantic-foundation.md` | Semantic search from scratch — chunking, embeddings, cosine similarity over flat JSON |
| `phase-2-vector-store.md` | Replace JSON with ChromaDB — persistent collection, metadata, `collection.query()` |
| `phase-3-rag-pipeline.md` | Close the loop — retrieval + LLM generation, grounded answers, citations, token budget |
| `phase-4-document-ingestion.md` | Multi-format ingestion — PDF, DOCX, Markdown parsers, CLI trigger, deduplication |
| `phase-5-retrieval-quality.md` | Improve retrieval — evaluation set, hybrid search (BM25 + vector), re-ranker, metadata filters |
| `phase-6-search-and-chat-mode.md` | Two interaction modes — document search and multi-turn chat with conversation history |
| `phase-7-role-based-document-access.md` | Access control — owner metadata at ingestion, per-user query filters, no bypass path |

---

## Sequence Diagrams

PlantUML sequence diagrams for each pipeline. Open with any PlantUML-compatible renderer.

### Document Ingestion Pipeline

Covers Phase 4. Split into 4 focused diagrams — read in this order to follow the full flow:

| Order | File | Covers |
| ----- | ---- | ------- |
| 1 | [pipeline-document-ingestion-entry-routing.svg](../diagrams/docs/pipeline-document-ingestion-entry-routing.svg) | CLI entry, arg validation, `get_or_create_collection`, file vs directory routing |
| 2 | [pipeline-document-ingestion-parsing.svg](../diagrams/docs/pipeline-document-ingestion-parsing.svg) | Router extension resolution, all 4 parsers (PDF / DOCX / MD / TXT), flatten to plain text |
| 3 | [pipeline-document-ingestion-chunk-embed.svg](../diagrams/docs/pipeline-document-ingestion-chunk-embed.svg) | Sliding window chunking, OpenAI embeddings API call |
| 4 | [pipeline-document-ingestion-upsert.svg](../diagrams/docs/pipeline-document-ingestion-upsert.svg) | Deduplication check, ChromaDB upsert with full payload |

### Other Pipelines

| File | Covers |
| ---- | ------- |
| [pipeline-semantic-search.svg](../diagrams/docs/pipeline-semantic-search.svg) | Phase 1 — query embedding, cosine similarity, top-K retrieval over flat JSON |
| [pipeline-vector-store.svg](../diagrams/docs/pipeline-vector-store.svg) | Phase 2 — ingest and query flow using ChromaDB |
| [pipeline-rag.svg](../diagrams/docs/pipeline-rag.svg) | Phase 3 — end-to-end RAG: retrieval, token budget, prompt construction, LLM generation, citations |
35 changes: 35 additions & 0 deletions docs/pipeline-document-ingestion-chunk-embed.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@startuml pipeline-document-ingestion-chunk-embed
skinparam sequenceMessageAlign center
skinparam ParticipantPadding 10

participant "ingest_file()" as ingest
participant "utils\nchunk_text()" as chunker
participant "embed\nembed_chunks()" as embedder
participant "OpenAI\nEmbeddings API" as openai

== Chunking ==

ingest -> chunker : chunk_text(text, chunk_size = 300, overlap = 50)
activate chunker
chunker -> chunker : words = text.split()
chunker -> chunker : step = chunk_size - overlap = 250
loop for i in range(0, len(words), step = 250)
chunker -> chunker : chunk_words = words[i : i + 300]
chunker -> chunker : chunks.append(" ".join(chunk_words))
end
chunker --> ingest : chunks: list[str]
deactivate chunker

== Embedding ==

ingest -> embedder : embed_chunks(chunks)
activate embedder
embedder -> openai : embeddings.create(model = EMBEDDING_MODEL, input = chunks)
activate openai
openai --> embedder : EmbeddingResponse — response.data[i].embedding
deactivate openai
embedder -> embedder : build [{"text": chunk, "embedding": [...float]}]
embedder --> ingest : embedded: list[dict]
deactivate embedder

@enduml
62 changes: 62 additions & 0 deletions docs/pipeline-document-ingestion-entry-routing.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
@startuml pipeline-document-ingestion-entry-routing
skinparam sequenceMessageAlign center
skinparam ParticipantPadding 10

actor User as user
participant "ingest.py" as main
participant "ChromaDB\nPersistentClient" as chromaClient
participant "ChromaDB\nCollection" as collection

== Module Init (at import time) ==

main -> chromaClient : PersistentClient(path = "./chroma_db")
activate chromaClient
chromaClient --> main : client (persistent, on-disk)
deactivate chromaClient

== main() Entry ==

user -> main : python ingest.py <path>
activate main

opt sys.argv < 2 — no argument provided
main --> user : print "Usage: python ingest.py <file_or_directory>"
main --> user : sys.exit(1)
end

main -> chromaClient : get_or_create_collection(name = "documents")
activate chromaClient
chromaClient --> main : collection
deactivate chromaClient
activate collection

== Path Routing ==

alt target.is_file()
main -> main : ingest_file(collection, filepath)

else target.is_dir()
main -> main : filter files by suffix (.txt | .pdf | .docx | .md)
alt no supported files found in directory
main --> user : print "No supported files found in {target}"
main --> user : sys.exit(0)
else supported files exist
loop for each file in directory
main -> main : ingest_file(collection, filepath)
end
end

else path does not exist
main --> user : print "Path not found: {target}"
main --> user : sys.exit(1)
end

== Summary ==

main -> collection : count()
collection --> main : total_count
main --> user : print "Total vectors in collection: {total_count}"
deactivate collection
deactivate main

@enduml
Loading
Loading