Buffden · Buffden · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
diff --git a/README.md b/README.md
@@ -1,25 +1,26 @@
 # RAG Document Engine
 
-A progressive RAG system built from first principles -- from raw embeddings and cosine similarity all the way to a full retrieval-augmented generation pipeline with document ingestion, reranking, and cited answers.
+A progressive RAG system built from first principles -- from raw embeddings and cosine similarity all the way to a full retrieval-augmented generation pipeline with multi-format document ingestion and cited answers.
 
 ---
 
 ## What It Does (Current State)
 
-**Ingestion**
+### Ingestion
 
-1. **Loads** `.txt` files (PDF, DOCX, Markdown from Phase 4)
+1. **Parses** `.txt`, `.pdf`, `.docx`, and `.md` files into plain text via format-specific parsers
 2. **Chunks** each document into overlapping word windows
 3. **Embeds** each chunk using OpenAI `text-embedding-3-small`, producing a 1536-dimensional vector
-4. **Stores** vectors with metadata (`source`, `chunk_index`) in a persistent Chroma collection
+4. **Deduplicates** - deletes any existing chunks for the file before storing, so re-ingestion replaces rather than duplicates
+5. **Stores** vectors with metadata (`source`, `chunk_index`) in a persistent Chroma collection
 
-**Search**
+### Search
 
 1. **Embeds** the query using the same model
 2. **Queries** Chroma for the top-K nearest vectors using built-in ANN (Approximate Nearest Neighbor) search
 3. **Returns** results with chunk text, source filename, and distance score
 
-**Generation**
+### Generation
 
 1. **Selects** retrieved chunks within a 2000-token budget using `tiktoken`
 2. **Builds** a numbered context block from the selected chunks
@@ -34,6 +35,9 @@ A progressive RAG system built from first principles -- from raw embeddings and
 - OpenAI SDK (`text-embedding-3-small` for embeddings, `gpt-4o-mini` for generation)
 - Chroma (persistent vector database)
 - tiktoken (token counting for context budget management)
+- pymupdf (PDF parsing)
+- python-docx (DOCX parsing)
+- numpy (cosine similarity computation)
 - python-dotenv
 
 ---
@@ -42,24 +46,25 @@ A progressive RAG system built from first principles -- from raw embeddings and
 
 ```text
 rag-document-engine/
-├── documents/                  # Sample .txt files
-│   ├── ancient-rome.txt
-│   ├── climate-change.txt
-│   ├── music-and-the-brain.txt
-│   ├── nutrition-and-health.txt
-│   └── space-exploration.txt
+├── documents/                  # Sample documents (.txt, .pdf, .docx, .md)
+├── ingest/                     # Format-specific parsers (Phase 4)
+│   ├── __init__.py
+│   ├── router.py               # Resolves parser by file extension
+│   ├── pdf_parser.py           # PDF extraction via pymupdf
+│   ├── docx_parser.py          # DOCX extraction via python-docx
+│   └── markdown_parser.py      # Markdown stripping to plain text
 ├── prompts/
 │   └── system_prompt.txt       # LLM system prompt (loaded at runtime)
 ├── embed.py                    # embed_chunks and embed_query utilities
-├── ingest.py                   # Load, chunk, embed, store in Chroma
+├── ingest.py                   # CLI entry point - parse, chunk, embed, store
 ├── search.py                   # Embed query + retrieve top-K from Chroma
 ├── generate.py                 # Token-budgeted answer generation via gpt-4o-mini
 ├── rag.py                      # End-to-end pipeline entry point
 ├── inspect_collection.py       # Print collection stats and a sample entry
 ├── utils.py                    # chunk_text, load_document, load_documents
 ├── chroma_db/                  # Chroma persistent storage (not committed)
-├── diagrams/                   # Pipeline diagrams (SVG, auto-exported from PlantUML)
-├── docs/                       # PlantUML source files and implementation plan
+├── diagrams/                   # Pipeline diagrams (SVG, generated via npx diagram-sync)
+├── docs/                       # Phase notes, PlantUML source files, and docs index
 ├── pyproject.toml
 └── .env                        # API keys (not committed)
 ```
@@ -88,8 +93,9 @@ TOKEN_BUDGET=2000
 ## Usage
 
 ```bash
-# Step 1 -- Ingest documents into Chroma
-python3 ingest.py
+# Step 1 -- Ingest a single file or an entire directory
+python3 ingest.py documents/ancient-rome.pdf
+python3 ingest.py documents/
 
 # Step 2 -- Ask a question (full RAG pipeline)
 python3 rag.py "what foods are good for the heart"
@@ -130,15 +136,15 @@ No answer found in the documents.
 **Search only** -- `python3 search.py`
 
 ```text
-Result 1 (distance: 1.2862) -- nutrition-and-health.txt [chunk 0]
+Result 1 (distance: 1.2862) - nutrition-and-health.txt [chunk 0]
 Nutrition is the science of how food affects the body... Unsaturated fats found in olive oil,
 nuts, avocados, and fatty fish are associated with reduced risk of heart disease...
 
-Result 2 (distance: 1.3720) -- nutrition-and-health.txt [chunk 1]
+Result 2 (distance: 1.3720) - nutrition-and-health.txt [chunk 1]
 The Mediterranean diet -- rich in vegetables, fruit, whole grains, fish, and olive oil -- is
 consistently associated with lower rates of heart disease, diabetes, and cognitive decline...
 
-Result 3 (distance: 1.6426) -- music-and-the-brain.txt [chunk 1]
+Result 3 (distance: 1.6426) - music-and-the-brain.txt [chunk 1]
 Music also affects mood and stress. Slow, quiet music activates the parasympathetic nervous
 system, lowering heart rate and cortisol levels...
 ```
@@ -156,7 +162,7 @@ Note: distance is an inverse similarity score -- lower means more relevant.
 | 1 | Semantic Foundation | Complete |
 | 2 | Vector Store | Complete |
 | 3 | RAG Pipeline | Complete |
-| 4 | Document Ingestion | Planned |
+| 4 | Document Ingestion | Complete |
 | 5 | Retrieval Quality | Planned |
 | 6 | Search and Chat Mode | Planned |
 | 7 | Role-Based Document Access | Planned |
@@ -173,14 +179,15 @@ See [docs/implementation-plan.md](./docs/implementation-plan.md) for full phase
 - **Model consistency** -- the same embedding model must be used for both documents and queries
 - **Vector database** -- stores embeddings with metadata and retrieves them by similarity using ANN search
 - **RAG** -- Retrieval-Augmented Generation: retrieve relevant context, then generate a grounded answer
+- **Document parsing** -- format-specific extraction that converts PDF, DOCX, and Markdown into plain text before chunking; all formats share the same embedding and storage flow after parsing
 
 ---
 
 ## Diagrams
 
-Pipeline diagrams are maintained as PlantUML source files in `docs/` and auto-exported to SVG on every push to main using [diagram-sync](https://www.npmjs.com/package/diagram-sync).
+Pipeline diagrams are maintained as PlantUML source files in `docs/` and exported to SVG via `npx diagram-sync` using [diagram-sync](https://www.npmjs.com/package/diagram-sync).
 
-The three diagrams below show the system growing phase by phase -- each one builds on the previous.
+The diagrams below show the system growing phase by phase -- each one builds on the previous.
 
 ### Phase 1 -- Semantic Search (cosine similarity over JSON embeddings)
 
@@ -193,3 +200,23 @@ The three diagrams below show the system growing phase by phase -- each one buil
 ### Phase 3 -- RAG Pipeline (generation on top of retrieval)
 
 ![RAG Pipeline](./diagrams/docs/pipeline-rag.svg)
+
+### Phase 4 -- Document Ingestion (multi-format parsing, deduplication)
+
+The ingestion flow is split into 4 focused diagrams - read in this order:
+
+**1. Entry and Routing** - CLI validation, collection setup, file vs directory routing
+
+![Entry and Routing](./diagrams/docs/pipeline-document-ingestion-entry-routing.svg)
+
+**2. Parsing** - router extension resolution, all 4 parsers (PDF / DOCX / MD / TXT), flatten to plain text
+
+![Parsing](./diagrams/docs/pipeline-document-ingestion-parsing.svg)
+
+**3. Chunking and Embedding** - sliding window chunking, OpenAI embeddings API call
+
+![Chunking and Embedding](./diagrams/docs/pipeline-document-ingestion-chunk-embed.svg)
+
+**4. Upsert** - deduplication check, ChromaDB upsert with full payload
+
+![Upsert](./diagrams/docs/pipeline-document-ingestion-upsert.svg)
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,52 @@
+# Docs
+
+This folder contains the implementation plan, phase-by-phase notes, and sequence diagrams for the RAG Document Engine.
+
+---
+
+## Implementation Plan
+
+| File | Description |
+| ---- | ----------- |
+| `implementation-plan.md` | Full build plan across all 7 phases — goals, what gets built, stack additions, and questions to answer per phase |
+
+---
+
+## Phase Notes
+
+One file per phase. Each covers the goal, design decisions, and key concepts for that phase.
+
+| File | Phase |
+| ---- | ----- |
+| `phase-1-semantic-foundation.md` | Semantic search from scratch — chunking, embeddings, cosine similarity over flat JSON |
+| `phase-2-vector-store.md` | Replace JSON with ChromaDB — persistent collection, metadata, `collection.query()` |
+| `phase-3-rag-pipeline.md` | Close the loop — retrieval + LLM generation, grounded answers, citations, token budget |
+| `phase-4-document-ingestion.md` | Multi-format ingestion — PDF, DOCX, Markdown parsers, CLI trigger, deduplication |
+| `phase-5-retrieval-quality.md` | Improve retrieval — evaluation set, hybrid search (BM25 + vector), re-ranker, metadata filters |
+| `phase-6-search-and-chat-mode.md` | Two interaction modes — document search and multi-turn chat with conversation history |
+| `phase-7-role-based-document-access.md` | Access control — owner metadata at ingestion, per-user query filters, no bypass path |
+
+---
+
+## Sequence Diagrams
+
+PlantUML sequence diagrams for each pipeline. Open with any PlantUML-compatible renderer.
+
+### Document Ingestion Pipeline
+
+Covers Phase 4. Split into 4 focused diagrams — read in this order to follow the full flow:
+
+| Order | File | Covers |
+| ----- | ---- | ------- |
+| 1 | [pipeline-document-ingestion-entry-routing.svg](../diagrams/docs/pipeline-document-ingestion-entry-routing.svg) | CLI entry, arg validation, `get_or_create_collection`, file vs directory routing |
+| 2 | [pipeline-document-ingestion-parsing.svg](../diagrams/docs/pipeline-document-ingestion-parsing.svg) | Router extension resolution, all 4 parsers (PDF / DOCX / MD / TXT), flatten to plain text |
+| 3 | [pipeline-document-ingestion-chunk-embed.svg](../diagrams/docs/pipeline-document-ingestion-chunk-embed.svg) | Sliding window chunking, OpenAI embeddings API call |
+| 4 | [pipeline-document-ingestion-upsert.svg](../diagrams/docs/pipeline-document-ingestion-upsert.svg) | Deduplication check, ChromaDB upsert with full payload |
+
+### Other Pipelines
+
+| File | Covers |
+| ---- | ------- |
+| [pipeline-semantic-search.svg](../diagrams/docs/pipeline-semantic-search.svg) | Phase 1 — query embedding, cosine similarity, top-K retrieval over flat JSON |
+| [pipeline-vector-store.svg](../diagrams/docs/pipeline-vector-store.svg) | Phase 2 — ingest and query flow using ChromaDB |
+| [pipeline-rag.svg](../diagrams/docs/pipeline-rag.svg) | Phase 3 — end-to-end RAG: retrieval, token budget, prompt construction, LLM generation, citations |
diff --git a/docs/pipeline-document-ingestion-chunk-embed.puml b/docs/pipeline-document-ingestion-chunk-embed.puml
@@ -0,0 +1,35 @@
+@startuml pipeline-document-ingestion-chunk-embed
+skinparam sequenceMessageAlign center
+skinparam ParticipantPadding 10
+
+participant "ingest_file()" as ingest
+participant "utils\nchunk_text()" as chunker
+participant "embed\nembed_chunks()" as embedder
+participant "OpenAI\nEmbeddings API" as openai
+
+== Chunking ==
+
+ingest -> chunker : chunk_text(text, chunk_size = 300, overlap = 50)
+activate chunker
+chunker -> chunker : words = text.split()
+chunker -> chunker : step = chunk_size - overlap = 250
+loop for i in range(0, len(words), step = 250)
+chunker -> chunker : chunk_words = words[i : i + 300]
+chunker -> chunker : chunks.append(" ".join(chunk_words))
+end
+chunker --> ingest : chunks: list[str]
+deactivate chunker
+
+== Embedding ==
+
+ingest -> embedder : embed_chunks(chunks)
+activate embedder
+embedder -> openai : embeddings.create(model = EMBEDDING_MODEL, input = chunks)
+activate openai
+openai --> embedder : EmbeddingResponse — response.data[i].embedding
+deactivate openai
+embedder -> embedder : build [{"text": chunk, "embedding": [...float]}]
+embedder --> ingest : embedded: list[dict]
+deactivate embedder
+
+@enduml
diff --git a/docs/pipeline-document-ingestion-entry-routing.puml b/docs/pipeline-document-ingestion-entry-routing.puml
@@ -0,0 +1,62 @@
+@startuml pipeline-document-ingestion-entry-routing
+skinparam sequenceMessageAlign center
+skinparam ParticipantPadding 10
+
+actor User as user
+participant "ingest.py" as main
+participant "ChromaDB\nPersistentClient" as chromaClient
+participant "ChromaDB\nCollection" as collection
+
+== Module Init (at import time) ==
+
+main -> chromaClient : PersistentClient(path = "./chroma_db")
+activate chromaClient
+chromaClient --> main : client (persistent, on-disk)
+deactivate chromaClient
+
+== main() Entry ==
+
+user -> main : python ingest.py <path>
+activate main
+
+opt sys.argv < 2 — no argument provided
+main --> user : print "Usage: python ingest.py <file_or_directory>"
+main --> user : sys.exit(1)
+end
+
+main -> chromaClient : get_or_create_collection(name = "documents")
+activate chromaClient
+chromaClient --> main : collection
+deactivate chromaClient
+activate collection
+
+== Path Routing ==
+
+alt target.is_file()
+main -> main : ingest_file(collection, filepath)
+
+else target.is_dir()
+main -> main : filter files by suffix (.txt | .pdf | .docx | .md)
+alt no supported files found in directory
+main --> user : print "No supported files found in {target}"
+main --> user : sys.exit(0)
+else supported files exist
+loop for each file in directory
+main -> main : ingest_file(collection, filepath)
+end
+end
+
+else path does not exist
+main --> user : print "Path not found: {target}"
+main --> user : sys.exit(1)
+end
+
+== Summary ==
+
+main -> collection : count()
+collection --> main : total_count
+main --> user : print "Total vectors in collection: {total_count}"
+deactivate collection
+deactivate main
+
+@enduml