Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 2 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,32 +6,21 @@ A progressive RAG system built from first principles -- from raw embeddings and

## What It Does (Current State)

<table>
<tr>
<td valign="top" width="55%">

Ingestion
**Ingestion**

1. **Loads** `.txt` files (PDF, DOCX, Markdown from Phase 4)
2. **Chunks** each document into overlapping word windows
3. **Embeds** each chunk using OpenAI `text-embedding-3-small`, producing a 1536-dimensional vector
4. **Stores** vectors with metadata (`source`, `chunk_index`) in a persistent Chroma collection

Search
**Search**

1. **Embeds** the query using the same model
2. **Queries** Chroma for the top-K nearest vectors using built-in ANN (Approximate Nearest Neighbor) search
3. **Returns** results with chunk text, source filename, and distance score

</td>
<td valign="top" width="45%">

![Pipeline](./diagrams/docs/pipeline-vector-store.svg)

</td>
</tr>
</table>

---

## Stack
Expand Down
39 changes: 22 additions & 17 deletions docs/pipeline-vector-store.puml
Original file line number Diff line number Diff line change
Expand Up @@ -17,38 +17,43 @@ skinparam ActorBorderColor #448844
title Chroma Vector Store Pipeline

box "Ingestion" #EEF6FF
participant "load_documents()" as load
participant "chunk_text()" as chunk
participant "embed_chunks()" as embed
collections "documents/" as docs
participant "ingest.py" as ingest
participant "utils.py" as utils
participant "embed.py" as embed
end box

participant "OpenAI API" as openai
database "Chroma DB" as chroma

box "Search" #FFF0F8
actor "Query" as query
participant "embed_query()" as embedq
participant "collection.query()" as cquery
participant "search.py" as search
end box

== Ingestion ==

[-> load : .txt files
load -> chunk : text
chunk -> embed : chunks
embed -> openai : embed request
docs -> ingest : .txt files
ingest -> utils : load_documents()
utils --> ingest : text
ingest -> utils : chunk_text()
utils --> ingest : chunks
ingest -> embed : embed_chunks()
embed -> openai : embeddings.create()
openai --> embed : 1536-dim vectors
embed -> chroma : upsert(ids, embeddings, metadata)
embed --> ingest : embedded chunks
ingest -> chroma : upsert(ids, embeddings, documents, metadatas)

== Search ==

query -> embedq : query string
embedq -> openai : embed request
openai --> embedq : query vector
embedq -> cquery : query vector
cquery -> chroma : ANN search
chroma --> cquery : nearest vectors
cquery --> query : Top-K results\n(text + source + distance)
query -> search : query string
search -> embed : embed_query()
embed -> openai : embeddings.create()
openai --> embed : query vector
embed --> search : query vector
search -> chroma : query(query_embeddings, n_results, include)
chroma --> search : nearest vectors
search --> query : Top-K results\n(text + source + distance)

' end
@enduml
10 changes: 5 additions & 5 deletions inspect_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ def main():
)

print("Sample entry:")
print(f" id: {sample['ids'][0]}")
print(f" source: {sample['metadatas'][0]['source']}")
print(f" chunk_index: {sample['metadatas'][0]['chunk_index']}")
print(f" text: {sample['documents'][0][:120]}...")
print(f" embedding: [{sample['embeddings'][0][0]:.6f}, {sample['embeddings'][0][1]:.6f}, ...] ({len(sample['embeddings'][0])} dims)")
print(f"id: {sample['ids'][0]}")
print(f"source: {sample['metadatas'][0]['source']}")
print(f"chunk_index: {sample['metadatas'][0]['chunk_index']}")
print(f"text: {sample['documents'][0][:120]}...")
print(f"embedding: [{sample['embeddings'][0][0]:.6f}, {sample['embeddings'][0][1]:.6f}, ...] ({len(sample['embeddings'][0])} dims)")


if __name__ == '__main__':
Expand Down
Loading