The Problem RAG Solves
LLMs are trained on data up to a certain cutoff date and know nothing about your company's internal documents โ SOPs, reports, policies, or recent emails. Fine-tuning for every update is prohibitively expensive.
RAG solves this more elegantly: before answering, the system automatically retrieves relevant documents and includes them in the LLM's context.
## How It Works: 3 Phases
Phase 1 โ Indexing (done once):
1. Cut documents into ~500-word chunks
2. Convert each chunk into an 'embedding' โ a numerical vector representing semantic meaning
3. Store all embeddings in a vector database (Pinecone, ChromaDB)
Phase 2 โ Retrieval (when query arrives):
1. Convert user question into an embedding
2. Find k most similar embeddings (cosine similarity)
3. Retrieve the corresponding document text
Phase 3 โ Generation:
Combine: [system instructions] + [relevant docs] + [user question] โ send to LLM
## Simple Python Implementation
```python
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection('docs')
def index_doc(text, doc_id):
emb = client.embeddings.create(
model='text-embedding-3-small', input=text
).data[0].embedding
collection.add(embeddings=[emb], documents=[text], ids=[doc_id])
def rag_query(question, k=3):
q_emb = client.embeddings.create(
model='text-embedding-3-small', input=question
).data[0].embedding
results = collection.query(query_embeddings=[q_emb], n_results=k)
context = chr(10).join(results['documents'][0])
resp = client.chat.completions.create(
model='gpt-4o',
messages=[
{'role':'system','content':f'Answer based on context:{context}'},
{'role':'user','content':question}
])
return resp.choices[0].message.content
```
## Keys to RAG Success
- Chunking: too small loses context, too large wastes tokens
- Metadata filtering: filter docs by department or date
- Hybrid search: combine semantic + keyword search