If you’ve ever asked an LLM a question about your docs, tickets, or notes, you’ve seen the problem: the model is smart, but it doesn’t automatically know what’s inside your data.
Retrieval-Augmented Generation (RAG) fixes this by doing two things at query time:
In this post we’ll build a small, practical RAG system in Python that you can run locally. We’ll keep it intentionally simple (one file, minimal moving parts), but the architecture matches what production systems do.
This assumes you are already familiar with basic agentic AI implementation in python. If not, feel free to start with our beginners guide to creating AI agents with python.
Create a virtual environment (recommended), then install:
python -m venv .venv
source .venv/bin/activate
pip install chromadb sentence-transformers openai python-dotenv
BashWhy these?
chromadb: lightweight vector store you can run locallysentence-transformers: creates embeddings without needing an external APIopenai: we’ll use it for the final “answer drafting” step (swap for any LLM you prefer)Pro Tip 💡: If you are new, feel free to go through our guide to setup python local environments first.
Create a folder like data/ and add a few .txt files. For example:
data/product.txtdata/faq.txtdata/policies.txtRAG works best when your source text is reasonably clean (no nav bars, no giant blobs of code unless that’s what you’re searching).
Vector search doesn’t “retrieve a whole PDF,” it retrieves chunks. Chunking matters more than most people think:
A solid default is ~800–1200 characters with a bit of overlap.
Below is a complete script that:
.txt files in data/sentence-transformersimport os
import glob
from dataclasses import dataclass
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
@dataclass
class Chunk:
id: str
text: str
source: str
start: int
end: int
def read_text_files(folder: str = "data"):
paths = sorted(glob.glob(os.path.join(folder, "**/*.txt"), recursive=True))
docs = []
for p in paths:
with open(p, "r", encoding="utf-8") as f:
docs.append((p, f.read()))
return docs
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
# simple, safe char-based chunking (guards against non-advancing loops)
if chunk_size <= 0:
raise ValueError("chunk_size must be > 0")
if overlap < 0:
raise ValueError("overlap must be >= 0")
if overlap >= chunk_size:
# prevent non-advancing loops
overlap = max(0, chunk_size - 1)
chunks = []
start = 0
text_len = len(text)
while start < text_len:
end = min(start + chunk_size, text_len)
chunk = text[start:end].strip()
if chunk:
chunks.append((start, end, chunk))
if end >= text_len:
break
next_start = end - overlap
if next_start <= start:
# fail-safe to avoid infinite loop
next_start = end
start = next_start
return chunks
def build_index(
persist_dir: str = "./chroma_db",
collection_name: str = "knowledge_base",
embedding_model_name: str = "all-MiniLM-L6-v2",
):
# Initialize Chroma (local persistent)
client = chromadb.PersistentClient(path=persist_dir)
# Create or load collection
collection = client.get_or_create_collection(name=collection_name)
# Embedding model
embedder = SentenceTransformer(embedding_model_name)
# Load docs
docs = read_text_files("data")
if not docs:
raise RuntimeError("No .txt files found in ./data")
# Build chunks
all_chunks: list[Chunk] = []
for path, text in docs:
for i, (s, e, c) in enumerate(chunk_text(text)):
chunk_id = f"{os.path.basename(path)}::{i}"
all_chunks.append(Chunk(id=chunk_id, text=c, source=path, start=s, end=e))
# Upsert into Chroma
texts = [c.text for c in all_chunks]
ids = [c.id for c in all_chunks]
metadatas = [{"source": c.source, "start": c.start, "end": c.end} for c in all_chunks]
embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()
# Use upsert so you can re-run the script without duplicates
collection.upsert(ids=ids, documents=texts, metadatas=metadatas, embeddings=embeddings)
print(f"Indexed {len(all_chunks)} chunks into collection '{collection_name}'.")
if __name__ == "__main__":
build_index()
PythonFilename: build_index.py
Now we’ll add a search function. At runtime, we:
Add the following to the same file (or create a second script). This example uses OpenAI for generation, but you can swap it for any chat model.
import chromadb
from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer
load_dotenv()
def retrieve(collection, embedder, query: str, k: int = 5):
q_emb = embedder.encode([query], normalize_embeddings=True).tolist()[0]
results = collection.query(query_embeddings=[q_emb], n_results=k)
# Chroma returns lists-of-lists
docs = results.get("documents", [[]])[0]
metas = results.get("metadatas", [[]])[0]
retrieved = []
for d, m in zip(docs, metas):
retrieved.append({"text": d, "meta": m})
return retrieved
def build_prompt(query: str, retrieved: list[dict]):
context_blocks = []
for i, item in enumerate(retrieved, start=1):
src = item["meta"].get("source", "unknown")
context_blocks.append(f"[Source {i}: {src}]\n{item['text']}")
context = "\n\n".join(context_blocks)
system = (
"You are a helpful assistant. Answer the user's question using ONLY the provided context. "
"If the context is insufficient, say what is missing and ask a clarifying question."
)
user = f"""Question: {query}
Context:
{context}
Instructions:
- Cite sources like (Source 1) when you use them.
- Be concise, but don't omit critical details.
"""
return system, user
def answer_with_rag(query: str, persist_dir: str = "./chroma_db", collection_name: str = "kb"):
client = chromadb.PersistentClient(path=persist_dir)
collection = client.get_collection(name=collection_name)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
retrieved = retrieve(collection, embedder, query=query, k=5)
system, user = build_prompt(query, retrieved)
# Choose your model here (gpt-4o-mini is a common low-cost default)
oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
resp = oai.chat.completions.create(
model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"),
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=0.2,
)
return retrieved, resp.choices[0].message.content
if __name__ == "__main__":
# Example usage
question = "What is our refund policy?"
retrieved, answer = answer_with_rag(question)
print("\n--- Retrieved context ---")
for i, r in enumerate(retrieved, start=1):
print(f"\n[Source {i}] {r['meta'].get('source')}\n{r['text'][:200]}...")
print("\n--- Answer ---")
print(answer)
PythonFile name: rag_demo.py
1) First build the index:
python build_index.py
Bash2) Set your OpenAI key (if you’re using OpenAI for generation):
export OPENAI_API_KEY="..."
# optional
export OPENAI_MODEL="gpt-4o-mini"
Bash3) Run a query and inspect what it retrieved and how it answered.
python rag_demo.pyBashDiscover how to add memory to AI agent systems using pure Python—no frameworks required. This hands-on tutorial walks you through building short-term conversation tracking and…
Model Context Protocol (MCP) is Anthropic's open standard that lets AI agents securely access your tools, databases, and APIs. Think of it as USB-C for…
Ever wondered how AI agents like Siri actually work? This comprehensive guide teaches you building AI agent from scratch using Python, covering everything from LLM…
This website uses cookies.