If you’ve ever asked an LLM a question about your docs, tickets, or notes, you’ve seen the problem: the model is smart, but it doesn’t automatically know what’s inside your data.
Retrieval-Augmented Generation (RAG) fixes this by doing two things at query time:
- Retrieve the most relevant snippets from your knowledge base (using vector search), then
- Generate an answer using those snippets as grounded context.
In this post we’ll build a small, practical RAG system in Python that you can run locally. We’ll keep it intentionally simple (one file, minimal moving parts), but the architecture matches what production systems do.
This assumes you are already familiar with basic agentic AI implementation in python. If not, feel free to start with our beginners guide to creating AI agents with python.
What we’re building
- A tiny “knowledge base” made from a handful of text files
- Chunking + embeddings
- A local vector database (Chroma) for similarity search
- A query function that retrieves top-K chunks and asks an LLM to answer
Step 0: Install dependencies
Create a virtual environment (recommended), then install:
python -m venv .venv
source .venv/bin/activate
pip install chromadb sentence-transformers openai python-dotenv
BashWhy these?
chromadb: lightweight vector store you can run locallysentence-transformers: creates embeddings without needing an external APIopenai: we’ll use it for the final “answer drafting” step (swap for any LLM you prefer)
Pro Tip 💡: If you are new, feel free to go through our guide to setup python local environments first.
Step 1: Prepare your documents
Create a folder like data/ and add a few .txt files. For example:
data/product.txtdata/faq.txtdata/policies.txt
RAG works best when your source text is reasonably clean (no nav bars, no giant blobs of code unless that’s what you’re searching).
Step 2: Chunk the documents (small pieces beat one giant document)
Vector search doesn’t “retrieve a whole PDF,” it retrieves chunks. Chunking matters more than most people think:
- Too big: retrieval gets fuzzy and you bring in irrelevant stuff
- Too small: you lose context and the answer becomes incomplete
A solid default is ~800–1200 characters with a bit of overlap.
Step 3: Embed + index into Chroma
Below is a complete script that:
- loads all
.txtfiles indata/ - splits them into overlapping chunks
- embeds each chunk using
sentence-transformers - stores vectors + text in a persistent Chroma collection
import os
import glob
from dataclasses import dataclass
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
@dataclass
class Chunk:
id: str
text: str
source: str
start: int
end: int
def read_text_files(folder: str = "data"):
paths = sorted(glob.glob(os.path.join(folder, "**/*.txt"), recursive=True))
docs = []
for p in paths:
with open(p, "r", encoding="utf-8") as f:
docs.append((p, f.read()))
return docs
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
# simple, safe char-based chunking (guards against non-advancing loops)
if chunk_size <= 0:
raise ValueError("chunk_size must be > 0")
if overlap < 0:
raise ValueError("overlap must be >= 0")
if overlap >= chunk_size:
# prevent non-advancing loops
overlap = max(0, chunk_size - 1)
chunks = []
start = 0
text_len = len(text)
while start < text_len:
end = min(start + chunk_size, text_len)
chunk = text[start:end].strip()
if chunk:
chunks.append((start, end, chunk))
if end >= text_len:
break
next_start = end - overlap
if next_start <= start:
# fail-safe to avoid infinite loop
next_start = end
start = next_start
return chunks
def build_index(
persist_dir: str = "./chroma_db",
collection_name: str = "knowledge_base",
embedding_model_name: str = "all-MiniLM-L6-v2",
):
# Initialize Chroma (local persistent)
client = chromadb.PersistentClient(path=persist_dir)
# Create or load collection
collection = client.get_or_create_collection(name=collection_name)
# Embedding model
embedder = SentenceTransformer(embedding_model_name)
# Load docs
docs = read_text_files("data")
if not docs:
raise RuntimeError("No .txt files found in ./data")
# Build chunks
all_chunks: list[Chunk] = []
for path, text in docs:
for i, (s, e, c) in enumerate(chunk_text(text)):
chunk_id = f"{os.path.basename(path)}::{i}"
all_chunks.append(Chunk(id=chunk_id, text=c, source=path, start=s, end=e))
# Upsert into Chroma
texts = [c.text for c in all_chunks]
ids = [c.id for c in all_chunks]
metadatas = [{"source": c.source, "start": c.start, "end": c.end} for c in all_chunks]
embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()
# Use upsert so you can re-run the script without duplicates
collection.upsert(ids=ids, documents=texts, metadatas=metadatas, embeddings=embeddings)
print(f"Indexed {len(all_chunks)} chunks into collection '{collection_name}'.")
if __name__ == "__main__":
build_index()
PythonFilename: build_index.py
Step 4: Retrieve relevant chunks for a question
Now we’ll add a search function. At runtime, we:
- embed the user’s question
- vector-search top-K similar chunks
- use those chunks as context for the LLM
Add the following to the same file (or create a second script). This example uses OpenAI for generation, but you can swap it for any chat model.
import chromadb
from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer
load_dotenv()
def retrieve(collection, embedder, query: str, k: int = 5):
q_emb = embedder.encode([query], normalize_embeddings=True).tolist()[0]
results = collection.query(query_embeddings=[q_emb], n_results=k)
# Chroma returns lists-of-lists
docs = results.get("documents", [[]])[0]
metas = results.get("metadatas", [[]])[0]
retrieved = []
for d, m in zip(docs, metas):
retrieved.append({"text": d, "meta": m})
return retrieved
def build_prompt(query: str, retrieved: list[dict]):
context_blocks = []
for i, item in enumerate(retrieved, start=1):
src = item["meta"].get("source", "unknown")
context_blocks.append(f"[Source {i}: {src}]\n{item['text']}")
context = "\n\n".join(context_blocks)
system = (
"You are a helpful assistant. Answer the user's question using ONLY the provided context. "
"If the context is insufficient, say what is missing and ask a clarifying question."
)
user = f"""Question: {query}
Context:
{context}
Instructions:
- Cite sources like (Source 1) when you use them.
- Be concise, but don't omit critical details.
"""
return system, user
def answer_with_rag(query: str, persist_dir: str = "./chroma_db", collection_name: str = "kb"):
client = chromadb.PersistentClient(path=persist_dir)
collection = client.get_collection(name=collection_name)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
retrieved = retrieve(collection, embedder, query=query, k=5)
system, user = build_prompt(query, retrieved)
# Choose your model here (gpt-4o-mini is a common low-cost default)
oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
resp = oai.chat.completions.create(
model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"),
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=0.2,
)
return retrieved, resp.choices[0].message.content
if __name__ == "__main__":
# Example usage
question = "What is our refund policy?"
retrieved, answer = answer_with_rag(question)
print("\n--- Retrieved context ---")
for i, r in enumerate(retrieved, start=1):
print(f"\n[Source {i}] {r['meta'].get('source')}\n{r['text'][:200]}...")
print("\n--- Answer ---")
print(answer)
PythonFile name: rag_demo.py
Step 5: Run it
1) First build the index:
python build_index.py
Bash2) Set your OpenAI key (if you’re using OpenAI for generation):
export OPENAI_API_KEY="..."
# optional
export OPENAI_MODEL="gpt-4o-mini"
Bash3) Run a query and inspect what it retrieved and how it answered.
python rag_demo.pyBashCommon pitfalls (and how to avoid them)
- Bad chunking: If answers feel random, tune chunk size/overlap before touching anything else.
- Top-K too low/high: Too low misses context, too high floods the model. Try 3–8.
- No grounding instructions: Always tell the model to use only provided context and to admit when it can’t answer.
- Stale index: If your docs change, you need a re-index step (or incremental updates).
Where to go from here (small upgrades that matter)
- Better loaders: Parse HTML/PDFs properly instead of raw copy/paste.
- Metadata filtering: Restrict retrieval by document type, project, or date.
- Hybrid search: Combine keyword + vector search for better precision.
- Evaluation: Keep a small set of Q&A pairs and measure retrieval quality (did we fetch the right chunks?)
More Relevant Resources
Discover more from CodeSamplez.com
Subscribe to get the latest posts sent to your email.

Leave a Reply