A step-by-step guide to building a simple RAG system in Python

If you’ve ever asked an LLM a question about your docs, tickets, or notes, you’ve seen the problem: the model is smart, but it doesn’t automatically know what’s inside your data.

Retrieval-Augmented Generation (RAG) fixes this by doing two things at query time:

Retrieve the most relevant snippets from your knowledge base (using vector search), then
Generate an answer using those snippets as grounded context.

In this post we’ll build a small, practical RAG system in Python that you can run locally. We’ll keep it intentionally simple (one file, minimal moving parts), but the architecture matches what production systems do.

This assumes you are already familiar with basic agentic AI implementation in python. If not, feel free to start with our beginners guide to creating AI agents with python.

What we’re building

A tiny “knowledge base” made from a handful of text files
Chunking + embeddings
A local vector database (Chroma) for similarity search
A query function that retrieves top-K chunks and asks an LLM to answer

Step 0: Install dependencies

Create a virtual environment (recommended), then install:

python -m venv .venv
source .venv/bin/activate

pip install chromadb sentence-transformers openai python-dotenv

python -m venv .venv
source .venv/bin/activate

pip install chromadb sentence-transformers openai python-dotenv

Bash

Why these?

chromadb: lightweight vector store you can run locally
sentence-transformers: creates embeddings without needing an external API
openai: we’ll use it for the final “answer drafting” step (swap for any LLM you prefer)

Pro Tip 💡: If you are new, feel free to go through our guide to setup python local environments first.

Step 1: Prepare your documents

Create a folder like data/ and add a few .txt files. For example:

data/product.txt
data/faq.txt
data/policies.txt

RAG works best when your source text is reasonably clean (no nav bars, no giant blobs of code unless that’s what you’re searching).

Step 2: Chunk the documents (small pieces beat one giant document)

Vector search doesn’t “retrieve a whole PDF,” it retrieves chunks. Chunking matters more than most people think:

Too big: retrieval gets fuzzy and you bring in irrelevant stuff
Too small: you lose context and the answer becomes incomplete

A solid default is ~800–1200 characters with a bit of overlap.

Step 3: Embed + index into Chroma

Below is a complete script that:

loads all .txt files in data/
splits them into overlapping chunks
embeds each chunk using sentence-transformers
stores vectors + text in a persistent Chroma collection

import os
import glob
from dataclasses import dataclass

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer


@dataclass
class Chunk:
    id: str
    text: str
    source: str
    start: int
    end: int


def read_text_files(folder: str = "data"):
    paths = sorted(glob.glob(os.path.join(folder, "**/*.txt"), recursive=True))
    docs = []
    for p in paths:
        with open(p, "r", encoding="utf-8") as f:
            docs.append((p, f.read()))
    return docs


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    # simple, safe char-based chunking (guards against non-advancing loops)
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if overlap < 0:
        raise ValueError("overlap must be >= 0")
    if overlap >= chunk_size:
        # prevent non-advancing loops
        overlap = max(0, chunk_size - 1)

    chunks = []
    start = 0
    text_len = len(text)
    while start < text_len:
        end = min(start + chunk_size, text_len)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append((start, end, chunk))
        if end >= text_len:
            break
        next_start = end - overlap
        if next_start <= start:
            # fail-safe to avoid infinite loop
            next_start = end
        start = next_start
    return chunks


def build_index(
    persist_dir: str = "./chroma_db",
    collection_name: str = "knowledge_base",
    embedding_model_name: str = "all-MiniLM-L6-v2",
):
    # Initialize Chroma (local persistent)
    client = chromadb.PersistentClient(path=persist_dir)

    # Create or load collection
    collection = client.get_or_create_collection(name=collection_name)

    # Embedding model
    embedder = SentenceTransformer(embedding_model_name)

    # Load docs
    docs = read_text_files("data")
    if not docs:
        raise RuntimeError("No .txt files found in ./data")

    # Build chunks
    all_chunks: list[Chunk] = []
    for path, text in docs:
        for i, (s, e, c) in enumerate(chunk_text(text)):
            chunk_id = f"{os.path.basename(path)}::{i}"
            all_chunks.append(Chunk(id=chunk_id, text=c, source=path, start=s, end=e))

    # Upsert into Chroma
    texts = [c.text for c in all_chunks]
    ids = [c.id for c in all_chunks]
    metadatas = [{"source": c.source, "start": c.start, "end": c.end} for c in all_chunks]

    embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()

    # Use upsert so you can re-run the script without duplicates
    collection.upsert(ids=ids, documents=texts, metadatas=metadatas, embeddings=embeddings)

    print(f"Indexed {len(all_chunks)} chunks into collection '{collection_name}'.")


if __name__ == "__main__":
    build_index()

import os
import glob
from dataclasses import dataclass

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer


@dataclass
class Chunk:
    id: str
    text: str
    source: str
    start: int
    end: int


def read_text_files(folder: str = "data"):
    paths = sorted(glob.glob(os.path.join(folder, "**/*.txt"), recursive=True))
    docs = []
    for p in paths:
        with open(p, "r", encoding="utf-8") as f:
            docs.append((p, f.read()))
    return docs


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    # simple, safe char-based chunking (guards against non-advancing loops)
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if overlap < 0:
        raise ValueError("overlap must be >= 0")
    if overlap >= chunk_size:
        # prevent non-advancing loops
        overlap = max(0, chunk_size - 1)

    chunks = []
    start = 0
    text_len = len(text)
    while start < text_len:
        end = min(start + chunk_size, text_len)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append((start, end, chunk))
        if end >= text_len:
            break
        next_start = end - overlap
        if next_start <= start:
            # fail-safe to avoid infinite loop
            next_start = end
        start = next_start
    return chunks


def build_index(
    persist_dir: str = "./chroma_db",
    collection_name: str = "knowledge_base",
    embedding_model_name: str = "all-MiniLM-L6-v2",
):
    # Initialize Chroma (local persistent)
    client = chromadb.PersistentClient(path=persist_dir)

    # Create or load collection
    collection = client.get_or_create_collection(name=collection_name)

    # Embedding model
    embedder = SentenceTransformer(embedding_model_name)

    # Load docs
    docs = read_text_files("data")
    if not docs:
        raise RuntimeError("No .txt files found in ./data")

    # Build chunks
    all_chunks: list[Chunk] = []
    for path, text in docs:
        for i, (s, e, c) in enumerate(chunk_text(text)):
            chunk_id = f"{os.path.basename(path)}::{i}"
            all_chunks.append(Chunk(id=chunk_id, text=c, source=path, start=s, end=e))

    # Upsert into Chroma
    texts = [c.text for c in all_chunks]
    ids = [c.id for c in all_chunks]
    metadatas = [{"source": c.source, "start": c.start, "end": c.end} for c in all_chunks]

    embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()

    # Use upsert so you can re-run the script without duplicates
    collection.upsert(ids=ids, documents=texts, metadatas=metadatas, embeddings=embeddings)

    print(f"Indexed {len(all_chunks)} chunks into collection '{collection_name}'.")


if __name__ == "__main__":
    build_index()

Python

Filename: build_index.py

Step 4: Retrieve relevant chunks for a question

Now we’ll add a search function. At runtime, we:

embed the user’s question
vector-search top-K similar chunks
use those chunks as context for the LLM

Add the following to the same file (or create a second script). This example uses OpenAI for generation, but you can swap it for any chat model.

import chromadb

from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer

load_dotenv()


def retrieve(collection, embedder, query: str, k: int = 5):
    q_emb = embedder.encode([query], normalize_embeddings=True).tolist()[0]
    results = collection.query(query_embeddings=[q_emb], n_results=k)

    # Chroma returns lists-of-lists
    docs = results.get("documents", [[]])[0]
    metas = results.get("metadatas", [[]])[0]

    retrieved = []
    for d, m in zip(docs, metas):
        retrieved.append({"text": d, "meta": m})

    return retrieved


def build_prompt(query: str, retrieved: list[dict]):
    context_blocks = []
    for i, item in enumerate(retrieved, start=1):
        src = item["meta"].get("source", "unknown")
        context_blocks.append(f"[Source {i}: {src}]\n{item['text']}")

    context = "\n\n".join(context_blocks)

    system = (
        "You are a helpful assistant. Answer the user's question using ONLY the provided context. "
        "If the context is insufficient, say what is missing and ask a clarifying question."
    )

    user = f"""Question: {query}

Context:
{context}

Instructions:
- Cite sources like (Source 1) when you use them.
- Be concise, but don't omit critical details.
"""

    return system, user


def answer_with_rag(query: str, persist_dir: str = "./chroma_db", collection_name: str = "kb"):
    client = chromadb.PersistentClient(path=persist_dir)
    collection = client.get_collection(name=collection_name)

    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    retrieved = retrieve(collection, embedder, query=query, k=5)

    system, user = build_prompt(query, retrieved)

    # Choose your model here (gpt-4o-mini is a common low-cost default)
    oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

    resp = oai.chat.completions.create(
        model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"),
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0.2,
    )

    return retrieved, resp.choices[0].message.content


if __name__ == "__main__":
    # Example usage
    question = "What is our refund policy?"
    retrieved, answer = answer_with_rag(question)

    print("\n--- Retrieved context ---")
    for i, r in enumerate(retrieved, start=1):
        print(f"\n[Source {i}] {r['meta'].get('source')}\n{r['text'][:200]}...")

    print("\n--- Answer ---")
    print(answer)

import chromadb

from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer

load_dotenv()


def retrieve(collection, embedder, query: str, k: int = 5):
    q_emb = embedder.encode([query], normalize_embeddings=True).tolist()[0]
    results = collection.query(query_embeddings=[q_emb], n_results=k)

    # Chroma returns lists-of-lists
    docs = results.get("documents", [[]])[0]
    metas = results.get("metadatas", [[]])[0]

    retrieved = []
    for d, m in zip(docs, metas):
        retrieved.append({"text": d, "meta": m})

    return retrieved


def build_prompt(query: str, retrieved: list[dict]):
    context_blocks = []
    for i, item in enumerate(retrieved, start=1):
        src = item["meta"].get("source", "unknown")
        context_blocks.append(f"[Source {i}: {src}]\n{item['text']}")

    context = "\n\n".join(context_blocks)

    system = (
        "You are a helpful assistant. Answer the user's question using ONLY the provided context. "
        "If the context is insufficient, say what is missing and ask a clarifying question."
    )

    user = f"""Question: {query}

Context:
{context}

Instructions:
- Cite sources like (Source 1) when you use them.
- Be concise, but don't omit critical details.
"""

    return system, user


def answer_with_rag(query: str, persist_dir: str = "./chroma_db", collection_name: str = "kb"):
    client = chromadb.PersistentClient(path=persist_dir)
    collection = client.get_collection(name=collection_name)

    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    retrieved = retrieve(collection, embedder, query=query, k=5)

    system, user = build_prompt(query, retrieved)

    # Choose your model here (gpt-4o-mini is a common low-cost default)
    oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

    resp = oai.chat.completions.create(
        model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"),
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0.2,
    )

    return retrieved, resp.choices[0].message.content


if __name__ == "__main__":
    # Example usage
    question = "What is our refund policy?"
    retrieved, answer = answer_with_rag(question)

    print("\n--- Retrieved context ---")
    for i, r in enumerate(retrieved, start=1):
        print(f"\n[Source {i}] {r['meta'].get('source')}\n{r['text'][:200]}...")

    print("\n--- Answer ---")
    print(answer)

Python

File name: rag_demo.py

Step 5: Run it

1) First build the index:

python build_index.py

python build_index.py

Bash

2) Set your OpenAI key (if you’re using OpenAI for generation):

export OPENAI_API_KEY="..."
# optional
export OPENAI_MODEL="gpt-4o-mini"

export OPENAI_API_KEY="..."
# optional
export OPENAI_MODEL="gpt-4o-mini"

Bash

3) Run a query and inspect what it retrieved and how it answered.

python rag_demo.py

python rag_demo.py

Bash

Common pitfalls (and how to avoid them)

Bad chunking: If answers feel random, tune chunk size/overlap before touching anything else.
Top-K too low/high: Too low misses context, too high floods the model. Try 3–8.
No grounding instructions: Always tell the model to use only provided context and to admit when it can’t answer.
Stale index: If your docs change, you need a re-index step (or incremental updates).

Where to go from here (small upgrades that matter)

Better loaders: Parse HTML/PDFs properly instead of raw copy/paste.
Metadata filtering: Restrict retrieval by document type, project, or date.
Hybrid search: Combine keyword + vector search for better precision.
Evaluation: Keep a small set of Q&A pairs and measure retrieval quality (did we fetch the right chunks?)

More Relevant Resources

Discover more from CodeSamplez.com

Subscribe to get the latest posts sent to your email.

A step-by-step guide to building a simple RAG system in Python

What we’re building

Step 0: Install dependencies

Step 1: Prepare your documents

Step 2: Chunk the documents (small pieces beat one giant document)

Step 3: Embed + index into Chroma

Step 4: Retrieve relevant chunks for a question

Step 5: Run it

Common pitfalls (and how to avoid them)

Where to go from here (small upgrades that matter)

More Relevant Resources

You may also like

Discover more from CodeSamplez.com

Subscribe to Blog via Email

Demos

Explore By Topics

What we’re building

Step 0: Install dependencies

Step 1: Prepare your documents

Step 2: Chunk the documents (small pieces beat one giant document)

Step 3: Embed + index into Chroma

Step 4: Retrieve relevant chunks for a question

Step 5: Run it

Common pitfalls (and how to avoid them)

Where to go from here (small upgrades that matter)

More Relevant Resources

Share if liked!

You may also like

Discover more from CodeSamplez.com

Reader Interactions

Leave a ReplyCancel reply

Footer

Subscribe to Blog via Email

Demos

Explore By Topics