• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Featured
    • Advanced Python Topics
    • AWS Learning Roadmap
    • JWT Complete Guide
    • Git CheatSheet
  • Explore
    • Programming
    • Development
      • microservices
      • Front End
    • Database
    • DevOps
    • Productivity
    • Tutorial Series
      • C# LinQ Tutorials
      • PHP Tutorials
  • Dev Tools
    • JSON Formatter
    • Diff Checker
    • JWT Decoder
    • JWT Generator
    • Base64 Converter
    • Data Format Converter
    • QR Code Generator
    • Javascript Minifier
    • CSS Minifier
    • Text Analyzer
  • About
  • Contact
CodeSamplez.com

CodeSamplez.com

Programming And Development Resources

You are here: Home / Development / A step-by-step guide to building a simple RAG system in Python

A step-by-step guide to building a simple RAG system in Python

Updated February 5, 2026 by Rana Ahsan Leave a Comment ⏰ 11 minutes

RAG Systems In Python

If you’ve ever asked an LLM a question about your docs, tickets, or notes, you’ve seen the problem: the model is smart, but it doesn’t automatically know what’s inside your data.

Retrieval-Augmented Generation (RAG) fixes this by doing two things at query time:

  • Retrieve the most relevant snippets from your knowledge base (using vector search), then
  • Generate an answer using those snippets as grounded context.

In this post we’ll build a small, practical RAG system in Python that you can run locally. We’ll keep it intentionally simple (one file, minimal moving parts), but the architecture matches what production systems do.

This assumes you are already familiar with basic agentic AI implementation in python. If not, feel free to start with our beginners guide to creating AI agents with python.

What we’re building

  • A tiny “knowledge base” made from a handful of text files
  • Chunking + embeddings
  • A local vector database (Chroma) for similarity search
  • A query function that retrieves top-K chunks and asks an LLM to answer

Step 0: Install dependencies

Create a virtual environment (recommended), then install:

python -m venv .venv
source .venv/bin/activate

pip install chromadb sentence-transformers openai python-dotenv
Bash

Why these?

  • chromadb: lightweight vector store you can run locally
  • sentence-transformers: creates embeddings without needing an external API
  • openai: we’ll use it for the final “answer drafting” step (swap for any LLM you prefer)

Pro Tip 💡: If you are new, feel free to go through our guide to setup python local environments first.

Step 1: Prepare your documents

Create a folder like data/ and add a few .txt files. For example:

  • data/product.txt
  • data/faq.txt
  • data/policies.txt

RAG works best when your source text is reasonably clean (no nav bars, no giant blobs of code unless that’s what you’re searching).

Step 2: Chunk the documents (small pieces beat one giant document)

Vector search doesn’t “retrieve a whole PDF,” it retrieves chunks. Chunking matters more than most people think:

  • Too big: retrieval gets fuzzy and you bring in irrelevant stuff
  • Too small: you lose context and the answer becomes incomplete

A solid default is ~800–1200 characters with a bit of overlap.

Step 3: Embed + index into Chroma

Below is a complete script that:

  • loads all .txt files in data/
  • splits them into overlapping chunks
  • embeds each chunk using sentence-transformers
  • stores vectors + text in a persistent Chroma collection
import os
import glob
from dataclasses import dataclass

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer


@dataclass
class Chunk:
    id: str
    text: str
    source: str
    start: int
    end: int


def read_text_files(folder: str = "data"):
    paths = sorted(glob.glob(os.path.join(folder, "**/*.txt"), recursive=True))
    docs = []
    for p in paths:
        with open(p, "r", encoding="utf-8") as f:
            docs.append((p, f.read()))
    return docs


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    # simple, safe char-based chunking (guards against non-advancing loops)
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if overlap < 0:
        raise ValueError("overlap must be >= 0")
    if overlap >= chunk_size:
        # prevent non-advancing loops
        overlap = max(0, chunk_size - 1)

    chunks = []
    start = 0
    text_len = len(text)
    while start < text_len:
        end = min(start + chunk_size, text_len)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append((start, end, chunk))
        if end >= text_len:
            break
        next_start = end - overlap
        if next_start <= start:
            # fail-safe to avoid infinite loop
            next_start = end
        start = next_start
    return chunks


def build_index(
    persist_dir: str = "./chroma_db",
    collection_name: str = "knowledge_base",
    embedding_model_name: str = "all-MiniLM-L6-v2",
):
    # Initialize Chroma (local persistent)
    client = chromadb.PersistentClient(path=persist_dir)

    # Create or load collection
    collection = client.get_or_create_collection(name=collection_name)

    # Embedding model
    embedder = SentenceTransformer(embedding_model_name)

    # Load docs
    docs = read_text_files("data")
    if not docs:
        raise RuntimeError("No .txt files found in ./data")

    # Build chunks
    all_chunks: list[Chunk] = []
    for path, text in docs:
        for i, (s, e, c) in enumerate(chunk_text(text)):
            chunk_id = f"{os.path.basename(path)}::{i}"
            all_chunks.append(Chunk(id=chunk_id, text=c, source=path, start=s, end=e))

    # Upsert into Chroma
    texts = [c.text for c in all_chunks]
    ids = [c.id for c in all_chunks]
    metadatas = [{"source": c.source, "start": c.start, "end": c.end} for c in all_chunks]

    embeddings = embedder.encode(texts, normalize_embeddings=True).tolist()

    # Use upsert so you can re-run the script without duplicates
    collection.upsert(ids=ids, documents=texts, metadatas=metadatas, embeddings=embeddings)

    print(f"Indexed {len(all_chunks)} chunks into collection '{collection_name}'.")


if __name__ == "__main__":
    build_index()
Python

Filename: build_index.py

Step 4: Retrieve relevant chunks for a question

Now we’ll add a search function. At runtime, we:

  1. embed the user’s question
  2. vector-search top-K similar chunks
  3. use those chunks as context for the LLM

Add the following to the same file (or create a second script). This example uses OpenAI for generation, but you can swap it for any chat model.

import chromadb

from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer

load_dotenv()


def retrieve(collection, embedder, query: str, k: int = 5):
    q_emb = embedder.encode([query], normalize_embeddings=True).tolist()[0]
    results = collection.query(query_embeddings=[q_emb], n_results=k)

    # Chroma returns lists-of-lists
    docs = results.get("documents", [[]])[0]
    metas = results.get("metadatas", [[]])[0]

    retrieved = []
    for d, m in zip(docs, metas):
        retrieved.append({"text": d, "meta": m})

    return retrieved


def build_prompt(query: str, retrieved: list[dict]):
    context_blocks = []
    for i, item in enumerate(retrieved, start=1):
        src = item["meta"].get("source", "unknown")
        context_blocks.append(f"[Source {i}: {src}]\n{item['text']}")

    context = "\n\n".join(context_blocks)

    system = (
        "You are a helpful assistant. Answer the user's question using ONLY the provided context. "
        "If the context is insufficient, say what is missing and ask a clarifying question."
    )

    user = f"""Question: {query}

Context:
{context}

Instructions:
- Cite sources like (Source 1) when you use them.
- Be concise, but don't omit critical details.
"""

    return system, user


def answer_with_rag(query: str, persist_dir: str = "./chroma_db", collection_name: str = "kb"):
    client = chromadb.PersistentClient(path=persist_dir)
    collection = client.get_collection(name=collection_name)

    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    retrieved = retrieve(collection, embedder, query=query, k=5)

    system, user = build_prompt(query, retrieved)

    # Choose your model here (gpt-4o-mini is a common low-cost default)
    oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

    resp = oai.chat.completions.create(
        model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"),
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0.2,
    )

    return retrieved, resp.choices[0].message.content


if __name__ == "__main__":
    # Example usage
    question = "What is our refund policy?"
    retrieved, answer = answer_with_rag(question)

    print("\n--- Retrieved context ---")
    for i, r in enumerate(retrieved, start=1):
        print(f"\n[Source {i}] {r['meta'].get('source')}\n{r['text'][:200]}...")

    print("\n--- Answer ---")
    print(answer)
Python

File name: rag_demo.py

Step 5: Run it

1) First build the index:

python build_index.py
Bash

2) Set your OpenAI key (if you’re using OpenAI for generation):

export OPENAI_API_KEY="..."
# optional
export OPENAI_MODEL="gpt-4o-mini"
Bash

3) Run a query and inspect what it retrieved and how it answered.

python rag_demo.py
Bash

Common pitfalls (and how to avoid them)

  • Bad chunking: If answers feel random, tune chunk size/overlap before touching anything else.
  • Top-K too low/high: Too low misses context, too high floods the model. Try 3–8.
  • No grounding instructions: Always tell the model to use only provided context and to admit when it can’t answer.
  • Stale index: If your docs change, you need a re-index step (or incremental updates).

Where to go from here (small upgrades that matter)

  • Better loaders: Parse HTML/PDFs properly instead of raw copy/paste.
  • Metadata filtering: Restrict retrieval by document type, project, or date.
  • Hybrid search: Combine keyword + vector search for better precision.
  • Evaluation: Keep a small set of Q&A pairs and measure retrieval quality (did we fetch the right chunks?)

More Relevant Resources

  • LangChain RAG tutorial
  • Chroma getting started

Share if liked!

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Pinterest (Opens in new window) Pinterest
  • Share on Reddit (Opens in new window) Reddit
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Pocket (Opens in new window) Pocket

You may also like


Discover more from CodeSamplez.com

Subscribe to get the latest posts sent to your email.

First Published On: February 5, 2026 Filed Under: Development, Programming Tagged With: chroma, embeddings, llm, python, rag, retrieval-augmented-generation

Reader Interactions

Leave a ReplyCancel reply

Primary Sidebar

  • Facebook
  • X
  • Pinterest
  • Tumblr

Subscribe via Email

Top Picks

python local environment setup

Python Local Development Environment: Complete Setup Guide

In-Depth JWT Tutorial Guide For Beginners

JSON Web Tokens (JWT): A Complete In-Depth Beginners Tutorial

The Ultimate Git Commands CheatSheet

Git Commands Cheatsheet: The Ultimate Git Reference

web development architecture case studies

Web Development Architecture Case Studies: Lessons From Titans

static website deployment s3 cloudfront

Host Static Website With AWS S3 And CloudFront – Step By Step

Featured Dev Tools

  • JSON Formatter
  • Diff Checker

Recently Published

RAG Systems In Python

A step-by-step guide to building a simple RAG system in Python

Add Memory To AI Agents

Add Memory to AI Agent: Python Tutorial for Beginners

MCP For Beginners

Model Context Protocol For Beginners: Superpower of AI Agents

Building AI Agent

Building AI Agent From Scratch: Complete Tutorial

python runtime environment

Python Runtime Environment: Understanding Code Execution Flow

Footer

Subscribe via Email

Follow Us

  • Facebook
  • X
  • Pinterest
  • Tumblr

Demos

  • Demo.CodeSamplez.com

Explore By Topics

Python | AWS | PHP | C# | Javascript

Copyright © 2026