• Skip to main content
  • Skip to primary navigation
  • Skip to primary sidebar
  • Skip to footer
codesamplez.com
  • Home
  • Featured
    • Advanced Python Topics
    • AWS Learning Roadmap
    • JWT Complete Guide
    • Git CheatSheet
  • Explore
    • Programming
    • Development
      • microservices
      • Front End
    • Database
    • DevOps
    • Productivity
    • Tutorial Series
      • C# LinQ Tutorials
      • PHP Tutorials
  • Dev Tools
    • JSON Formatter
    • Diff Checker
    • JWT Decoder
    • JWT Generator
    • Base64 Converter
    • Data Format Converter
    • QR Code Generator
    • Javascript Minifier
    • CSS Minifier
    • Text Analyzer
  • About
  • Contact
You are here: Home / Productivity / Local LLM for Coding: Free AI Coding Agent With Ollama + Claude

Local LLM for Coding: Free AI Coding Agent With Ollama + Claude

Updated March 11, 2026 by Rana Ahsan Leave a Comment ⏰ 17 minutes

Local Coding Agent

If you’ve been curious about running a local LLM for coding but felt overwhelmed by the sheer number of models, runtimes, and configuration options, this guide is for you. Not the “tab-complete a for loop” kind of local. I’m talking full agentic coding — an AI agent that reads your repo, plans changes across multiple files, runs shell commands, executes tests, and iterates on errors. All on your machine. Zero API costs. Zero code leaving your network.

The stack we’re building: Ollama + Claude Code + an open-weight model. It takes about 15 minutes to get running.

What is a local LLM?

A local LLM for coding is a large language model running entirely on your own hardware — laptop, workstation, or home server — that provides AI-powered code generation, refactoring, and agentic multi-file editing without sending a single line of code to the cloud. It matters because it gives you complete privacy, zero recurring costs, and AI coding assistance that works even without internet.

💡Pro Tip: New to agentic coding? Consider going over beginners guide to ai coding assistant setup first.

Why “Local LLM” Changes Everything for Agentic Coding

Before we set anything up, let me be specific about what changes when you move agentic coding off the cloud.

Privacy becomes real, not theoretical. Agentic coding tools read everything — your entire repo structure, environment files, config files, test output, shell history. With a cloud provider, all of that context ships to someone else’s servers. After Samsung’s engineers accidentally uploaded confidential source code to ChatGPT back in 2023, this stopped being a paranoia issue and became a policy issue at many companies. Running locally means your proprietary business logic stays on your machine. Period.

Cost drops to zero. Claude Code on Anthropic’s API is powerful but expensive — heavy agentic sessions with Opus can burn through $5-15/hour easily. Locally? Your electricity bill goes up a few dollars, … for the entire month! The math is absurdly good once you have the hardware.

Availability becomes unconditional. Flights, coffee shops with garbage Wi-Fi, AWS outages that take down half the internet — none of it matters. Your AI coding agent is always on.

The quality tradeoff is shrinking fast. In 2023, local models were a joke for agentic work. In 2026, models like Qwen 3.5 and GLM-4.7-Flash genuinely handle multi-file edits, tool calling, and long-context planning. They won’t match Opus 4.6 on the hardest tasks, but for building features, fixing bugs, and scaffolding projects? They’re shockingly competent.

Setting Expectations

Here’s what nobody warns you about: a local model won’t feel exactly like Claude Opus or GPT-5 on day one. And that’s okay.

Local models are best at: scaffolding new features, generating boilerplate, writing tests, fixing bugs with clear error messages, and iterating on code with feedback loops.

They’re weaker at: massive architectural decisions across 20+ files, highly nuanced refactoring of complex legacy code, and tasks requiring enormous context windows.

The sweet spot for local agentic coding is tasks where you can describe what you want, let the agent take a first pass, review, and iterate. That covers probably 80% of daily development work.


Choose the Right Model

This is where most people get stuck. Hundreds of models on Hugging Face, new ones every week, every Reddit thread recommending something different. Let me cut through the noise.

For agentic coding, your model needs three things:

long context (~64K tokens minimum),

tool calling support (so the agent can execute commands, read files, run tests), and

strong instruction following (so it doesn’t go off the rails mid-task). Not every model delivers on all three.

The Models Worth Running Right Now

Qwen 3.5 35B-A3B — My current daily driver. Released February 2026, this is a 35B parameter MoE model that only activates 3B parameters per token. That means it’s fast AND smart. It supports 256K context natively, has strong agentic capabilities, and its tool calling works reliably with Claude Code. Benchmarks back this up — the 35B-A3B model surpasses much larger predecessors like Qwen3-235B, as well as proprietary models like GPT-5 mini and Claude Sonnet 4.5 in categories including knowledge and visual reasoning. Runs comfortably on 32GB unified memory or a 24GB GPU. Apache 2.0 license, fully open for commercial use.

GLM-4.7-Flash — The best all-around model for 24GB VRAM setups. GLM-4.7-Flash dominates with a 30.1 Intelligence Index and won agentic coding challenges in recent independent testing. It handles planning, multi-step tool use, and code generation across multiple files with real consistency. If Qwen 3.5 35B doesn’t click for you, GLM-4.7-Flash is a rock-solid alternative.

Qwen3-Coder 30B-A3B — Purpose-built for coding agents. It offers 30B total parameters with only 3.3B activated, with exceptional agentic capabilities for real-world software engineering tasks and native support for 256K tokens. Trained specifically on agentic coding workflows through reinforcement learning on SWE-Bench. If your work is purely code (no general reasoning, no docs), this specialist might outperform the generalists above.

GPT-OSS 20B — OpenAI’s open-weight model. Strong reasoning and tool calling capabilities. A solid option at ~13GB, it fits on more modest hardware while still handling agentic workflows.

Match Your Hardware to a Local LLM Model

No CPU-only options here — agentic coding needs responsive inference, and CPU-only speeds are too slow for the multi-turn, tool-calling loops that Claude Code runs. You need a GPU or Apple Silicon unified memory.

Here’s the straightforward mapping:

Your HardwareAvailable MemoryModel to RunDownload SizeWhat to Expect
Mac M1/M2/M3 with 16GB unified~14GB usableGPT-OSS 20B (Q4)~13GBWorkable but tight. Close other apps.
Mac M2/M3/M4 with 32GB unified~28GB usableQwen 3.5 35B-A3B (Q4)~22GBSmooth daily driver. My recommended starting point.
Mac M2/M3/M4 with 48-64GB unified~44-58GB usableQwen 3.5 35B-A3B (Q8) or 122B-A10B (Q4)~35GB / ~70GBPremium experience, larger context windows.
NVIDIA GPU with 16GB VRAM16GBGPT-OSS 20B (Q4)~13GBSolid agentic coding for most tasks.
NVIDIA GPU with 24GB VRAM24GBGLM-4.7-Flash (Q4) or Qwen 3.5 35B-A3B (Q4)~18-22GBExcellent. This is the NVIDIA sweet spot.
NVIDIA GPU with 48GB+ VRAM48GB+Qwen 3.5 122B-A10B (Q4)~70GBNear-cloud quality, serious agentic power.

The quick rule of thumb: Take your available memory, subtract 2-4GB for overhead, and that’s your model size budget at Q4 quantization.

What’s Q4 quantization? Quantization compresses model weights from 16-bit floats to smaller integers, dramatically reducing memory usage. Q4_K_M is the sweet spot — it cuts memory by ~75% with minimal quality loss. Below Q3, quality degrades noticeably. Ollama handles quantization automatically when you pull a model, so you don’t need to worry about the details.


Setup Guide: Ollama + Claude Code

One path. No forks. Let’s get a working local AI coding agent in 15 minutes.

Step 1: Install Ollama

Ollama handles model downloads, GPU detection, quantization, and serves an API that Claude Code talks to. It’s the foundation.

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download installer from https://ollama.com
Bash

Verify it’s working:

ollama --version
ollama serve    # starts the background server (may already be running)
Bash

Alt text: Terminal commands showing Ollama installation on macOS, Linux, and Windows with version verification

Important: Make sure you’re on Ollama v0.14.0 or later. In January 2026, Ollama added support for the Anthropic Messages API, enabling Claude Code to connect directly to any Ollama model. Older versions won’t work with Claude Code.

Step 2: Pull a Coding Model

Pick a model from the recommendations above based on your hardware. Here are the pull commands:

# For 32GB+ unified memory or 24GB VRAM (recommended starting point):
ollama pull qwen3.5:35b-a3b

# Alternative for 24GB VRAM:
ollama pull glm-4.7-flash

# For coding-specific workloads:
ollama pull qwen3-coder:30b

# For 16GB setups:
ollama pull gpt-oss:20b
Bash

Quick sanity check — run it interactively to confirm it works:

ollama run qwen3.5:35b-a3b "Write a Python function that finds the longest palindromic substring"
Bash

If you get a sensible response, you’re golden. If it’s painfully slow or crashes, you need a smaller model for your hardware. Type /bye to exit the interactive session.

Step 3: Install Claude Code

Claude Code is Anthropic’s terminal-based agentic coding tool. It can read your repo, plan changes, edit files, run commands, and iterate — all from your terminal. And thanks to Ollama’s Anthropic API compatibility, it works with local models.

# macOS / Linux / WSL
curl -fsSL https://claude.ai/install.sh | bash

# Windows CMD
curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd
Bash

Verify the installation:

claude --version
Bash

Step 4: Connect Claude Code to Ollama

This is the key step. You need to tell Claude Code to talk to your local Ollama server instead of Anthropic’s cloud API.

Option A: The one-liner (easiest)

If your Ollama is up to date, this single command handles everything:

ollama launch claude --model qwen3.5:35b-a3b
Bash

That’s it. Ollama sets the environment variables and launches Claude Code pointed at your local model. 🎉

Option B: Manual environment variables (if ollama launch isn’t available)

Add these to your ~/.bashrc, ~/.zshrc, or run them before launching Claude Code:

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
Bash

Then launch Claude Code in your project directory:

cd /path/to/your/project
claude
Bash

Option C: Persistent config (recommended for daily use)

Add the settings to Claude Code’s config file at ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  }
}
Bash

Now every time you run claude, it’ll automatically connect to Ollama. Switch models anytime with:

claude --model qwen3.5:35b-a3b
claude --model glm-4.7-flash
claude --model qwen3-coder:30b
Bash

Alt text: Three methods for connecting Claude Code to Ollama — one-liner, environment variables, and persistent config file

Step 5: Test It On a Real Task

Navigate to a project directory and give Claude Code a real job:

cd ~/projects/my-app
claude

# Inside Claude Code, try:
> Add input validation to the user registration endpoint. 
  Write tests for the validation logic. Run the tests and fix any failures.
Bash

Watch it work. Claude Code will read your codebase, plan the changes, edit files, run your test suite, and fix issues — all powered by your local model. No tokens leave your machine. No API bill at the end. 🚀

If things feel slow: The first prompt after loading a model takes longer (cold start). Subsequent prompts are much faster. You can keep models loaded longer by setting OLLAMA_KEEP_ALIVE=30m (or -1 for indefinitely).


Make Local LLM based Agentic Coding Feel Good

Getting the agent running is step one. Getting it to produce code you actually want to use? That’s the craft. Here’s what I’ve learned through hundreds of hours of local agentic coding.

Prompt Patterns That Work

Local models have smaller effective context windows than cloud models. Every wasted token costs you speed. These patterns make the most of what you have.

“Plan first, code second.” Instead of asking the agent to immediately start editing files, ask it to outline its approach. This catches wrong assumptions before they waste 200 lines of generation:

❌ "Add Redis caching to the user service"

✅ "I need Redis caching for the user service.
   Before writing code:
   1. List your assumptions about the existing architecture
   2. Describe the approach in 3-4 bullets
   3. Then implement it"
Bash

Use a CLAUDE.md file. Drop a CLAUDE.md in your project root with context about your stack, conventions, and testing commands. Claude Code reads this automatically. This single file replaces a ton of repeated prompt context:

# CLAUDE.md
## Stack
- Python 3.12, FastAPI, SQLAlchemy 2.0, PostgreSQL
- Tests: pytest, run with `pytest tests/ -v`
- Linting: `ruff check .`

## Conventions
- Type hints on all function signatures
- Docstrings on public functions
- Tests go in tests/ mirroring src/ structure

## Important
- Never modify alembic migration files
- The auth middleware in src/auth/middleware.py is security-critical — ask before changing
Markdown

Constrain the scope. Local models can wander. Be explicit about boundaries:

✅ "Fix the failing test in tests/test_users.py. 
   Only modify src/users/service.py and the test file itself. 
   Don't touch any other files."
Markdown

Quality Controls (Your Safety Net)

Local models hallucinate. Smaller models hallucinate more than larger ones. These guardrails have saved me countless hours.

Always require tests. If the agent generates a function, it should also generate the test. If the test fails, it should fix the code. This feedback loop catches the majority of hallucinated APIs and wrong assumptions. Claude Code does this naturally when you ask — lean into it.

Keep your CLAUDE.md’s test command current. If Claude Code knows how to run your tests (pytest, npm test, cargo test), it’ll run them automatically after making changes and self-correct failures. This is the single most impactful quality improvement you can make.

Don’t trust, verify. Review every change before committing. I treat local LLM output the same way I treat a PR from a new team member — review it, test it, then merge it.

Watch for hallucinated imports. The most common failure mode: the model imports a function or library that doesn’t exist. Including your package.json or requirements.txt context (via CLAUDE.md or direct mention) reduces this significantly.


Troubleshooting

I’ve hit every one of these problems. Here are the fixes.

Claude Code says “connection refused”: Ollama isn’t running. Start it with ollama serve or check that the Ollama app is open. Verify with curl http://localhost:11434 — you should see “Ollama is running.”

Model is painfully slow or hangs: Your model is too large for your hardware. Run ollama ps to check memory usage. If the model is spilling to CPU (you’ll see partial GPU offload), try a smaller model. Also check that no other heavy processes are eating your VRAM (nvidia-smi on NVIDIA, Activity Monitor on Mac).

Claude Code errors with “model not found”: The model name must exactly match what Ollama has. Run ollama list and use the exact name shown. Common mistake: pulling qwen3.5 but specifying qwen3.5:35b-a3b (or vice versa).

First prompt takes forever, then it’s fine: This is the cold start — model weights loading into memory. Set OLLAMA_KEEP_ALIVE=30m to keep the model loaded for 30 minutes between requests. For all-day coding, use OLLAMA_KEEP_ALIVE=-1.

Agent makes changes to wrong files or goes off-script: Your context window might be overflowing. Reduce the scope of your prompts, use a CLAUDE.md to establish boundaries, and break large tasks into smaller steps. Also try a model with better agentic training — Qwen3-Coder 30B was specifically RL-trained for multi-step coding tasks.

Hallucinated APIs — model suggests functions that don’t exist: Include your dependency files in context. Add to your CLAUDE.md: “Only use APIs from dependencies listed in package.json / requirements.txt.” Run type-checking (tsc --noEmit, mypy, pyright) on generated code as a filter.

ollama launch claude returns “unknown command”: Your Ollama version is too old. Update Ollama — the launch command requires v0.14.0+. After updating, restart Ollama.


Privacy and Security Checklist

Running locally doesn’t automatically make you bulletproof. Here’s what to lock down.

Disable VS Code / editor telemetry. If you’re also using an IDE alongside Claude Code, many extensions phone home with usage data. Set "telemetry.telemetryLevel": "off" in VS Code settings.

Keep Ollama on localhost. Ollama binds to 127.0.0.1:11434 by default — that’s safe. If you change it to 0.0.0.0:11434 for remote access, anyone on your network can hit the API. Use a firewall or bind to a specific interface.

Disable nonessential traffic. Add CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to your environment variables. This prevents Claude Code from sending any analytics or telemetry.

Enable disk encryption. Conversation history and model weights live on your disk. FileVault (Mac), LUKS (Linux), or BitLocker (Windows) protect them if your machine is lost or stolen.

Check your model license. Most models covered here (Qwen 3.5, GLM-4.7, Qwen3-Coder) use Apache 2.0 or similarly permissive licenses — free for commercial use, no strings attached. DeepSeek models have a revenue threshold. Always check the model card on Hugging Face before deploying in a commercial context.


FAQ

What is the best local LLM for agentic coding in 2026?

For most developers: Qwen 3.5 35B-A3B. It balances strong agentic capabilities, long context (256K), tool calling support, and hardware efficiency (only 3B active parameters). On a 24GB NVIDIA GPU specifically, GLM-4.7-Flash is equally strong. Both run on Claude Code via Ollama without issues.

How much VRAM/RAM do I actually need?

16GB is the minimum for a usable experience. 24GB VRAM or 32GB unified memory is the sweet spot where agentic coding starts to feel genuinely productive. Below 16GB, the models that support proper tool calling and long context won’t fit.

Can I use this without a GPU at all?

Agentic coding requires responsive multi-turn inference — the agent makes many sequential calls to the model during a single task. CPU-only inference is too slow for this workflow. You need a dedicated GPU or Apple Silicon unified memory.

Are local LLMs actually free?

Yes. Ollama is free, Claude Code is free to run against non-Anthropic endpoints, and the open-weight models are free to download and use. Your only costs are hardware (which you likely already own) and electricity.

How does this compare to using Claude Code with Anthropic’s API?

Anthropic’s cloud models (Opus 4.6, Sonnet 4.6) are still more capable on the hardest tasks. But for 80% of daily coding work — building features, fixing bugs, writing tests, scaffolding projects — a good local model is more than sufficient. And it’s free, private, and always available.

Can I switch between local and cloud models?

Yes. Remove the environment variables (or use a separate terminal profile) and Claude Code goes back to Anthropic’s API. Many developers use local for routine work and cloud for complex tasks. It’s a pragmatic setup.

Share if liked!

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Pinterest (Opens in new window) Pinterest
  • Share on Reddit (Opens in new window) Reddit
  • Share on Tumblr (Opens in new window) Tumblr

You may also like


Discover more from CodeSamplez.com

Subscribe to get the latest posts sent to your email.

First Published On: March 11, 2026 Filed Under: Productivity Tagged With: Agent, AI

About Rana Ahsan

Rana Ahsan is a seasoned software engineer and technology leader specialized in distributed systems and software architecture. With a Master’s in Software Engineering from Concordia University, his experience spans leading scalable architecture at Coursera and TopHat, contributing to open-source projects. This blog, CodeSamplez.com, showcases his passion for sharing practical insights on programming and distributed systems concepts and help educate others.
Github | X | LinkedIn

Reader Interactions

Leave a ReplyCancel reply

Primary Sidebar

  • Facebook
  • X
  • Pinterest
  • Tumblr

Subscribe to Blog via Email

Featured Dev Tools

  • JSON Formatter
  • Diff Checker

Top Picks

python local environment setup

Python Local Development Environment: Complete Setup Guide

In-Depth JWT Tutorial Guide For Beginners

JSON Web Tokens (JWT): A Complete In-Depth Beginners Tutorial

The Ultimate Git Commands CheatSheet

Git Commands Cheatsheet: The Ultimate Git Reference

web development architecture case studies

Web Development Architecture Case Studies: Lessons From Titans

static website deployment s3 cloudfront

Host Static Website With AWS S3 And CloudFront – Step By Step

Recently Published

Local Coding Agent

Local LLM for Coding: Free AI Coding Agent With Ollama + Claude

Best AI Coding Agents

Best AI Coding Agents in 2026: The Complete Beginner’s Guide

RAG Systems In Python

A step-by-step guide to building a simple RAG system in Python

Add Memory To AI Agents

Add Memory to AI Agent: Python Tutorial for Beginners

MCP For Beginners

Model Context Protocol For Beginners: Superpower of AI Agents

Footer

Subscribe to Blog via Email

Demos

  • Demo.CodeSamplez.com
  • Facebook
  • X
  • Pinterest
  • Tumblr

Explore By Topics

Python | AWS | PHP | C# | Javascript

Copyright © 2026