If you’ve been curious about running a local LLM for coding but felt overwhelmed by the sheer number of models, runtimes, and configuration options, this guide is for you. Not the “tab-complete a for loop” kind of local. I’m talking full agentic coding — an AI agent that reads your repo, plans changes across multiple files, runs shell commands, executes tests, and iterates on errors. All on your machine. Zero API costs. Zero code leaving your network.
The stack we’re building: Ollama + Claude Code + an open-weight model. It takes about 15 minutes to get running.
A local LLM for coding is a large language model running entirely on your own hardware — laptop, workstation, or home server — that provides AI-powered code generation, refactoring, and agentic multi-file editing without sending a single line of code to the cloud. It matters because it gives you complete privacy, zero recurring costs, and AI coding assistance that works even without internet.
💡Pro Tip: New to agentic coding? Consider going over beginners guide to ai coding assistant setup first.
Why “Local LLM” Changes Everything for Agentic Coding
Before we set anything up, let me be specific about what changes when you move agentic coding off the cloud.
Privacy becomes real, not theoretical. Agentic coding tools read everything — your entire repo structure, environment files, config files, test output, shell history. With a cloud provider, all of that context ships to someone else’s servers. After Samsung’s engineers accidentally uploaded confidential source code to ChatGPT back in 2023, this stopped being a paranoia issue and became a policy issue at many companies. Running locally means your proprietary business logic stays on your machine. Period.
Cost drops to zero. Claude Code on Anthropic’s API is powerful but expensive — heavy agentic sessions with Opus can burn through $5-15/hour easily. Locally? Your electricity bill goes up a few dollars, … for the entire month! The math is absurdly good once you have the hardware.
Availability becomes unconditional. Flights, coffee shops with garbage Wi-Fi, AWS outages that take down half the internet — none of it matters. Your AI coding agent is always on.
The quality tradeoff is shrinking fast. In 2023, local models were a joke for agentic work. In 2026, models like Qwen 3.5 and GLM-4.7-Flash genuinely handle multi-file edits, tool calling, and long-context planning. They won’t match Opus 4.6 on the hardest tasks, but for building features, fixing bugs, and scaffolding projects? They’re shockingly competent.
Setting Expectations
Here’s what nobody warns you about: a local model won’t feel exactly like Claude Opus or GPT-5 on day one. And that’s okay.
Local models are best at: scaffolding new features, generating boilerplate, writing tests, fixing bugs with clear error messages, and iterating on code with feedback loops.
They’re weaker at: massive architectural decisions across 20+ files, highly nuanced refactoring of complex legacy code, and tasks requiring enormous context windows.
The sweet spot for local agentic coding is tasks where you can describe what you want, let the agent take a first pass, review, and iterate. That covers probably 80% of daily development work.
Choose the Right Model
This is where most people get stuck. Hundreds of models on Hugging Face, new ones every week, every Reddit thread recommending something different. Let me cut through the noise.
For agentic coding, your model needs three things:
long context (~64K tokens minimum),
tool calling support (so the agent can execute commands, read files, run tests), and
strong instruction following (so it doesn’t go off the rails mid-task). Not every model delivers on all three.
The Models Worth Running Right Now
Qwen 3.5 35B-A3B — My current daily driver. Released February 2026, this is a 35B parameter MoE model that only activates 3B parameters per token. That means it’s fast AND smart. It supports 256K context natively, has strong agentic capabilities, and its tool calling works reliably with Claude Code. Benchmarks back this up — the 35B-A3B model surpasses much larger predecessors like Qwen3-235B, as well as proprietary models like GPT-5 mini and Claude Sonnet 4.5 in categories including knowledge and visual reasoning. Runs comfortably on 32GB unified memory or a 24GB GPU. Apache 2.0 license, fully open for commercial use.
GLM-4.7-Flash — The best all-around model for 24GB VRAM setups. GLM-4.7-Flash dominates with a 30.1 Intelligence Index and won agentic coding challenges in recent independent testing. It handles planning, multi-step tool use, and code generation across multiple files with real consistency. If Qwen 3.5 35B doesn’t click for you, GLM-4.7-Flash is a rock-solid alternative.
Qwen3-Coder 30B-A3B — Purpose-built for coding agents. It offers 30B total parameters with only 3.3B activated, with exceptional agentic capabilities for real-world software engineering tasks and native support for 256K tokens. Trained specifically on agentic coding workflows through reinforcement learning on SWE-Bench. If your work is purely code (no general reasoning, no docs), this specialist might outperform the generalists above.
GPT-OSS 20B — OpenAI’s open-weight model. Strong reasoning and tool calling capabilities. A solid option at ~13GB, it fits on more modest hardware while still handling agentic workflows.
Match Your Hardware to a Local LLM Model
No CPU-only options here — agentic coding needs responsive inference, and CPU-only speeds are too slow for the multi-turn, tool-calling loops that Claude Code runs. You need a GPU or Apple Silicon unified memory.
Here’s the straightforward mapping:
| Your Hardware | Available Memory | Model to Run | Download Size | What to Expect |
|---|---|---|---|---|
| Mac M1/M2/M3 with 16GB unified | ~14GB usable | GPT-OSS 20B (Q4) | ~13GB | Workable but tight. Close other apps. |
| Mac M2/M3/M4 with 32GB unified | ~28GB usable | Qwen 3.5 35B-A3B (Q4) | ~22GB | Smooth daily driver. My recommended starting point. |
| Mac M2/M3/M4 with 48-64GB unified | ~44-58GB usable | Qwen 3.5 35B-A3B (Q8) or 122B-A10B (Q4) | ~35GB / ~70GB | Premium experience, larger context windows. |
| NVIDIA GPU with 16GB VRAM | 16GB | GPT-OSS 20B (Q4) | ~13GB | Solid agentic coding for most tasks. |
| NVIDIA GPU with 24GB VRAM | 24GB | GLM-4.7-Flash (Q4) or Qwen 3.5 35B-A3B (Q4) | ~18-22GB | Excellent. This is the NVIDIA sweet spot. |
| NVIDIA GPU with 48GB+ VRAM | 48GB+ | Qwen 3.5 122B-A10B (Q4) | ~70GB | Near-cloud quality, serious agentic power. |
The quick rule of thumb: Take your available memory, subtract 2-4GB for overhead, and that’s your model size budget at Q4 quantization.
What’s Q4 quantization? Quantization compresses model weights from 16-bit floats to smaller integers, dramatically reducing memory usage. Q4_K_M is the sweet spot — it cuts memory by ~75% with minimal quality loss. Below Q3, quality degrades noticeably. Ollama handles quantization automatically when you pull a model, so you don’t need to worry about the details.
Setup Guide: Ollama + Claude Code
One path. No forks. Let’s get a working local AI coding agent in 15 minutes.
Step 1: Install Ollama
Ollama handles model downloads, GPU detection, quantization, and serves an API that Claude Code talks to. It’s the foundation.
# macOS (Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows — download installer from https://ollama.comBashVerify it’s working:
ollama --version
ollama serve # starts the background server (may already be running)BashAlt text: Terminal commands showing Ollama installation on macOS, Linux, and Windows with version verification
Important: Make sure you’re on Ollama v0.14.0 or later. In January 2026, Ollama added support for the Anthropic Messages API, enabling Claude Code to connect directly to any Ollama model. Older versions won’t work with Claude Code.
Step 2: Pull a Coding Model
Pick a model from the recommendations above based on your hardware. Here are the pull commands:
# For 32GB+ unified memory or 24GB VRAM (recommended starting point):
ollama pull qwen3.5:35b-a3b
# Alternative for 24GB VRAM:
ollama pull glm-4.7-flash
# For coding-specific workloads:
ollama pull qwen3-coder:30b
# For 16GB setups:
ollama pull gpt-oss:20bBashQuick sanity check — run it interactively to confirm it works:
ollama run qwen3.5:35b-a3b "Write a Python function that finds the longest palindromic substring"BashIf you get a sensible response, you’re golden. If it’s painfully slow or crashes, you need a smaller model for your hardware. Type /bye to exit the interactive session.
Step 3: Install Claude Code
Claude Code is Anthropic’s terminal-based agentic coding tool. It can read your repo, plan changes, edit files, run commands, and iterate — all from your terminal. And thanks to Ollama’s Anthropic API compatibility, it works with local models.
# macOS / Linux / WSL
curl -fsSL https://claude.ai/install.sh | bash
# Windows CMD
curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmdBashVerify the installation:
claude --versionBashStep 4: Connect Claude Code to Ollama
This is the key step. You need to tell Claude Code to talk to your local Ollama server instead of Anthropic’s cloud API.
Option A: The one-liner (easiest)
If your Ollama is up to date, this single command handles everything:
ollama launch claude --model qwen3.5:35b-a3bBashThat’s it. Ollama sets the environment variables and launches Claude Code pointed at your local model. 🎉
Option B: Manual environment variables (if ollama launch isn’t available)
Add these to your ~/.bashrc, ~/.zshrc, or run them before launching Claude Code:
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""BashThen launch Claude Code in your project directory:
cd /path/to/your/project
claudeBashOption C: Persistent config (recommended for daily use)
Add the settings to Claude Code’s config file at ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": ""
}
}BashNow every time you run claude, it’ll automatically connect to Ollama. Switch models anytime with:
claude --model qwen3.5:35b-a3b
claude --model glm-4.7-flash
claude --model qwen3-coder:30bBashAlt text: Three methods for connecting Claude Code to Ollama — one-liner, environment variables, and persistent config file
Step 5: Test It On a Real Task
Navigate to a project directory and give Claude Code a real job:
cd ~/projects/my-app
claude
# Inside Claude Code, try:
> Add input validation to the user registration endpoint.
Write tests for the validation logic. Run the tests and fix any failures.BashWatch it work. Claude Code will read your codebase, plan the changes, edit files, run your test suite, and fix issues — all powered by your local model. No tokens leave your machine. No API bill at the end. 🚀
If things feel slow: The first prompt after loading a model takes longer (cold start). Subsequent prompts are much faster. You can keep models loaded longer by setting OLLAMA_KEEP_ALIVE=30m (or -1 for indefinitely).
Make Local LLM based Agentic Coding Feel Good
Getting the agent running is step one. Getting it to produce code you actually want to use? That’s the craft. Here’s what I’ve learned through hundreds of hours of local agentic coding.
Prompt Patterns That Work
Local models have smaller effective context windows than cloud models. Every wasted token costs you speed. These patterns make the most of what you have.
“Plan first, code second.” Instead of asking the agent to immediately start editing files, ask it to outline its approach. This catches wrong assumptions before they waste 200 lines of generation:
❌ "Add Redis caching to the user service"
✅ "I need Redis caching for the user service.
Before writing code:
1. List your assumptions about the existing architecture
2. Describe the approach in 3-4 bullets
3. Then implement it"BashUse a CLAUDE.md file. Drop a CLAUDE.md in your project root with context about your stack, conventions, and testing commands. Claude Code reads this automatically. This single file replaces a ton of repeated prompt context:
# CLAUDE.md
## Stack
- Python 3.12, FastAPI, SQLAlchemy 2.0, PostgreSQL
- Tests: pytest, run with `pytest tests/ -v`
- Linting: `ruff check .`
## Conventions
- Type hints on all function signatures
- Docstrings on public functions
- Tests go in tests/ mirroring src/ structure
## Important
- Never modify alembic migration files
- The auth middleware in src/auth/middleware.py is security-critical — ask before changingMarkdownConstrain the scope. Local models can wander. Be explicit about boundaries:
✅ "Fix the failing test in tests/test_users.py.
Only modify src/users/service.py and the test file itself.
Don't touch any other files."MarkdownQuality Controls (Your Safety Net)
Local models hallucinate. Smaller models hallucinate more than larger ones. These guardrails have saved me countless hours.
Always require tests. If the agent generates a function, it should also generate the test. If the test fails, it should fix the code. This feedback loop catches the majority of hallucinated APIs and wrong assumptions. Claude Code does this naturally when you ask — lean into it.
Keep your CLAUDE.md’s test command current. If Claude Code knows how to run your tests (pytest, npm test, cargo test), it’ll run them automatically after making changes and self-correct failures. This is the single most impactful quality improvement you can make.
Don’t trust, verify. Review every change before committing. I treat local LLM output the same way I treat a PR from a new team member — review it, test it, then merge it.
Watch for hallucinated imports. The most common failure mode: the model imports a function or library that doesn’t exist. Including your package.json or requirements.txt context (via CLAUDE.md or direct mention) reduces this significantly.
Troubleshooting
I’ve hit every one of these problems. Here are the fixes.
Claude Code says “connection refused”: Ollama isn’t running. Start it with ollama serve or check that the Ollama app is open. Verify with curl http://localhost:11434 — you should see “Ollama is running.”
Model is painfully slow or hangs: Your model is too large for your hardware. Run ollama ps to check memory usage. If the model is spilling to CPU (you’ll see partial GPU offload), try a smaller model. Also check that no other heavy processes are eating your VRAM (nvidia-smi on NVIDIA, Activity Monitor on Mac).
Claude Code errors with “model not found”: The model name must exactly match what Ollama has. Run ollama list and use the exact name shown. Common mistake: pulling qwen3.5 but specifying qwen3.5:35b-a3b (or vice versa).
First prompt takes forever, then it’s fine: This is the cold start — model weights loading into memory. Set OLLAMA_KEEP_ALIVE=30m to keep the model loaded for 30 minutes between requests. For all-day coding, use OLLAMA_KEEP_ALIVE=-1.
Agent makes changes to wrong files or goes off-script: Your context window might be overflowing. Reduce the scope of your prompts, use a CLAUDE.md to establish boundaries, and break large tasks into smaller steps. Also try a model with better agentic training — Qwen3-Coder 30B was specifically RL-trained for multi-step coding tasks.
Hallucinated APIs — model suggests functions that don’t exist: Include your dependency files in context. Add to your CLAUDE.md: “Only use APIs from dependencies listed in package.json / requirements.txt.” Run type-checking (tsc --noEmit, mypy, pyright) on generated code as a filter.
ollama launch claude returns “unknown command”: Your Ollama version is too old. Update Ollama — the launch command requires v0.14.0+. After updating, restart Ollama.
Privacy and Security Checklist
Running locally doesn’t automatically make you bulletproof. Here’s what to lock down.
Disable VS Code / editor telemetry. If you’re also using an IDE alongside Claude Code, many extensions phone home with usage data. Set "telemetry.telemetryLevel": "off" in VS Code settings.
Keep Ollama on localhost. Ollama binds to 127.0.0.1:11434 by default — that’s safe. If you change it to 0.0.0.0:11434 for remote access, anyone on your network can hit the API. Use a firewall or bind to a specific interface.
Disable nonessential traffic. Add CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to your environment variables. This prevents Claude Code from sending any analytics or telemetry.
Enable disk encryption. Conversation history and model weights live on your disk. FileVault (Mac), LUKS (Linux), or BitLocker (Windows) protect them if your machine is lost or stolen.
Check your model license. Most models covered here (Qwen 3.5, GLM-4.7, Qwen3-Coder) use Apache 2.0 or similarly permissive licenses — free for commercial use, no strings attached. DeepSeek models have a revenue threshold. Always check the model card on Hugging Face before deploying in a commercial context.
FAQ
For most developers: Qwen 3.5 35B-A3B. It balances strong agentic capabilities, long context (256K), tool calling support, and hardware efficiency (only 3B active parameters). On a 24GB NVIDIA GPU specifically, GLM-4.7-Flash is equally strong. Both run on Claude Code via Ollama without issues.
16GB is the minimum for a usable experience. 24GB VRAM or 32GB unified memory is the sweet spot where agentic coding starts to feel genuinely productive. Below 16GB, the models that support proper tool calling and long context won’t fit.
Agentic coding requires responsive multi-turn inference — the agent makes many sequential calls to the model during a single task. CPU-only inference is too slow for this workflow. You need a dedicated GPU or Apple Silicon unified memory.
Yes. Ollama is free, Claude Code is free to run against non-Anthropic endpoints, and the open-weight models are free to download and use. Your only costs are hardware (which you likely already own) and electricity.
Anthropic’s cloud models (Opus 4.6, Sonnet 4.6) are still more capable on the hardest tasks. But for 80% of daily coding work — building features, fixing bugs, writing tests, scaffolding projects — a good local model is more than sufficient. And it’s free, private, and always available.
Yes. Remove the environment variables (or use a separate terminal profile) and Claude Code goes back to Anthropic’s API. Many developers use local for routine work and cloud for complex tasks. It’s a pragmatic setup.
Discover more from CodeSamplez.com
Subscribe to get the latest posts sent to your email.

Leave a Reply