Tech Tips 6 min read intermediate

Ollama: Run Local LLMs Like a Pro in 2026

A hands-on guide to Ollama, the default local-LLM runner in 2026 (v0.30.10). Covers install, pulling and running models, calling them from the OpenAI SDK at localhost:11434, structured JSON outputs, tool calling, and Modelfiles, plus how to size a model to your hardware.

Marcus Rivera

Jun 25, 2026

There is a moment every developer hits with cloud AI APIs: the bill arrives, or the privacy review stalls a project, or you realize you are shipping your users' data to someone else's server. Ollama is the answer to all three. It is the closest thing the local-LLM world has to Docker — pull a model with one command, and it handles quantization, memory, and GPU acceleration for you. By mid-2026 it has become the default way to run open models on your own machine.

This guide takes you from a cold install to a working, code-callable local model, plus the features that actually matter for building things: an OpenAI-compatible API, structured JSON output, and tool calling. It assumes you are comfortable in a terminal but have never touched Ollama.

Why local, and why Ollama

Running models locally gives you three things the cloud can't: your data never leaves the machine, there are no per-token costs, and it works offline. The catch has always been setup pain — drivers, quantization formats, memory juggling. Ollama collapses all of that into a single binary.

Under the hood it wraps llama.cpp and related engines, automatically downloads models in GGUF format, and picks sensible quantization (typically 4-bit) so models fit in available memory. The piece that makes it genuinely useful for developers: it exposes an OpenAI-compatible REST API on localhost:11434, so most code written for the OpenAI SDK works against Ollama with a one-line base-URL change.

Ollama's real trick isn't running models — it's making a local model look exactly like a cloud API to your code.

Step 1: Install

As of June 2026 the current release is v0.30.10. Install is one command per platform.

macOS / Windows: download the installer from ollama.com/download and run it. The app installs the ollama CLI and runs a background service.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify it's working:

ollama --version

Step 2: Pull and run a model

The run command downloads a model on first use and then drops you into an interactive chat. Start small if your hardware is modest:

ollama run llama3.2

The first run pulls the weights; later runs start instantly. Type a message, chat, and exit with /bye. To download without chatting, use pull; to see what you have, use list:

ollama pull llama3.2
ollama list

Pick a model your hardware can hold. The iron rule of local inference: the model must fit in VRAM — or system RAM for CPU-only — or speeds collapse. A practical baseline is 16 GB of RAM minimum, with 32 GB being where 7B-class models become comfortable to run alongside everything else. A rough sizing guide at 4-bit:

Your hardware	Sensible model size
8 GB RAM, no GPU	3B (e.g. `llama3.2:3b`)
16 GB RAM	7B–8B
24 GB VRAM GPU	up to ~27B–32B
48 GB+ VRAM	70B-class

Step 3: Call it from code

This is where Ollama earns its place in a stack. Point the OpenAI SDK at your local server and nothing else changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, but ignored
)

resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantization in one sentence."}],
)
print(resp.choices[0].message.content)

Prefer Ollama's native API? It lives at the same host:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": false
}'

Step 4: Force structured JSON output

One of the most useful additions to modern Ollama is structured outputs — you hand it a JSON schema and the model is constrained to return valid, matching JSON. This turns a chatty model into a reliable data extractor, no fragile regex parsing required.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Tony is 25 and lives in Berlin."}],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "age": {"type": "integer"},
      "city": {"type": "string"}
    },
    "required": ["name", "age", "city"]
  }
}'

The model returns a clean object you can json.loads() directly. For anything that feeds downstream code — extraction, classification, form-filling — this is the feature you will reach for constantly.

Step 5: Tool calling

Ollama also supports tool calling, letting the model request a function and you supply the result — the backbone of agents and RAG. You declare the available tools in the request; the model responds with a structured call when it decides one is needed. Recent versions can even stream responses while a tool call is in flight, so your UI shows partial output instead of freezing while the model waits on a function.

Step 6: Customize with a Modelfile

To bake in a system prompt or tweak parameters, write a Modelfile — Ollama's equivalent of a Dockerfile:

FROM llama3.2
SYSTEM "You are a terse senior engineer. Answer in at most three sentences."
PARAMETER temperature 0.3

Then build and run your variant:

ollama create terse-eng -f Modelfile
ollama run terse-eng

You now have a reusable, named model with your defaults locked in.

A few things that will save you grief

Memory beats raw GPU speed. A model that fits entirely in VRAM and runs at a modest clock will crush a bigger model that spills to disk. If generation feels glacially slow, you are almost certainly swapping — drop to a smaller model or a harder quantization.

Keep the server running. On Linux the install sets up a background service; if you ever need it manually, ollama serve starts the host that the API and CLI both talk to.

Match the model to the job. Small models are excellent at extraction, summarization, and routing. Push them into hard multi-step reasoning and they will disappoint — that is a limit of the model, not of Ollama.

The Bottom Line

Ollama removes essentially every excuse not to run models locally. Installation is one command, the first model is one more, and the OpenAI-compatible endpoint means your existing code barely changes. The features that make it production-relevant — structured JSON outputs, tool calling, and Modelfiles — are the same primitives you would reach for against a cloud API, except now they run on hardware you control, for free, offline. Start with llama3.2, point your SDK at localhost:11434, and size up from there as your hardware allows.

ollama local-llm developer-tools llm open-source

More in Tech Tips

Tech Tips

Structured Outputs: Force LLMs to Return Valid JSON

A practical guide to OpenAI Structured Outputs: the difference from JSON mode, function calling vs response_format, strict schema rules, constrained decoding, limits, and cross-provider options.

By Marcus Rivera · 8 min · Jun 22, 2026

Tech Tips

Context Engineering: A Practical Playbook for Reliable AI Agents

Context engineering is the discipline of curating tools, prompts, retrieval, and memory each turn so AI agents stay reliable over long-horizon tasks.

By Marcus Rivera · 7 min · Jun 16, 2026

Tech Tips

Prompt Caching: How to Cut LLM API Costs by Up to 90%

Prompt caching stores the computed KV attention tensors for a repeated prompt prefix so the model skips recomputation, cutting input cost and latency. Anthropic (explicit cache_control, ~90% read discount), OpenAI (automatic, 50% off, 1,024-token minimum), and Google Gemini (implicit plus explicit cache objects, up to 90%) all support it. The one rule that determines hit rate: put all static content at the front of the prompt and all dynamic content at the back.

By Marcus Rivera · 7 min · Jun 12, 2026