Fix: Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected

Q: How do I fix "Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected"?

How to fix Ollama errors — connection refused when the daemon isn't running, model not found, GPU not detected falling back to CPU, port 11434 already in use, VRAM exhausted, and API access from other machines.

The Error

You run ollama run llama3 and get this:

Error: could not connect to ollama app, is it running?

Or the model isn’t there:

Error: model "llama3" not found, try pulling it first

Or Ollama starts but ignores your GPU entirely:

llm_load_tensors: offloading 0 layers to GPU
llm_load_tensors: offloaded 0/32 layers to GPU

Or another process is already on the port:

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Each of these is a distinct failure mode. None requires reinstalling.

Why This Happens

Ollama has two parts: a background daemon (ollama serve) and the CLI. The daemon must be running before any API call or ollama run command will work. When it isn’t running, every call fails with “connection refused.”

GPU detection is a separate concern — Ollama detects your GPU at startup by scanning for CUDA (NVIDIA), ROCm (AMD), or Metal (Apple Silicon). If the right driver or toolkit isn’t installed, Ollama silently falls back to CPU, which is 5–30x slower but otherwise functional. The detection happens exactly once when the daemon starts. That means installing CUDA after the fact doesn’t help until you restart ollama serve, and a driver update that breaks CUDA also breaks Ollama silently — you’ll see normal startup logs but ollama ps will show CPU under PROCESSOR.

The third concept worth understanding is model storage. Ollama bundles two things into each “model”: the weights (a GGUF file, often several gigabytes) and a Modelfile (a small text manifest that defines the chat template, default parameters, and any system prompt). Pulling a model downloads both. Switching between models reuses cached weights; only the Modelfile differs between, say, llama3:8b and llama3:8b-instruct. This is why ollama list can show many entries that share a common parent layer, and why “model not found” usually means a tag mismatch, not a missing download.

Fix 1: Daemon Not Running — Start Ollama First

The daemon runs independently of your terminal. If you haven’t started it, all commands fail immediately.

Verify the daemon is up:

curl http://localhost:11434/api/tags

If you get a JSON response, the daemon is running. If you get “connection refused,” it’s not.

macOS — Ollama runs as a menu bar app. Open it from Spotlight (Cmd+Space, search “Ollama”) or from /Applications/Ollama.app. You can also start just the server from a terminal:

ollama serve

Linux (systemd):

sudo systemctl start ollama
sudo systemctl status ollama     # Verify it's active
sudo systemctl enable ollama     # Start automatically on boot

View live logs:

journalctl -u ollama -f

Windows — Ollama installs as a background service. Find it in the system tray or restart it from Task Manager → Services tab. To run it manually:

ollama serve

Debug mode — if the daemon starts but something is wrong, enable verbose logging:

OLLAMA_DEBUG=1 ollama serve

This logs GPU detection, model loading decisions, and request handling in detail.

Fix 2: Model Not Found — Pull Before Running

Ollama models are not bundled with the application. Each model must be downloaded separately and stored locally before it can be used.

Error: model "llama3" not found, try pulling it first

Pull the model first:

ollama pull llama3        # Download without running
ollama run llama3         # Download if missing, then run

List what’s already installed:

ollama list

NAME                    ID              SIZE    MODIFIED
llama3:8b               6d4eaa4c8e7f    4.7 GB  2 hours ago
mistral:7b              f974a74d6e12    4.1 GB  3 days ago
nomic-embed-text:v1.5   0a109f422b47    274 MB  1 week ago

Model names are case-sensitive and include a tag (:8b, :7b, :latest). If you pull llama3 and then try to run llama3:8b, it works — :latest and the default tag resolve to the same image. But llama3:70b is a different, much larger model.

Pull failing due to network issues:

# With a proxy
export HTTPS_PROXY=http://proxy.example.com:8080
ollama pull llama3

# Debug the pull
OLLAMA_DEBUG=1 ollama pull llama3

# Manual connectivity check
curl -v https://registry.ollama.ai/v2/library/llama3/manifests/latest

If the registry is unreachable from your network, you can copy a pulled model from another machine. Models are stored in ~/.ollama/models on macOS and Linux, and %USERPROFILE%\.ollama\models on Windows.

Fix 3: GPU Not Detected — Falling Back to CPU

When Ollama runs entirely on CPU, generation is noticeably slow (often 1–5 tokens/second on consumer hardware vs. 30–100+ on GPU). The log output gives it away:

llm_load_tensors: offloaded 0/32 layers to GPU

Or when you check running models:

ollama ps

NAME        ID              SIZE    PROCESSOR    UNTIL
llama3:8b   6d4eaa4c8e7f    4.7GB   100% CPU     5 minutes

The PROCESSOR column tells you exactly what’s being used.

NVIDIA GPUs — CUDA requirements:

Ollama requires NVIDIA driver 531+ and CUDA toolkit. The driver and CUDA toolkit are separate packages — having just the driver is not enough.

nvidia-smi        # Shows driver version — must be 531+
nvcc --version    # Shows CUDA toolkit version — must be 11.3+

If nvidia-smi works but nvcc is missing, install the CUDA toolkit:

# Ubuntu
sudo apt install nvidia-cuda-toolkit

After installing, restart the Ollama daemon. It detects CUDA at startup, not at runtime.

If you’re on Linux and the GPU stops working after the system wakes from suspend:

sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
sudo systemctl restart ollama

AMD GPUs — ROCm requirements:

# Install ROCm on Ubuntu
sudo apt install rocm-hip-sdk

# Verify
rocm-smi

Ollama supports ROCm 6+ on Linux. Windows ROCm support is limited to select GPU models.

Apple Silicon — Metal:

Metal acceleration is automatic on M1/M2/M3/M4 chips. No configuration needed. If Ollama is running under Rosetta (x86 emulation instead of native ARM), Metal won’t work. Check with:

file $(which ollama)
# Should show: Mach-O 64-bit executable arm64

If it shows x86_64, reinstall Ollama from the official site using the macOS (Apple Silicon) download.

Force CPU-only mode (for testing or when GPU causes instability):

OLLAMA_NUM_GPU=0 ollama serve

Pro Tip: GPU detection is logged at startup. Run OLLAMA_DEBUG=1 ollama serve and look for lines containing CUDA, ROCm, or Metal to see exactly what Ollama found and why it accepted or rejected each GPU.

Fix 4: Port 11434 Already in Use

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Usually this means a previous Ollama instance didn’t shut down cleanly. Find and kill it:

# macOS/Linux — find the process
lsof -i :11434

# Kill it
kill -9 <PID>

# Or by name
pkill -f "ollama serve"

On Windows:

netstat -ano | findstr :11434
taskkill /PID <PID> /F

To run on a different port permanently, set OLLAMA_HOST:

export OLLAMA_HOST=127.0.0.1:11435
ollama serve

For a systemd service:

sudo systemctl edit ollama.service

Add under [Service]:

Environment="OLLAMA_HOST=127.0.0.1:11435"

sudo systemctl restart ollama

The port conflict fix is the same pattern as other server processes. For more on killing port conflicts across different tools, see port 3000 already in use.

Fix 5: Out of VRAM — Model Too Large

When VRAM is insufficient, Ollama doesn’t error out — it offloads layers to system RAM, which works but is much slower. If ollama ps shows 50% CPU 50% GPU, only half the model fit in VRAM.

Option 1: Use a smaller quantization:

Quantized models are compressed versions that trade a small amount of quality for significantly less VRAM. The q4_K_M variant is the standard recommendation:

ollama pull llama3:8b-q4_K_M   # ~5 GB VRAM — best balance
ollama pull llama3:8b-q3_K_M   # ~4 GB VRAM — more compressed

Approximate VRAM requirements for a 8B model:

q8: ~9 GB
q6_K: ~7.5 GB
q5_K_M: ~6.5 GB
q4_K_M: ~5 GB (recommended starting point)
q3_K_M: ~4 GB

Option 2: Reduce context length:

The KV cache grows linearly with context length. Reducing num_ctx from the default (often 128k) to something smaller frees significant VRAM:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3:8b", "prompt": "Hello", "options": {"num_ctx": 4096}}'

To set it globally via a Modelfile:

FROM llama3:8b
PARAMETER num_ctx 4096

ollama create llama3-compact -f Modelfile
ollama run llama3-compact

Option 3: Control layer offloading:

OLLAMA_NUM_GPU sets how many transformer layers to place on the GPU. If you have limited VRAM, set it to a lower value to fit within your budget:

OLLAMA_NUM_GPU=20 ollama serve   # Offload 20 layers, rest on CPU

Option 4: Limit concurrent loaded models:

By default, Ollama keeps recently used models in VRAM. If you’re switching between models, limit this to one:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Fix 6: API Access from Other Machines

By default, Ollama only listens on 127.0.0.1 — requests from other hosts are refused. To expose the API on your network:

export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Verify it’s listening on all interfaces:

lsof -i :11434
# Should show: LISTEN *:11434

Test from another machine:

curl http://<your-ip>:11434/api/tags

CORS for browser-based clients:

Ollama’s default CORS policy only allows requests from localhost origins. If you’re calling the API from a browser app hosted on a different origin, set OLLAMA_ORIGINS:

export OLLAMA_ORIGINS="http://localhost:3000,https://your-app.example.com"
ollama serve

For development:

export OLLAMA_ORIGINS="*"   # Allow all — do not use in production
ollama serve

For a systemd service, add both to the override file:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=https://your-app.example.com"

Fix 7: Using Ollama with the OpenAI SDK

Ollama exposes an OpenAI-compatible API at /v1/. You can use the official OpenAI Python or Node.js SDK pointed at your local Ollama instance — no OpenAI account needed.

Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",          # Required by the client, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3:8b",         # Must match a model from `ollama list`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
)

print(response.choices[0].message.content)

Streaming:

stream = client.chat.completions.create(
    model="llama3:8b",
    messages=[{"role": "user", "content": "Write a haiku."}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Embeddings:

response = client.embeddings.create(
    model="nomic-embed-text:v1.5",
    input="The quick brown fox jumps over the lazy dog",
)
embedding = response.data[0].embedding

Node.js:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1/",
  apiKey: "ollama",
});

const response = await client.chat.completions.create({
  model: "llama3:8b",
  messages: [{ role: "user", content: "Hello" }],
});

console.log(response.choices[0].message.content);

Using Ollama with LangChain:

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3:8b", base_url="http://localhost:11434")
response = llm.invoke("What is the capital of France?")
print(response.content)

Install: pip install langchain-ollama.

Fix 8: Docker Setup with GPU Passthrough

Running Ollama in Docker requires explicitly passing GPU access to the container.

NVIDIA — install nvidia-container-toolkit first:

sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify Docker can see the GPU before trying Ollama
docker run --rm --gpus all ubuntu nvidia-smi

If this test fails, the Docker GPU setup is broken regardless of Ollama. Fix Docker’s GPU access first — see Docker daemon not running for Docker service troubleshooting.

Run Ollama with GPU:

# NVIDIA
docker run -d \
  --name ollama \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

# AMD ROCm
docker run -d \
  --name ollama \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:rocm

Common Mistake: Using -p 127.0.0.1:11434:11434 instead of -p 11434:11434. The first form only accepts connections from the Docker host’s loopback — if you’re calling the API from another container on the same Docker network, it won’t reach Ollama. Use -p 11434:11434 or set OLLAMA_HOST=0.0.0.0:11434 inside the container.

Docker Compose:

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama_data:

Pull and run a model inside the running container:

docker exec -it ollama ollama pull llama3:8b
docker exec -it ollama ollama run llama3:8b "Hello"

Ollama vs llama.cpp, vLLM, LM Studio, Jan, LocalAI, text-generation-webui

Local LLM serving is a crowded category. The right tool depends on how much you care about throughput, ease of setup, and whether the model lives on a developer laptop or a server with multiple GPUs.

Ollama. Wraps llama.cpp with a friendlier CLI, an HTTP API, and a Modelfile concept. Optimized for a single user on a single machine. Pulling models is one command. Default config “just works” on Apple Silicon. The OpenAI-compatible /v1/ endpoint makes it a drop-in replacement for the OpenAI SDK during development. The trade-off: throughput is single-stream-focused — running batched inference on a server with concurrent users isn’t its strong suit.

llama.cpp (bare). Ollama’s underlying engine. Pure C/C++ with no daemon. You compile it, hand it a .gguf file, and call ./main or ./server. Faster iteration on cutting-edge model architectures (it absorbs new model families before Ollama does), and lighter resource footprint. Trade-off: you handle quantization conversion, prompt templates, and CLI arguments yourself. Pick llama.cpp directly when you want the smallest possible runtime or you’re building a custom front end.

vLLM. Server-grade inference engine. PagedAttention, continuous batching, tensor parallelism across multiple GPUs. Designed for high-concurrency production workloads — think “we serve 200 simultaneous chat sessions on an A100.” It exposes an OpenAI-compatible API too, but installation requires a recent CUDA toolkit and a Python environment. Pick vLLM for production servers; don’t pick it for a MacBook. See Fix: vLLM Not Working for the install-time errors you’ll hit moving from local-laptop to GPU-server.

LM Studio. GUI desktop app. Discovery via Hugging Face integration, one-click model downloads, a built-in chat playground, and a local server with OpenAI-compatible endpoints. Wraps llama.cpp under the hood. Best for non-developers and prototype workflows. Trade-off: closed source, no CLI-friendly automation story, less reproducible than Ollama in scripts.

Jan. Open-source alternative to LM Studio. Same niche — desktop chat + local server — but MIT licensed. Smaller model catalog and a smaller community than LM Studio or Ollama. Pick Jan if you specifically want a desktop GUI that’s auditable.

LocalAI. Aims to be a drop-in OpenAI replacement with broader model coverage (LLMs, image gen, audio, embeddings, all behind one API). Wraps multiple backends including llama.cpp and whisper.cpp. Heavier to set up than Ollama but covers more modalities. Pick LocalAI when you want one server that handles chat + Whisper transcription + Stable Diffusion behind a uniform API.

text-generation-webui (oobabooga). Web UI focused on local LLM tinkering. Strong support for LoRA loading, fine-tuning interfaces, and a wide range of model loaders (Transformers, ExLlama, GPTQ, etc.). Targets hobbyists and researchers experimenting with model internals. Trade-off: significantly more setup, optimized for interactive use rather than serving an API to other apps.

Quick decision table:

Goal	Pick
Local dev, OpenAI SDK swap	Ollama
Bleeding-edge model architectures	llama.cpp directly
High-throughput production GPU server	vLLM
Desktop GUI, non-developer audience	LM Studio (closed) or Jan (open)
Multi-modal (LLM + image + audio)	LocalAI
Experimenting with LoRA / fine-tunes	text-generation-webui

A pattern that pairs well in practice: develop against Ollama locally for the OpenAI-SDK compatibility, then swap the base URL to vLLM in staging and production. Both expose the same /v1/chat/completions endpoint, so application code stays unchanged. Cloud APIs (Anthropic, OpenAI) are the third tier; their error patterns are covered in Fix: OpenAI API Not Working.

Still Not Working?

Generation Is Slow Even with GPU Detected

ollama ps shows the model is on GPU but output is 2–3 tokens/second. Three possible causes:

KV cache is overflowing to RAM — reduce num_ctx (see Fix 5)
Model is partially on CPU — ollama ps shows a split like 40% GPU 60% CPU, meaning the model doesn’t fit. Use a smaller quantization
Thermal throttling — GPU is overheating and reducing clock speed. Check GPU temperature with nvidia-smi -q -d TEMPERATURE

`ollama serve` Crashes Immediately on Linux

Check for missing libraries:

OLLAMA_DEBUG=1 ollama serve 2>&1 | head -50

Common causes: missing CUDA runtime libraries after a driver update, or an incompatible glibc version. Re-run the Ollama installer script:

curl -fsSL https://ollama.com/install.sh | sh

Model Response Cuts Off Mid-Sentence

The num_predict parameter limits output token count. The default is -1 (unlimited) but some Modelfile configurations set it lower. Check and override:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3:8b", "prompt": "Write a detailed explanation", "options": {"num_predict": 2048}}'

Ollama API vs. Native API

Ollama has two API endpoints:

/api/generate — Ollama’s native API, supports raw completions and streaming
/v1/chat/completions — OpenAI-compatible endpoint

The native API accepts options for per-request parameters. The /v1/ endpoint maps some OpenAI fields but ignores others. If a parameter isn’t working via /v1/, try the native endpoint directly.

For building apps that need to fall back between cloud LLMs and Ollama, the OpenAI SDK base_url swap works cleanly — the same code talks to either API. See OpenAI API not working for the error patterns you’ll encounter on the cloud side.

Windows: `ollama` Not Found After Install

The installer adds Ollama to %LOCALAPPDATA%\Programs\Ollama. If that path isn’t in your PATH, open a new terminal after install. PowerShell sometimes caches the old PATH — close and reopen the terminal window after installation.

Streaming Stops Mid-Response Through a Reverse Proxy

If you put nginx, HAProxy, or Cloudflare in front of Ollama and streaming responses cut off after a few seconds, the proxy is buffering. Disable buffering for the Ollama endpoint:

location /ollama/ {
    proxy_pass http://127.0.0.1:11434/;
    proxy_buffering off;
    proxy_read_timeout 600s;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
}

Long generations also exceed default proxy timeouts (60s in many CDNs). Either raise the read timeout or push streaming traffic over a dedicated route that bypasses the CDN.

Multiple Ollama Daemons Fighting for the GPU

If you start ollama serve manually while the systemd or macOS service is also running, two processes will fight over the same model files and VRAM. The first sign is ollama ps returning empty even though a model is loaded — you’re talking to the wrong daemon. Stop the system-managed instance first (sudo systemctl stop ollama) before launching one in the foreground. The same idea applies inside Docker: don’t bind-mount ~/.ollama into a container that runs alongside a host daemon.

Pulling Models Over a Corporate Proxy with Self-Signed TLS

In enterprises, the registry pull fails with x509: certificate signed by unknown authority. Ollama uses the OS trust store. Install the corporate CA at the OS level (Keychain on macOS, update-ca-certificates on Debian/Ubuntu) and restart ollama serve. Don’t try to set GODEBUG=x509ignoreCN=0 or similar — those flags are removed in current Go versions and won’t help.

Fix: Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected

The Error

Why This Happens

Fix 1: Daemon Not Running — Start Ollama First

Fix 2: Model Not Found — Pull Before Running

Fix 3: GPU Not Detected — Falling Back to CPU

Fix 4: Port 11434 Already in Use

Fix 5: Out of VRAM — Model Too Large

Fix 6: API Access from Other Machines

Fix 7: Using Ollama with the OpenAI SDK

Fix 8: Docker Setup with GPU Passthrough

Ollama vs llama.cpp, vLLM, LM Studio, Jan, LocalAI, text-generation-webui

Still Not Working?

Generation Is Slow Even with GPU Detected

`ollama serve` Crashes Immediately on Linux

Model Response Cuts Off Mid-Sentence

Ollama API vs. Native API

Windows: `ollama` Not Found After Install

Streaming Stops Mid-Response Through a Reverse Proxy

Multiple Ollama Daemons Fighting for the GPU

Pulling Models Over a Corporate Proxy with Self-Signed TLS

Related Articles

Fix: CrewAI Not Working — Agent Delegation, Task Context, and LLM Configuration Errors

Fix: LangGraph Not Working — State Errors, Checkpointer Setup, and Cyclic Graph Failures

Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors

Fix: Hugging Face Transformers Not Working — OSError, CUDA OOM, and Generation Errors

The Error

Why This Happens

Fix 1: Daemon Not Running — Start Ollama First

Fix 2: Model Not Found — Pull Before Running

Fix 3: GPU Not Detected — Falling Back to CPU

Fix 4: Port 11434 Already in Use

Fix 5: Out of VRAM — Model Too Large

Fix 6: API Access from Other Machines

Fix 7: Using Ollama with the OpenAI SDK

Fix 8: Docker Setup with GPU Passthrough

Ollama vs llama.cpp, vLLM, LM Studio, Jan, LocalAI, text-generation-webui

Still Not Working?

Generation Is Slow Even with GPU Detected

ollama serve Crashes Immediately on Linux

Model Response Cuts Off Mid-Sentence

Ollama API vs. Native API

Windows: ollama Not Found After Install

Streaming Stops Mid-Response Through a Reverse Proxy

Multiple Ollama Daemons Fighting for the GPU

Pulling Models Over a Corporate Proxy with Self-Signed TLS

Related Articles

Fix: CrewAI Not Working — Agent Delegation, Task Context, and LLM Configuration Errors

Fix: LangGraph Not Working — State Errors, Checkpointer Setup, and Cyclic Graph Failures

Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors

Fix: Hugging Face Transformers Not Working — OSError, CUDA OOM, and Generation Errors

`ollama serve` Crashes Immediately on Linux

Windows: `ollama` Not Found After Install