Skip to content

Fix: Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected

FixDevs ·

Quick Answer

How to fix Ollama errors — connection refused when the daemon isn't running, model not found, GPU not detected falling back to CPU, port 11434 already in use, VRAM exhausted, and API access from other machines.

The Error

You run ollama run llama3 and get this:

Error: could not connect to ollama app, is it running?

Or the model isn’t there:

Error: model "llama3" not found, try pulling it first

Or Ollama starts but ignores your GPU entirely:

llm_load_tensors: offloading 0 layers to GPU
llm_load_tensors: offloaded 0/32 layers to GPU

Or another process is already on the port:

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Each of these is a distinct failure mode. None requires reinstalling.

Why This Happens

Ollama has two parts: a background daemon (ollama serve) and the CLI. The daemon must be running before any API call or ollama run command will work. When it isn’t running, every call fails with “connection refused.”

GPU detection is a separate concern — Ollama detects your GPU at startup by scanning for CUDA (NVIDIA), ROCm (AMD), or Metal (Apple Silicon). If the right driver or toolkit isn’t installed, Ollama silently falls back to CPU, which is 5–30x slower but otherwise functional.

Fix 1: Daemon Not Running — Start Ollama First

The daemon runs independently of your terminal. If you haven’t started it, all commands fail immediately.

Verify the daemon is up:

curl http://localhost:11434/api/tags

If you get a JSON response, the daemon is running. If you get “connection refused,” it’s not.

macOS — Ollama runs as a menu bar app. Open it from Spotlight (Cmd+Space, search “Ollama”) or from /Applications/Ollama.app. You can also start just the server from a terminal:

ollama serve

Linux (systemd):

sudo systemctl start ollama
sudo systemctl status ollama     # Verify it's active
sudo systemctl enable ollama     # Start automatically on boot

View live logs:

journalctl -u ollama -f

Windows — Ollama installs as a background service. Find it in the system tray or restart it from Task Manager → Services tab. To run it manually:

ollama serve

Debug mode — if the daemon starts but something is wrong, enable verbose logging:

OLLAMA_DEBUG=1 ollama serve

This logs GPU detection, model loading decisions, and request handling in detail.

Fix 2: Model Not Found — Pull Before Running

Ollama models are not bundled with the application. Each model must be downloaded separately and stored locally before it can be used.

Error: model "llama3" not found, try pulling it first

Pull the model first:

ollama pull llama3        # Download without running
ollama run llama3         # Download if missing, then run

List what’s already installed:

ollama list
NAME                    ID              SIZE    MODIFIED
llama3:8b               6d4eaa4c8e7f    4.7 GB  2 hours ago
mistral:7b              f974a74d6e12    4.1 GB  3 days ago
nomic-embed-text:v1.5   0a109f422b47    274 MB  1 week ago

Model names are case-sensitive and include a tag (:8b, :7b, :latest). If you pull llama3 and then try to run llama3:8b, it works — :latest and the default tag resolve to the same image. But llama3:70b is a different, much larger model.

Pull failing due to network issues:

# With a proxy
export HTTPS_PROXY=http://proxy.example.com:8080
ollama pull llama3

# Debug the pull
OLLAMA_DEBUG=1 ollama pull llama3

# Manual connectivity check
curl -v https://registry.ollama.ai/v2/library/llama3/manifests/latest

If the registry is unreachable from your network, you can copy a pulled model from another machine. Models are stored in ~/.ollama/models on macOS and Linux, and %USERPROFILE%\.ollama\models on Windows.

Fix 3: GPU Not Detected — Falling Back to CPU

When Ollama runs entirely on CPU, generation is noticeably slow (often 1–5 tokens/second on consumer hardware vs. 30–100+ on GPU). The log output gives it away:

llm_load_tensors: offloaded 0/32 layers to GPU

Or when you check running models:

ollama ps
NAME        ID              SIZE    PROCESSOR    UNTIL
llama3:8b   6d4eaa4c8e7f    4.7GB   100% CPU     5 minutes

The PROCESSOR column tells you exactly what’s being used.

NVIDIA GPUs — CUDA requirements:

Ollama requires NVIDIA driver 531+ and CUDA toolkit. The driver and CUDA toolkit are separate packages — having just the driver is not enough.

nvidia-smi        # Shows driver version — must be 531+
nvcc --version    # Shows CUDA toolkit version — must be 11.3+

If nvidia-smi works but nvcc is missing, install the CUDA toolkit:

# Ubuntu
sudo apt install nvidia-cuda-toolkit

After installing, restart the Ollama daemon. It detects CUDA at startup, not at runtime.

If you’re on Linux and the GPU stops working after the system wakes from suspend:

sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
sudo systemctl restart ollama

AMD GPUs — ROCm requirements:

# Install ROCm on Ubuntu
sudo apt install rocm-hip-sdk

# Verify
rocm-smi

Ollama supports ROCm 6+ on Linux. Windows ROCm support is limited to select GPU models.

Apple Silicon — Metal:

Metal acceleration is automatic on M1/M2/M3/M4 chips. No configuration needed. If Ollama is running under Rosetta (x86 emulation instead of native ARM), Metal won’t work. Check with:

file $(which ollama)
# Should show: Mach-O 64-bit executable arm64

If it shows x86_64, reinstall Ollama from the official site using the macOS (Apple Silicon) download.

Force CPU-only mode (for testing or when GPU causes instability):

OLLAMA_NUM_GPU=0 ollama serve

Pro Tip: GPU detection is logged at startup. Run OLLAMA_DEBUG=1 ollama serve and look for lines containing CUDA, ROCm, or Metal to see exactly what Ollama found and why it accepted or rejected each GPU.

Fix 4: Port 11434 Already in Use

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Usually this means a previous Ollama instance didn’t shut down cleanly. Find and kill it:

# macOS/Linux — find the process
lsof -i :11434

# Kill it
kill -9 <PID>

# Or by name
pkill -f "ollama serve"

On Windows:

netstat -ano | findstr :11434
taskkill /PID <PID> /F

To run on a different port permanently, set OLLAMA_HOST:

export OLLAMA_HOST=127.0.0.1:11435
ollama serve

For a systemd service:

sudo systemctl edit ollama.service

Add under [Service]:

Environment="OLLAMA_HOST=127.0.0.1:11435"
sudo systemctl restart ollama

The port conflict fix is the same pattern as other server processes. For more on killing port conflicts across different tools, see port 3000 already in use.

Fix 5: Out of VRAM — Model Too Large

When VRAM is insufficient, Ollama doesn’t error out — it offloads layers to system RAM, which works but is much slower. If ollama ps shows 50% CPU 50% GPU, only half the model fit in VRAM.

Option 1: Use a smaller quantization:

Quantized models are compressed versions that trade a small amount of quality for significantly less VRAM. The q4_K_M variant is the standard recommendation:

ollama pull llama3:8b-q4_K_M   # ~5 GB VRAM — best balance
ollama pull llama3:8b-q3_K_M   # ~4 GB VRAM — more compressed

Approximate VRAM requirements for a 8B model:

  • q8: ~9 GB
  • q6_K: ~7.5 GB
  • q5_K_M: ~6.5 GB
  • q4_K_M: ~5 GB (recommended starting point)
  • q3_K_M: ~4 GB

Option 2: Reduce context length:

The KV cache grows linearly with context length. Reducing num_ctx from the default (often 128k) to something smaller frees significant VRAM:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3:8b", "prompt": "Hello", "options": {"num_ctx": 4096}}'

To set it globally via a Modelfile:

FROM llama3:8b
PARAMETER num_ctx 4096
ollama create llama3-compact -f Modelfile
ollama run llama3-compact

Option 3: Control layer offloading:

OLLAMA_NUM_GPU sets how many transformer layers to place on the GPU. If you have limited VRAM, set it to a lower value to fit within your budget:

OLLAMA_NUM_GPU=20 ollama serve   # Offload 20 layers, rest on CPU

Option 4: Limit concurrent loaded models:

By default, Ollama keeps recently used models in VRAM. If you’re switching between models, limit this to one:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Fix 6: API Access from Other Machines

By default, Ollama only listens on 127.0.0.1 — requests from other hosts are refused. To expose the API on your network:

export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Verify it’s listening on all interfaces:

lsof -i :11434
# Should show: LISTEN *:11434

Test from another machine:

curl http://<your-ip>:11434/api/tags

CORS for browser-based clients:

Ollama’s default CORS policy only allows requests from localhost origins. If you’re calling the API from a browser app hosted on a different origin, set OLLAMA_ORIGINS:

export OLLAMA_ORIGINS="http://localhost:3000,https://your-app.example.com"
ollama serve

For development:

export OLLAMA_ORIGINS="*"   # Allow all — do not use in production
ollama serve

For a systemd service, add both to the override file:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=https://your-app.example.com"

Fix 7: Using Ollama with the OpenAI SDK

Ollama exposes an OpenAI-compatible API at /v1/. You can use the official OpenAI Python or Node.js SDK pointed at your local Ollama instance — no OpenAI account needed.

Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",          # Required by the client, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3:8b",         # Must match a model from `ollama list`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain async/await in Python."},
    ],
)

print(response.choices[0].message.content)

Streaming:

stream = client.chat.completions.create(
    model="llama3:8b",
    messages=[{"role": "user", "content": "Write a haiku."}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Embeddings:

response = client.embeddings.create(
    model="nomic-embed-text:v1.5",
    input="The quick brown fox jumps over the lazy dog",
)
embedding = response.data[0].embedding

Node.js:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1/",
  apiKey: "ollama",
});

const response = await client.chat.completions.create({
  model: "llama3:8b",
  messages: [{ role: "user", content: "Hello" }],
});

console.log(response.choices[0].message.content);

Using Ollama with LangChain:

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3:8b", base_url="http://localhost:11434")
response = llm.invoke("What is the capital of France?")
print(response.content)

Install: pip install langchain-ollama. For LangChain agent patterns that work with Ollama, see LangChain Python not working.

Fix 8: Docker Setup with GPU Passthrough

Running Ollama in Docker requires explicitly passing GPU access to the container.

NVIDIA — install nvidia-container-toolkit first:

sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify Docker can see the GPU before trying Ollama
docker run --rm --gpus all ubuntu nvidia-smi

If this test fails, the Docker GPU setup is broken regardless of Ollama. Fix Docker’s GPU access first — see Docker daemon not running for Docker service troubleshooting.

Run Ollama with GPU:

# NVIDIA
docker run -d \
  --name ollama \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

# AMD ROCm
docker run -d \
  --name ollama \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:rocm

Common Mistake: Using -p 127.0.0.1:11434:11434 instead of -p 11434:11434. The first form only accepts connections from the Docker host’s loopback — if you’re calling the API from another container on the same Docker network, it won’t reach Ollama. Use -p 11434:11434 or set OLLAMA_HOST=0.0.0.0:11434 inside the container.

Docker Compose:

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama_data:

Pull and run a model inside the running container:

docker exec -it ollama ollama pull llama3:8b
docker exec -it ollama ollama run llama3:8b "Hello"

Still Not Working?

Generation Is Slow Even with GPU Detected

ollama ps shows the model is on GPU but output is 2–3 tokens/second. Three possible causes:

  1. KV cache is overflowing to RAM — reduce num_ctx (see Fix 5)
  2. Model is partially on CPUollama ps shows a split like 40% GPU 60% CPU, meaning the model doesn’t fit. Use a smaller quantization
  3. Thermal throttling — GPU is overheating and reducing clock speed. Check GPU temperature with nvidia-smi -q -d TEMPERATURE

ollama serve Crashes Immediately on Linux

Check for missing libraries:

OLLAMA_DEBUG=1 ollama serve 2>&1 | head -50

Common causes: missing CUDA runtime libraries after a driver update, or an incompatible glibc version. Re-run the Ollama installer script:

curl -fsSL https://ollama.com/install.sh | sh

Model Response Cuts Off Mid-Sentence

The num_predict parameter limits output token count. The default is -1 (unlimited) but some Modelfile configurations set it lower. Check and override:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3:8b", "prompt": "Write a detailed explanation", "options": {"num_predict": 2048}}'

Ollama API vs. Native API

Ollama has two API endpoints:

  • /api/generate — Ollama’s native API, supports raw completions and streaming
  • /v1/chat/completions — OpenAI-compatible endpoint

The native API accepts options for per-request parameters. The /v1/ endpoint maps some OpenAI fields but ignores others. If a parameter isn’t working via /v1/, try the native endpoint directly.

For building apps that need to fall back between cloud LLMs and Ollama, the OpenAI SDK base_url swap works cleanly — the same code talks to either API. See OpenAI API not working for the error patterns you’ll encounter on the cloud side.

Windows: ollama Not Found After Install

The installer adds Ollama to %LOCALAPPDATA%\Programs\Ollama. If that path isn’t in your PATH, open a new terminal after install. PowerShell sometimes caches the old PATH — close and reopen the terminal window after installation.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles