Fix: Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected
Quick Answer
How to fix Ollama errors — connection refused when the daemon isn't running, model not found, GPU not detected falling back to CPU, port 11434 already in use, VRAM exhausted, and API access from other machines.
The Error
You run ollama run llama3 and get this:
Error: could not connect to ollama app, is it running?Or the model isn’t there:
Error: model "llama3" not found, try pulling it firstOr Ollama starts but ignores your GPU entirely:
llm_load_tensors: offloading 0 layers to GPU
llm_load_tensors: offloaded 0/32 layers to GPUOr another process is already on the port:
Error: listen tcp 127.0.0.1:11434: bind: address already in useEach of these is a distinct failure mode. None requires reinstalling.
Why This Happens
Ollama has two parts: a background daemon (ollama serve) and the CLI. The daemon must be running before any API call or ollama run command will work. When it isn’t running, every call fails with “connection refused.”
GPU detection is a separate concern — Ollama detects your GPU at startup by scanning for CUDA (NVIDIA), ROCm (AMD), or Metal (Apple Silicon). If the right driver or toolkit isn’t installed, Ollama silently falls back to CPU, which is 5–30x slower but otherwise functional.
Fix 1: Daemon Not Running — Start Ollama First
The daemon runs independently of your terminal. If you haven’t started it, all commands fail immediately.
Verify the daemon is up:
curl http://localhost:11434/api/tagsIf you get a JSON response, the daemon is running. If you get “connection refused,” it’s not.
macOS — Ollama runs as a menu bar app. Open it from Spotlight (Cmd+Space, search “Ollama”) or from /Applications/Ollama.app. You can also start just the server from a terminal:
ollama serveLinux (systemd):
sudo systemctl start ollama
sudo systemctl status ollama # Verify it's active
sudo systemctl enable ollama # Start automatically on bootView live logs:
journalctl -u ollama -fWindows — Ollama installs as a background service. Find it in the system tray or restart it from Task Manager → Services tab. To run it manually:
ollama serveDebug mode — if the daemon starts but something is wrong, enable verbose logging:
OLLAMA_DEBUG=1 ollama serveThis logs GPU detection, model loading decisions, and request handling in detail.
Fix 2: Model Not Found — Pull Before Running
Ollama models are not bundled with the application. Each model must be downloaded separately and stored locally before it can be used.
Error: model "llama3" not found, try pulling it firstPull the model first:
ollama pull llama3 # Download without running
ollama run llama3 # Download if missing, then runList what’s already installed:
ollama listNAME ID SIZE MODIFIED
llama3:8b 6d4eaa4c8e7f 4.7 GB 2 hours ago
mistral:7b f974a74d6e12 4.1 GB 3 days ago
nomic-embed-text:v1.5 0a109f422b47 274 MB 1 week agoModel names are case-sensitive and include a tag (:8b, :7b, :latest). If you pull llama3 and then try to run llama3:8b, it works — :latest and the default tag resolve to the same image. But llama3:70b is a different, much larger model.
Pull failing due to network issues:
# With a proxy
export HTTPS_PROXY=http://proxy.example.com:8080
ollama pull llama3
# Debug the pull
OLLAMA_DEBUG=1 ollama pull llama3
# Manual connectivity check
curl -v https://registry.ollama.ai/v2/library/llama3/manifests/latestIf the registry is unreachable from your network, you can copy a pulled model from another machine. Models are stored in ~/.ollama/models on macOS and Linux, and %USERPROFILE%\.ollama\models on Windows.
Fix 3: GPU Not Detected — Falling Back to CPU
When Ollama runs entirely on CPU, generation is noticeably slow (often 1–5 tokens/second on consumer hardware vs. 30–100+ on GPU). The log output gives it away:
llm_load_tensors: offloaded 0/32 layers to GPUOr when you check running models:
ollama psNAME ID SIZE PROCESSOR UNTIL
llama3:8b 6d4eaa4c8e7f 4.7GB 100% CPU 5 minutesThe PROCESSOR column tells you exactly what’s being used.
NVIDIA GPUs — CUDA requirements:
Ollama requires NVIDIA driver 531+ and CUDA toolkit. The driver and CUDA toolkit are separate packages — having just the driver is not enough.
nvidia-smi # Shows driver version — must be 531+
nvcc --version # Shows CUDA toolkit version — must be 11.3+If nvidia-smi works but nvcc is missing, install the CUDA toolkit:
# Ubuntu
sudo apt install nvidia-cuda-toolkitAfter installing, restart the Ollama daemon. It detects CUDA at startup, not at runtime.
If you’re on Linux and the GPU stops working after the system wakes from suspend:
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
sudo systemctl restart ollamaAMD GPUs — ROCm requirements:
# Install ROCm on Ubuntu
sudo apt install rocm-hip-sdk
# Verify
rocm-smiOllama supports ROCm 6+ on Linux. Windows ROCm support is limited to select GPU models.
Apple Silicon — Metal:
Metal acceleration is automatic on M1/M2/M3/M4 chips. No configuration needed. If Ollama is running under Rosetta (x86 emulation instead of native ARM), Metal won’t work. Check with:
file $(which ollama)
# Should show: Mach-O 64-bit executable arm64If it shows x86_64, reinstall Ollama from the official site using the macOS (Apple Silicon) download.
Force CPU-only mode (for testing or when GPU causes instability):
OLLAMA_NUM_GPU=0 ollama servePro Tip: GPU detection is logged at startup. Run OLLAMA_DEBUG=1 ollama serve and look for lines containing CUDA, ROCm, or Metal to see exactly what Ollama found and why it accepted or rejected each GPU.
Fix 4: Port 11434 Already in Use
Error: listen tcp 127.0.0.1:11434: bind: address already in useUsually this means a previous Ollama instance didn’t shut down cleanly. Find and kill it:
# macOS/Linux — find the process
lsof -i :11434
# Kill it
kill -9 <PID>
# Or by name
pkill -f "ollama serve"On Windows:
netstat -ano | findstr :11434
taskkill /PID <PID> /FTo run on a different port permanently, set OLLAMA_HOST:
export OLLAMA_HOST=127.0.0.1:11435
ollama serveFor a systemd service:
sudo systemctl edit ollama.serviceAdd under [Service]:
Environment="OLLAMA_HOST=127.0.0.1:11435"sudo systemctl restart ollamaThe port conflict fix is the same pattern as other server processes. For more on killing port conflicts across different tools, see port 3000 already in use.
Fix 5: Out of VRAM — Model Too Large
When VRAM is insufficient, Ollama doesn’t error out — it offloads layers to system RAM, which works but is much slower. If ollama ps shows 50% CPU 50% GPU, only half the model fit in VRAM.
Option 1: Use a smaller quantization:
Quantized models are compressed versions that trade a small amount of quality for significantly less VRAM. The q4_K_M variant is the standard recommendation:
ollama pull llama3:8b-q4_K_M # ~5 GB VRAM — best balance
ollama pull llama3:8b-q3_K_M # ~4 GB VRAM — more compressedApproximate VRAM requirements for a 8B model:
q8: ~9 GBq6_K: ~7.5 GBq5_K_M: ~6.5 GBq4_K_M: ~5 GB (recommended starting point)q3_K_M: ~4 GB
Option 2: Reduce context length:
The KV cache grows linearly with context length. Reducing num_ctx from the default (often 128k) to something smaller frees significant VRAM:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3:8b", "prompt": "Hello", "options": {"num_ctx": 4096}}'To set it globally via a Modelfile:
FROM llama3:8b
PARAMETER num_ctx 4096ollama create llama3-compact -f Modelfile
ollama run llama3-compactOption 3: Control layer offloading:
OLLAMA_NUM_GPU sets how many transformer layers to place on the GPU. If you have limited VRAM, set it to a lower value to fit within your budget:
OLLAMA_NUM_GPU=20 ollama serve # Offload 20 layers, rest on CPUOption 4: Limit concurrent loaded models:
By default, Ollama keeps recently used models in VRAM. If you’re switching between models, limit this to one:
OLLAMA_MAX_LOADED_MODELS=1 ollama serveFix 6: API Access from Other Machines
By default, Ollama only listens on 127.0.0.1 — requests from other hosts are refused. To expose the API on your network:
export OLLAMA_HOST=0.0.0.0:11434
ollama serveVerify it’s listening on all interfaces:
lsof -i :11434
# Should show: LISTEN *:11434Test from another machine:
curl http://<your-ip>:11434/api/tagsCORS for browser-based clients:
Ollama’s default CORS policy only allows requests from localhost origins. If you’re calling the API from a browser app hosted on a different origin, set OLLAMA_ORIGINS:
export OLLAMA_ORIGINS="http://localhost:3000,https://your-app.example.com"
ollama serveFor development:
export OLLAMA_ORIGINS="*" # Allow all — do not use in production
ollama serveFor a systemd service, add both to the override file:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=https://your-app.example.com"Fix 7: Using Ollama with the OpenAI SDK
Ollama exposes an OpenAI-compatible API at /v1/. You can use the official OpenAI Python or Node.js SDK pointed at your local Ollama instance — no OpenAI account needed.
Python:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama", # Required by the client, ignored by Ollama
)
response = client.chat.completions.create(
model="llama3:8b", # Must match a model from `ollama list`
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain async/await in Python."},
],
)
print(response.choices[0].message.content)Streaming:
stream = client.chat.completions.create(
model="llama3:8b",
messages=[{"role": "user", "content": "Write a haiku."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)Embeddings:
response = client.embeddings.create(
model="nomic-embed-text:v1.5",
input="The quick brown fox jumps over the lazy dog",
)
embedding = response.data[0].embeddingNode.js:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1/",
apiKey: "ollama",
});
const response = await client.chat.completions.create({
model: "llama3:8b",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);Using Ollama with LangChain:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3:8b", base_url="http://localhost:11434")
response = llm.invoke("What is the capital of France?")
print(response.content)Install: pip install langchain-ollama. For LangChain agent patterns that work with Ollama, see LangChain Python not working.
Fix 8: Docker Setup with GPU Passthrough
Running Ollama in Docker requires explicitly passing GPU access to the container.
NVIDIA — install nvidia-container-toolkit first:
sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify Docker can see the GPU before trying Ollama
docker run --rm --gpus all ubuntu nvidia-smiIf this test fails, the Docker GPU setup is broken regardless of Ollama. Fix Docker’s GPU access first — see Docker daemon not running for Docker service troubleshooting.
Run Ollama with GPU:
# NVIDIA
docker run -d \
--name ollama \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
# AMD ROCm
docker run -d \
--name ollama \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:rocmCommon Mistake: Using -p 127.0.0.1:11434:11434 instead of -p 11434:11434. The first form only accepts connections from the Docker host’s loopback — if you’re calling the API from another container on the same Docker network, it won’t reach Ollama. Use -p 11434:11434 or set OLLAMA_HOST=0.0.0.0:11434 inside the container.
Docker Compose:
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
ollama_data:Pull and run a model inside the running container:
docker exec -it ollama ollama pull llama3:8b
docker exec -it ollama ollama run llama3:8b "Hello"Still Not Working?
Generation Is Slow Even with GPU Detected
ollama ps shows the model is on GPU but output is 2–3 tokens/second. Three possible causes:
- KV cache is overflowing to RAM — reduce
num_ctx(see Fix 5) - Model is partially on CPU —
ollama psshows a split like40% GPU 60% CPU, meaning the model doesn’t fit. Use a smaller quantization - Thermal throttling — GPU is overheating and reducing clock speed. Check GPU temperature with
nvidia-smi -q -d TEMPERATURE
ollama serve Crashes Immediately on Linux
Check for missing libraries:
OLLAMA_DEBUG=1 ollama serve 2>&1 | head -50Common causes: missing CUDA runtime libraries after a driver update, or an incompatible glibc version. Re-run the Ollama installer script:
curl -fsSL https://ollama.com/install.sh | shModel Response Cuts Off Mid-Sentence
The num_predict parameter limits output token count. The default is -1 (unlimited) but some Modelfile configurations set it lower. Check and override:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3:8b", "prompt": "Write a detailed explanation", "options": {"num_predict": 2048}}'Ollama API vs. Native API
Ollama has two API endpoints:
/api/generate— Ollama’s native API, supports raw completions and streaming/v1/chat/completions— OpenAI-compatible endpoint
The native API accepts options for per-request parameters. The /v1/ endpoint maps some OpenAI fields but ignores others. If a parameter isn’t working via /v1/, try the native endpoint directly.
For building apps that need to fall back between cloud LLMs and Ollama, the OpenAI SDK base_url swap works cleanly — the same code talks to either API. See OpenAI API not working for the error patterns you’ll encounter on the cloud side.
Windows: ollama Not Found After Install
The installer adds Ollama to %LOCALAPPDATA%\Programs\Ollama. If that path isn’t in your PATH, open a new terminal after install. PowerShell sometimes caches the old PATH — close and reopen the terminal window after installation.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Hugging Face Transformers Not Working — OSError, CUDA OOM, and Generation Errors
How to fix Hugging Face Transformers errors — OSError can't load tokenizer, gated repo access, CUDA out of memory with device_map auto, bitsandbytes not installed, tokenizer padding mismatch, pad_token_id warning, and LoRA adapter loading failures.
Fix: LangChain Python Not Working — ImportError, Pydantic, and Deprecated Classes
How to fix LangChain Python errors — ImportError from package split, Pydantic v2 compatibility, AgentExecutor deprecated, ConversationBufferMemory removed, LCEL output type mismatches, and tool calling failures.
Fix: TensorFlow Not Working — OOM, Shape Mismatch, GPU Not Found, and Keras Errors
How to fix TensorFlow errors — GPU not detected CUDA library missing, ResourceExhaustedError OOM, InvalidArgumentError shape mismatch, NaN loss, @tf.function AutoGraph failures, and Keras 3 breaking changes in TF 2.16+.
Fix: PyTorch Not Working — CUDA Out of Memory, Device Mismatch, and NaN Loss
How to fix PyTorch errors — CUDA out of memory, expected all tensors on same device, CUDA device-side assert triggered, torch.cuda.is_available() False, inplace gradient errors, DataLoader Windows crash, dtype mismatch, and NaN loss.