Fix: PyTorch Not Working — CUDA Out of Memory, Device Mismatch, and NaN Loss
Quick Answer
How to fix PyTorch errors — CUDA out of memory, expected all tensors on same device, CUDA device-side assert triggered, torch.cuda.is_available() False, inplace gradient errors, DataLoader Windows crash, dtype mismatch, and NaN loss.
The Error
You start training and GPU memory fills up:
RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB
(GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated;
1.03 GiB free; 6.89 GiB reserved in total by PyTorch)Or a forward pass crashes with a device error:
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!Or CUDA gives you a cryptic error that points to the wrong line:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.Or training runs fine for a few steps and then loss becomes NaN and stays there:
Epoch 1, Step 50: loss = 0.3421
Epoch 1, Step 51: loss = nan
Epoch 1, Step 52: loss = nan # Never recoversEach of these is a distinct failure with a specific fix.
Why This Happens
PyTorch is strict about where tensors live (CPU vs. GPU device index), what type they are (float32 vs. float64), and how they’re modified (inplace vs. out-of-place). The GPU introduces additional failure modes: CUDA errors are reported asynchronously by default, which means the stack trace you see often points to the wrong line. GPU memory is managed by a caching allocator that doesn’t always release memory when you expect it to.
Understanding these mechanics makes the errors predictable rather than mysterious.
Fix 1: CUDA Out of Memory
The error tells you how much was requested, how much is allocated, and how much is free. The gap between “allocated” and “reserved” is cached memory that PyTorch holds but isn’t actively using:
Tried to allocate 2.50 GiB
Already allocated: 6.73 GiB
Free: 1.03 GiB ← not enough for the request
Reserved: 6.89 GiB ← includes cached but inactive memoryOption 1: Gradient accumulation (reduces effective batch size without changing the model):
from torch.amp import autocast, GradScaler
accumulation_steps = 4 # Effective batch = actual_batch × 4
scaler = GradScaler(device="cuda")
optimizer.zero_grad()
for i, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
with autocast(device_type="cuda", dtype=torch.float16):
loss = criterion(model(X), y) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()Dividing the loss by accumulation_steps keeps gradients on the same scale as if you’d used the full batch.
Option 2: Mixed precision — halves memory usage for activations and most parameters by running the forward pass in float16:
from torch.amp import autocast, GradScaler
scaler = GradScaler(device="cuda")
for X, y in dataloader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
with autocast(device_type="cuda", dtype=torch.float16):
output = model(X)
loss = criterion(output, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()GradScaler prevents float16 underflow during the backward pass. Don’t skip it.
Option 3: Call torch.cuda.empty_cache() — but understand what it does and doesn’t do:
torch.cuda.empty_cache()This releases cached (reserved but inactive) memory back to the OS. It does not free memory that’s still held by live tensors. If allocated is the problem, not reserved, empty_cache() won’t help. Use torch.cuda.memory_summary() to see which category your memory is in:
print(torch.cuda.memory_summary(device=0))Option 4: Reduce batch size — the simplest fix if you’re not compute-bound. As a rough guide, halving the batch size frees roughly half the activation memory.
Pro Tip: Wrap your training loop with torch.no_grad() during validation. Every tensor created in a forward pass without no_grad() saves its intermediate values for backprop, doubling memory usage compared to inference:
model.eval()
with torch.no_grad():
for X, y in val_loader:
output = model(X.to(device))
# No gradients tracked — significantly less memoryFix 2: Device Mismatch — All Tensors Must Be on the Same Device
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!The three most common places this happens:
1. Input tensors not moved to GPU:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
for X, y in dataloader:
# WRONG — X and y are on CPU, model is on cuda:0
output = model(X)
# CORRECT — move inputs before every forward pass
X, y = X.to(device), y.to(device)
output = model(X)2. Loading a checkpoint without map_location:
# WRONG — loads tensors to the device they were saved on
checkpoint = torch.load("model.pth")
# CORRECT — redirect to wherever you need them
checkpoint = torch.load("model.pth", map_location=device, weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])weights_only=True is recommended in PyTorch 2.x to avoid loading arbitrary Python objects from checkpoint files.
3. Tensors created inside a model without inheriting the device:
class MyModel(nn.Module):
def forward(self, x):
# WRONG — hardcoded CPU tensor
mask = torch.ones(x.shape[0], x.shape[1])
# CORRECT — match the device of the input tensor
mask = torch.ones(x.shape[0], x.shape[1], device=x.device)
return x * maskAny tensor you create inside a forward() method must explicitly specify device=x.device or be created via operations on existing tensors (which inherit the device automatically).
Fix 3: CUDA Device-Side Assert — Finding the Real Error
This is PyTorch’s most confusing error pattern. The actual cause is hidden because CUDA runs asynchronously:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Step 1: Make CUDA synchronous to get the real stack trace:
# Linux / macOS
CUDA_LAUNCH_BLOCKING=1 python train.py
# Windows PowerShell
$env:CUDA_LAUNCH_BLOCKING=1
python train.pyWith CUDA_LAUNCH_BLOCKING=1, the error appears at the correct line.
Step 2: The real error is almost always invalid class indices in CrossEntropyLoss.
CrossEntropyLoss expects labels in [0, num_classes - 1]. If any label equals num_classes or is negative, CUDA triggers the assert:
num_classes = 5
criterion = nn.CrossEntropyLoss()
# WRONG — label 5 is out of range for num_classes=5
labels = torch.tensor([0, 2, 5], device="cuda") # 5 >= num_classes
logits = model(X)
loss = criterion(logits, labels) # device-side assert
# CORRECT — labels must be in [0, 4]
labels = torch.tensor([0, 2, 4], device="cuda")
loss = criterion(logits, labels) # worksAdd a validation check before the loss call during debugging:
assert labels.min() >= 0, f"Negative label: {labels.min()}"
assert labels.max() < num_classes, f"Label {labels.max()} >= num_classes {num_classes}"
assert labels.dtype == torch.long, f"Labels must be torch.long, got {labels.dtype}"Other common causes of device-side asserts: index out of bounds in torch.gather() or torch.index_select(), and NaN values in torch.log() or torch.sqrt().
Fix 4: torch.cuda.is_available() Returns False
After installing PyTorch, this is the first check:
import torch
print(torch.cuda.is_available()) # False — why?
print(torch.version.cuda) # None — CPU-only build installedIf torch.version.cuda is None, you have a CPU-only PyTorch build installed. The pip install torch default on some systems installs the CPU variant.
Check your system CUDA version:
nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver, release 12.1Reinstall PyTorch with the correct CUDA version. Find your CUDA version from nvcc --version and match it:
# Clear old installation and cache
pip uninstall torch torchvision torchaudio -y
pip cache purge
# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Verify
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"The --index-url flag is required — without it, pip resolves to the CPU build from PyPI.
If nvcc is not found, the CUDA toolkit may not be installed (only the driver is). NVIDIA drivers and the CUDA toolkit are separate packages. Check GPU detection with nvidia-smi — if that works, you have the driver but may be missing the toolkit. The same driver verification steps apply when running local LLMs — see Ollama not working for GPU detection diagnostics you can run independently of PyTorch.
Common Mistake: Having a CUDA 12.x system but installing a PyTorch build for CUDA 11.x. The minor version mismatch is often acceptable (e.g., PyTorch cu121 on a CUDA 12.4 system), but a major version mismatch is not.
Fix 5: Inplace Operation Breaks Gradient Computation
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.cuda.FloatTensor [128, 10]], which is
output 0 of LinearBackward0, is at version 2; expected version 0.PyTorch tracks a version counter on every tensor. Inplace operations (+=, [i] = x, .fill_()) increment the version. If the version doesn’t match what the autograd graph recorded, backprop fails.
Replace inplace operations with their out-of-place equivalents:
# WRONG — inplace on a tensor that requires grad
x = torch.randn(10, requires_grad=True)
x += 1 # Inplace: modifies x at version 0 → version 1
y = x * 2
y.sum().backward() # Error: x already modified
# CORRECT — out-of-place creates a new tensor
x = torch.randn(10, requires_grad=True)
x = x + 1 # New tensor, x is untouched
y = x * 2
y.sum().backward() # WorksCommon pattern in RNN loops — collecting hidden states:
# WRONG — assigning into a pre-allocated tensor
outputs = torch.zeros(seq_len, batch_size, hidden_dim)
for t in range(seq_len):
h = cell(inputs[t], h)
outputs[t] = h # Inplace write into outputs
# CORRECT — accumulate in a list, stack at the end
outputs = []
for t in range(seq_len):
h = cell(inputs[t], h)
outputs.append(h)
outputs = torch.stack(outputs) # Assembles without inplace opsIf you need to write into a pre-allocated buffer and the tensor doesn’t need gradients (output storage, not computation), use .detach() before assignment:
buffer[t] = h.detach() # Detached — safe to write inplaceFix 6: DataLoader Crashes on Windows — num_workers Error
RuntimeError: An attempt has been made to start a new process before the current
process has finished its bootstrapping phase. This probably means that you are on
Windows and you forgot to use the proper idiom in the main module:
if __name__ == '__main__':
...Windows spawns new processes by re-importing the entire script, which causes recursive spawning. Linux forks processes and avoids this.
Fix: wrap everything in if __name__ == '__main__':
import torch
from torch.utils.data import DataLoader, TensorDataset
def train():
dataset = TensorDataset(torch.randn(1000, 100), torch.randint(0, 10, (1000,)))
loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.Linear(100, 10).to(device)
for X, y in loader:
X, y = X.to(device), y.to(device)
output = model(X)
if __name__ == "__main__": # Required on Windows
train()Quick fix for notebooks or scripts where restructuring is inconvenient:
loader = DataLoader(dataset, batch_size=32, num_workers=0) # Disables multiprocessingnum_workers=0 runs data loading in the main process. It’s slower for I/O-bound datasets but avoids the Windows spawn issue entirely.
For DataLoader multiprocessing behavior differences between platforms, see Python multiprocessing not working.
Fix 7: Dtype Mismatch — Float32 vs. Float64
RuntimeError: expected scalar type Float but found Double
RuntimeError: mat1 and mat2 must have the same dtype, but got Double and FloatPyTorch model parameters default to float32. NumPy arrays default to float64. Converting NumPy to a tensor without specifying the dtype preserves float64:
import numpy as np
import torch
data = np.array([[1.0, 2.0, 3.0]]) # float64 by default
tensor = torch.from_numpy(data) # still float64 (Double)
model = torch.nn.Linear(3, 1) # float32
output = model(tensor) # RuntimeError: float32 ≠ float64Fix: convert to float32 explicitly:
# Option 1 — call .float() to convert to float32
tensor = torch.from_numpy(data).float()
# Option 2 — specify dtype during tensor creation
tensor = torch.tensor(data, dtype=torch.float32)
# Option 3 — convert the numpy array first
data = data.astype(np.float32)
tensor = torch.from_numpy(data)If the mismatch is inside your model (e.g., a custom layer creates float64 constants):
class MyLayer(nn.Module):
def forward(self, x):
# WRONG — np.pi is a float64 Python scalar that becomes float64 tensor
scale = torch.tensor(np.pi)
# CORRECT — match the input's dtype
scale = torch.tensor(np.pi, dtype=x.dtype)
return x * scaleFix 8: NaN Loss
Loss becoming NaN mid-training and never recovering is almost always one of three causes: a numerical singularity in the loss function, exploding gradients, or a learning rate that’s too high.
First: check where NaN enters. Add detection to your training loop:
for step, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
output = model(X)
loss = criterion(output, y)
if torch.isnan(loss):
print(f"NaN at step {step}")
print(f" output range: [{output.min():.3f}, {output.max():.3f}]")
print(f" output has NaN: {torch.isnan(output).any()}")
break # Stop before NaN propagates into weights
loss.backward()
optimizer.step()Gradient clipping is the standard fix for exploding gradients. Apply it after loss.backward() and before optimizer.step():
loss.backward()
# Clip all parameter gradients to a maximum L2 norm of 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()clip_grad_norm_ scales the entire gradient vector down if its norm exceeds max_norm. Values between 0.5 and 5.0 are common depending on the architecture.
Learning rate is the other frequent cause. If NaN appears in the first few steps, try reducing lr by 10x:
# Start conservative
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Rather than
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)With mixed precision, GradScaler handles the numerical instability of float16 automatically. If you’re using AMP without GradScaler, the combination of float16 and large gradients is a common NaN source:
from torch.amp import autocast, GradScaler
scaler = GradScaler(device="cuda")
for X, y in dataloader:
optimizer.zero_grad()
with autocast(device_type="cuda", dtype=torch.float16):
output = model(X.to(device))
loss = criterion(output, y.to(device))
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # Unscale before clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()Call scaler.unscale_(optimizer) before clip_grad_norm_ — otherwise you’re clipping the scaled gradients, not the true ones.
Still Not Working?
Model Not Training — Loss Not Decreasing
If loss prints but never changes, check these in order:
optimizer.zero_grad()is missing — gradients accumulate across steps and explodeloss.backward()is called on the wrong tensor — you’re differentiating a detached or constant value- Model is in eval mode —
model.eval()disables dropout and batchnorm tracking; callmodel.train()at the start of each epoch
model.train() # Back to training mode
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()Slow Training — GPU Utilization Low
If nvidia-smi shows low GPU utilization (<50%), the bottleneck is usually the CPU data pipeline:
loader = DataLoader(
dataset,
batch_size=64,
num_workers=4, # Parallel data loading (Linux/macOS)
pin_memory=True, # Faster host-to-GPU transfer
persistent_workers=True, # Keep worker processes alive between epochs
prefetch_factor=2, # Pre-load 2 batches per worker
)pin_memory=True pins CPU memory for faster CUDA transfers. Only use it when training on GPU.
RuntimeError: Expected input batch_size to match target batch_size
The batch dimension doesn’t match between your model output and your labels. Common cause: the last batch in an epoch has fewer samples than batch_size. Either use drop_last=True in your DataLoader or make your loss function handle variable batch sizes (most built-in losses do):
loader = DataLoader(dataset, batch_size=32, drop_last=True)Checking PyTorch Compile Status
torch.compile() (introduced in PyTorch 2.0) can speed up training by up to 2x but adds a one-time compilation overhead on the first batch. If the compiled model crashes but the eager model works, disable compile to isolate the issue:
model = MyModel().to(device)
# model = torch.compile(model) # Comment out to debug
for X, y in dataloader:
loss = criterion(model(X.to(device)), y.to(device))
loss.backward()
optimizer.step()
optimizer.zero_grad()For Python-level concurrency errors that surface in training pipelines, see Python multiprocessing not working.
If you’re building LLM pipelines that call PyTorch models through a chain or agent, see LangChain Python not working for the integration patterns between LangChain and custom torch inference code.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: TensorFlow Not Working — OOM, Shape Mismatch, GPU Not Found, and Keras Errors
How to fix TensorFlow errors — GPU not detected CUDA library missing, ResourceExhaustedError OOM, InvalidArgumentError shape mismatch, NaN loss, @tf.function AutoGraph failures, and Keras 3 breaking changes in TF 2.16+.
Fix: Apache Airflow Not Working — DAG Not Found, Task Failures, and Scheduler Issues
How to fix Apache Airflow errors — DAG not appearing in UI, ImportError preventing DAG load, task stuck in running or queued, scheduler not scheduling, XCom too large, connection not found, and database migration errors.
Fix: Dash Not Working — Callback Errors, Pattern Matching, and State Management
How to fix Dash errors — circular dependency in callbacks, pattern matching callback not firing, missing attribute clientside_callback, DataTable filtering not working, clientside JavaScript errors, Input Output State confusion, and async callback delays.
Fix: dbt Not Working — ref() Not Found, Schema Mismatch, and Compilation Errors
How to fix dbt errors — ref() model not found, profile not found, database relation does not exist, incremental model schema mismatch requiring full-refresh, dbt deps failure, Jinja compilation errors, and test failures.