Skip to content

Fix: PyTorch Not Working — CUDA Out of Memory, Device Mismatch, and NaN Loss

FixDevs ·

Quick Answer

How to fix PyTorch errors — CUDA out of memory, expected all tensors on same device, CUDA device-side assert triggered, torch.cuda.is_available() False, inplace gradient errors, DataLoader Windows crash, dtype mismatch, and NaN loss.

The Error

You start training and GPU memory fills up:

RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB
(GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated;
1.03 GiB free; 6.89 GiB reserved in total by PyTorch)

Or a forward pass crashes with a device error:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

Or CUDA gives you a cryptic error that points to the wrong line:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

Or training runs fine for a few steps and then loss becomes NaN and stays there:

Epoch 1, Step 50: loss = 0.3421
Epoch 1, Step 51: loss = nan
Epoch 1, Step 52: loss = nan  # Never recovers

Each of these is a distinct failure with a specific fix.

Why This Happens

PyTorch is strict about where tensors live (CPU vs. GPU device index), what type they are (float32 vs. float64), and how they’re modified (inplace vs. out-of-place). The GPU introduces additional failure modes: CUDA errors are reported asynchronously by default, which means the stack trace you see often points to the wrong line. GPU memory is managed by a caching allocator that doesn’t always release memory when you expect it to.

Understanding these mechanics makes the errors predictable rather than mysterious.

Fix 1: CUDA Out of Memory

The error tells you how much was requested, how much is allocated, and how much is free. The gap between “allocated” and “reserved” is cached memory that PyTorch holds but isn’t actively using:

Tried to allocate 2.50 GiB
Already allocated: 6.73 GiB
Free: 1.03 GiB               ← not enough for the request
Reserved: 6.89 GiB           ← includes cached but inactive memory

Option 1: Gradient accumulation (reduces effective batch size without changing the model):

from torch.amp import autocast, GradScaler

accumulation_steps = 4   # Effective batch = actual_batch × 4
scaler = GradScaler(device="cuda")

optimizer.zero_grad()
for i, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)

    with autocast(device_type="cuda", dtype=torch.float16):
        loss = criterion(model(X), y) / accumulation_steps

    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Dividing the loss by accumulation_steps keeps gradients on the same scale as if you’d used the full batch.

Option 2: Mixed precision — halves memory usage for activations and most parameters by running the forward pass in float16:

from torch.amp import autocast, GradScaler

scaler = GradScaler(device="cuda")

for X, y in dataloader:
    X, y = X.to(device), y.to(device)
    optimizer.zero_grad()

    with autocast(device_type="cuda", dtype=torch.float16):
        output = model(X)
        loss = criterion(output, y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

GradScaler prevents float16 underflow during the backward pass. Don’t skip it.

Option 3: Call torch.cuda.empty_cache() — but understand what it does and doesn’t do:

torch.cuda.empty_cache()

This releases cached (reserved but inactive) memory back to the OS. It does not free memory that’s still held by live tensors. If allocated is the problem, not reserved, empty_cache() won’t help. Use torch.cuda.memory_summary() to see which category your memory is in:

print(torch.cuda.memory_summary(device=0))

Option 4: Reduce batch size — the simplest fix if you’re not compute-bound. As a rough guide, halving the batch size frees roughly half the activation memory.

Pro Tip: Wrap your training loop with torch.no_grad() during validation. Every tensor created in a forward pass without no_grad() saves its intermediate values for backprop, doubling memory usage compared to inference:

model.eval()
with torch.no_grad():
    for X, y in val_loader:
        output = model(X.to(device))
        # No gradients tracked — significantly less memory

Fix 2: Device Mismatch — All Tensors Must Be on the Same Device

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

The three most common places this happens:

1. Input tensors not moved to GPU:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

for X, y in dataloader:
    # WRONG — X and y are on CPU, model is on cuda:0
    output = model(X)

    # CORRECT — move inputs before every forward pass
    X, y = X.to(device), y.to(device)
    output = model(X)

2. Loading a checkpoint without map_location:

# WRONG — loads tensors to the device they were saved on
checkpoint = torch.load("model.pth")

# CORRECT — redirect to wherever you need them
checkpoint = torch.load("model.pth", map_location=device, weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])

weights_only=True is recommended in PyTorch 2.x to avoid loading arbitrary Python objects from checkpoint files.

3. Tensors created inside a model without inheriting the device:

class MyModel(nn.Module):
    def forward(self, x):
        # WRONG — hardcoded CPU tensor
        mask = torch.ones(x.shape[0], x.shape[1])

        # CORRECT — match the device of the input tensor
        mask = torch.ones(x.shape[0], x.shape[1], device=x.device)
        return x * mask

Any tensor you create inside a forward() method must explicitly specify device=x.device or be created via operations on existing tensors (which inherit the device automatically).

Fix 3: CUDA Device-Side Assert — Finding the Real Error

This is PyTorch’s most confusing error pattern. The actual cause is hidden because CUDA runs asynchronously:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Step 1: Make CUDA synchronous to get the real stack trace:

# Linux / macOS
CUDA_LAUNCH_BLOCKING=1 python train.py

# Windows PowerShell
$env:CUDA_LAUNCH_BLOCKING=1
python train.py

With CUDA_LAUNCH_BLOCKING=1, the error appears at the correct line.

Step 2: The real error is almost always invalid class indices in CrossEntropyLoss.

CrossEntropyLoss expects labels in [0, num_classes - 1]. If any label equals num_classes or is negative, CUDA triggers the assert:

num_classes = 5
criterion = nn.CrossEntropyLoss()

# WRONG — label 5 is out of range for num_classes=5
labels = torch.tensor([0, 2, 5], device="cuda")  # 5 >= num_classes
logits = model(X)
loss = criterion(logits, labels)  # device-side assert

# CORRECT — labels must be in [0, 4]
labels = torch.tensor([0, 2, 4], device="cuda")
loss = criterion(logits, labels)  # works

Add a validation check before the loss call during debugging:

assert labels.min() >= 0, f"Negative label: {labels.min()}"
assert labels.max() < num_classes, f"Label {labels.max()} >= num_classes {num_classes}"
assert labels.dtype == torch.long, f"Labels must be torch.long, got {labels.dtype}"

Other common causes of device-side asserts: index out of bounds in torch.gather() or torch.index_select(), and NaN values in torch.log() or torch.sqrt().

Fix 4: torch.cuda.is_available() Returns False

After installing PyTorch, this is the first check:

import torch
print(torch.cuda.is_available())  # False — why?
print(torch.version.cuda)         # None — CPU-only build installed

If torch.version.cuda is None, you have a CPU-only PyTorch build installed. The pip install torch default on some systems installs the CPU variant.

Check your system CUDA version:

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver, release 12.1

Reinstall PyTorch with the correct CUDA version. Find your CUDA version from nvcc --version and match it:

# Clear old installation and cache
pip uninstall torch torchvision torchaudio -y
pip cache purge

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

The --index-url flag is required — without it, pip resolves to the CPU build from PyPI.

If nvcc is not found, the CUDA toolkit may not be installed (only the driver is). NVIDIA drivers and the CUDA toolkit are separate packages. Check GPU detection with nvidia-smi — if that works, you have the driver but may be missing the toolkit. The same driver verification steps apply when running local LLMs — see Ollama not working for GPU detection diagnostics you can run independently of PyTorch.

Common Mistake: Having a CUDA 12.x system but installing a PyTorch build for CUDA 11.x. The minor version mismatch is often acceptable (e.g., PyTorch cu121 on a CUDA 12.4 system), but a major version mismatch is not.

Fix 5: Inplace Operation Breaks Gradient Computation

RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.cuda.FloatTensor [128, 10]], which is
output 0 of LinearBackward0, is at version 2; expected version 0.

PyTorch tracks a version counter on every tensor. Inplace operations (+=, [i] = x, .fill_()) increment the version. If the version doesn’t match what the autograd graph recorded, backprop fails.

Replace inplace operations with their out-of-place equivalents:

# WRONG — inplace on a tensor that requires grad
x = torch.randn(10, requires_grad=True)
x += 1      # Inplace: modifies x at version 0 → version 1
y = x * 2
y.sum().backward()  # Error: x already modified

# CORRECT — out-of-place creates a new tensor
x = torch.randn(10, requires_grad=True)
x = x + 1  # New tensor, x is untouched
y = x * 2
y.sum().backward()  # Works

Common pattern in RNN loops — collecting hidden states:

# WRONG — assigning into a pre-allocated tensor
outputs = torch.zeros(seq_len, batch_size, hidden_dim)
for t in range(seq_len):
    h = cell(inputs[t], h)
    outputs[t] = h  # Inplace write into outputs

# CORRECT — accumulate in a list, stack at the end
outputs = []
for t in range(seq_len):
    h = cell(inputs[t], h)
    outputs.append(h)
outputs = torch.stack(outputs)  # Assembles without inplace ops

If you need to write into a pre-allocated buffer and the tensor doesn’t need gradients (output storage, not computation), use .detach() before assignment:

buffer[t] = h.detach()  # Detached — safe to write inplace

Fix 6: DataLoader Crashes on Windows — num_workers Error

RuntimeError: An attempt has been made to start a new process before the current
process has finished its bootstrapping phase. This probably means that you are on
Windows and you forgot to use the proper idiom in the main module:

    if __name__ == '__main__':
        ...

Windows spawns new processes by re-importing the entire script, which causes recursive spawning. Linux forks processes and avoids this.

Fix: wrap everything in if __name__ == '__main__':

import torch
from torch.utils.data import DataLoader, TensorDataset

def train():
    dataset = TensorDataset(torch.randn(1000, 100), torch.randint(0, 10, (1000,)))
    loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.Linear(100, 10).to(device)

    for X, y in loader:
        X, y = X.to(device), y.to(device)
        output = model(X)

if __name__ == "__main__":    # Required on Windows
    train()

Quick fix for notebooks or scripts where restructuring is inconvenient:

loader = DataLoader(dataset, batch_size=32, num_workers=0)  # Disables multiprocessing

num_workers=0 runs data loading in the main process. It’s slower for I/O-bound datasets but avoids the Windows spawn issue entirely.

For DataLoader multiprocessing behavior differences between platforms, see Python multiprocessing not working.

Fix 7: Dtype Mismatch — Float32 vs. Float64

RuntimeError: expected scalar type Float but found Double
RuntimeError: mat1 and mat2 must have the same dtype, but got Double and Float

PyTorch model parameters default to float32. NumPy arrays default to float64. Converting NumPy to a tensor without specifying the dtype preserves float64:

import numpy as np
import torch

data = np.array([[1.0, 2.0, 3.0]])   # float64 by default
tensor = torch.from_numpy(data)       # still float64 (Double)

model = torch.nn.Linear(3, 1)        # float32
output = model(tensor)               # RuntimeError: float32 ≠ float64

Fix: convert to float32 explicitly:

# Option 1 — call .float() to convert to float32
tensor = torch.from_numpy(data).float()

# Option 2 — specify dtype during tensor creation
tensor = torch.tensor(data, dtype=torch.float32)

# Option 3 — convert the numpy array first
data = data.astype(np.float32)
tensor = torch.from_numpy(data)

If the mismatch is inside your model (e.g., a custom layer creates float64 constants):

class MyLayer(nn.Module):
    def forward(self, x):
        # WRONG — np.pi is a float64 Python scalar that becomes float64 tensor
        scale = torch.tensor(np.pi)

        # CORRECT — match the input's dtype
        scale = torch.tensor(np.pi, dtype=x.dtype)
        return x * scale

Fix 8: NaN Loss

Loss becoming NaN mid-training and never recovering is almost always one of three causes: a numerical singularity in the loss function, exploding gradients, or a learning rate that’s too high.

First: check where NaN enters. Add detection to your training loop:

for step, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)
    optimizer.zero_grad()

    output = model(X)
    loss = criterion(output, y)

    if torch.isnan(loss):
        print(f"NaN at step {step}")
        print(f"  output range: [{output.min():.3f}, {output.max():.3f}]")
        print(f"  output has NaN: {torch.isnan(output).any()}")
        break  # Stop before NaN propagates into weights

    loss.backward()
    optimizer.step()

Gradient clipping is the standard fix for exploding gradients. Apply it after loss.backward() and before optimizer.step():

loss.backward()

# Clip all parameter gradients to a maximum L2 norm of 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()

clip_grad_norm_ scales the entire gradient vector down if its norm exceeds max_norm. Values between 0.5 and 5.0 are common depending on the architecture.

Learning rate is the other frequent cause. If NaN appears in the first few steps, try reducing lr by 10x:

# Start conservative
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Rather than
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

With mixed precision, GradScaler handles the numerical instability of float16 automatically. If you’re using AMP without GradScaler, the combination of float16 and large gradients is a common NaN source:

from torch.amp import autocast, GradScaler

scaler = GradScaler(device="cuda")

for X, y in dataloader:
    optimizer.zero_grad()

    with autocast(device_type="cuda", dtype=torch.float16):
        output = model(X.to(device))
        loss = criterion(output, y.to(device))

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)                             # Unscale before clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Call scaler.unscale_(optimizer) before clip_grad_norm_ — otherwise you’re clipping the scaled gradients, not the true ones.

Still Not Working?

Model Not Training — Loss Not Decreasing

If loss prints but never changes, check these in order:

  1. optimizer.zero_grad() is missing — gradients accumulate across steps and explode
  2. loss.backward() is called on the wrong tensor — you’re differentiating a detached or constant value
  3. Model is in eval modemodel.eval() disables dropout and batchnorm tracking; call model.train() at the start of each epoch
model.train()           # Back to training mode
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()

Slow Training — GPU Utilization Low

If nvidia-smi shows low GPU utilization (<50%), the bottleneck is usually the CPU data pipeline:

loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Parallel data loading (Linux/macOS)
    pin_memory=True,         # Faster host-to-GPU transfer
    persistent_workers=True, # Keep worker processes alive between epochs
    prefetch_factor=2,       # Pre-load 2 batches per worker
)

pin_memory=True pins CPU memory for faster CUDA transfers. Only use it when training on GPU.

RuntimeError: Expected input batch_size to match target batch_size

The batch dimension doesn’t match between your model output and your labels. Common cause: the last batch in an epoch has fewer samples than batch_size. Either use drop_last=True in your DataLoader or make your loss function handle variable batch sizes (most built-in losses do):

loader = DataLoader(dataset, batch_size=32, drop_last=True)

Checking PyTorch Compile Status

torch.compile() (introduced in PyTorch 2.0) can speed up training by up to 2x but adds a one-time compilation overhead on the first batch. If the compiled model crashes but the eager model works, disable compile to isolate the issue:

model = MyModel().to(device)
# model = torch.compile(model)  # Comment out to debug

for X, y in dataloader:
    loss = criterion(model(X.to(device)), y.to(device))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

For Python-level concurrency errors that surface in training pipelines, see Python multiprocessing not working.

If you’re building LLM pipelines that call PyTorch models through a chain or agent, see LangChain Python not working for the integration patterns between LangChain and custom torch inference code.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles