Skip to content

Fix: Hugging Face Transformers Not Working — OSError, CUDA OOM, and Generation Errors

FixDevs ·

Quick Answer

How to fix Hugging Face Transformers errors — OSError can't load tokenizer, gated repo access, CUDA out of memory with device_map auto, bitsandbytes not installed, tokenizer padding mismatch, pad_token_id warning, and LoRA adapter loading failures.

The Error

You try to load a model and get this immediately:

OSError: Can't load tokenizer for 'meta-llama/Llama-2-7b-hf'.

Or the model repo exists but requires a token:

huggingface_hub.errors.GatedRepoError: 403 Client Error. Cannot access gated repo.
Make sure to have access and pass a valid token.

Or you get past loading and run out of VRAM:

RuntimeError: CUDA out of memory. Tried to allocate 14.00 GiB
(GPU 0; 8.00 GiB total capacity; 7.23 GiB already allocated)

Or generation produces a wall of warnings and then nonsense output:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
# followed by repetitive or incoherent text

Most of these trace back to a handful of configuration steps that aren’t obvious from the README.

Why This Happens

The Transformers library separates model downloading, model loading, and model execution into distinct phases — each with its own failure modes. Downloads fail when authentication is missing or the cache is stale. Loading fails when VRAM is insufficient without memory distribution. Generation fails when tokenizers aren’t configured for the model type (GPT-style models lack a padding token by default). Understanding which phase failed narrows the fix immediately.

Fix 1: OSError / GatedRepoError — Authentication and Access

Most model loading errors fall into two categories: the model doesn’t exist at that path, or it exists but requires authentication.

Gated models (Llama, Gemma, some Mistral variants) require you to accept terms on the Hugging Face website before the API grants access. After accepting, you need an API token:

# Set your token as an environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"  # Linux/macOS

# Windows PowerShell
$env:HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxx"

Then load with the token:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Option 1: token=True reads from HF_TOKEN env var
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    token=True
)

# Option 2: pass the token directly
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    token="hf_xxxxxxxxxxxxxxxxxxxx"
)

# Option 3: log in once and reuse (saves token to ~/.cache/huggingface/token)
from huggingface_hub import login
login(token="hf_xxxxxxxxxxxxxxxxxxxx")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")

For private repositories (your own fine-tuned models), the same token parameter applies. The error message says “Unauthorized” rather than “Gated” in this case.

For OSError: Can't load tokenizer on a public model — the most common cause is a cached partial download. Clear the cache for that model:

huggingface-cli delete-cache
# Or delete the specific model directory:
rm -rf ~/.cache/huggingface/hub/models--bert-base-uncased

Then re-run the load.

For offline environments, load from a local directory after downloading once:

# First: download to a local path
from huggingface_hub import snapshot_download
snapshot_download(repo_id="bert-base-uncased", local_dir="./models/bert")

# Then: load from local path (no network access needed)
tokenizer = AutoTokenizer.from_pretrained("./models/bert")
model = AutoModelForCausalLM.from_pretrained("./models/bert")

# Or set the env var to block all network calls globally
import os
os.environ["HF_HUB_OFFLINE"] = "1"

Fix 2: CUDA Out of Memory — Large Model Loading

A 7B parameter model in float32 requires ~28 GB VRAM. In float16, ~14 GB. Most consumer GPUs have 8–16 GB. Loading without any memory management fails immediately:

# Fails on GPUs with < 14 GB VRAM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-8B")

Fix: use device_map="auto" with torch_dtype=torch.float16.

This distributes the model across all available GPUs and spills to CPU RAM if needed. It requires the accelerate library:

pip install accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    device_map="auto",          # Distribute across GPU(s) + CPU
    torch_dtype=torch.float16,  # Half precision — halves VRAM usage
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")

After loading, check where each layer landed:

print(model.hf_device_map)
# {'model.embed_tokens': 0, 'model.layers.0': 0, ..., 'model.layers.16': 'cpu'}

For even tighter VRAM constraints — 4-bit quantization:

4-bit quantization (BitsAndBytesConfig) loads a 7B model in approximately 4–5 GB VRAM. Quality is slightly reduced but often imperceptible for most tasks:

pip install bitsandbytes accelerate
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bfloat16
    bnb_4bit_quant_type="nf4",              # Normal Float 4 — best quality
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    device_map="auto",
    quantization_config=quantization_config,
)

8-bit quantization is a middle ground — ~8 GB VRAM for a 7B model, better quality than 4-bit:

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    device_map="auto",
    quantization_config=quantization_config,
)

Check actual memory usage after loading:

print(f"Memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

For a broader look at PyTorch-level CUDA memory management — gradient accumulation, torch.no_grad(), and empty_cache() behavior — see PyTorch not working.

Fix 3: ImportError — Missing accelerate or bitsandbytes

ImportError: Using `load_in_8bit=True` requires Accelerate and the latest version of bitsandbytes.
ModuleNotFoundError: No module named 'bitsandbytes'

These are missing dependency errors. Install both together:

pip install --upgrade accelerate bitsandbytes

accelerate is required for device_map to work. bitsandbytes is required for any quantization (4-bit or 8-bit). Both must be present even if you only use one feature.

Verify the installation worked:

import accelerate, bitsandbytes
print(accelerate.__version__)    # Should be 0.20.0+
print(bitsandbytes.__version__)  # Should be 0.41.0+

Note: bitsandbytes has limited Windows support. On Windows, CUDA quantization may require WSL2 or a Linux Docker container. The CPU-only quantization path works on Windows but is much slower.

Fix 4: Tokenizer Shape Mismatch — Padding and Truncation

ValueError: Unable to create tensor, you should probably activate truncation and/or padding
with 'padding=True' 'truncation=True' to have batched tensors with the same length.

This happens when you batch multiple sequences of different lengths without padding. The tokenizer produces lists of different sizes, which can’t be stacked into a tensor.

Always pass padding=True, truncation=True, and return_tensors="pt" when batching:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = [
    "Short.",
    "This sentence is much longer and will need padding to match the short one.",
]

# WRONG — returns lists of different lengths, can't be stacked
inputs = tokenizer(texts)

# CORRECT — pads shorter sequences, truncates longer ones, returns tensors
inputs = tokenizer(
    texts,
    padding=True,        # Pad to longest sequence in this batch
    truncation=True,     # Truncate sequences longer than max_length
    max_length=128,      # Model's maximum context length
    return_tensors="pt", # Return PyTorch tensors (not lists)
)

print(inputs["input_ids"].shape)       # (2, 128) — uniform shape
print(inputs["attention_mask"])        # 1 = real token, 0 = padding

The attention_mask is critical — always pass it to the model alongside input_ids. Without it, the model treats padding tokens as real input:

from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")

outputs = model(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],  # Tell model which tokens to ignore
)

Common Mistake: Calling tokenizer(texts) without return_tensors="pt" and then passing the result directly to the model. The tokenizer returns Python lists by default — **inputs unpacking passes them as lists, and the model rejects the shape:

# WRONG
inputs = tokenizer(text)
model(**inputs)  # TypeError: expected Tensor, got list

# CORRECT
inputs = tokenizer(text, return_tensors="pt")
model(**inputs)  # Works

Fix 5: Pipeline API Errors

Wrong task name:

KeyError: 'text-classify' is not supported. Supported tasks: ['text-classification', ...]

Task names are exact strings. The common ones:

from transformers import pipeline

# Text
pipe = pipeline("text-classification")    # or "sentiment-analysis"
pipe = pipeline("text-generation")
pipe = pipeline("summarization")
pipe = pipeline("translation_en_to_fr")
pipe = pipeline("fill-mask")
pipe = pipeline("question-answering")
pipe = pipeline("token-classification")   # or "ner"

# Audio / Vision
pipe = pipeline("automatic-speech-recognition")
pipe = pipeline("image-classification")

Run on GPU by passing device=0 (for cuda:0):

import torch

pipe = pipeline(
    "text-generation",
    model="gpt2",
    device=0,                        # CUDA:0
    torch_dtype=torch.float16,       # Use half precision
)

Batched inference for large datasets — significantly faster than calling one at a time:

texts = ["Input one.", "Input two.", "Input three."] * 100

# Process 32 examples at a time instead of one by one
results = pipe(texts, batch_size=32)

Getting all class scores (the return_all_scores parameter was deprecated — use top_k instead):

# OLD (deprecated, still works but warns)
results = pipe("This product is amazing!", return_all_scores=True)

# CURRENT
results = pipe("This product is amazing!", top_k=None)  # None = return all
# [{'label': 'POSITIVE', 'score': 0.998}, {'label': 'NEGATIVE', 'score': 0.002}]

Fix 6: Caching — Models Re-downloading Every Time

By default, models cache to ~/.cache/huggingface/hub on Linux/macOS and C:\Users\<user>\.cache\huggingface\hub on Windows. If you’re on a machine with a small home partition or a shared server, redirect the cache:

# Linux/macOS — add to ~/.bashrc or ~/.zshrc
export HF_HOME="/data/huggingface"

# Windows PowerShell
$env:HF_HOME = "D:\huggingface"

HF_HOME controls the entire Hugging Face cache. Set it before running any Python that imports transformers.

Check what’s cached and how much disk it uses:

huggingface-cli scan-cache
REPO ID                                          SIZE      LAST ACCESSED
meta-llama/Llama-3.2-8B                         14.9 GB   2 hours ago
bert-base-uncased                               440.0 MB   3 days ago

Delete specific models:

huggingface-cli delete-cache
# Interactive UI lets you select models to remove

Pro Tip: On shared servers or CI/CD pipelines where bandwidth is limited, download models once to a shared directory and point everyone at it:

export HF_HOME="/shared/team/huggingface"

Every team member loading from_pretrained("bert-base-uncased") will hit the shared cache instead of re-downloading.

Fix 7: Generation Issues — pad_token_id Warning and Repetitive Output

GPT-style models (GPT-2, GPT-Neo, many LLaMA variants) don’t have a padding token. When you call generate() without setting one, you get this warning:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

The warning is harmless for single-sequence generation, but for batched generation it causes incorrect behavior — the model can’t distinguish padding tokens from real content:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Silence the warning and fix batch generation:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

max_new_tokens vs max_length — these do different things and mixing them causes confusing behavior:

# max_new_tokens — tokens to generate AFTER the prompt (use this)
outputs = model.generate(inputs, max_new_tokens=200)

# max_length — TOTAL sequence length including the prompt (avoid)
outputs = model.generate(inputs, max_length=200)
# If your prompt is 180 tokens, max_length=200 only generates 20 new tokens

Always use max_new_tokens unless you have a specific reason to cap total length.

For sampling to work, do_sample=True must be set. Without it, temperature and top_p are silently ignored:

# WRONG — temperature has no effect without do_sample=True
outputs = model.generate(inputs, temperature=0.7, max_new_tokens=100)

# CORRECT
outputs = model.generate(
    inputs,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=100,
)

For deterministic output (same input always produces the same output), use greedy decoding:

outputs = model.generate(inputs, do_sample=False, max_new_tokens=100)

Or beam search for slightly better quality at the cost of speed:

outputs = model.generate(inputs, num_beams=4, max_new_tokens=100)

Fix 8: LoRA / PEFT Adapter Loading

Fine-tuned LoRA adapters are tied to a specific base model architecture. Loading an adapter on the wrong base model produces either a silent mismatch or an explicit error:

ValueError: You are trying to load a checkpoint saved with lora_target_modules=
['q_proj', 'v_proj'] but your model has target modules ['q_proj', 'k_proj', 'v_proj'].

The correct loading pattern:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Step 1: Load the base model (must match the adapter's base)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    device_map="auto",
    torch_dtype=torch.float16,
)

# Step 2: Load the adapter on top of the base
model = PeftModel.from_pretrained(base_model, "your-org/your-lora-adapter")

# Step 3: Use for inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")
inputs = tokenizer("Tell me about", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merging LoRA weights for faster inference:

PeftModel runs adapter math on every forward pass, which is slightly slower. For production inference, merge the adapter into the base model and export:

# merge_and_unload() RETURNS a new model — assign it
merged_model = model.merge_and_unload()

# Save for future use (no PEFT dependency needed to load)
merged_model.save_pretrained("./merged-llama-lora")
tokenizer.save_pretrained("./merged-llama-lora")

# Load later as a plain model
from transformers import AutoModelForCausalLM
inference_model = AutoModelForCausalLM.from_pretrained(
    "./merged-llama-lora",
    device_map="auto",
    torch_dtype=torch.float16,
)

Common Mistake: Calling model.merge_and_unload() without assigning the return value. The method returns the merged model — it doesn’t modify the original in place:

model.merge_and_unload()  # Result discarded — model is unchanged
merged_model = model.merge_and_unload()  # Correct

Still Not Working?

trust_remote_code=True Required

Some models (Phi-3, Qwen, custom architectures) include custom Python code in the repository. Loading them without trust_remote_code=True fails:

ValueError: Loading model requires you to execute the configuration file in that repo
on your local machine. Make sure you have read the code there to avoid malicious use,
then set the option `trust_remote_code=True` to remove this error.

Only use this for models from sources you trust:

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True,
    device_map="auto",
)

Slow Tokenizer Warning

The tokenizer class you load from this checkpoint is not the same type as the class this function
is called from. It may result in unexpected tokenization. The tokenizer class to load is
'LlamaTokenizerFast'...

This warning usually resolves itself — AutoTokenizer picks the correct class. If you want the fast tokenizer explicitly:

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained("meta-llama/Llama-3.2-8B")

Running Models Locally Without Python

If the Transformers stack is too heavy for your use case — especially for inference-only deployments — consider Ollama not working as a reference for running quantized LLMs through Ollama, which handles the device management and quantization automatically.

Pinning a Specific Model Revision

If a model update breaks your application, pin to a known-good commit:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    revision="abc123def",  # Git commit hash or tag
    device_map="auto",
)

Find the revision hash in the model’s commit history on Hugging Face. Pinning is critical for production deployments where unexpected model changes would break outputs.

Debugging Model Loading Without Downloading

Check whether a model exists and what configuration it has before downloading the full weights:

from transformers import AutoConfig

# Fast — only downloads config.json (~few KB)
config = AutoConfig.from_pretrained("meta-llama/Llama-3.2-8B", token=True)
print(config.model_type)        # "llama"
print(config.num_hidden_layers) # 32
print(config.hidden_size)       # 4096

This lets you confirm access and inspect architecture without downloading multi-GB weights.

Using Transformers Models in LangChain Pipelines

For chaining Hugging Face models with prompt templates and memory, use langchain_huggingface:

from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=pipe)

For agent patterns, LCEL chains, and memory management with these models, see LangChain Python not working.

Verifying the Full Stack

When a combination of device_map, quantization, and PEFT isn’t working, isolate each layer:

# 1. Can you load the tokenizer alone?
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("model-id", token=True)

# 2. Can you load the config (no weights)?
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("model-id", token=True)

# 3. Can you load the model on CPU without quantization?
from transformers import AutoModelForCausalLM
m = AutoModelForCausalLM.from_pretrained("model-id", token=True)

# 4. Now add device_map
m = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto", token=True)

# 5. Now add quantization
from transformers import BitsAndBytesConfig
import torch
qc = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
m = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto", quantization_config=qc, token=True)

Each step narrows down which layer introduced the failure. For Python-level packaging errors that arise when installing bitsandbytes or accelerate, see Python packaging not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles