Fix: Hugging Face Transformers Not Working — OSError, CUDA OOM, and Generation Errors
Quick Answer
How to fix Hugging Face Transformers errors — OSError can't load tokenizer, gated repo access, CUDA out of memory with device_map auto, bitsandbytes not installed, tokenizer padding mismatch, pad_token_id warning, and LoRA adapter loading failures.
The Error
You try to load a model and get this immediately:
OSError: Can't load tokenizer for 'meta-llama/Llama-2-7b-hf'.Or the model repo exists but requires a token:
huggingface_hub.errors.GatedRepoError: 403 Client Error. Cannot access gated repo.
Make sure to have access and pass a valid token.Or you get past loading and run out of VRAM:
RuntimeError: CUDA out of memory. Tried to allocate 14.00 GiB
(GPU 0; 8.00 GiB total capacity; 7.23 GiB already allocated)Or generation produces a wall of warnings and then nonsense output:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
# followed by repetitive or incoherent textMost of these trace back to a handful of configuration steps that aren’t obvious from the README.
Why This Happens
The Transformers library separates model downloading, model loading, and model execution into distinct phases — each with its own failure modes. Downloads fail when authentication is missing or the cache is stale. Loading fails when VRAM is insufficient without memory distribution. Generation fails when tokenizers aren’t configured for the model type (GPT-style models lack a padding token by default). Understanding which phase failed narrows the fix immediately.
Fix 1: OSError / GatedRepoError — Authentication and Access
Most model loading errors fall into two categories: the model doesn’t exist at that path, or it exists but requires authentication.
Gated models (Llama, Gemma, some Mistral variants) require you to accept terms on the Hugging Face website before the API grants access. After accepting, you need an API token:
# Set your token as an environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx" # Linux/macOS
# Windows PowerShell
$env:HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxx"Then load with the token:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Option 1: token=True reads from HF_TOKEN env var
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-3.2-8B",
token=True
)
# Option 2: pass the token directly
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-3.2-8B",
token="hf_xxxxxxxxxxxxxxxxxxxx"
)
# Option 3: log in once and reuse (saves token to ~/.cache/huggingface/token)
from huggingface_hub import login
login(token="hf_xxxxxxxxxxxxxxxxxxxx")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")For private repositories (your own fine-tuned models), the same token parameter applies. The error message says “Unauthorized” rather than “Gated” in this case.
For OSError: Can't load tokenizer on a public model — the most common cause is a cached partial download. Clear the cache for that model:
huggingface-cli delete-cache
# Or delete the specific model directory:
rm -rf ~/.cache/huggingface/hub/models--bert-base-uncasedThen re-run the load.
For offline environments, load from a local directory after downloading once:
# First: download to a local path
from huggingface_hub import snapshot_download
snapshot_download(repo_id="bert-base-uncased", local_dir="./models/bert")
# Then: load from local path (no network access needed)
tokenizer = AutoTokenizer.from_pretrained("./models/bert")
model = AutoModelForCausalLM.from_pretrained("./models/bert")
# Or set the env var to block all network calls globally
import os
os.environ["HF_HUB_OFFLINE"] = "1"Fix 2: CUDA Out of Memory — Large Model Loading
A 7B parameter model in float32 requires ~28 GB VRAM. In float16, ~14 GB. Most consumer GPUs have 8–16 GB. Loading without any memory management fails immediately:
# Fails on GPUs with < 14 GB VRAM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-8B")Fix: use device_map="auto" with torch_dtype=torch.float16.
This distributes the model across all available GPUs and spills to CPU RAM if needed. It requires the accelerate library:
pip install accelerateimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
device_map="auto", # Distribute across GPU(s) + CPU
torch_dtype=torch.float16, # Half precision — halves VRAM usage
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")After loading, check where each layer landed:
print(model.hf_device_map)
# {'model.embed_tokens': 0, 'model.layers.0': 0, ..., 'model.layers.16': 'cpu'}For even tighter VRAM constraints — 4-bit quantization:
4-bit quantization (BitsAndBytesConfig) loads a 7B model in approximately 4–5 GB VRAM. Quality is slightly reduced but often imperceptible for most tasks:
pip install bitsandbytes accelerateimport torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_quant_type="nf4", # Normal Float 4 — best quality
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
device_map="auto",
quantization_config=quantization_config,
)8-bit quantization is a middle ground — ~8 GB VRAM for a 7B model, better quality than 4-bit:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
device_map="auto",
quantization_config=quantization_config,
)Check actual memory usage after loading:
print(f"Memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")For a broader look at PyTorch-level CUDA memory management — gradient accumulation, torch.no_grad(), and empty_cache() behavior — see PyTorch not working.
Fix 3: ImportError — Missing accelerate or bitsandbytes
ImportError: Using `load_in_8bit=True` requires Accelerate and the latest version of bitsandbytes.ModuleNotFoundError: No module named 'bitsandbytes'These are missing dependency errors. Install both together:
pip install --upgrade accelerate bitsandbytesaccelerate is required for device_map to work. bitsandbytes is required for any quantization (4-bit or 8-bit). Both must be present even if you only use one feature.
Verify the installation worked:
import accelerate, bitsandbytes
print(accelerate.__version__) # Should be 0.20.0+
print(bitsandbytes.__version__) # Should be 0.41.0+Note: bitsandbytes has limited Windows support. On Windows, CUDA quantization may require WSL2 or a Linux Docker container. The CPU-only quantization path works on Windows but is much slower.
Fix 4: Tokenizer Shape Mismatch — Padding and Truncation
ValueError: Unable to create tensor, you should probably activate truncation and/or padding
with 'padding=True' 'truncation=True' to have batched tensors with the same length.This happens when you batch multiple sequences of different lengths without padding. The tokenizer produces lists of different sizes, which can’t be stacked into a tensor.
Always pass padding=True, truncation=True, and return_tensors="pt" when batching:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
texts = [
"Short.",
"This sentence is much longer and will need padding to match the short one.",
]
# WRONG — returns lists of different lengths, can't be stacked
inputs = tokenizer(texts)
# CORRECT — pads shorter sequences, truncates longer ones, returns tensors
inputs = tokenizer(
texts,
padding=True, # Pad to longest sequence in this batch
truncation=True, # Truncate sequences longer than max_length
max_length=128, # Model's maximum context length
return_tensors="pt", # Return PyTorch tensors (not lists)
)
print(inputs["input_ids"].shape) # (2, 128) — uniform shape
print(inputs["attention_mask"]) # 1 = real token, 0 = paddingThe attention_mask is critical — always pass it to the model alongside input_ids. Without it, the model treats padding tokens as real input:
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"], # Tell model which tokens to ignore
)Common Mistake: Calling tokenizer(texts) without return_tensors="pt" and then passing the result directly to the model. The tokenizer returns Python lists by default — **inputs unpacking passes them as lists, and the model rejects the shape:
# WRONG
inputs = tokenizer(text)
model(**inputs) # TypeError: expected Tensor, got list
# CORRECT
inputs = tokenizer(text, return_tensors="pt")
model(**inputs) # WorksFix 5: Pipeline API Errors
Wrong task name:
KeyError: 'text-classify' is not supported. Supported tasks: ['text-classification', ...]Task names are exact strings. The common ones:
from transformers import pipeline
# Text
pipe = pipeline("text-classification") # or "sentiment-analysis"
pipe = pipeline("text-generation")
pipe = pipeline("summarization")
pipe = pipeline("translation_en_to_fr")
pipe = pipeline("fill-mask")
pipe = pipeline("question-answering")
pipe = pipeline("token-classification") # or "ner"
# Audio / Vision
pipe = pipeline("automatic-speech-recognition")
pipe = pipeline("image-classification")Run on GPU by passing device=0 (for cuda:0):
import torch
pipe = pipeline(
"text-generation",
model="gpt2",
device=0, # CUDA:0
torch_dtype=torch.float16, # Use half precision
)Batched inference for large datasets — significantly faster than calling one at a time:
texts = ["Input one.", "Input two.", "Input three."] * 100
# Process 32 examples at a time instead of one by one
results = pipe(texts, batch_size=32)Getting all class scores (the return_all_scores parameter was deprecated — use top_k instead):
# OLD (deprecated, still works but warns)
results = pipe("This product is amazing!", return_all_scores=True)
# CURRENT
results = pipe("This product is amazing!", top_k=None) # None = return all
# [{'label': 'POSITIVE', 'score': 0.998}, {'label': 'NEGATIVE', 'score': 0.002}]Fix 6: Caching — Models Re-downloading Every Time
By default, models cache to ~/.cache/huggingface/hub on Linux/macOS and C:\Users\<user>\.cache\huggingface\hub on Windows. If you’re on a machine with a small home partition or a shared server, redirect the cache:
# Linux/macOS — add to ~/.bashrc or ~/.zshrc
export HF_HOME="/data/huggingface"
# Windows PowerShell
$env:HF_HOME = "D:\huggingface"HF_HOME controls the entire Hugging Face cache. Set it before running any Python that imports transformers.
Check what’s cached and how much disk it uses:
huggingface-cli scan-cacheREPO ID SIZE LAST ACCESSED
meta-llama/Llama-3.2-8B 14.9 GB 2 hours ago
bert-base-uncased 440.0 MB 3 days agoDelete specific models:
huggingface-cli delete-cache
# Interactive UI lets you select models to removePro Tip: On shared servers or CI/CD pipelines where bandwidth is limited, download models once to a shared directory and point everyone at it:
export HF_HOME="/shared/team/huggingface"Every team member loading from_pretrained("bert-base-uncased") will hit the shared cache instead of re-downloading.
Fix 7: Generation Issues — pad_token_id Warning and Repetitive Output
GPT-style models (GPT-2, GPT-Neo, many LLaMA variants) don’t have a padding token. When you call generate() without setting one, you get this warning:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.The warning is harmless for single-sequence generation, but for batched generation it causes incorrect behavior — the model can’t distinguish padding tokens from real content:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Silence the warning and fix batch generation:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_idmax_new_tokens vs max_length — these do different things and mixing them causes confusing behavior:
# max_new_tokens — tokens to generate AFTER the prompt (use this)
outputs = model.generate(inputs, max_new_tokens=200)
# max_length — TOTAL sequence length including the prompt (avoid)
outputs = model.generate(inputs, max_length=200)
# If your prompt is 180 tokens, max_length=200 only generates 20 new tokensAlways use max_new_tokens unless you have a specific reason to cap total length.
For sampling to work, do_sample=True must be set. Without it, temperature and top_p are silently ignored:
# WRONG — temperature has no effect without do_sample=True
outputs = model.generate(inputs, temperature=0.7, max_new_tokens=100)
# CORRECT
outputs = model.generate(
inputs,
do_sample=True,
temperature=0.7,
top_p=0.9,
max_new_tokens=100,
)For deterministic output (same input always produces the same output), use greedy decoding:
outputs = model.generate(inputs, do_sample=False, max_new_tokens=100)Or beam search for slightly better quality at the cost of speed:
outputs = model.generate(inputs, num_beams=4, max_new_tokens=100)Fix 8: LoRA / PEFT Adapter Loading
Fine-tuned LoRA adapters are tied to a specific base model architecture. Loading an adapter on the wrong base model produces either a silent mismatch or an explicit error:
ValueError: You are trying to load a checkpoint saved with lora_target_modules=
['q_proj', 'v_proj'] but your model has target modules ['q_proj', 'k_proj', 'v_proj'].The correct loading pattern:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Step 1: Load the base model (must match the adapter's base)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
device_map="auto",
torch_dtype=torch.float16,
)
# Step 2: Load the adapter on top of the base
model = PeftModel.from_pretrained(base_model, "your-org/your-lora-adapter")
# Step 3: Use for inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")
inputs = tokenizer("Tell me about", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Merging LoRA weights for faster inference:
PeftModel runs adapter math on every forward pass, which is slightly slower. For production inference, merge the adapter into the base model and export:
# merge_and_unload() RETURNS a new model — assign it
merged_model = model.merge_and_unload()
# Save for future use (no PEFT dependency needed to load)
merged_model.save_pretrained("./merged-llama-lora")
tokenizer.save_pretrained("./merged-llama-lora")
# Load later as a plain model
from transformers import AutoModelForCausalLM
inference_model = AutoModelForCausalLM.from_pretrained(
"./merged-llama-lora",
device_map="auto",
torch_dtype=torch.float16,
)Common Mistake: Calling model.merge_and_unload() without assigning the return value. The method returns the merged model — it doesn’t modify the original in place:
model.merge_and_unload() # Result discarded — model is unchanged
merged_model = model.merge_and_unload() # CorrectStill Not Working?
trust_remote_code=True Required
Some models (Phi-3, Qwen, custom architectures) include custom Python code in the repository. Loading them without trust_remote_code=True fails:
ValueError: Loading model requires you to execute the configuration file in that repo
on your local machine. Make sure you have read the code there to avoid malicious use,
then set the option `trust_remote_code=True` to remove this error.Only use this for models from sources you trust:
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
trust_remote_code=True,
device_map="auto",
)Slow Tokenizer Warning
The tokenizer class you load from this checkpoint is not the same type as the class this function
is called from. It may result in unexpected tokenization. The tokenizer class to load is
'LlamaTokenizerFast'...This warning usually resolves itself — AutoTokenizer picks the correct class. If you want the fast tokenizer explicitly:
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast.from_pretrained("meta-llama/Llama-3.2-8B")Running Models Locally Without Python
If the Transformers stack is too heavy for your use case — especially for inference-only deployments — consider Ollama not working as a reference for running quantized LLMs through Ollama, which handles the device management and quantization automatically.
Pinning a Specific Model Revision
If a model update breaks your application, pin to a known-good commit:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
revision="abc123def", # Git commit hash or tag
device_map="auto",
)Find the revision hash in the model’s commit history on Hugging Face. Pinning is critical for production deployments where unexpected model changes would break outputs.
Debugging Model Loading Without Downloading
Check whether a model exists and what configuration it has before downloading the full weights:
from transformers import AutoConfig
# Fast — only downloads config.json (~few KB)
config = AutoConfig.from_pretrained("meta-llama/Llama-3.2-8B", token=True)
print(config.model_type) # "llama"
print(config.num_hidden_layers) # 32
print(config.hidden_size) # 4096This lets you confirm access and inspect architecture without downloading multi-GB weights.
Using Transformers Models in LangChain Pipelines
For chaining Hugging Face models with prompt templates and memory, use langchain_huggingface:
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
pipe = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=pipe)For agent patterns, LCEL chains, and memory management with these models, see LangChain Python not working.
Verifying the Full Stack
When a combination of device_map, quantization, and PEFT isn’t working, isolate each layer:
# 1. Can you load the tokenizer alone?
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("model-id", token=True)
# 2. Can you load the config (no weights)?
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("model-id", token=True)
# 3. Can you load the model on CPU without quantization?
from transformers import AutoModelForCausalLM
m = AutoModelForCausalLM.from_pretrained("model-id", token=True)
# 4. Now add device_map
m = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto", token=True)
# 5. Now add quantization
from transformers import BitsAndBytesConfig
import torch
qc = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
m = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto", quantization_config=qc, token=True)Each step narrows down which layer introduced the failure. For Python-level packaging errors that arise when installing bitsandbytes or accelerate, see Python packaging not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: LangChain Python Not Working — ImportError, Pydantic, and Deprecated Classes
How to fix LangChain Python errors — ImportError from package split, Pydantic v2 compatibility, AgentExecutor deprecated, ConversationBufferMemory removed, LCEL output type mismatches, and tool calling failures.
Fix: Gradio Not Working — Share Link, Queue Timeout, and Component Errors
How to fix Gradio errors — share link not working, queue timeout, component not updating, Blocks layout mistakes, flagging permission denied, file upload size limit, and HuggingFace Spaces deployment failures.
Fix: Ollama Not Working — Connection Refused, Model Not Found, GPU Not Detected
How to fix Ollama errors — connection refused when the daemon isn't running, model not found, GPU not detected falling back to CPU, port 11434 already in use, VRAM exhausted, and API access from other machines.
Fix: Apache Airflow Not Working — DAG Not Found, Task Failures, and Scheduler Issues
How to fix Apache Airflow errors — DAG not appearing in UI, ImportError preventing DAG load, task stuck in running or queued, scheduler not scheduling, XCom too large, connection not found, and database migration errors.