Fix: Outlines Not Working — Backend Setup, Pydantic Schemas, Regex, Choice, and Slow Sampling
Part of: Python Errors
Quick Answer
How to fix Python Outlines errors — model backend missing, JSON schema vs Pydantic, regex pattern compilation slow, choice list timing, vLLM/Transformers/Ollama wiring, and streaming structured outputs.
The Error
You try to load a model with Outlines and it can’t find a backend:
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
# ImportError: To use the transformers integration, please install
# transformers, sentencepiece and torch.Or your JSON-constrained generation returns plain text:
generator = outlines.generate.json(model, MySchema)
result = generator("Extract user data: ...")
# Returns a string, not a parsed object — you forgot to pass a Pydantic model.Or the regex generator takes 30+ seconds before sampling the first token:
gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
out = gen("Today is")
# Long pause before any output.Or choice returns an unexpected value not in your list:
gen = outlines.generate.choice(model, ["yes", "no"])
out = gen("Is the sky blue?")
print(out) # "Yes." — not in the list!Why This Happens
Outlines constrains the LLM’s token sampling so output must match a grammar (regex, JSON schema, choice list). Most issues come from:
- Backend selection. Outlines supports
transformers,vllm,llamacpp,mlx, and (less native)openai. Each has its own install path and API. Choosing the wrong one or missing deps breaks at load. - First-time grammar compilation is slow. Outlines converts the constraint into a finite-state machine over the model’s tokenizer. This is fast at sampling time but the build can take 10-60s for complex JSON schemas. Subsequent runs cache it.
- Choice/JSON token boundaries. The constraint operates on tokens, not characters. If
"Yes."is a single token but"yes"isn’t on the choice list, the model can pick the longer token before Outlines narrows the search. - Local model selection. Constrained sampling only helps if the underlying model can also write reasonable content. A 1B model forced into a JSON schema produces valid-shaped garbage.
A second cause that catches users by surprise is the API churn between Outlines versions. Outlines launched in July 2023 with a small surface focused on regex-constrained generation. The library then went through three significant API revisions (0.1, 0.2, and the 1.0 stable release) over the next eighteen months. Tutorials written against 0.0.x recommend outlines.text.generate.regex(...); the 0.1 series moved to outlines.generate.regex(model, pattern); the 1.0 release reorganised the models.* namespace and tightened the Pydantic-v2 contract. If you copied an example from 2023 and it imports a outlines.text module, that module no longer exists. The fix is always to consult the current docs for the exact import path your installed version uses (pip show outlines | grep Version).
A third cause is the tokenizer awareness gap. Outlines builds its FSM over the model’s specific tokenizer. If you swap models — say from Llama 3 to Mistral — the same regex or JSON schema generates a different FSM because the token vocabularies differ. The cached compilation from the previous model isn’t reused. Worse, some regex patterns that are tractable on one tokenizer (Llama 3’s larger vocab) explode in compilation time on another (Phi-3’s smaller, more fragmented vocab) because the FSM has to enumerate more token boundary variants. Always test compilation speed when you switch models.
Fix 1: Pick a Backend and Install Its Deps
# Transformers (HF, GPU or CPU):
pip install outlines transformers torch sentencepiece accelerate
# vLLM (GPU, fastest for production):
pip install outlines vllm
# llama.cpp:
pip install outlines llama-cpp-python
# Apple Silicon native:
pip install outlines mlx mlx-lm
# OpenAI / Anthropic (via API, no GPU needed):
pip install outlines openaiThen load:
import outlines
# Transformers:
model = outlines.models.transformers(
"microsoft/Phi-3-mini-4k-instruct",
device="cuda", # or "cpu", "mps"
)
# vLLM:
model = outlines.models.vllm("meta-llama/Meta-Llama-3-8B-Instruct")
# llama.cpp (GGUF):
model = outlines.models.llamacpp(
"TheBloke/Llama-2-7B-Chat-GGUF",
"llama-2-7b-chat.Q4_K_M.gguf",
)
# OpenAI:
model = outlines.models.openai("gpt-4o-mini")Pro Tip: For production deployments, prefer vLLM. It’s significantly faster than Transformers for batched constrained generation because Outlines integrates as a logits processor without per-token Python overhead.
Note: The OpenAI backend can’t do true constrained generation (no logits access). It uses prompt engineering + post-validation under the hood. For real constraint enforcement, you need a model whose logits you control.
Fix 2: JSON Generation With Pydantic
The cleanest way to define a JSON schema:
from pydantic import BaseModel, Field
import outlines
class User(BaseModel):
name: str
age: int = Field(ge=0, le=150)
email: str | None = None
generator = outlines.generate.json(model, User)
result = generator("Extract: John Doe is 30 years old, email [email protected]")
print(result)
# User(name='John Doe', age=30, email='[email protected]')Pydantic’s Field(ge=..., le=..., pattern=...) constraints translate to schema constraints that Outlines enforces during sampling.
For non-Pydantic schemas, pass raw JSON schema:
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
},
"required": ["name", "age"],
}
generator = outlines.generate.json(model, schema)
result = generator("...")
# result is a dict, not a Pydantic modelCommon Mistake: Schemas with $ref to external definitions. Outlines doesn’t follow $ref outside the schema. Inline everything or use Pydantic’s auto-generated schema (which inlines via $defs).
Fix 3: Choice With Token-Aware Options
For multi-class classification with strict outputs:
generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
sentiment = generator("Review: 'Loved the food, hated the service.'")Outlines constrains sampling so only tokens that lead to one of the choices are allowed. The output is guaranteed to be exactly one of the strings.
Common Mistake: Choices that are prefixes of each other:
generator = outlines.generate.choice(model, ["yes", "yes_strongly"])
# Ambiguous: after "yes", the model might continue to "_strongly" or stop.Fix by adding terminating context or making choices unambiguous:
generator = outlines.generate.choice(model, ["yes", "yes_strongly", "no"])If you need free-form output that starts with one of several phrases, use regex instead:
gen = outlines.generate.regex(model, r"(positive|negative|neutral)\b.*")Fix 4: Regex Generation
For arbitrary patterns:
# Phone number:
gen = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")
# ISO date:
gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
# IP address:
gen = outlines.generate.regex(model, r"(\d{1,3}\.){3}\d{1,3}")The compilation cost for these is one-time per (model, regex) pair. Outlines caches it across calls for the same generator object — reuse generators rather than creating fresh ones in a loop.
# Slow — recompiles on every call:
for prompt in prompts:
gen = outlines.generate.regex(model, pattern)
out = gen(prompt)
# Fast — compile once:
gen = outlines.generate.regex(model, pattern)
for prompt in prompts:
out = gen(prompt)For very complex regex (lots of alternation, deep nesting), compilation can take 30s+. Simplify the pattern or use JSON/choice if applicable.
Fix 5: Streaming Constrained Output
For long outputs you want to display progressively:
generator = outlines.generate.text(model)
for chunk in generator.stream("Write a story about..."):
print(chunk, end="", flush=True)For constrained generators that have a stream method:
gen = outlines.generate.json(model, User)
for partial in gen.stream("Extract: ..."):
print(partial)Streaming a JSON generator yields tokens; you reconstruct the partial JSON yourself. Not all backends support streaming uniformly — Transformers and vLLM do; llama.cpp and OpenAI depend on the version.
Note: Constrained streaming is rarely a UX improvement for short structured outputs (a User object completes in <500ms). Streaming shines for long-form text or large arrays.
Fix 6: Multiple Generators Share One Model
Loading a 7B model takes 20-60s and 14 GB of VRAM. Don’t reload per generator:
# Load model once:
model = outlines.models.transformers("model-name")
# Reuse for multiple generators:
sentiment_gen = outlines.generate.choice(model, ["positive", "negative"])
date_gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
user_gen = outlines.generate.json(model, User)Each generator caches its FSM independently. Switching between them is free at runtime.
For batched throughput, prefer vLLM:
model = outlines.models.vllm("model-name", gpu_memory_utilization=0.9)
gen = outlines.generate.json(model, User)
# Batch processing — vLLM batches internally:
results = [gen(prompt) for prompt in prompts]Fix 7: Sampling Parameters
Control creativity and length:
from outlines.samplers import multinomial, greedy
# Default is multinomial sampling at temp=1.0 — varied outputs.
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))
# Greedy: deterministic, picks highest-probability token at each step.
generator = outlines.generate.text(model, sampler=greedy())
# Generate with explicit max tokens:
result = generator("...", max_tokens=512)For reproducible runs:
import torch
torch.manual_seed(42)
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))Common Mistake: Using multinomial(temperature=0.0). Multinomial with zero temperature is undefined; use greedy() for “always pick the most likely token.”
Fix 8: Prompt Templates
For chat models, format the prompt with the model’s expected template:
# Manual templating:
prompt = """<|user|>
Extract user info from: John Doe, age 30
<|end|>
<|assistant|>"""
result = generator(prompt)Or use the tokenizer’s template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
messages = [
{"role": "system", "content": "You extract user info as JSON."},
{"role": "user", "content": "John Doe, age 30"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = generator(prompt)For Outlines + vLLM, pass the chat template via the vLLM model:
model = outlines.models.vllm("model-name")
# vLLM applies the template automatically when you pass messages.Pro Tip: Mismatched chat templates are the #1 reason a model that “should be smart” returns nonsense. Always check the model’s README for the exact template format.
Version History: Outlines From 0.0.x to 1.0 and the Structured-Output Landscape
Outlines is one of the original structured-generation libraries, but the field has gotten crowded. Understanding the version timeline tells you why APIs changed, and the cross-tool comparison tells you when to reach for something else.
July 2023 — Outlines 0.0.x (initial release). The original paper-backed release (“Efficient Guided Generation for Large Language Models”, dottxt-ai). API was minimal: outlines.text.generate.regex(model, prompt, pattern). JSON support existed but was clunky. Only Transformers backend.
Late 2023 — 0.1.x (grammar mode). Added context-free-grammar (Lark) support via outlines.generate.cfg(model, grammar). Choice generation arrived. Added llama.cpp backend.
Early 2024 — 0.2 (regex stabilisation). Major rework of the FSM compiler for speed; regex compilation got significantly faster for common patterns. Added vLLM integration as a logits processor (the production-grade path used in Fix 1). Pydantic v2 contract was made strict — v1 Pydantic models started failing.
Mid 2024 — 0.0.46+ structured outputs maturity. JSON schema support became the headline feature. The outlines.generate.json(model, MyPydanticModel) API stabilised. MLX backend (Apple Silicon native) was added.
Late 2024 / 2025 — 1.0 (stable). The outlines.models.* namespace was reorganised. The streaming API (.stream()) became consistent across backends. Function-calling style “generators that bind to a Pydantic model” became the dominant pattern. Older imports like outlines.text.* were removed.
If your code reads import outlines.text.generate as generate you are on a pre-0.1 release and everything will look broken against current docs. pip install -U outlines and re-read the migration notes for whatever 1.x your install landed on.
vs Instructor
Instructor takes a different approach: it sits on top of the OpenAI / Anthropic / Mistral chat APIs and uses function-calling (or JSON-mode) to get structured responses, then validates them with Pydantic and re-asks the model if validation fails.
- Outlines guarantees structural validity by token-masking; Instructor encourages it and validates after.
- Instructor works against any API model without logits access; Outlines needs logits (so only local models or vLLM/Transformers).
- Use Instructor when you ship to OpenAI/Anthropic and don’t want to host a model. Use Outlines when you self-host and need bullet-proof structure (no retries, no parsing failures).
vs Marvin
Marvin (PrefectHQ) is higher-level: @marvin.ai_fn decorators that turn a Python function into a structured LLM call. Under the hood it uses function-calling or Instructor-style validation.
- Marvin is ergonomic for application code; Outlines is closer to the metal.
- Marvin doesn’t run local models in the same first-class way; Outlines targets local inference primarily.
vs LangChain Output Parsers
LangChain’s PydanticOutputParser / OutputFixingParser are prompt-engineering wrappers: they inject schema instructions into the prompt and parse the response, sometimes calling the model again to “fix” malformed output.
- No structural guarantee. A LangChain parser can still fail on malformed output — the model can simply ignore the schema instructions.
- Works against any model — no logits access required.
- Use LangChain parsers when you’re already in LangChain land and don’t need hard guarantees. Use Outlines when malformed JSON would crash your pipeline.
vs xgrammar / jsonformer / lm-format-enforcer
These are direct competitors to Outlines for token-mask-based structured generation:
- xgrammar (CMU, late 2024) — emerged in 2024 with state-machine performance optimisations. Often faster than Outlines on complex JSON schemas. Used by vLLM as an alternative backend.
- jsonformer — earlier (2023), only handles JSON, no regex/choice/CFG. Simpler but less general.
- lm-format-enforcer — similar capability to Outlines, often paired with vLLM. Sometimes more lenient with tokenization edge cases.
If Outlines compilation is too slow on a complex schema, try xgrammar as a drop-in for the JSON path; it solves several of the FSM-compilation pain points by design.
Pinning strategy
Pin Outlines to an exact version in production (outlines==1.0.4 not outlines>=1.0). Pin the backend too (vllm==0.6.x, transformers==4.45.x). The interaction surface between Outlines and the inference backend is where most “it worked last week” breakages originate. Re-test after every coordinated bump.
Still Not Working?
A few less-obvious failures:
- Generation starts but produces gibberish for the first 5 tokens. The model is warming up the cache. Discard the first run or use
model.warmup()if your backend exposes it. jsongeneration returnsNonefor optional fields. That’s correct — your schema marks them optional and the model didn’t fill them. Tighten the prompt if you want all fields populated.- vLLM Outlines plugin not picked up. vLLM needs the outlines-vllm integration. Install via
pip install outlines[vllm]or check vLLM’s--guided-decoding-backendflag. mlxbackend returns empty strings on M-series Macs. MLX support is newer and less stable. Trytransformerswithdevice="mps"as a fallback while you debug.max_tokensexceeded with no warning. Outlines silently truncates if the model hits the limit before completing the schema. For structured output, setmax_tokensgenerously.- GPU runs out of memory immediately. Constrained generation has some overhead. Reduce
gpu_memory_utilization(vLLM) or batch size, or use a smaller model. outlines.generate.cfgfor context-free grammars hangs. CFG support is experimental and the grammar must be in Lark format. Stick to regex/JSON for production.- Pydantic v1 schemas reject extra fields. Outlines 0.x targets Pydantic v2. Update your models or pin a compatible Outlines version.
- Tokenizer mismatch with vLLM. If you load the model with one tokenizer in Outlines and a different one in vLLM (e.g. fast vs slow), the FSM you compiled doesn’t align with sampling. Always let Outlines pull the tokenizer from the same model object you give to vLLM.
Fieldconstraints silently dropped. PydanticField(min_length=...)translates to JSON schemaminLength. Outlines’ FSM enforcesminLengthfor strings butField(pattern=...)is only enforced for the body — a string that’s too short still fails post-validation. Check the schema Outlines actually sees withUser.model_json_schema().- Choice list with multi-token strings is slow. Each choice is converted to a token sequence at compile time. A list of 50 long phrases recompiles into a wide FSM. For long classification spaces, switch to a regex pattern like
r"(label_\d{1,3})"and post-validate against your label set.
For related LLM constraint and validation issues, see Instructor not working, DSPy not working, Pydantic validation error, and vLLM not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Instructor Not Working — Validation Loops, Mode Mismatch, Streaming, and Anthropic / Gemini Issues
How to fix Python Instructor errors — ValidationError loops, max_retries exhausted, mode=Mode.TOOLS vs JSON, partial streaming type errors, Anthropic and Gemini client patching, token usage tracking.
Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors
How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.
Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises
How to fix DSPy errors — no LM configured, signature field types, ChainOfThought vs Predict, optimizer (MIPROv2) setup, retrieval module wiring, async usage, and cache invalidation between runs.
Fix: LiteLLM Not Working — Model Name Format, API Keys, Streaming, and Fallback Errors
How to fix LiteLLM errors — BadRequestError model not found, missing API key env vars, streaming chunk differences, fallback model not triggering, async drop_params, and proxy server 401.