Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises

Q: How do I fix "DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises"?

How to fix DSPy errors — no LM configured, signature field types, ChainOfThought vs Predict, optimizer (MIPROv2) setup, retrieval module wiring, async usage, and cache invalidation between runs.

The Error

You write your first DSPy program and it complains about a missing LM:

import dspy

predict = dspy.Predict("question -> answer")
result = predict(question="What's the capital of France?")
# dspy.utils.callbacks.UserError: No LM is loaded. Please configure your LM using `dspy.configure(lm=...)`.

Or your signature throws an unexpected output type:

class QA(dspy.Signature):
    """Answer a question concisely."""
    question: str = dspy.InputField()
    answer: int = dspy.OutputField()  # Expecting int

qa = dspy.Predict(QA)
result = qa(question="How many planets are in our solar system?")
# pydantic.ValidationError: Input should be a valid integer, got 'eight'

Or the optimizer runs but the resulting program is no better than the baseline:

optimizer = dspy.MIPROv2(metric=my_metric, auto="light")
optimized = optimizer.compile(student=program, trainset=trainset)
# No improvement. Sometimes worse.

Or you change a prompt and DSPy keeps returning cached old results:

# Edit the signature docstring → behavior unchanged.
# Edit the LM model name → unchanged.

Why This Happens

DSPy compiles natural-language modules into prompts. Three layers cause most issues:

Global LM configuration. Every dspy.Predict, dspy.ChainOfThought, or custom Module uses the LM set by dspy.configure(lm=...). Without it, the first call raises. You can also pass lm= per-call to override.
Signatures define the I/O contract. Strings like "question -> answer" are shorthand. For type validation or multiple fields, use a class-based dspy.Signature with typed InputField / OutputField. Mismatches between declared types and the model’s output cause Pydantic validation errors at runtime.
Caching is by call args. DSPy caches LM responses by the resolved prompt and model name. Editing your Python code (signature docstring, module composition) sometimes changes the prompt, sometimes doesn’t. When it doesn’t, cached results stick.

The “optimizer didn’t help” issue is usually too few or bad-quality training examples, or a metric that doesn’t differentiate good from bad outputs.

The architectural mental model that helps the most: DSPy is a prompt compiler, not a runtime. When you write dspy.Predict(MySignature), DSPy doesn’t immediately make an LM call — it lazily constructs a prompt template and waits for inputs. When you call predict(question="..."), it resolves the template (substituting field descriptions, in-context examples, and any optimization tweaks) into a single string, then hands that to LiteLLM. Every failure mode tracks back to one of these layers: the template wasn’t ready (no LM, missing fields), the LM produced something the parser can’t decode (type mismatch, malformed JSON), or the optimizer’s compiled prompt drifted from your code (cache, stale .json snapshot).

For production deployments, the cache and the optimizer are the two layers that most often cause incidents. The cache is keyed by the resolved prompt string plus the model identifier; if you change models or update DSPy itself, your cache may invalidate without warning, and the next batch of requests stampedes against the LM provider. The optimizer’s compiled prompts are stored in a JSON artifact (program.save("...")) that you typically check into source control or ship with your model bundle — if the runtime DSPy version doesn’t match the version that produced the artifact, behavior diverges silently. Pin both the DSPy version and the model snapshot, and treat the optimized prompt JSON as a versioned artifact, not just a config file.

Fix 1: Configure an LM Globally

import dspy

lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

predict = dspy.Predict("question -> answer")
print(predict(question="What's the capital of France?").answer)
# Paris

DSPy uses LiteLLM under the hood, so any LiteLLM-supported provider works:

# Anthropic
dspy.configure(lm=dspy.LM("anthropic/claude-3-5-sonnet-20241022"))

# Ollama (local)
dspy.configure(lm=dspy.LM("ollama/llama3.1", api_base="http://localhost:11434"))

# Google Gemini
dspy.configure(lm=dspy.LM("gemini/gemini-1.5-pro"))

# With cost tracking and config:
lm = dspy.LM(
    "openai/gpt-4o",
    max_tokens=1000,
    temperature=0.0,
    cache=True,
)
dspy.configure(lm=lm)

For multi-LM workflows (cheap model for retrieval, expensive for synthesis), use dspy.context:

cheap = dspy.LM("openai/gpt-4o-mini")
expensive = dspy.LM("openai/gpt-4o")

dspy.configure(lm=cheap)

with dspy.context(lm=expensive):
    result = synthesize(...)  # Uses expensive
result = retrieve(...)  # Back to cheap

Pro Tip: Lock the LM down for production reproducibility — pin the model version, set temperature=0.0, and pin the DSPy version in your lockfile.

Fix 2: Use Class-Based Signatures for Type Safety

The string shorthand is fine for prototyping. For anything real, use a class:

class AnswerWithCitation(dspy.Signature):
    """Answer the question with at least one citation URL."""
    
    question: str = dspy.InputField(desc="A user question")
    context: list[str] = dspy.InputField(desc="Retrieved passages")
    answer: str = dspy.OutputField(desc="A concise answer")
    citation: str = dspy.OutputField(desc="A URL supporting the answer")

qa = dspy.Predict(AnswerWithCitation)
result = qa(question="...", context=[...])
print(result.answer, result.citation)

Typed fields use Pydantic for validation. If the LM returns text that can’t be coerced to the declared type, DSPy raises a clear error (and retries, depending on the LM config).

For complex output structures, use Pydantic models:

from pydantic import BaseModel

class Result(BaseModel):
    summary: str
    confidence: float
    sources: list[str]

class Summarize(dspy.Signature):
    """Summarize the document."""
    document: str = dspy.InputField()
    result: Result = dspy.OutputField()

summarize = dspy.Predict(Summarize)
out = summarize(document="...")
print(out.result.summary, out.result.confidence)

Common Mistake: Declaring int or float outputs and getting validation errors. Models sometimes write "eight" instead of 8. Either accept str and convert in your code, or strengthen the description: desc="Return the number as a digit, not spelled out".

Fix 3: Pick the Right Module

DSPy has several built-in modules:

dspy.Predict — the simplest. Just runs the signature.
dspy.ChainOfThought — adds a rationale step before answering. Often improves accuracy on reasoning tasks.
dspy.ProgramOfThought — generates and executes Python code for the answer. Best for numeric/symbolic problems.
dspy.ReAct — interleaves reasoning with tool calls.
dspy.MultiChainComparison — generates multiple chains and picks the best.

Switching is one line:

# Simple:
qa = dspy.Predict("question -> answer")

# With reasoning:
qa = dspy.ChainOfThought("question -> answer")
result = qa(question="...")
print(result.rationale)  # The thinking step
print(result.answer)

For agentic patterns, wire tools into ReAct:

def search(query: str) -> str:
    return wikipedia.search(query)

def calculate(expression: str) -> str:
    return str(eval(expression))

agent = dspy.ReAct("question -> answer", tools=[search, calculate])
result = agent(question="Population of Tokyo squared")

ReAct lets the LM call your tools by name. The tool function’s docstring/signature becomes part of the prompt.

Fix 4: Compose Modules in a Custom Program

For multi-step programs, subclass dspy.Module:

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(AnswerWithCitation)
    
    def forward(self, question):
        passages = self.retrieve(question).passages
        prediction = self.generate_answer(question=question, context=passages)
        return prediction

rag = RAG()
result = rag(question="What is DSPy?")

The forward method is your control flow. Each sub-module (self.retrieve, self.generate_answer) becomes an optimization target.

Pro Tip: Keep forward deterministic where possible (no random calls, no side effects). Optimizers re-run forward many times with different prompts — non-determinism in your code makes the metric noisier.

Fix 5: Configure a Retriever for RAG

dspy.Retrieve needs a configured RM (retrieval model):

# ColBERTv2-hosted retriever (the canonical DSPy example):
colbert = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")
dspy.configure(rm=colbert)

retrieve = dspy.Retrieve(k=5)
passages = retrieve("DSPy framework").passages

For your own retriever (BM25, Chroma, Pinecone, etc.), wrap it in a dspy.Retriever subclass:

class ChromaRM(dspy.Retrieve):
    def __init__(self, collection, k=3):
        super().__init__(k=k)
        self.collection = collection
    
    def forward(self, query, k=None):
        results = self.collection.query(
            query_texts=[query],
            n_results=k or self.k,
        )
        return dspy.Prediction(passages=results["documents"][0])

dspy.configure(rm=ChromaRM(my_chroma_collection))

Fix 6: Optimize With MIPROv2

The optimizer compiles your program by searching for better prompts using examples:

trainset = [
    dspy.Example(question="...", answer="...").with_inputs("question"),
    # ... 20-50 examples ...
]

def my_metric(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

optimizer = dspy.MIPROv2(metric=my_metric, auto="medium", num_threads=4)
optimized_rag = optimizer.compile(
    student=RAG(),
    trainset=trainset,
)

optimized_rag.save("optimized_rag.json")

auto="light", auto="medium", auto="heavy" are progressively more thorough (and more expensive). Start with light for fast iteration.

If the optimized program isn’t better:

Trainset too small. Aim for at least 20 high-quality examples, ideally 50+.
Metric is noisy. Test the metric manually first — give it a perfect pred and a bad pred, confirm it returns 1 and 0 respectively. Vague metrics confuse the optimizer.
Student model is the limit. If gpt-4o-mini can’t do the task even with the best prompt, no amount of optimization will save it. Try a stronger LM as the student or as a teacher.

Note: MIPROv2 calls the LM many times during optimization. Estimate cost before running on production data: a light run on 50 examples with gpt-4o might cost a few dollars.

Fix 7: Cache Control

DSPy caches LM responses. Useful in dev (no repeated cost for the same prompt), confusing when iterating:

# Disable cache for a fresh run:
dspy.configure(cache=False)

# Or disable on the LM:
lm = dspy.LM("openai/gpt-4o-mini", cache=False)
dspy.configure(lm=lm)

The cache key includes the resolved prompt and model parameters. If you edit your code but the generated prompt is identical (same signature, same inputs), DSPy reuses the cached response.

To inspect what was actually sent to the LM:

dspy.inspect_history(n=3)  # Last 3 calls

Prints the actual prompts and responses — invaluable when “DSPy returns weird stuff” turns out to be your signature producing a weirder-than-expected prompt.

Fix 8: Async Calls and Concurrency

DSPy 2.5+ supports async:

import asyncio

async def main():
    qa = dspy.Predict("question -> answer")
    result = await qa.acall(question="What's 2+2?")
    print(result.answer)

asyncio.run(main())

For concurrent calls (e.g. batch inference), use dspy.Parallel:

pred = dspy.ChainOfThought("question -> answer")

questions = [
    {"question": "Q1"},
    {"question": "Q2"},
    {"question": "Q3"},
]

results = dspy.Parallel(num_threads=8).forward(pred, questions)

For older DSPy versions without acall, wrap the sync function in asyncio.to_thread:

result = await asyncio.to_thread(qa, question="...")

Production Incident Lens: When the Prompt Pipeline Breaks

The DSPy production incident that hurts most is the silent cost blowup. Your prompt pipeline works fine in dev because the cache is warm; you deploy, traffic hits, and either the cache is cold (every request hits the LM) or the optimizer’s compiled prompts grew the per-request token count. Either way, your LLM bill spikes by 5-10x within hours, and the alert that fires first is usually your provider’s spend dashboard, not your own monitoring. By the time you notice, you’ve burned the monthly budget in an afternoon.

Defend against this with three layers. First, track per-request token usage from inside DSPy and emit it to your metrics pipeline — dspy.LM exposes call history, and you can wrap each invocation to log the prompt/completion token count alongside the route or endpoint that triggered it. Second, set a hard ceiling on the prompt length: if any module attempts a call above N tokens, fail loudly rather than passing it through. A runaway retrieval module that grows context size into the 100k-token range is the most common amplifier of LLM cost incidents. Third, run a synthetic load test against the production DSPy pipeline before a major deploy — even ten requests per endpoint catches order-of-magnitude regressions.

The other production trap is endpoint coupling. If every API route runs through the same DSPy pipeline (one Predict module shared across handlers), an upstream provider outage takes down every route simultaneously. Split critical paths from experimental ones, cache aggressively at the application layer for read-heavy endpoints, and have a fallback response ready for the case where the LM call fails or exceeds its timeout. The blast radius of an LM provider 5xx isn’t just “this request fails” — it’s every DSPy-dependent endpoint going down at once, with retry storms making the recovery worse.

Still Not Working?

A few less-obvious failures:

ValidationError only on some inputs. The LM’s output style is unstable. Add explicit desc= to the OutputField clarifying the format, or use dspy.TypedPredictor (forces stricter typing).
Trainset examples don’t seem to influence the optimized program. Make sure each example calls .with_inputs("field1", "field2") to mark which fields are inputs vs labels. Without it, the optimizer treats all fields as inputs.
Saved/loaded programs don’t behave the same. program.save("p.json") stores the optimized prompts. program.load("p.json") restores them — but the LM, RM, and Module structure must match the original at load time.
Token limit exceeded in retrieval. Long contexts fill the LM’s window. Either reduce k in Retrieve, or filter/summarize passages before passing to the next module.
OpenAI 429 during MIPROv2 run. Reduce num_threads, or add min_examples to slow down. Long optimization runs hammer the API.
dspy.inspect_history shows weird system prompts. That’s DSPy’s prompt engineering at work. To customize, write a custom Module that constructs the prompt yourself (escape hatch from the framework).
Different results between local and production. Cache. Or temperature > 0. Or the LM version drifted (Anthropic’s latest alias points at different snapshots). Pin everything explicitly.
AssertionError from dspy.Suggest/dspy.Assert. These are constraints DSPy uses for self-refinement. The constraint failed — read the assertion’s message and either weaken it or improve the upstream prompt.
Optimized prompt JSON grew in size between commits. Compiled prompts include in-context examples; if your trainset grew, so did the per-call payload — and so did per-call cost. Diff the saved JSON across deploys to catch silent prompt bloat.
A library upgrade reset every cache key. DSPy version bumps sometimes change the prompt template structure, invalidating the cache. After upgrading, prewarm the cache with representative inputs before peak traffic.
One slow LM provider blocks the whole pipeline. Without a per-call timeout, a misbehaving provider holds your worker indefinitely. Set timeout= on dspy.LM and surface the timeout as a clean 503 to the client, not as a hung request.

For related Python LLM tooling and validation issues, see LiteLLM not working, LangChain Python not working, Pydantic validation error, and OpenAI API not working.