Skip to content

Fix: LiteLLM Not Working — Model Name Format, API Keys, Streaming, and Fallback Errors

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix LiteLLM errors — BadRequestError model not found, missing API key env vars, streaming chunk differences, fallback model not triggering, async drop_params, and proxy server 401.

The Error

You call litellm.completion with what looks like a valid model name and get this:

import litellm

response = litellm.completion(
    model="claude-3-5-sonnet",
    messages=[{"role": "user", "content": "Hi"}],
)
# litellm.exceptions.BadRequestError:
# LLM Provider NOT provided. Pass in the LLM provider you are trying to call.
# You passed model=claude-3-5-sonnet
# Pass model as E.g. For 'Huggingface' inference endpoints pass in
# `completion(model='huggingface/starcoder',..)`

Or you set OPENAI_API_KEY and switch to Anthropic, and it still tries OpenAI:

AuthenticationError: Anthropic API key not provided. Set ANTHROPIC_API_KEY env var.

Or fallback doesn’t trigger when the primary model fails:

response = litellm.completion(
    model="gpt-4o",
    fallbacks=["claude-3-5-sonnet-20241022"],
    messages=[...],
)
# Primary fails. No fallback attempted. Error raised.

Or streaming chunks have a different shape across providers:

# OpenAI: chunk.choices[0].delta.content is str
# Anthropic via LiteLLM: chunk.choices[0].delta.content is str
# Gemini: sometimes None for the first chunk
# Bedrock: chunk shape differs entirely

Why This Happens

LiteLLM normalizes 100+ LLM providers behind one API, but the normalization isn’t perfect. Three sources of pain:

  • Provider prefix in the model name. LiteLLM uses the model string to route the request. gpt-4o implies OpenAI, but claude-3-5-sonnet doesn’t unambiguously imply Anthropic — you must write anthropic/claude-3-5-sonnet-20241022 so the router picks the right SDK.
  • Per-provider env vars. Each provider needs its own key: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY for Bedrock, etc. LiteLLM reads them lazily — a missing one only fails when you actually call that provider.
  • Provider-specific kwargs. OpenAI accepts response_format={"type": "json_object"} but Anthropic doesn’t. Pass it anyway and you get BadRequestError. LiteLLM offers drop_params=True to silently drop unsupported kwargs.

Fallbacks have a separate gotcha: the fallbacks parameter only triggers on errors LiteLLM classifies as retryable (rate limits, timeouts, server errors). A BadRequestError from a bad prompt is not retried — by design.

A fourth source of pain shows up only at scale: LiteLLM’s Router keeps its own in-process state for rate-limit awareness, per-model token usage, and cooldown windows. When you run multiple replicas behind a load balancer, each replica has its own view of what’s healthy. A provider that one replica has cooled down for 60 seconds is still receiving traffic from the other three replicas — and that’s how you end up paging on “OpenAI rate limit cleared 10 minutes ago, why is the proxy still erroring?” The answer is usually one replica’s cooldown timer that hasn’t aged out yet.

Production Incident Lens: When the LLM Router Goes Down

The classic LiteLLM production incident is brutal in scope. You sit a LiteLLM proxy in front of OpenAI, Anthropic, Bedrock, and a self-hosted vLLM cluster. Every product surface that calls an LLM — chat, summarization, embeddings for search, RAG, the daily report job — flows through this one proxy. When the proxy degrades, the blast radius is every LLM-dependent endpoint at once. Users see “AI is down” across totally unrelated features and your support inbox lights up in the same minute.

The failure modes you have to distinguish during an incident:

  • Upstream provider outage. OpenAI returns 5xx, LiteLLM correctly fails over to Anthropic, but Anthropic is also degraded. Fallbacks chain through and you get the last provider’s error. Watching only the LiteLLM error rate hides whether your fallbacks even tried.
  • Auth rotation. A team rotated OPENAI_API_KEY on the secrets manager but the proxy pod cached the old value at startup. Every OpenAI call returns 401, and fallbacks fire on every request, doubling cost and latency.
  • Cooldown drift. A transient rate limit on one model puts it in cooldown. The router routes elsewhere correctly, but the cooldown never clears because the health-check probe path bypasses the router.
  • Proxy version skew. You upgraded LiteLLM in one environment and not another. The model alias names in config.yaml are valid in one version and unknown in the other.

The right monitoring stack is per-provider error rate, per-model p95 latency, and fallback invocation rate. Wire LiteLLM’s success_callback and failure_callback to your metrics backend; do not rely on the proxy’s built-in dashboard alone, because if the proxy is down the dashboard goes with it. The fallback invocation rate is the single most useful gauge — under normal load it sits near zero; a spike means a provider is failing and your bill is about to climb. Alert when fallbacks exceed 5% of total calls for 5 consecutive minutes.

When the page fires at 3am, work the playbook: hit the proxy’s /health/liveliness and /health/readiness endpoints, then test each upstream from inside the proxy pod with curl (curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models). That tells you in 30 seconds whether the problem is the proxy or the provider. If the upstream is healthy from inside the pod, restart the proxy — cached auth state and stuck cooldowns evaporate. If the upstream is unhealthy, flip the model alias in config.yaml to a known-good fallback and let the deploy pipeline reload the proxy.

Fix 1: Use the provider/model Format

The safest pattern: always prefix the model with the provider. It removes ambiguity and survives the next time LiteLLM adds a new model that overlaps a name from another provider:

import litellm

# OpenAI
litellm.completion(model="openai/gpt-4o", messages=[...])
litellm.completion(model="openai/gpt-4o-mini", messages=[...])

# Anthropic
litellm.completion(model="anthropic/claude-3-5-sonnet-20241022", messages=[...])
litellm.completion(model="anthropic/claude-3-5-haiku-20241022", messages=[...])

# Gemini
litellm.completion(model="gemini/gemini-1.5-pro", messages=[...])

# Bedrock
litellm.completion(model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0", messages=[...])

# Ollama (local)
litellm.completion(model="ollama/llama3.1", api_base="http://localhost:11434", messages=[...])

# Together AI
litellm.completion(model="together_ai/meta-llama/Llama-3-70b-chat-hf", messages=[...])

Pro Tip: Look up exact provider prefixes at LiteLLM’s provider list. The pattern is {provider}/{model_id} where model_id is what the provider’s API expects — including the version date for Anthropic.

Fix 2: Set the Right Env Vars

LiteLLM doesn’t share keys between providers. Set each one explicitly:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION_NAME=us-east-1
export GROQ_API_KEY=gsk_...

Or pass keys per call (useful in multi-tenant code):

litellm.completion(
    model="anthropic/claude-3-5-sonnet-20241022",
    api_key="sk-ant-...",
    messages=[...],
)

Verify which keys are loaded with:

import litellm
litellm.set_verbose = True  # Logs the resolved key source and provider routing

Common Mistake: Putting keys in .env but forgetting python-dotenv. LiteLLM reads os.environ, not .env files. Call load_dotenv() at app startup or use pydantic-settings.

Fix 3: Configure Fallbacks Correctly

fallbacks is a list of model strings tried in order when the primary fails with a retryable error. Catch the exception types LiteLLM treats as retryable: RateLimitError, Timeout, APIConnectionError, ServiceUnavailableError.

import litellm
from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[...],
    fallbacks=[
        "anthropic/claude-3-5-sonnet-20241022",
        "gemini/gemini-1.5-pro",
    ],
    num_retries=2,
)

For more control, use the Router with explicit fallback policies:

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "primary", "litellm_params": {"model": "openai/gpt-4o"}},
        {"model_name": "backup", "litellm_params": {"model": "anthropic/claude-3-5-sonnet-20241022"}},
    ],
    fallbacks=[{"primary": ["backup"]}],
    context_window_fallbacks=[{"primary": ["anthropic/claude-3-5-sonnet-20241022"]}],
)

response = router.completion(
    model="primary",
    messages=[...],
)

context_window_fallbacks is gold for long-context prompts — when the primary’s context limit is too small, LiteLLM automatically retries on a larger-context model.

Fix 4: Drop Provider-Incompatible Params

If you pass response_format, seed, or logprobs (OpenAI features) to Anthropic, you get a BadRequestError. Two fixes:

Drop unsupported params silently:

litellm.drop_params = True

# Or per call:
litellm.completion(
    model="anthropic/claude-3-5-sonnet-20241022",
    response_format={"type": "json_object"},  # Dropped, not sent.
    messages=[...],
    drop_params=True,
)

Or branch on the provider:

def supports_json_mode(model: str) -> bool:
    return model.startswith("openai/") or "json" in model.lower()

kwargs = {"messages": messages}
if supports_json_mode(model):
    kwargs["response_format"] = {"type": "json_object"}

litellm.completion(model=model, **kwargs)

Note: drop_params=True is convenient but silent. If you rely on JSON mode and switch to a provider that doesn’t support it, your prompts will start returning prose. Pair drop_params=True with explicit “respond in JSON” instructions in the system prompt as a safety net.

Fix 5: Handle Streaming Across Providers

LiteLLM normalizes streaming chunks to the OpenAI shape, but the content of those chunks varies — some providers send chunks with delta.content=None to signal role/tool start. Filter for content:

response = litellm.completion(
    model="anthropic/claude-3-5-sonnet-20241022",
    messages=[...],
    stream=True,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:  # Skip None and empty
        print(content, end="", flush=True)

For async streaming:

import asyncio
from litellm import acompletion

async def main():
    response = await acompletion(
        model="anthropic/claude-3-5-sonnet-20241022",
        messages=[...],
        stream=True,
    )
    async for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(main())

The final usage data arrives on the last chunk. Sum it from the stream or use litellm.stream_chunk_builder(chunks) to rebuild the full response object after iterating.

Fix 6: Track Cost and Token Usage

LiteLLM has built-in cost tracking for every supported provider. Read it off the response:

response = litellm.completion(model="openai/gpt-4o", messages=[...])
print("Cost:", response._hidden_params["response_cost"])
print("Tokens:", response.usage.total_tokens)

For aggregate tracking, register a success callback:

def track(kwargs, completion_response, start_time, end_time):
    print(f"{kwargs['model']}: ${completion_response._hidden_params['response_cost']:.4f}")

litellm.success_callback = [track]

If response_cost shows 0.0, LiteLLM doesn’t have pricing for that model. Add custom pricing:

litellm.register_model({
    "my-custom-model": {
        "max_tokens": 8192,
        "input_cost_per_token": 0.000001,
        "output_cost_per_token": 0.000002,
        "litellm_provider": "openai",
        "mode": "chat",
    }
})

Fix 7: Proxy Server (litellm --model ...) Returns 401

The LiteLLM proxy server is a separate use case — you run it as a gateway and your apps point at http://localhost:4000. If clients get 401, check the master key:

# Set the master key the proxy will require from clients
export LITELLM_MASTER_KEY="sk-1234"

# Start the proxy
litellm --config config.yaml --port 4000

In your client:

from openai import OpenAI

client = OpenAI(
    api_key="sk-1234",  # The LiteLLM master key, not OpenAI's
    base_url="http://localhost:4000",
)
client.chat.completions.create(model="gpt-4o", messages=[...])

For per-team or per-user keys, create virtual keys via the proxy’s /key/generate endpoint or the UI. Don’t share the master key with end users — it has admin privileges.

Fix 8: Logging and Debugging

When something silently misbehaves, turn on verbose mode and watch the resolved request:

import litellm

litellm.set_verbose = True  # Print SDK-level routing and request details

# Or in production, use structured logging
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger("LiteLLM").setLevel(logging.DEBUG)

You’ll see the exact provider, model ID, API base, and which env var was used. This is the fastest way to debug “why is my Anthropic call going to OpenAI.”

Still Not Working?

A few less-obvious failures:

  • json_repair import errors. Older LiteLLM optional deps. Run pip install -U litellm[proxy] or pip install litellm[extra_proxy].
  • Anthropic prompt caching not triggering. Pass cache_control: {"type": "ephemeral"} on a message block — LiteLLM forwards it through. Cache hits require identical prefixes.
  • Tool calls with mismatched IDs across providers. Anthropic uses tool_use_id, OpenAI uses tool_call_id. LiteLLM normalizes the response, but if you hand-build the next turn’s messages you need to use the same id the model returned, not a regenerated one.
  • Embeddings work for one provider but not another. embedding() is a separate function. Use litellm.embedding(model="openai/text-embedding-3-small", input=[...]). Provider prefixes work the same way as completion.
  • litellm.completion blocks the event loop in FastAPI. You called the sync function from async code. Use litellm.acompletion and await.
  • Token counting is off by 5-10%. LiteLLM estimates tokens for providers that don’t return exact counts. For billing, treat counts as approximate and add a safety margin.
  • Proxy returns 429 even when the upstream provider isn’t rate-limited. Check the proxy’s tpm_limit / rpm_limit settings in config.yaml — they apply before the upstream call.
  • drop_params=True drops a param you actually need. Set litellm.drop_params = False globally and handle compatibility in your code — explicit beats implicit when output quality matters.

Replicas Cooldown Independently and Cause Phantom 429s

LiteLLM Router’s cooldown state is per-process, so under any horizontal scale you get inconsistent routing. If one pod cools down gpt-4o for 60 seconds, the other pods still send traffic there. Two workable fixes: front the router with a shared Redis cache (litellm_settings.redis_url in config.yaml) so cooldowns are global, or run a single proxy replica behind a smaller pool of stateless app pods. The Redis option survives pod restarts, which the in-memory option doesn’t.

Streaming Responses Drop Mid-Stream Under a Load Balancer

If you put nginx, ALB, or Cloudflare in front of the LiteLLM proxy and streaming chunks stop arriving after 60 seconds, the load balancer is closing idle connections. Streaming chunks are technically active data, but heartbeats vary by provider. Bump the load-balancer idle timeout (300s is a safe floor) and disable response buffering for the streaming endpoint. For nginx that’s proxy_buffering off; and proxy_read_timeout 300s; on the relevant location block. Without this, long generations look like silent truncation to clients.

Cost Tracking Misses Calls Made Via the Proxy’s HTTP API

The Python success_callback only fires for in-process litellm.completion calls. When apps hit the proxy via HTTP, you need the proxy’s own logging hooks (litellm_settings.success_callback in config.yaml) wired to your usage backend. Otherwise, cost attribution is half-blind: you see what your Python services spent, not what the proxy spent on their behalf. Most teams discover this on the first surprise OpenAI invoice.

For related LLM SDK and proxy issues, see OpenAI API not working, Ollama not working, LangChain Python not working, and Instructor not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles