Fix: Replicate Not Working — Model Versions, Prediction Polling, Webhooks, and Cog Build
Quick Answer
How to fix Replicate API errors — model version ID required, prediction polling vs streaming, webhook signature verification, file inputs and HTTPS URLs, cold start latency, Cog deployment, and deployments vs predictions.
The Error
You call the Replicate API and get a version error:
HTTPError: 422 Client Error: Unprocessable Entity
{"detail": "version is required"}Or prediction.output is None:
prediction = replicate.predictions.create(...)
print(prediction.output) # None
print(prediction.status) # "starting"Or your webhook never fires:
[webhook handler] No predictions arriving...Or cog build fails:
cog: error: building image failed: layer ... too largeWhy This Happens
Replicate hosts ML models accessible via HTTP API. Most issues map to:
- Predictions are async.
predictions.createreturns immediately with a job instartingstate. You either poll, use webhooks, or call the conveniencerun()method (which polls internally). - Model versions are required. A model URL like
replicate/stable-diffusionis ambiguous — versions are commit-like IDs. Either pin a version or use the helper that picks the latest. - File inputs need URLs or base64. Local file paths don’t work over HTTP. Either upload to your own storage and pass the URL, or base64-encode inline (size-limited).
- Cog (Replicate’s containerization tool) builds Docker images of your ML code. Big images, GPU dependencies, slow builds.
The mental model that helps: “your model is a Docker image, and a prediction is one short-lived container run.” cog push uploads the image to Replicate’s registry, version-tagged by a content hash. Each API call schedules a container, mounts the input, runs predict(), and ships the output back. Cold start is the time to pull the image and run setup(). Deployments differ in that they pin warm containers to keep cold-start latency low.
The version-ID system catches everyone the first time. Replicate models look like username/model-name, but the actual artifact is username/model-name:abcdef123... — a content-addressed identifier. Calling without a version uses “latest,” which can change. The 422 “version is required” error means the SDK couldn’t resolve a version. Predictions (shared hardware, per-second billing) and deployments (reserved hardware, uptime billing) look identical in code but bill very differently.
Fix 1: Specify the Model Version
import replicate
# Use the model:version shorthand:
output = replicate.run(
"stability-ai/stable-diffusion-3:abcdef0123456789",
input={"prompt": "a cat on a roof", "width": 1024, "height": 1024},
)
print(output)stability-ai/stable-diffusion-3:abcdef0123456789 — username/model:version_id. The version ID is a hash of the deployed model.
To find the latest version:
model = replicate.models.get("stability-ai/stable-diffusion-3")
latest_version = model.latest_version.id
print(latest_version)Or skip the version (uses model.latest_version):
output = replicate.run(
"stability-ai/stable-diffusion-3", # No version — uses latest
input={"prompt": "..."},
)For production, pin specific versions to avoid surprises:
STABLE_DIFFUSION_VERSION = "stability-ai/stable-diffusion-3:abcdef0123456789"
output = replicate.run(STABLE_DIFFUSION_VERSION, input={...})For Node:
import Replicate from "replicate";
const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });
const output = await replicate.run(
"stability-ai/stable-diffusion-3:abcdef0123456789",
{ input: { prompt: "a cat" } },
);Pro Tip: Pin versions per environment. Dev can track latest; production should pin so a model update doesn’t accidentally change your output.
Fix 2: run() vs Manual Polling
The run() method blocks until the prediction completes:
output = replicate.run(model_version, input={...})
# Returns the final output, polling internally.For more control (e.g. show progress to users), call predictions.create and poll:
prediction = replicate.predictions.create(
version=model_version,
input={"prompt": "..."},
)
while prediction.status not in ("succeeded", "failed", "canceled"):
time.sleep(1)
prediction.reload()
print(prediction.status) # starting → processing → succeeded
if prediction.status == "succeeded":
print(prediction.output)
elif prediction.status == "failed":
print(prediction.error)For Node:
const prediction = await replicate.predictions.create({
version: "abcdef0123456789",
input: { prompt: "..." },
});
let status = prediction.status;
while (status === "starting" || status === "processing") {
await new Promise((r) => setTimeout(r, 1000));
const updated = await replicate.predictions.get(prediction.id);
status = updated.status;
}Common Mistake: Polling without backoff. Hammering the API every 100ms can hit rate limits. Use 1-second intervals or exponential backoff.
Fix 3: Webhooks Instead of Polling
For predictions that take minutes, webhooks are cheaper than polling:
prediction = replicate.predictions.create(
version=model_version,
input={...},
webhook="https://app.example.com/api/replicate-webhook",
webhook_events_filter=["completed"], # Or "start", "output", "logs", "completed"
)The webhook fires when the prediction reaches the filtered states. completed includes both succeeded and failed.
Your handler:
@app.post("/api/replicate-webhook")
async def handle_webhook(request: Request):
body = await request.body()
# Verify signature (recommended):
signature = request.headers.get("webhook-signature")
if not verify_signature(body, signature):
return Response(status_code=401)
payload = json.loads(body)
if payload["status"] == "succeeded":
await save_output(payload["id"], payload["output"])
elif payload["status"] == "failed":
await record_failure(payload["id"], payload["error"])
return {"ok": True}For signature verification (uses HMAC-SHA256 with a signing secret from Replicate Dashboard → API tokens):
import hmac
import hashlib
def verify_signature(body: bytes, signature: str) -> bool:
secret = os.environ["REPLICATE_WEBHOOK_SECRET"]
computed = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={computed}", signature)Pro Tip: Combine webhook with a polling fallback. Webhooks can fail (network blip, your app restart) — fall back to polling for predictions that have been pending too long.
Fix 4: Streaming Outputs
For models that support streaming (LLMs, some image gen):
# Python:
for event in replicate.stream(
"meta/meta-llama-3-70b-instruct",
input={"prompt": "Tell me about Python"},
):
print(event, end="", flush=True)// Node:
for await (const event of replicate.stream("...", { input: {...} })) {
process.stdout.write(event.toString());
}stream() yields server-sent events as the model produces them — useful for chat UIs with token-by-token output.
Not all models support streaming. Check the model’s documentation under “API Examples.”
For SSE-based streaming via fetch directly:
const response = await fetch("https://api.replicate.com/v1/predictions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.REPLICATE_API_TOKEN}`,
"content-type": "application/json",
},
body: JSON.stringify({
version: "...",
input: {...},
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
console.log(decoder.decode(value));
}Fix 5: File Inputs
For inputs like images, audio, video:
Option A — public URL:
output = replicate.run(
"ai-forever/kandinsky-2.2",
input={"image": "https://example.com/cat.jpg"},
)Replicate fetches from the URL. Must be HTTPS and publicly accessible.
Option B — base64 data URL:
import base64
with open("cat.jpg", "rb") as f:
img_bytes = f.read()
data_url = f"data:image/jpeg;base64,{base64.b64encode(img_bytes).decode()}"
output = replicate.run(
"ai-forever/kandinsky-2.2",
input={"image": data_url},
)Size-limited (typically 5-25 MB per input).
Option C — Replicate’s file upload helper:
output = replicate.run(
"...",
input={"image": open("cat.jpg", "rb")},
)The Python client uploads the file to Replicate’s hosted storage and passes the URL.
For Node, use Buffer or stream:
import fs from "node:fs";
const output = await replicate.run("...", {
input: { image: fs.createReadStream("cat.jpg") },
});The client uploads automatically.
Common Mistake: Passing local file paths as strings. Replicate’s HTTP API has no access to your filesystem. Use one of the three patterns above.
Fix 6: Deployments for Lower Cold Starts
A “prediction” runs on shared infrastructure with potential cold starts. A “deployment” is a pinned model with reserved hardware — no cold starts, predictable cost.
In the Replicate Dashboard → Deployments → New deployment:
Model: my-username/my-model
Version: abcdef0123456789
Min instances: 1
Max instances: 10
Hardware: A100 (80GB)Then call via the deployment endpoint:
deployment = replicate.deployments.get("my-username/production-deploy")
prediction = deployment.predictions.create(input={...})
prediction.wait()
print(prediction.output)Or use replicate.deployments.run:
output = replicate.run(
"my-username/production-deploy",
input={"prompt": "..."},
# Deployments are addressed by name; version is implicit.
)Deployments cost the reserved hardware’s hourly rate, even when idle. For sporadic traffic, predictions are cheaper. For latency-sensitive endpoints, deployments win.
Pro Tip: Use min_instances: 0 for cost — but expect cold-start latency on the first request after idle. For 24/7 readiness, min_instances: 1 with the smallest hardware tier.
Fix 7: Building Custom Models With Cog
Cog packages your model as a Docker image Replicate can run.
cog.yaml:
build:
gpu: true
python_version: "3.11"
python_packages:
- "torch==2.1.0"
- "transformers==4.40.0"
- "diffusers==0.27.0"
system_packages:
- "ffmpeg"
predict: "predict.py:Predictor"predict.py:
from cog import BasePredictor, Input, Path
class Predictor(BasePredictor):
def setup(self):
"""Loaded once at startup. Slow loads (model weights) go here."""
from diffusers import StableDiffusionPipeline
self.pipe = StableDiffusionPipeline.from_pretrained("...")
self.pipe.to("cuda")
def predict(
self,
prompt: str = Input(description="Prompt for generation"),
steps: int = Input(default=50, ge=1, le=100),
) -> Path:
image = self.pipe(prompt, num_inference_steps=steps).images[0]
output_path = Path("/tmp/output.png")
image.save(output_path)
return output_pathBuild and test locally:
cog build
cog predict -i prompt="a cat on a roof"Push to Replicate:
cog login
cog push r8.im/my-username/my-modelNow my-username/my-model is callable via the API.
Common Mistake: Loading weights in predict() instead of setup(). Every prediction reloads — slow. Put expensive init in setup(); it runs once when the container starts.
For big models, mount weights from a Cloudflare R2 / S3 bucket at runtime instead of baking into the image — keeps the image smaller and rebuilds faster.
Fix 8: Rate Limits and Errors
Replicate’s API has rate limits per token:
- Free tier: limited concurrent predictions.
- Paid: higher concurrency.
Common errors and handling:
import replicate
from replicate.exceptions import ModelError, ReplicateError
try:
output = replicate.run("...", input={...})
except ModelError as e:
# Model itself errored (e.g. invalid prompt, OOM).
print("Model error:", e)
except ReplicateError as e:
# API error (rate limit, auth, network).
if e.status == 429:
time.sleep(60)
# Retry
elif e.status == 401:
# Bad token
...For retries with exponential backoff:
import time
for attempt in range(5):
try:
output = replicate.run("...", input={...})
break
except ReplicateError as e:
if e.status in (429, 502, 503):
time.sleep(2 ** attempt)
continue
raiseFor production traffic, queue requests on your side (BullMQ, Sidekiq, etc.) and pull at a rate Replicate can handle.
Replicate vs Modal vs HuggingFace Inference vs Banana vs RunPod vs Cog
Serverless GPU inference is a crowded category. Each platform makes different trade-offs around cold start, pricing, model packaging, and how much “DevOps” you’re expected to do.
Replicate. Models packaged as Cog containers. HTTP API, predictions and deployments, webhooks, streaming for LLMs. The “marketplace” angle — thousands of community models you can call without uploading anything — is the main differentiator. Best when you want to consume someone else’s model with a one-liner. Weakness: cold starts on shared hardware can run 10–60 seconds for large models; per-second pricing adds up if you forget to switch to deployments.
Modal. Python-first serverless. You write a modal.Function decorator on a Python function, Modal handles the container, scaling, and GPU allocation. No Dockerfile needed for most cases. Best when you write your own inference code and want it to “just work” on cloud GPUs. Stronger for custom training/batch jobs than for serving a single model endpoint. Weakness: Python-only, not a model marketplace.
Hugging Face Inference Endpoints. Pick a model from the Hub, click “Deploy,” pay for a fixed instance size. Great UX for serving anything Transformers-compatible. Best when your model is already on the Hub and you want the simplest possible path to an HTTPS endpoint. Weakness: pricing is per-hour (not per-second), so idle endpoints cost money; less flexibility for custom inference code.
Banana. Serverless model deployment with a focus on cold-start minimization. Custom Potassium framework wraps your code. Best when latency-sensitive but bursty traffic — Banana’s product positioning is around fast cold starts. Weakness: smaller ecosystem, less proven at very high scale.
RunPod. GPU rental, including serverless endpoints and persistent pods. Closer to “raw GPU access” — bring your own Docker image, you pick the GPU type, you pay by the second. Best when you want full control over the runtime and the cheapest possible GPU-hour rate. Weakness: more DIY than Replicate or Modal; expect to write your own container and handler.
Cog (as a standalone tool). Cog is the open-source containerization format Replicate uses. You can cog build and run the resulting image anywhere — your own Kubernetes, EC2, or another platform. Best when you want Replicate-style ergonomics but self-hosted inference. Weakness: you’re now operating GPU infrastructure yourself.
Honorable mentions. AWS SageMaker, Google Vertex AI, Lambda Labs (GPU rental), Together AI, Fireworks (LLM-only). Cloudflare Workers AI for small models at the edge.
Practical mapping: existing models → Replicate. Custom Python inference → Modal. Hugging Face Hub models → HF Endpoints. Tightest latency on bursty traffic → Banana or Replicate Deployments. Cheapest GPU-hour → RunPod. Self-hosted → Cog. The biggest cross-platform gotcha: “model URL” means different things on each platform, but the underlying primitive (an HTTPS POST returning a prediction or job ID) is roughly the same.
Still Not Working?
A few less-obvious failures:
No webhook events received. Replicate sends webhooks at specific lifecycle moments. Checkwebhook_events_filter. Also verify your webhook endpoint is HTTPS and publicly accessible (no localhost).- Output is a URL, not the data. Image/audio/video outputs are URLs to Replicate-hosted files. Download to your storage if you need long-term retention — Replicate’s hosted files may expire.
File too largefor image upload. ~5-10 MB limit on inputs via base64. Use a public URL for larger files.Prediction timed out. Default timeout is per-model. For long-running predictions, check the model’spredict_timeoutin cog.yaml.- Cog build slow. Each build pushes the full image. Use
cog build --use-cuda-base-imageand pin dependencies for caching. - Webhook signatures don’t match. Replicate uses a specific signature format. Use
compare_digestfor timing-safe comparison. Verify the secret you used to sign against the one Replicate has. - Streaming events out of order. SSE is in-order at the network level but client parsing may buffer. Use a proper SSE parser.
- Predictions on shared infrastructure are slow. Cold start. Deploy to a deployment with
min_instances >= 1for predictable latency. cog pushsucceeds but the new version doesn’t appear in the API. Replicate processes pushed images asynchronously. Check Dashboard → Versions; if the new version is “Building” or “Failed” there, your localcog pushfinished but the server-side conversion hasn’t. Failed builds often relate to CUDA version mismatches between Cog and yourrequirements.txt.- Output URLs return 404 after a few hours. Replicate-hosted prediction outputs expire (typically 24 hours for free tier, longer for paid). If you need persistent assets, download the URL immediately after the prediction succeeds and store it in your own S3/R2 bucket. Don’t link to Replicate URLs from production UIs.
- Streaming hangs at the first token on certain models. Some LLM models on Replicate require
stream=truein the input and the SSE streaming endpoint. Callingreplicate.stream()against a model that only supports batch returns nothing forever. Check the model’sopenapi_schemaforx-stream: trueon the output.
For related ML inference and serving issues, see Modal not working, Cloudflare Workers AI not working, vLLM not working, and HuggingFace Transformers not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Bedrock Not Working — Model Access, IAM, Converse API, Streaming, and Cross-Region
How to fix AWS Bedrock errors — AccessDeniedException for model access, bedrock vs bedrock-runtime client, Converse vs InvokeModel API, streaming with ConverseStream, regional availability, and Knowledge Bases setup.
Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration
How to fix Cloudflare Workers AI errors — env.AI binding setup, model ID format, text-generation streaming with ReadableStream, AI Gateway, Vectorize embeddings, region availability, and Neuron-based pricing.
Fix: Modal Not Working — App vs Stub, Image Build, Volumes, GPU Selection, and Cold Starts
How to fix Modal Labs errors — modal.App vs modal.Stub deprecation, image dependencies missing, Volume vs NetworkFileSystem, GPU type mismatch, .remote vs .local invocation, web endpoint URL, and cold start tuning.
Fix: Hono RPC Not Working — Client Type Inference, AppType Export, Validators, and Path Params
How to fix Hono RPC client errors — hc<AppType> showing any, validator types not flowing, app.route chaining loses types, monorepo type import, path param typing, JSON body validation, and streaming.