Skip to content

Fix: Replicate Not Working — Model Versions, Prediction Polling, Webhooks, and Cog Build

FixDevs · (Updated: )

Quick Answer

How to fix Replicate API errors — model version ID required, prediction polling vs streaming, webhook signature verification, file inputs and HTTPS URLs, cold start latency, Cog deployment, and deployments vs predictions.

The Error

You call the Replicate API and get a version error:

HTTPError: 422 Client Error: Unprocessable Entity
{"detail": "version is required"}

Or prediction.output is None:

prediction = replicate.predictions.create(...)
print(prediction.output)  # None
print(prediction.status)  # "starting"

Or your webhook never fires:

[webhook handler] No predictions arriving...

Or cog build fails:

cog: error: building image failed: layer ... too large

Why This Happens

Replicate hosts ML models accessible via HTTP API. Most issues map to:

  • Predictions are async. predictions.create returns immediately with a job in starting state. You either poll, use webhooks, or call the convenience run() method (which polls internally).
  • Model versions are required. A model URL like replicate/stable-diffusion is ambiguous — versions are commit-like IDs. Either pin a version or use the helper that picks the latest.
  • File inputs need URLs or base64. Local file paths don’t work over HTTP. Either upload to your own storage and pass the URL, or base64-encode inline (size-limited).
  • Cog (Replicate’s containerization tool) builds Docker images of your ML code. Big images, GPU dependencies, slow builds.

The mental model that helps: “your model is a Docker image, and a prediction is one short-lived container run.” cog push uploads the image to Replicate’s registry, version-tagged by a content hash. Each API call schedules a container, mounts the input, runs predict(), and ships the output back. Cold start is the time to pull the image and run setup(). Deployments differ in that they pin warm containers to keep cold-start latency low.

The version-ID system catches everyone the first time. Replicate models look like username/model-name, but the actual artifact is username/model-name:abcdef123... — a content-addressed identifier. Calling without a version uses “latest,” which can change. The 422 “version is required” error means the SDK couldn’t resolve a version. Predictions (shared hardware, per-second billing) and deployments (reserved hardware, uptime billing) look identical in code but bill very differently.

Fix 1: Specify the Model Version

import replicate

# Use the model:version shorthand:
output = replicate.run(
    "stability-ai/stable-diffusion-3:abcdef0123456789",
    input={"prompt": "a cat on a roof", "width": 1024, "height": 1024},
)
print(output)

stability-ai/stable-diffusion-3:abcdef0123456789username/model:version_id. The version ID is a hash of the deployed model.

To find the latest version:

model = replicate.models.get("stability-ai/stable-diffusion-3")
latest_version = model.latest_version.id
print(latest_version)

Or skip the version (uses model.latest_version):

output = replicate.run(
    "stability-ai/stable-diffusion-3",  # No version — uses latest
    input={"prompt": "..."},
)

For production, pin specific versions to avoid surprises:

STABLE_DIFFUSION_VERSION = "stability-ai/stable-diffusion-3:abcdef0123456789"

output = replicate.run(STABLE_DIFFUSION_VERSION, input={...})

For Node:

import Replicate from "replicate";

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const output = await replicate.run(
  "stability-ai/stable-diffusion-3:abcdef0123456789",
  { input: { prompt: "a cat" } },
);

Pro Tip: Pin versions per environment. Dev can track latest; production should pin so a model update doesn’t accidentally change your output.

Fix 2: run() vs Manual Polling

The run() method blocks until the prediction completes:

output = replicate.run(model_version, input={...})
# Returns the final output, polling internally.

For more control (e.g. show progress to users), call predictions.create and poll:

prediction = replicate.predictions.create(
    version=model_version,
    input={"prompt": "..."},
)

while prediction.status not in ("succeeded", "failed", "canceled"):
    time.sleep(1)
    prediction.reload()
    print(prediction.status)  # starting → processing → succeeded

if prediction.status == "succeeded":
    print(prediction.output)
elif prediction.status == "failed":
    print(prediction.error)

For Node:

const prediction = await replicate.predictions.create({
  version: "abcdef0123456789",
  input: { prompt: "..." },
});

let status = prediction.status;
while (status === "starting" || status === "processing") {
  await new Promise((r) => setTimeout(r, 1000));
  const updated = await replicate.predictions.get(prediction.id);
  status = updated.status;
}

Common Mistake: Polling without backoff. Hammering the API every 100ms can hit rate limits. Use 1-second intervals or exponential backoff.

Fix 3: Webhooks Instead of Polling

For predictions that take minutes, webhooks are cheaper than polling:

prediction = replicate.predictions.create(
    version=model_version,
    input={...},
    webhook="https://app.example.com/api/replicate-webhook",
    webhook_events_filter=["completed"],  # Or "start", "output", "logs", "completed"
)

The webhook fires when the prediction reaches the filtered states. completed includes both succeeded and failed.

Your handler:

@app.post("/api/replicate-webhook")
async def handle_webhook(request: Request):
    body = await request.body()
    
    # Verify signature (recommended):
    signature = request.headers.get("webhook-signature")
    if not verify_signature(body, signature):
        return Response(status_code=401)
    
    payload = json.loads(body)
    if payload["status"] == "succeeded":
        await save_output(payload["id"], payload["output"])
    elif payload["status"] == "failed":
        await record_failure(payload["id"], payload["error"])
    
    return {"ok": True}

For signature verification (uses HMAC-SHA256 with a signing secret from Replicate Dashboard → API tokens):

import hmac
import hashlib

def verify_signature(body: bytes, signature: str) -> bool:
    secret = os.environ["REPLICATE_WEBHOOK_SECRET"]
    computed = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={computed}", signature)

Pro Tip: Combine webhook with a polling fallback. Webhooks can fail (network blip, your app restart) — fall back to polling for predictions that have been pending too long.

Fix 4: Streaming Outputs

For models that support streaming (LLMs, some image gen):

# Python:
for event in replicate.stream(
    "meta/meta-llama-3-70b-instruct",
    input={"prompt": "Tell me about Python"},
):
    print(event, end="", flush=True)
// Node:
for await (const event of replicate.stream("...", { input: {...} })) {
  process.stdout.write(event.toString());
}

stream() yields server-sent events as the model produces them — useful for chat UIs with token-by-token output.

Not all models support streaming. Check the model’s documentation under “API Examples.”

For SSE-based streaming via fetch directly:

const response = await fetch("https://api.replicate.com/v1/predictions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.REPLICATE_API_TOKEN}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    version: "...",
    input: {...},
    stream: true,
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}

Fix 5: File Inputs

For inputs like images, audio, video:

Option A — public URL:

output = replicate.run(
    "ai-forever/kandinsky-2.2",
    input={"image": "https://example.com/cat.jpg"},
)

Replicate fetches from the URL. Must be HTTPS and publicly accessible.

Option B — base64 data URL:

import base64

with open("cat.jpg", "rb") as f:
    img_bytes = f.read()

data_url = f"data:image/jpeg;base64,{base64.b64encode(img_bytes).decode()}"

output = replicate.run(
    "ai-forever/kandinsky-2.2",
    input={"image": data_url},
)

Size-limited (typically 5-25 MB per input).

Option C — Replicate’s file upload helper:

output = replicate.run(
    "...",
    input={"image": open("cat.jpg", "rb")},
)

The Python client uploads the file to Replicate’s hosted storage and passes the URL.

For Node, use Buffer or stream:

import fs from "node:fs";

const output = await replicate.run("...", {
  input: { image: fs.createReadStream("cat.jpg") },
});

The client uploads automatically.

Common Mistake: Passing local file paths as strings. Replicate’s HTTP API has no access to your filesystem. Use one of the three patterns above.

Fix 6: Deployments for Lower Cold Starts

A “prediction” runs on shared infrastructure with potential cold starts. A “deployment” is a pinned model with reserved hardware — no cold starts, predictable cost.

In the Replicate Dashboard → Deployments → New deployment:

Model:        my-username/my-model
Version:      abcdef0123456789
Min instances: 1
Max instances: 10
Hardware:     A100 (80GB)

Then call via the deployment endpoint:

deployment = replicate.deployments.get("my-username/production-deploy")
prediction = deployment.predictions.create(input={...})
prediction.wait()
print(prediction.output)

Or use replicate.deployments.run:

output = replicate.run(
    "my-username/production-deploy",
    input={"prompt": "..."},
    # Deployments are addressed by name; version is implicit.
)

Deployments cost the reserved hardware’s hourly rate, even when idle. For sporadic traffic, predictions are cheaper. For latency-sensitive endpoints, deployments win.

Pro Tip: Use min_instances: 0 for cost — but expect cold-start latency on the first request after idle. For 24/7 readiness, min_instances: 1 with the smallest hardware tier.

Fix 7: Building Custom Models With Cog

Cog packages your model as a Docker image Replicate can run.

cog.yaml:

build:
  gpu: true
  python_version: "3.11"
  python_packages:
    - "torch==2.1.0"
    - "transformers==4.40.0"
    - "diffusers==0.27.0"
  system_packages:
    - "ffmpeg"

predict: "predict.py:Predictor"

predict.py:

from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self):
        """Loaded once at startup. Slow loads (model weights) go here."""
        from diffusers import StableDiffusionPipeline
        self.pipe = StableDiffusionPipeline.from_pretrained("...")
        self.pipe.to("cuda")
    
    def predict(
        self,
        prompt: str = Input(description="Prompt for generation"),
        steps: int = Input(default=50, ge=1, le=100),
    ) -> Path:
        image = self.pipe(prompt, num_inference_steps=steps).images[0]
        output_path = Path("/tmp/output.png")
        image.save(output_path)
        return output_path

Build and test locally:

cog build
cog predict -i prompt="a cat on a roof"

Push to Replicate:

cog login
cog push r8.im/my-username/my-model

Now my-username/my-model is callable via the API.

Common Mistake: Loading weights in predict() instead of setup(). Every prediction reloads — slow. Put expensive init in setup(); it runs once when the container starts.

For big models, mount weights from a Cloudflare R2 / S3 bucket at runtime instead of baking into the image — keeps the image smaller and rebuilds faster.

Fix 8: Rate Limits and Errors

Replicate’s API has rate limits per token:

  • Free tier: limited concurrent predictions.
  • Paid: higher concurrency.

Common errors and handling:

import replicate
from replicate.exceptions import ModelError, ReplicateError

try:
    output = replicate.run("...", input={...})
except ModelError as e:
    # Model itself errored (e.g. invalid prompt, OOM).
    print("Model error:", e)
except ReplicateError as e:
    # API error (rate limit, auth, network).
    if e.status == 429:
        time.sleep(60)
        # Retry
    elif e.status == 401:
        # Bad token
        ...

For retries with exponential backoff:

import time

for attempt in range(5):
    try:
        output = replicate.run("...", input={...})
        break
    except ReplicateError as e:
        if e.status in (429, 502, 503):
            time.sleep(2 ** attempt)
            continue
        raise

For production traffic, queue requests on your side (BullMQ, Sidekiq, etc.) and pull at a rate Replicate can handle.

Replicate vs Modal vs HuggingFace Inference vs Banana vs RunPod vs Cog

Serverless GPU inference is a crowded category. Each platform makes different trade-offs around cold start, pricing, model packaging, and how much “DevOps” you’re expected to do.

Replicate. Models packaged as Cog containers. HTTP API, predictions and deployments, webhooks, streaming for LLMs. The “marketplace” angle — thousands of community models you can call without uploading anything — is the main differentiator. Best when you want to consume someone else’s model with a one-liner. Weakness: cold starts on shared hardware can run 10–60 seconds for large models; per-second pricing adds up if you forget to switch to deployments.

Modal. Python-first serverless. You write a modal.Function decorator on a Python function, Modal handles the container, scaling, and GPU allocation. No Dockerfile needed for most cases. Best when you write your own inference code and want it to “just work” on cloud GPUs. Stronger for custom training/batch jobs than for serving a single model endpoint. Weakness: Python-only, not a model marketplace.

Hugging Face Inference Endpoints. Pick a model from the Hub, click “Deploy,” pay for a fixed instance size. Great UX for serving anything Transformers-compatible. Best when your model is already on the Hub and you want the simplest possible path to an HTTPS endpoint. Weakness: pricing is per-hour (not per-second), so idle endpoints cost money; less flexibility for custom inference code.

Banana. Serverless model deployment with a focus on cold-start minimization. Custom Potassium framework wraps your code. Best when latency-sensitive but bursty traffic — Banana’s product positioning is around fast cold starts. Weakness: smaller ecosystem, less proven at very high scale.

RunPod. GPU rental, including serverless endpoints and persistent pods. Closer to “raw GPU access” — bring your own Docker image, you pick the GPU type, you pay by the second. Best when you want full control over the runtime and the cheapest possible GPU-hour rate. Weakness: more DIY than Replicate or Modal; expect to write your own container and handler.

Cog (as a standalone tool). Cog is the open-source containerization format Replicate uses. You can cog build and run the resulting image anywhere — your own Kubernetes, EC2, or another platform. Best when you want Replicate-style ergonomics but self-hosted inference. Weakness: you’re now operating GPU infrastructure yourself.

Honorable mentions. AWS SageMaker, Google Vertex AI, Lambda Labs (GPU rental), Together AI, Fireworks (LLM-only). Cloudflare Workers AI for small models at the edge.

Practical mapping: existing models → Replicate. Custom Python inference → Modal. Hugging Face Hub models → HF Endpoints. Tightest latency on bursty traffic → Banana or Replicate Deployments. Cheapest GPU-hour → RunPod. Self-hosted → Cog. The biggest cross-platform gotcha: “model URL” means different things on each platform, but the underlying primitive (an HTTPS POST returning a prediction or job ID) is roughly the same.

Still Not Working?

A few less-obvious failures:

  • No webhook events received. Replicate sends webhooks at specific lifecycle moments. Check webhook_events_filter. Also verify your webhook endpoint is HTTPS and publicly accessible (no localhost).
  • Output is a URL, not the data. Image/audio/video outputs are URLs to Replicate-hosted files. Download to your storage if you need long-term retention — Replicate’s hosted files may expire.
  • File too large for image upload. ~5-10 MB limit on inputs via base64. Use a public URL for larger files.
  • Prediction timed out. Default timeout is per-model. For long-running predictions, check the model’s predict_timeout in cog.yaml.
  • Cog build slow. Each build pushes the full image. Use cog build --use-cuda-base-image and pin dependencies for caching.
  • Webhook signatures don’t match. Replicate uses a specific signature format. Use compare_digest for timing-safe comparison. Verify the secret you used to sign against the one Replicate has.
  • Streaming events out of order. SSE is in-order at the network level but client parsing may buffer. Use a proper SSE parser.
  • Predictions on shared infrastructure are slow. Cold start. Deploy to a deployment with min_instances >= 1 for predictable latency.
  • cog push succeeds but the new version doesn’t appear in the API. Replicate processes pushed images asynchronously. Check Dashboard → Versions; if the new version is “Building” or “Failed” there, your local cog push finished but the server-side conversion hasn’t. Failed builds often relate to CUDA version mismatches between Cog and your requirements.txt.
  • Output URLs return 404 after a few hours. Replicate-hosted prediction outputs expire (typically 24 hours for free tier, longer for paid). If you need persistent assets, download the URL immediately after the prediction succeeds and store it in your own S3/R2 bucket. Don’t link to Replicate URLs from production UIs.
  • Streaming hangs at the first token on certain models. Some LLM models on Replicate require stream=true in the input and the SSE streaming endpoint. Calling replicate.stream() against a model that only supports batch returns nothing forever. Check the model’s openapi_schema for x-stream: true on the output.

For related ML inference and serving issues, see Modal not working, Cloudflare Workers AI not working, vLLM not working, and HuggingFace Transformers not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles