Skip to content

Fix: Modal Not Working — App vs Stub, Image Build, Volumes, GPU Selection, and Cold Starts

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix Modal Labs errors — modal.App vs modal.Stub deprecation, image dependencies missing, Volume vs NetworkFileSystem, GPU type mismatch, .remote vs .local invocation, web endpoint URL, and cold start tuning.

The Error

You write a Modal app and the new App syntax errors:

import modal

stub = modal.Stub("my-app")  # DeprecationWarning or AttributeError

@stub.function()
def hello():
    print("hi")

Or modal run fails because the image is missing a dependency:

ModuleNotFoundError: No module named 'transformers'

Or you allocate a GPU and Modal says it’s unavailable:

modal.exception.ResourceExhaustedError: A100 capacity is unavailable in the region

Or .local() runs locally instead of on Modal:

result = my_function.local(...)
# Runs in your terminal, not in the cloud.

Why This Happens

Modal is Python-native serverless: you write Python, decorate it, and Modal handles container build, GPU allocation, scaling, and queues. The model — write code in your normal repo, deploy with one CLI command, get autoscaling containers — is the appeal, but it abstracts a lot. Every Modal function runs in its own container, with its own image, possibly its own GPU, possibly with mounted Volumes. When something fails it’s rarely a single layer; it’s usually a mismatch between what you wrote locally and what Modal’s container sees.

The four most common categories. StubApp rename. Modal 0.62+ renamed Stub to App. Both still work in transition versions but new code should use App. Older tutorials, blog posts, and example repos still show Stub, which makes copy-paste a recipe for DeprecationWarning noise (or, on newer Modal, AttributeError). Images are declarative. You build an image with modal.Image.debian_slim().pip_install(...) chains. The image is built remotely; if a dep is missing, you have to add it to the image — not just import locally. A common confusion: your local Python has transformers installed and the import works in your editor, but Modal’s container is a fresh Debian with whatever you declared in the image chain, nothing more. Invocation styles. .remote(...) runs on Modal; .local(...) runs in your local Python; .map(...) runs many in parallel on Modal. Mixing them produces confusing results — particularly in test scripts where you want to verify the function logic without paying for a GPU run. GPU types differ by availability. gpu="a100" may not be available; gpu="any" lets Modal pick. Pinning a specific GPU can fail at runtime when capacity runs out, and “capacity” varies hour-to-hour depending on what other Modal users are doing.

There’s also a less-discussed category: cost surprises. Modal bills per second of container time. If your function spins up a container, downloads a 30 GB model from Hugging Face, and then your client times out — you’ve paid for the container time even though no useful work happened. The fixes below cover both reliability (it works) and economics (it works without wasting money).

How Other Tools Handle This

Serverless GPU is suddenly a competitive market. Each platform makes different bets on the developer experience.

  • Modal. Python-native, decorator-based, code-as-config. Strengths: tightest Python integration of any platform, fastest iteration via modal serve with live reload, first-class Volumes for model caching, secrets management built in. Weaknesses: lock-in is high — your code is wallpapered with modal. decorators, hard to extract; Python-only.
  • Replicate. Models packaged as Cog containers (Dockerfile + predict.py). Strengths: huge model catalog (anyone’s published models are one API call away), per-second billing, simple replicate.run(model, input=...) API. Weaknesses: less flexible than Modal for custom logic; Cog adds a wrapper you have to learn; cold starts can be longer.
  • Banana. Similar shape to Replicate (containers, HTTP API) but with explicit “warm replicas” you provision. Strengths: low cold-start latency when you keep replicas warm. Weaknesses: smaller ecosystem; you manage replica scaling more manually than Modal’s autoscaler.
  • Beam. Python-decorator approach (very similar feel to Modal) with built-in queues and storage. Strengths: simpler pricing tier, includes serverless functions and queues in one. Weaknesses: smaller scale and ecosystem than Modal; fewer GPU types.
  • RunPod. Lower-level: rent GPUs by the hour (Pods) or use Serverless endpoints. Strengths: cheapest GPU prices, widest GPU selection (4090, 6000 Ada, H100s, MI300s). Weaknesses: more DIY — you bring your own container, manage your own startup, do your own scaling logic for Serverless.
  • Fly.io / Vast.ai / Lambda Cloud. Each has its own model. Fly.io is great for general apps with GPU-augmented endpoints, Vast.ai is a marketplace for cheaper consumer GPUs, Lambda is hourly with API.

If you want Python decorators and don’t mind lock-in, Modal or Beam. If you want a model marketplace, Replicate. If you want the cheapest raw compute, RunPod. The cold-start, image-build, and GPU-availability problems below are most acute on Modal/Beam (because images are built per-function) and least acute on Replicate (because each model has one prebuilt image).

Fix 1: Use modal.App (Not Stub)

import modal

app = modal.App("my-app")  # New name

@app.function()
def hello():
    print("hi")

@app.local_entrypoint()
def main():
    hello.remote()

For older code that still uses Stub, both should work in current versions:

# Both equivalent — but App is the future:
app = modal.App("my-app")
stub = modal.Stub("my-app")  # Deprecated alias

Run from CLI:

modal run my_app.py
# or
modal run my_app.py::main   # Specify entry point

For deployment (persistent functions accessible via API):

modal deploy my_app.py

run is one-shot; deploy makes the functions callable from anywhere with API credentials.

Pro Tip: Use modal serve my_app.py during development. It deploys with live reload — file changes trigger a re-deploy automatically.

Fix 2: Build the Image With Your Dependencies

Modal runs each function in its own container. The container’s image must include every package your function imports:

import modal

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("transformers", "torch", "numpy")
    .apt_install("git")
    .env({"MY_VAR": "value"})
)

app = modal.App("my-app", image=image)

@app.function()
def use_transformers():
    from transformers import pipeline  # Now available
    pipe = pipeline("sentiment-analysis")
    return pipe("I love this!")

The image is built remotely the first time you run. Subsequent runs reuse the cached image.

For project-local code:

image = (
    modal.Image.debian_slim()
    .pip_install("transformers")
    .add_local_dir("./my_package", remote_path="/root/my_package")
)

add_local_dir syncs a local directory into the image. For Python source you want to call from a function:

import sys

@app.function(image=image)
def my_func():
    sys.path.append("/root")  # If using add_local_dir
    from my_package import thing
    return thing()

Common Mistake: Installing pip packages outside the image chain:

# Wrong — only installs in your local venv:
# pip install transformers  

# Right — installs in the Modal image:
image = modal.Image.debian_slim().pip_install("transformers")

Fix 3: Pick the Right GPU

Modal supports several GPU types. As of 2025-2026:

@app.function(gpu="any")          # Anything available (cheapest)
@app.function(gpu="T4")           # T4 (16 GB)
@app.function(gpu="L4")           # L4 (24 GB)
@app.function(gpu="A10G")         # A10G (24 GB)
@app.function(gpu="A100")         # A100 (40 GB)
@app.function(gpu="A100-80GB")    # A100 80 GB
@app.function(gpu="H100")         # H100 (80 GB)

For multi-GPU:

@app.function(gpu=modal.gpu.A100(count=4))
def train():
    # 4x A100 in one container
    pass

If your preferred GPU isn’t available (ResourceExhaustedError), Modal will queue or fail. Two strategies:

# Fallback list (Modal tries in order):
@app.function(gpu=["H100", "A100-80GB", "A100"])
def train():
    pass

# Or just "any" and detect inside:
@app.function(gpu="any")
def train():
    import torch
    print(torch.cuda.get_device_name(0))

Pro Tip: Don’t pin to H100 unless you actually need its specific features (FP8, NVL chains). T4/A10G are cheaper and available faster.

Fix 4: .remote(), .local(), .map()

@app.function()
def square(x: int) -> int:
    return x * x

@app.local_entrypoint()
def main():
    # Runs on Modal:
    result = square.remote(5)
    print(result)  # 25

    # Runs locally in your terminal:
    result = square.local(5)
    print(result)  # 25, but didn't use Modal

    # Runs many in parallel on Modal:
    results = list(square.map(range(10)))
    print(results)  # [0, 1, 4, 9, ..., 81]

.remote() — single call on Modal. Returns the result. .local() — runs in your local Python process (for testing). .map() — batch parallel. Returns a generator of results. .spawn() — fire-and-forget; returns a FunctionCall handle for later polling.

For thousands of parallel calls:

results = list(square.map(range(10000), order_outputs=False))

order_outputs=False lets results return as they finish (faster). Default is True (matches input order, slower for skewed durations).

Common Mistake: Calling square(5) directly (no .remote() or .local()). This calls the bare function in your local Python — same as .local() but without making it explicit.

Fix 5: Volumes for Persistent Storage

Containers are ephemeral. For persistent data, use modal.Volume:

volume = modal.Volume.from_name("my-vol", create_if_missing=True)

@app.function(volumes={"/data": volume})
def write_data():
    with open("/data/file.txt", "w") as f:
        f.write("hello")
    volume.commit()  # Persist changes

@app.function(volumes={"/data": volume})
def read_data():
    volume.reload()  # Get latest state
    with open("/data/file.txt") as f:
        return f.read()

Two important calls:

  • volume.commit() — saves changes back to the Volume after writes. Without it, writes are lost when the container exits.
  • volume.reload() — pulls fresh state before reading. Without it, you may read stale cached data.

For caching model weights:

weights_vol = modal.Volume.from_name("model-weights", create_if_missing=True)

@app.function(
    volumes={"/cache": weights_vol},
    image=image,
)
def inference(prompt: str):
    import os
    os.environ["HF_HOME"] = "/cache/huggingface"
    # First call downloads ~5GB to /cache; subsequent calls reuse.
    from transformers import pipeline
    pipe = pipeline("text-generation", model="meta-llama/Llama-3-8B")
    return pipe(prompt)

The first inference downloads the model; subsequent inferences (even in fresh containers) reuse the Volume.

Pro Tip: Use a separate Volume for each large dataset/model. Volumes have per-Volume read/write caches, so isolating them gives the best cold-start times.

Fix 6: Web Endpoints

Expose a function as an HTTPS endpoint:

@app.function(image=image)
@modal.web_endpoint(method="POST")
def predict(payload: dict):
    result = my_model.predict(payload["input"])
    return {"result": result}

After modal deploy, the URL is in the deploy output. Or via CLI:

modal app list
modal app show my-app

For FastAPI integration (more powerful):

from fastapi import FastAPI

web_app = FastAPI()

@web_app.post("/predict")
def predict(payload: dict):
    return {"result": my_model.predict(payload["input"])}

@app.function(image=image)
@modal.asgi_app()
def fastapi_app():
    return web_app

This exposes the FastAPI app at one Modal-assigned URL. All routes work as in any FastAPI deploy.

Common Mistake: Authenticated web endpoints without proper headers. Modal’s web endpoints can be public or require an API token:

@modal.web_endpoint(method="POST", requires_proxy_auth=True)
def secure_endpoint(...): ...

requires_proxy_auth=True blocks unauthenticated callers at the Modal edge.

Fix 7: Secrets

Don’t put API keys in your code. Use Modal Secrets:

modal secret create openai-secret OPENAI_API_KEY=sk-...

Or via the dashboard.

Reference in functions:

@app.function(
    image=image,
    secrets=[modal.Secret.from_name("openai-secret")],
)
def call_openai(prompt: str):
    import os
    from openai import OpenAI
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    return client.chat.completions.create(...)

The secret’s env vars are injected at runtime; they don’t appear in your image or logs.

For multiple secrets:

secrets=[
    modal.Secret.from_name("openai-secret"),
    modal.Secret.from_name("anthropic-secret"),
    modal.Secret.from_name("aws-creds"),
]

For dynamic secrets (per-deployment overrides):

modal.Secret.from_dict({"DB_URL": os.environ["LOCAL_DB_URL"]})

from_dict is fine for dev but don’t commit hard-coded secrets that way.

Fix 8: Cold Starts and container_idle_timeout

Each cold start downloads the image, starts the container, and runs your function. For latency-sensitive workloads, keep containers warm:

@app.function(
    image=image,
    gpu="A10G",
    container_idle_timeout=300,  # Keep idle containers alive for 5 minutes
    allow_concurrent_inputs=10,  # One container handles up to 10 concurrent calls
    keep_warm=1,                 # Always have 1 container ready
)
def predict(payload: dict):
    ...

Three controls:

  • container_idle_timeout — how long a container sits idle before being killed.
  • allow_concurrent_inputs — concurrent requests per container. Higher means fewer cold starts but more memory pressure.
  • keep_warm — number of always-running containers. Costs money even when idle, but cold-start latency drops to zero.

For GPU functions, even keep_warm=1 is expensive. Use it for production endpoints; for batch jobs, accept cold starts.

Pro Tip: Test cold-start time with modal run --detach. The first call after deploy is a cold start; subsequent calls are warm.

Still Not Working?

A few less-obvious failures:

  • ImportError despite pip_install. Modal caches images by chain hash. Adding a pip_install after a run_commands may invalidate the cache for run_commands. Order chain steps from least-changing to most-changing.
  • TimeoutError: function exceeded 600s. Default function timeout is 10 minutes. Bump via @app.function(timeout=3600) (1 hour). Max varies by Modal plan.
  • GPU function runs on CPU. No gpu= set. Decorator must include gpu="any" or specific type.
  • add_local_dir not picking up changes. Modal caches local syncs. Force re-sync with --detach or bump the version of your function decorator.
  • Volumes diverge across regions. Volumes are region-scoped. Functions in different regions reading the same Volume name access different physical volumes. Pin function region or use a different storage backend.
  • modal token set fails in CI. CI environments need API token via env var: MODAL_TOKEN_ID and MODAL_TOKEN_SECRET. Generate from the dashboard.
  • modal.exception.InvalidError: function does not exist. Either the function isn’t deployed yet (run modal deploy) or the function name in the lookup doesn’t match. Use Function.lookup("my-app", "my-func") exact names.
  • Class.cls deprecated. Modal moved from Stub.cls to App.cls along with Stub → App. Update.
  • Cold start exceeds 90s on first call after deploy. Image download dominates. Make the image smaller — pin Python version, prune CUDA dev tools, separate model weights into a Volume so they’re not in the image layer. The image size is reported by modal image show.
  • Volume reads return old data even after volume.reload(). Volumes are eventually consistent across regions. If your reader is in a different region than the writer, brief staleness is expected. Pin both functions to the same region with region="us-east-1".
  • modal serve doesn’t reload on edits inside a subdirectory. The CLI watches the entrypoint file and its direct imports. For deep package imports, add explicit --watch-paths or run modal from a parent directory.
  • @modal.web_endpoint is deprecated in newer SDKs. Recent Modal versions replaced it with @modal.fastapi_endpoint and @modal.asgi_app for richer routing. Check the warning text; the migration is mechanical.

For related Python deployment and ML serving issues, see vLLM not working, AWS Lambda timeout, PyTorch not working, and Replicate not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles