Fix: Modal Not Working — App vs Stub, Image Build, Volumes, GPU Selection, and Cold Starts
Part of: Python Errors
Quick Answer
How to fix Modal Labs errors — modal.App vs modal.Stub deprecation, image dependencies missing, Volume vs NetworkFileSystem, GPU type mismatch, .remote vs .local invocation, web endpoint URL, and cold start tuning.
The Error
You write a Modal app and the new App syntax errors:
import modal
stub = modal.Stub("my-app") # DeprecationWarning or AttributeError
@stub.function()
def hello():
print("hi")Or modal run fails because the image is missing a dependency:
ModuleNotFoundError: No module named 'transformers'Or you allocate a GPU and Modal says it’s unavailable:
modal.exception.ResourceExhaustedError: A100 capacity is unavailable in the regionOr .local() runs locally instead of on Modal:
result = my_function.local(...)
# Runs in your terminal, not in the cloud.Why This Happens
Modal is Python-native serverless: you write Python, decorate it, and Modal handles container build, GPU allocation, scaling, and queues. The model — write code in your normal repo, deploy with one CLI command, get autoscaling containers — is the appeal, but it abstracts a lot. Every Modal function runs in its own container, with its own image, possibly its own GPU, possibly with mounted Volumes. When something fails it’s rarely a single layer; it’s usually a mismatch between what you wrote locally and what Modal’s container sees.
The four most common categories. Stub → App rename. Modal 0.62+ renamed Stub to App. Both still work in transition versions but new code should use App. Older tutorials, blog posts, and example repos still show Stub, which makes copy-paste a recipe for DeprecationWarning noise (or, on newer Modal, AttributeError). Images are declarative. You build an image with modal.Image.debian_slim().pip_install(...) chains. The image is built remotely; if a dep is missing, you have to add it to the image — not just import locally. A common confusion: your local Python has transformers installed and the import works in your editor, but Modal’s container is a fresh Debian with whatever you declared in the image chain, nothing more. Invocation styles. .remote(...) runs on Modal; .local(...) runs in your local Python; .map(...) runs many in parallel on Modal. Mixing them produces confusing results — particularly in test scripts where you want to verify the function logic without paying for a GPU run. GPU types differ by availability. gpu="a100" may not be available; gpu="any" lets Modal pick. Pinning a specific GPU can fail at runtime when capacity runs out, and “capacity” varies hour-to-hour depending on what other Modal users are doing.
There’s also a less-discussed category: cost surprises. Modal bills per second of container time. If your function spins up a container, downloads a 30 GB model from Hugging Face, and then your client times out — you’ve paid for the container time even though no useful work happened. The fixes below cover both reliability (it works) and economics (it works without wasting money).
How Other Tools Handle This
Serverless GPU is suddenly a competitive market. Each platform makes different bets on the developer experience.
- Modal. Python-native, decorator-based, code-as-config. Strengths: tightest Python integration of any platform, fastest iteration via
modal servewith live reload, first-class Volumes for model caching, secrets management built in. Weaknesses: lock-in is high — your code is wallpapered withmodal.decorators, hard to extract; Python-only. - Replicate. Models packaged as Cog containers (Dockerfile +
predict.py). Strengths: huge model catalog (anyone’s published models are one API call away), per-second billing, simplereplicate.run(model, input=...)API. Weaknesses: less flexible than Modal for custom logic; Cog adds a wrapper you have to learn; cold starts can be longer. - Banana. Similar shape to Replicate (containers, HTTP API) but with explicit “warm replicas” you provision. Strengths: low cold-start latency when you keep replicas warm. Weaknesses: smaller ecosystem; you manage replica scaling more manually than Modal’s autoscaler.
- Beam. Python-decorator approach (very similar feel to Modal) with built-in queues and storage. Strengths: simpler pricing tier, includes serverless functions and queues in one. Weaknesses: smaller scale and ecosystem than Modal; fewer GPU types.
- RunPod. Lower-level: rent GPUs by the hour (Pods) or use Serverless endpoints. Strengths: cheapest GPU prices, widest GPU selection (4090, 6000 Ada, H100s, MI300s). Weaknesses: more DIY — you bring your own container, manage your own startup, do your own scaling logic for Serverless.
- Fly.io / Vast.ai / Lambda Cloud. Each has its own model. Fly.io is great for general apps with GPU-augmented endpoints, Vast.ai is a marketplace for cheaper consumer GPUs, Lambda is hourly with API.
If you want Python decorators and don’t mind lock-in, Modal or Beam. If you want a model marketplace, Replicate. If you want the cheapest raw compute, RunPod. The cold-start, image-build, and GPU-availability problems below are most acute on Modal/Beam (because images are built per-function) and least acute on Replicate (because each model has one prebuilt image).
Fix 1: Use modal.App (Not Stub)
import modal
app = modal.App("my-app") # New name
@app.function()
def hello():
print("hi")
@app.local_entrypoint()
def main():
hello.remote()For older code that still uses Stub, both should work in current versions:
# Both equivalent — but App is the future:
app = modal.App("my-app")
stub = modal.Stub("my-app") # Deprecated aliasRun from CLI:
modal run my_app.py
# or
modal run my_app.py::main # Specify entry pointFor deployment (persistent functions accessible via API):
modal deploy my_app.pyrun is one-shot; deploy makes the functions callable from anywhere with API credentials.
Pro Tip: Use modal serve my_app.py during development. It deploys with live reload — file changes trigger a re-deploy automatically.
Fix 2: Build the Image With Your Dependencies
Modal runs each function in its own container. The container’s image must include every package your function imports:
import modal
image = (
modal.Image.debian_slim(python_version="3.12")
.pip_install("transformers", "torch", "numpy")
.apt_install("git")
.env({"MY_VAR": "value"})
)
app = modal.App("my-app", image=image)
@app.function()
def use_transformers():
from transformers import pipeline # Now available
pipe = pipeline("sentiment-analysis")
return pipe("I love this!")The image is built remotely the first time you run. Subsequent runs reuse the cached image.
For project-local code:
image = (
modal.Image.debian_slim()
.pip_install("transformers")
.add_local_dir("./my_package", remote_path="/root/my_package")
)add_local_dir syncs a local directory into the image. For Python source you want to call from a function:
import sys
@app.function(image=image)
def my_func():
sys.path.append("/root") # If using add_local_dir
from my_package import thing
return thing()Common Mistake: Installing pip packages outside the image chain:
# Wrong — only installs in your local venv:
# pip install transformers
# Right — installs in the Modal image:
image = modal.Image.debian_slim().pip_install("transformers")Fix 3: Pick the Right GPU
Modal supports several GPU types. As of 2025-2026:
@app.function(gpu="any") # Anything available (cheapest)
@app.function(gpu="T4") # T4 (16 GB)
@app.function(gpu="L4") # L4 (24 GB)
@app.function(gpu="A10G") # A10G (24 GB)
@app.function(gpu="A100") # A100 (40 GB)
@app.function(gpu="A100-80GB") # A100 80 GB
@app.function(gpu="H100") # H100 (80 GB)For multi-GPU:
@app.function(gpu=modal.gpu.A100(count=4))
def train():
# 4x A100 in one container
passIf your preferred GPU isn’t available (ResourceExhaustedError), Modal will queue or fail. Two strategies:
# Fallback list (Modal tries in order):
@app.function(gpu=["H100", "A100-80GB", "A100"])
def train():
pass
# Or just "any" and detect inside:
@app.function(gpu="any")
def train():
import torch
print(torch.cuda.get_device_name(0))Pro Tip: Don’t pin to H100 unless you actually need its specific features (FP8, NVL chains). T4/A10G are cheaper and available faster.
Fix 4: .remote(), .local(), .map()
@app.function()
def square(x: int) -> int:
return x * x
@app.local_entrypoint()
def main():
# Runs on Modal:
result = square.remote(5)
print(result) # 25
# Runs locally in your terminal:
result = square.local(5)
print(result) # 25, but didn't use Modal
# Runs many in parallel on Modal:
results = list(square.map(range(10)))
print(results) # [0, 1, 4, 9, ..., 81].remote() — single call on Modal. Returns the result. .local() — runs in your local Python process (for testing). .map() — batch parallel. Returns a generator of results. .spawn() — fire-and-forget; returns a FunctionCall handle for later polling.
For thousands of parallel calls:
results = list(square.map(range(10000), order_outputs=False))order_outputs=False lets results return as they finish (faster). Default is True (matches input order, slower for skewed durations).
Common Mistake: Calling square(5) directly (no .remote() or .local()). This calls the bare function in your local Python — same as .local() but without making it explicit.
Fix 5: Volumes for Persistent Storage
Containers are ephemeral. For persistent data, use modal.Volume:
volume = modal.Volume.from_name("my-vol", create_if_missing=True)
@app.function(volumes={"/data": volume})
def write_data():
with open("/data/file.txt", "w") as f:
f.write("hello")
volume.commit() # Persist changes
@app.function(volumes={"/data": volume})
def read_data():
volume.reload() # Get latest state
with open("/data/file.txt") as f:
return f.read()Two important calls:
volume.commit()— saves changes back to the Volume after writes. Without it, writes are lost when the container exits.volume.reload()— pulls fresh state before reading. Without it, you may read stale cached data.
For caching model weights:
weights_vol = modal.Volume.from_name("model-weights", create_if_missing=True)
@app.function(
volumes={"/cache": weights_vol},
image=image,
)
def inference(prompt: str):
import os
os.environ["HF_HOME"] = "/cache/huggingface"
# First call downloads ~5GB to /cache; subsequent calls reuse.
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3-8B")
return pipe(prompt)The first inference downloads the model; subsequent inferences (even in fresh containers) reuse the Volume.
Pro Tip: Use a separate Volume for each large dataset/model. Volumes have per-Volume read/write caches, so isolating them gives the best cold-start times.
Fix 6: Web Endpoints
Expose a function as an HTTPS endpoint:
@app.function(image=image)
@modal.web_endpoint(method="POST")
def predict(payload: dict):
result = my_model.predict(payload["input"])
return {"result": result}After modal deploy, the URL is in the deploy output. Or via CLI:
modal app list
modal app show my-appFor FastAPI integration (more powerful):
from fastapi import FastAPI
web_app = FastAPI()
@web_app.post("/predict")
def predict(payload: dict):
return {"result": my_model.predict(payload["input"])}
@app.function(image=image)
@modal.asgi_app()
def fastapi_app():
return web_appThis exposes the FastAPI app at one Modal-assigned URL. All routes work as in any FastAPI deploy.
Common Mistake: Authenticated web endpoints without proper headers. Modal’s web endpoints can be public or require an API token:
@modal.web_endpoint(method="POST", requires_proxy_auth=True)
def secure_endpoint(...): ...requires_proxy_auth=True blocks unauthenticated callers at the Modal edge.
Fix 7: Secrets
Don’t put API keys in your code. Use Modal Secrets:
modal secret create openai-secret OPENAI_API_KEY=sk-...Or via the dashboard.
Reference in functions:
@app.function(
image=image,
secrets=[modal.Secret.from_name("openai-secret")],
)
def call_openai(prompt: str):
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
return client.chat.completions.create(...)The secret’s env vars are injected at runtime; they don’t appear in your image or logs.
For multiple secrets:
secrets=[
modal.Secret.from_name("openai-secret"),
modal.Secret.from_name("anthropic-secret"),
modal.Secret.from_name("aws-creds"),
]For dynamic secrets (per-deployment overrides):
modal.Secret.from_dict({"DB_URL": os.environ["LOCAL_DB_URL"]})from_dict is fine for dev but don’t commit hard-coded secrets that way.
Fix 8: Cold Starts and container_idle_timeout
Each cold start downloads the image, starts the container, and runs your function. For latency-sensitive workloads, keep containers warm:
@app.function(
image=image,
gpu="A10G",
container_idle_timeout=300, # Keep idle containers alive for 5 minutes
allow_concurrent_inputs=10, # One container handles up to 10 concurrent calls
keep_warm=1, # Always have 1 container ready
)
def predict(payload: dict):
...Three controls:
container_idle_timeout— how long a container sits idle before being killed.allow_concurrent_inputs— concurrent requests per container. Higher means fewer cold starts but more memory pressure.keep_warm— number of always-running containers. Costs money even when idle, but cold-start latency drops to zero.
For GPU functions, even keep_warm=1 is expensive. Use it for production endpoints; for batch jobs, accept cold starts.
Pro Tip: Test cold-start time with modal run --detach. The first call after deploy is a cold start; subsequent calls are warm.
Still Not Working?
A few less-obvious failures:
ImportErrordespitepip_install. Modal caches images by chain hash. Adding apip_installafter arun_commandsmay invalidate the cache forrun_commands. Order chain steps from least-changing to most-changing.TimeoutError: function exceeded 600s. Default function timeout is 10 minutes. Bump via@app.function(timeout=3600)(1 hour). Max varies by Modal plan.- GPU function runs on CPU. No
gpu=set. Decorator must includegpu="any"or specific type. add_local_dirnot picking up changes. Modal caches local syncs. Force re-sync with--detachor bump the version of your function decorator.- Volumes diverge across regions. Volumes are region-scoped. Functions in different regions reading the same Volume name access different physical volumes. Pin function region or use a different storage backend.
modal token setfails in CI. CI environments need API token via env var:MODAL_TOKEN_IDandMODAL_TOKEN_SECRET. Generate from the dashboard.modal.exception.InvalidError: function does not exist. Either the function isn’t deployed yet (runmodal deploy) or the function name in the lookup doesn’t match. UseFunction.lookup("my-app", "my-func")exact names.Class.clsdeprecated. Modal moved fromStub.clstoApp.clsalong withStub → App. Update.- Cold start exceeds 90s on first call after deploy. Image download dominates. Make the image smaller — pin Python version, prune CUDA dev tools, separate model weights into a Volume so they’re not in the image layer. The image size is reported by
modal image show. - Volume reads return old data even after
volume.reload(). Volumes are eventually consistent across regions. If your reader is in a different region than the writer, brief staleness is expected. Pin both functions to the same region withregion="us-east-1". modal servedoesn’t reload on edits inside a subdirectory. The CLI watches the entrypoint file and its direct imports. For deep package imports, add explicit--watch-pathsor run modal from a parent directory.@modal.web_endpointis deprecated in newer SDKs. Recent Modal versions replaced it with@modal.fastapi_endpointand@modal.asgi_appfor richer routing. Check the warning text; the migration is mechanical.
For related Python deployment and ML serving issues, see vLLM not working, AWS Lambda timeout, PyTorch not working, and Replicate not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs
How to fix Lambda SnapStart errors — feature requires published version, $LATEST not supported, restore hook for stale connections, UUID collisions after snapshot, time-based state staleness, and pricing surprises.
Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors
How to fix scalene errors — scalene command not found, web UI port conflict, no GPU detected, profile.json empty, AI optimize requires OpenAI key, native code not attributed, and Jupyter integration.
Fix: Gunicorn Not Working — Worker Timeout, Boot Errors, and Signal Handling
How to fix Gunicorn errors — WORKER TIMEOUT killed, ImportError cannot import app, worker class not found, connection refused 502 behind nginx, graceful reload not working, and sync vs async worker selection.
Fix: ONNX Not Working — Conversion Errors, Runtime Provider Issues, and Dynamic Shape Problems
How to fix ONNX errors — torch.onnx.export unsupported operator, ONNX Runtime CUDA provider not found, InvalidArgument input shape mismatch, dynamic axes not working, IR version mismatch, and opset version conflicts.