Fix: Fly.io Deploy Not Working — fly.toml, Machines, Volumes, Secrets, and Internal DNS
Part of: Docker, DevOps & Infrastructure
Quick Answer
How to fix Fly.io errors — fly.toml app vs name confusion, machines API vs legacy apps, Dockerfile build failures, volume per-region, secrets staging, fly proxy for local access, and internal IPv6 routing.
The Error
You fly deploy and the build fails halfway through:
==> Building image
Error: failed to fetch an image or build from source: error building:
exit code 1Or the deploy succeeds but the app crashes immediately:
$ fly status
# State: machine started, then exited
$ fly logs
[error] PORT environment variable not setOr you create a volume and the machine can’t see it:
$ fly volumes create my-data --region nrt --size 10
$ fly deploy
# Container starts but /data is empty.Or fly secrets set succeeds but the app doesn’t see the variable:
$ fly secrets set OPENAI_API_KEY=sk-...
$ fly ssh console -C "env | grep OPENAI"
OPENAI_API_KEY=
# Empty.Why This Happens
Fly.io runs your apps in Firecracker microVMs (“Machines”) in regions worldwide. Most deploy issues map to one of:
fly.tomlis the contract. It declares the app name, primary region, builder, ports, mounts, and health checks. Bugs infly.tomlcause subtle deploy failures.- Machines vs Apps v1. Older Fly used “apps” with Nomad scheduling. New deploys use Machines (Firecracker VMs). Some tutorials still reference Nomad-style commands. Stick to Machines (
fly deploydefaults to it). - Volumes are zone-specific. A volume in
nrt(Tokyo) can’t attach to a machine iniad(Virginia). One volume = one machine. - Secrets are staged.
fly secrets setqueues the change; it doesn’t apply until the next deploy or restart. You can force withfly machines restart.
A second source of confusion is the Fly mental model itself. Most PaaS platforms hide the VM — you push code, the platform builds a container, runs it, and routes HTTP to it. Fly does that too, but it also exposes the underlying machines as first-class objects. You can fly machines list, stop one, clone another, attach a volume directly, and ssh in. That power means many failures happen at the machine layer (a machine is stopped, a volume is detached, the health check failed) and don’t show up in your fly deploy output. Running fly status after every deploy is the cheapest way to catch these.
A third source is the IPv6-first private network. Apps inside Fly talk to each other over IPv6 via .internal DNS. Code that hard-codes IPv4 dials between Fly apps either fails or takes the long public route. Knowing whether you’re inside Fly, outside Fly, or talking to a Fly app from a non-Fly client changes which hostnames and IP family you use.
Fix 1: Write a Working fly.toml
app = "my-app"
primary_region = "nrt"
[build]
# Builder is auto-detected from Dockerfile or buildpacks
# Or explicit:
# builder = "paketobuildpacks/builder:base"
# dockerfile = "Dockerfile.prod"
[env]
NODE_ENV = "production"
PORT = "8080"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "stop" # Stop idle machines to save money
auto_start_machines = true
min_machines_running = 0
[[http_service.checks]]
interval = "10s"
timeout = "2s"
grace_period = "5s"
method = "GET"
path = "/health"
[[vm]]
cpu_kind = "shared"
cpus = 1
memory_mb = 256Key sections:
app— your unique Fly app name.primary_region— where new machines spawn by default.http_service— exposes HTTP, terminates TLS, handlesforce_httpsredirect.http_service.checks— health checks. If they fail, Fly marks the machine unhealthy and stops sending traffic.vm— sizing per machine.shared-cpu-1xwith 256 MB is the cheapest tier.
Pro Tip: Generate a starter fly.toml with fly launch --no-deploy. It auto-detects your stack (Node, Python, Go, Rust) and writes a sensible default. Then edit before deploying.
Fix 2: Inspect Build Failures
When fly deploy fails during build:
fly deploy --verboseVerbose mode prints the full Dockerfile build log. The most common failures:
- Missing files in build context.
.dockerignoreexcludes them. Check what’s getting sent:tar -czf - . --exclude-from=.dockerignore | tar tz | head. - Network issues fetching deps.
npm installorpip installtimes out. Add retries or a build-time cache. - Image too large. Fly’s free tier has size limits. Use multi-stage builds to ship only the final artifact:
# Build stage
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Runtime stage
FROM node:20-alpine
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 8080
CMD ["node", "dist/server.js"]For Buildpacks or Nixpacks instead of Dockerfile:
[build]
builder = "paketobuildpacks/builder:base"Buildpacks auto-detect your stack. Slower than a tuned Dockerfile but no Dockerfile needed.
Fix 3: Pick the Right Port
The container’s process must listen on internal_port (declared in fly.toml). Fly injects PORT as an env var pointing at that:
# Python example:
import os
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)Critical: bind to 0.0.0.0, not 127.0.0.1. Fly’s network requires accepting connections on the public interface inside the VM. Loopback-only servers are invisible to Fly’s proxy.
For Node/Express:
app.listen(process.env.PORT || 8080, "0.0.0.0", () => {
console.log(`listening on ${process.env.PORT}`);
});If health checks fail with “connection refused,” the process either isn’t listening or is bound to localhost only.
Common Mistake: Setting PORT=3000 in [env] of fly.toml and a Dockerfile EXPOSE 8080. Fly uses internal_port from http_service; the EXPOSE and PORT env should match it. Pick one number and use it everywhere.
Fix 4: Volumes Are Per-Region, Per-Machine
Create a volume in a specific region:
fly volumes create my-data --region nrt --size 10Attach in fly.toml:
[[mounts]]
source = "my-data"
destination = "/data"
initial_size = "10gb"When you deploy, Fly attaches the volume to a machine in the same region. If you scale to 2 machines, you need 2 volumes (one per machine):
fly volumes create my-data --region nrt --size 10 # Creates a new volume each time
fly volumes create my-data --region nrt --size 10
fly scale count 2Common Mistake: Expecting one volume to be shared across machines. Volumes are local SSD attached to a single machine. For shared storage, use Tigris (Fly’s S3-compatible object storage) or LiteFS for distributed SQLite.
For databases that need consistent storage:
[[mounts]]
source = "postgres_data"
destination = "/var/lib/postgresql/data"Then pin the app to a single machine (no auto-scale) or use Fly Postgres (managed).
Fix 5: Secrets and Staging
fly secrets set queues the change; the app doesn’t see new values until restart:
fly secrets set OPENAI_API_KEY=sk-...
# Secret staged. To apply, restart machines.
fly machines restart
# Or:
fly deploySet multiple at once (avoids multiple restarts):
fly secrets set \
OPENAI_API_KEY=sk-... \
DATABASE_URL=postgres://... \
STRIPE_KEY=sk_test_...For dev:
# Import from .env file:
fly secrets import < .envTo list (names only, values hidden):
fly secrets listTo remove:
fly secrets unset OPENAI_API_KEYPro Tip: Use a separate Fly app per environment (my-app-dev, my-app-prod). Secrets are per-app — no risk of accidentally pushing prod secrets to dev.
Fix 6: Local Access via fly proxy
For databases and internal services that aren’t HTTP-exposed:
# Connect to your Fly Postgres locally:
fly proxy 5432:5432 -a my-postgres-app
# Now psql can connect:
psql postgres://user:pass@localhost:5432/dbnamefly proxy opens a tunnel from your laptop through Fly’s edge to the internal service. Useful for one-off psql, redis-cli, mongosh sessions.
For Redis:
fly proxy 6379:6379 -a my-redis-app
redis-cli -h localhost -p 6379For SSH into a running machine:
fly ssh console
# Or specific machine:
fly ssh console --machine <machine-id>Common Mistake: Trying to connect to <app>.fly.dev:5432. Fly’s external HTTPS endpoint only proxies HTTP. For TCP services, use fly proxy or attach a public IPv4/IPv6 with proper port config.
Fix 7: Internal IPv6 Networking
Fly’s internal network is IPv6-only by default. App-to-app calls use .internal DNS:
# From app A, calling app B:
import httpx
response = httpx.get("http://my-app-b.internal:8080/api/health")<app-name>.internal resolves to the IPv6 address of the closest healthy machine in the org’s private network.
For region-specific routing:
# Hit a machine in a specific region:
"http://nrt.my-app-b.internal:8080"
# Hit all machines:
"http://global.my-app-b.internal:8080" # Load-balancedCommon Mistake: Trying to call my-app-b.fly.dev from inside another Fly app. That round-trips through the public edge — slow and wastes bandwidth. Use .internal for app-to-app.
For Postgres connections inside the org:
postgres://user:[email protected]:5432/dbnameFix 8: Scale and Auto-Stop
auto_stop_machines = "stop" stops idle machines to save money:
[http_service]
auto_stop_machines = "stop" # or "off" to never stop
auto_start_machines = true # Start on incoming traffic
min_machines_running = 0 # Number to keep always runningA stopped machine has zero cost but a ~1-2s cold start when a request arrives. For latency-sensitive apps, set min_machines_running = 1.
Scale manually:
fly scale count 3 # 3 machines total
fly scale count 1 --region nrt # 1 machine in Tokyo
fly scale count 2 --region nrt --region iad # 2 each in nrt and iad
fly scale vm shared-cpu-2x --memory 1024 # Resize VMsScale by region:
fly scale count 2 --region nrt
fly scale count 1 --region iad
fly scale count 1 --region fraPro Tip: Use fly logs --region nrt to filter logs per region when debugging multi-region issues.
Fly.io vs Railway vs Render vs Heroku vs Cloud Run vs Vercel: Picking the Right PaaS
Fly.io shares a category with Railway, Render, Heroku, Google Cloud Run, and Vercel, but each makes very different trade-offs. Knowing which one you’d actually want often turns a “Fly deploy not working” bug into a “this app shouldn’t be on Fly” decision.
Fly.io runs Firecracker microVMs in regions worldwide, with persistent block-storage volumes, IPv6 private networking, and per-machine control. The model is closer to a globally distributed VM platform than a typical PaaS. Fly is the right pick for low-latency multi-region apps, sticky-session workloads, apps that need persistent local SSD (LiteFS, embedded databases), or anything that benefits from “the same app running in 10 cities at once.” The cost: you write a fly.toml, manage volumes per region, and learn the machines model.
Railway is the closest direct competitor in feel — push a repo, pick a region, get a URL. Railway has nicer defaults around managed Postgres, Redis, and one-click environments per branch. It has no first-class multi-region story and no equivalent of LiteFS or per-machine volumes. Good for small teams that want a Heroku-style experience without writing Dockerfiles.
Render is the most Heroku-like of the bunch. Free TLS, autoscaling, managed Postgres/Redis, background workers, cron jobs. It runs in two US regions and a few elsewhere, and there’s no concept of “deploy to every region.” Stick with Render if your app is single-region and you want a hosted Postgres without thinking about it.
Heroku still exists, still works, still charges for what used to be free. The dyno model is the original PaaS abstraction. Pick Heroku only if you’re already on it or you specifically want its add-ons.
Google Cloud Run is serverless containers — runs only on request, scales to zero, pay per-100ms. No persistent storage, no sticky sessions, no long-running workers. Right for stateless HTTP; wrong for anything stateful.
Vercel is purpose-built for Next.js with serverless functions and an edge runtime. The integration is unmatched for Next apps. For non-Next workloads (Rails API, Go gRPC, workers), it’s awkward.
Fly’s edge: persistent state at the edge (LiteFS), TCP services (Postgres, Redis, custom protocols), and per-region pricing control. Pick Fly when those matter; pick something simpler when they don’t.
Still Not Working?
A few less-obvious failures:
fly launchoverwrites yourfly.toml. Usefly launch --no-deployand review the generated file before deploying. Or skiplaunchif you already have a working config.deploysucceeds butfly statusshows “no machines.” The Dockerfile’sCMDexits immediately. Make sure your process keeps running (don’texeca one-shot command).- Free tier bandwidth exceeded. Fly’s free allowance covers basic apps. Heavy traffic or large image pulls eat into it. Check usage in the dashboard.
fly deployis slow even with cache. The image is huge (GB+). Use--build-onlyto inspect the image, multi-stage to slim it, or use--remote-onlyto skip local Docker.- App can’t connect to managed Postgres. Use the internal hostname (
<pg-app>.internal:5432), not the public hostname. Check the connection string Fly’sattachcommand output. - Sudden
Error: machines updated; the machine cannot be ssh'd into during deploy. Deploy is in progress; wait for it to finish. Force with--no-deployif you’re just sshing for diagnostics. hostsync.fly.devlookup fails. Internal DNS is region-aware; sometimes a region has issues. Trynslookup my-app.internalfrom a different region or check Fly’s status page.- LiteFS errors after deploy. LiteFS needs a leader/replica config in
litefs.ymland Consul or static lease. Without it, all nodes try to be the leader and writes fail. Pin one machine as the primary. fly deploytimes out on the health check but the app is actually healthy. The check path returns 200 but takes more thantimeoutseconds because the app is doing one-shot initialization. Move warmup work into a background task and have the health endpoint return immediately, or increasegrace_periodso the first check starts later.- Auto-stopped machines never wake up.
auto_start_machines = truerequires that the proxy can reach a stopped machine’s metadata — if you’ve setmin_machines_running = 0and disabled all regions except one, and that region has an outage, requests fail. Keep at least one warm machine in a fallback region if uptime matters. fly deploysucceeds but the app uses the old code. The Dockerfile cached an oldCOPY . .layer. Add aRUN echo $(date)line just before the copy, or pass--build-arg CACHEBUST=$(date +%s). Fly’s remote builder respects Docker cache the same way local builds do.
For related deployment and edge computing issues, see Cloudflare D1 not working, Docker Compose service failed to build, Heroku h10 app crashed, and Postgres connection refused.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Coolify Not Working — Deployment Failing, SSL Not Working, or Containers Not Starting
How to fix Coolify self-hosted PaaS issues — server setup, application deployment, Docker and Nixpacks builds, environment variables, SSL certificates, database provisioning, and GitHub integration.
Fix: Cloudflare Pages Not Working — Build Output, Functions Routing, _redirects, and Bindings
How to fix Cloudflare Pages errors — build output directory mismatch, Functions in /functions/, _redirects vs _headers, compatibility flags, env per branch, D1/R2/KV bindings, and Direct Upload alternatives.
Fix: LiteFS Not Working — Consul Lease, Primary Election, Halt Locks, and Replica Reads
How to fix LiteFS errors — primary not elected, Consul lease setup, static lease single-node mode, halt locks for cross-node writes, replica seeing stale data, mount path mismatch, and LiteFS Cloud sync.
Fix: Docker Compose Watch Not Working — sync vs rebuild, Ignore Patterns, WSL/macOS File Events
How to fix docker compose watch errors — develop.watch directive not firing, sync vs sync+restart vs rebuild differences, ignore globs not matching, WSL2 file events delayed, named volumes shadowing watch, and Compose version requirements.