Skip to content

Fix: Fly.io Deploy Not Working — fly.toml, Machines, Volumes, Secrets, and Internal DNS

FixDevs · (Updated: )

Part of:  Docker, DevOps & Infrastructure

Quick Answer

How to fix Fly.io errors — fly.toml app vs name confusion, machines API vs legacy apps, Dockerfile build failures, volume per-region, secrets staging, fly proxy for local access, and internal IPv6 routing.

The Error

You fly deploy and the build fails halfway through:

==> Building image
Error: failed to fetch an image or build from source: error building: 
exit code 1

Or the deploy succeeds but the app crashes immediately:

$ fly status
# State: machine started, then exited
$ fly logs
[error] PORT environment variable not set

Or you create a volume and the machine can’t see it:

$ fly volumes create my-data --region nrt --size 10
$ fly deploy
# Container starts but /data is empty.

Or fly secrets set succeeds but the app doesn’t see the variable:

$ fly secrets set OPENAI_API_KEY=sk-...
$ fly ssh console -C "env | grep OPENAI"
OPENAI_API_KEY=
# Empty.

Why This Happens

Fly.io runs your apps in Firecracker microVMs (“Machines”) in regions worldwide. Most deploy issues map to one of:

  • fly.toml is the contract. It declares the app name, primary region, builder, ports, mounts, and health checks. Bugs in fly.toml cause subtle deploy failures.
  • Machines vs Apps v1. Older Fly used “apps” with Nomad scheduling. New deploys use Machines (Firecracker VMs). Some tutorials still reference Nomad-style commands. Stick to Machines (fly deploy defaults to it).
  • Volumes are zone-specific. A volume in nrt (Tokyo) can’t attach to a machine in iad (Virginia). One volume = one machine.
  • Secrets are staged. fly secrets set queues the change; it doesn’t apply until the next deploy or restart. You can force with fly machines restart.

A second source of confusion is the Fly mental model itself. Most PaaS platforms hide the VM — you push code, the platform builds a container, runs it, and routes HTTP to it. Fly does that too, but it also exposes the underlying machines as first-class objects. You can fly machines list, stop one, clone another, attach a volume directly, and ssh in. That power means many failures happen at the machine layer (a machine is stopped, a volume is detached, the health check failed) and don’t show up in your fly deploy output. Running fly status after every deploy is the cheapest way to catch these.

A third source is the IPv6-first private network. Apps inside Fly talk to each other over IPv6 via .internal DNS. Code that hard-codes IPv4 dials between Fly apps either fails or takes the long public route. Knowing whether you’re inside Fly, outside Fly, or talking to a Fly app from a non-Fly client changes which hostnames and IP family you use.

Fix 1: Write a Working fly.toml

app = "my-app"
primary_region = "nrt"

[build]
  # Builder is auto-detected from Dockerfile or buildpacks
  # Or explicit:
  # builder = "paketobuildpacks/builder:base"
  # dockerfile = "Dockerfile.prod"

[env]
  NODE_ENV = "production"
  PORT = "8080"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"  # Stop idle machines to save money
  auto_start_machines = true
  min_machines_running = 0

  [[http_service.checks]]
    interval = "10s"
    timeout = "2s"
    grace_period = "5s"
    method = "GET"
    path = "/health"

[[vm]]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 256

Key sections:

  • app — your unique Fly app name.
  • primary_region — where new machines spawn by default.
  • http_service — exposes HTTP, terminates TLS, handles force_https redirect.
  • http_service.checks — health checks. If they fail, Fly marks the machine unhealthy and stops sending traffic.
  • vm — sizing per machine. shared-cpu-1x with 256 MB is the cheapest tier.

Pro Tip: Generate a starter fly.toml with fly launch --no-deploy. It auto-detects your stack (Node, Python, Go, Rust) and writes a sensible default. Then edit before deploying.

Fix 2: Inspect Build Failures

When fly deploy fails during build:

fly deploy --verbose

Verbose mode prints the full Dockerfile build log. The most common failures:

  • Missing files in build context. .dockerignore excludes them. Check what’s getting sent: tar -czf - . --exclude-from=.dockerignore | tar tz | head.
  • Network issues fetching deps. npm install or pip install times out. Add retries or a build-time cache.
  • Image too large. Fly’s free tier has size limits. Use multi-stage builds to ship only the final artifact:
# Build stage
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Runtime stage
FROM node:20-alpine
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 8080
CMD ["node", "dist/server.js"]

For Buildpacks or Nixpacks instead of Dockerfile:

[build]
  builder = "paketobuildpacks/builder:base"

Buildpacks auto-detect your stack. Slower than a tuned Dockerfile but no Dockerfile needed.

Fix 3: Pick the Right Port

The container’s process must listen on internal_port (declared in fly.toml). Fly injects PORT as an env var pointing at that:

# Python example:
import os
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)

Critical: bind to 0.0.0.0, not 127.0.0.1. Fly’s network requires accepting connections on the public interface inside the VM. Loopback-only servers are invisible to Fly’s proxy.

For Node/Express:

app.listen(process.env.PORT || 8080, "0.0.0.0", () => {
  console.log(`listening on ${process.env.PORT}`);
});

If health checks fail with “connection refused,” the process either isn’t listening or is bound to localhost only.

Common Mistake: Setting PORT=3000 in [env] of fly.toml and a Dockerfile EXPOSE 8080. Fly uses internal_port from http_service; the EXPOSE and PORT env should match it. Pick one number and use it everywhere.

Fix 4: Volumes Are Per-Region, Per-Machine

Create a volume in a specific region:

fly volumes create my-data --region nrt --size 10

Attach in fly.toml:

[[mounts]]
  source = "my-data"
  destination = "/data"
  initial_size = "10gb"

When you deploy, Fly attaches the volume to a machine in the same region. If you scale to 2 machines, you need 2 volumes (one per machine):

fly volumes create my-data --region nrt --size 10  # Creates a new volume each time
fly volumes create my-data --region nrt --size 10
fly scale count 2

Common Mistake: Expecting one volume to be shared across machines. Volumes are local SSD attached to a single machine. For shared storage, use Tigris (Fly’s S3-compatible object storage) or LiteFS for distributed SQLite.

For databases that need consistent storage:

[[mounts]]
  source = "postgres_data"
  destination = "/var/lib/postgresql/data"

Then pin the app to a single machine (no auto-scale) or use Fly Postgres (managed).

Fix 5: Secrets and Staging

fly secrets set queues the change; the app doesn’t see new values until restart:

fly secrets set OPENAI_API_KEY=sk-...
# Secret staged. To apply, restart machines.

fly machines restart
# Or:
fly deploy

Set multiple at once (avoids multiple restarts):

fly secrets set \
  OPENAI_API_KEY=sk-... \
  DATABASE_URL=postgres://... \
  STRIPE_KEY=sk_test_...

For dev:

# Import from .env file:
fly secrets import < .env

To list (names only, values hidden):

fly secrets list

To remove:

fly secrets unset OPENAI_API_KEY

Pro Tip: Use a separate Fly app per environment (my-app-dev, my-app-prod). Secrets are per-app — no risk of accidentally pushing prod secrets to dev.

Fix 6: Local Access via fly proxy

For databases and internal services that aren’t HTTP-exposed:

# Connect to your Fly Postgres locally:
fly proxy 5432:5432 -a my-postgres-app

# Now psql can connect:
psql postgres://user:pass@localhost:5432/dbname

fly proxy opens a tunnel from your laptop through Fly’s edge to the internal service. Useful for one-off psql, redis-cli, mongosh sessions.

For Redis:

fly proxy 6379:6379 -a my-redis-app
redis-cli -h localhost -p 6379

For SSH into a running machine:

fly ssh console
# Or specific machine:
fly ssh console --machine <machine-id>

Common Mistake: Trying to connect to <app>.fly.dev:5432. Fly’s external HTTPS endpoint only proxies HTTP. For TCP services, use fly proxy or attach a public IPv4/IPv6 with proper port config.

Fix 7: Internal IPv6 Networking

Fly’s internal network is IPv6-only by default. App-to-app calls use .internal DNS:

# From app A, calling app B:
import httpx
response = httpx.get("http://my-app-b.internal:8080/api/health")

<app-name>.internal resolves to the IPv6 address of the closest healthy machine in the org’s private network.

For region-specific routing:

# Hit a machine in a specific region:
"http://nrt.my-app-b.internal:8080"

# Hit all machines:
"http://global.my-app-b.internal:8080"  # Load-balanced

Common Mistake: Trying to call my-app-b.fly.dev from inside another Fly app. That round-trips through the public edge — slow and wastes bandwidth. Use .internal for app-to-app.

For Postgres connections inside the org:

postgres://user:[email protected]:5432/dbname

Fix 8: Scale and Auto-Stop

auto_stop_machines = "stop" stops idle machines to save money:

[http_service]
  auto_stop_machines = "stop"   # or "off" to never stop
  auto_start_machines = true     # Start on incoming traffic
  min_machines_running = 0       # Number to keep always running

A stopped machine has zero cost but a ~1-2s cold start when a request arrives. For latency-sensitive apps, set min_machines_running = 1.

Scale manually:

fly scale count 3                  # 3 machines total
fly scale count 1 --region nrt     # 1 machine in Tokyo
fly scale count 2 --region nrt --region iad  # 2 each in nrt and iad
fly scale vm shared-cpu-2x --memory 1024  # Resize VMs

Scale by region:

fly scale count 2 --region nrt
fly scale count 1 --region iad
fly scale count 1 --region fra

Pro Tip: Use fly logs --region nrt to filter logs per region when debugging multi-region issues.

Fly.io vs Railway vs Render vs Heroku vs Cloud Run vs Vercel: Picking the Right PaaS

Fly.io shares a category with Railway, Render, Heroku, Google Cloud Run, and Vercel, but each makes very different trade-offs. Knowing which one you’d actually want often turns a “Fly deploy not working” bug into a “this app shouldn’t be on Fly” decision.

Fly.io runs Firecracker microVMs in regions worldwide, with persistent block-storage volumes, IPv6 private networking, and per-machine control. The model is closer to a globally distributed VM platform than a typical PaaS. Fly is the right pick for low-latency multi-region apps, sticky-session workloads, apps that need persistent local SSD (LiteFS, embedded databases), or anything that benefits from “the same app running in 10 cities at once.” The cost: you write a fly.toml, manage volumes per region, and learn the machines model.

Railway is the closest direct competitor in feel — push a repo, pick a region, get a URL. Railway has nicer defaults around managed Postgres, Redis, and one-click environments per branch. It has no first-class multi-region story and no equivalent of LiteFS or per-machine volumes. Good for small teams that want a Heroku-style experience without writing Dockerfiles.

Render is the most Heroku-like of the bunch. Free TLS, autoscaling, managed Postgres/Redis, background workers, cron jobs. It runs in two US regions and a few elsewhere, and there’s no concept of “deploy to every region.” Stick with Render if your app is single-region and you want a hosted Postgres without thinking about it.

Heroku still exists, still works, still charges for what used to be free. The dyno model is the original PaaS abstraction. Pick Heroku only if you’re already on it or you specifically want its add-ons.

Google Cloud Run is serverless containers — runs only on request, scales to zero, pay per-100ms. No persistent storage, no sticky sessions, no long-running workers. Right for stateless HTTP; wrong for anything stateful.

Vercel is purpose-built for Next.js with serverless functions and an edge runtime. The integration is unmatched for Next apps. For non-Next workloads (Rails API, Go gRPC, workers), it’s awkward.

Fly’s edge: persistent state at the edge (LiteFS), TCP services (Postgres, Redis, custom protocols), and per-region pricing control. Pick Fly when those matter; pick something simpler when they don’t.

Still Not Working?

A few less-obvious failures:

  • fly launch overwrites your fly.toml. Use fly launch --no-deploy and review the generated file before deploying. Or skip launch if you already have a working config.
  • deploy succeeds but fly status shows “no machines.” The Dockerfile’s CMD exits immediately. Make sure your process keeps running (don’t exec a one-shot command).
  • Free tier bandwidth exceeded. Fly’s free allowance covers basic apps. Heavy traffic or large image pulls eat into it. Check usage in the dashboard.
  • fly deploy is slow even with cache. The image is huge (GB+). Use --build-only to inspect the image, multi-stage to slim it, or use --remote-only to skip local Docker.
  • App can’t connect to managed Postgres. Use the internal hostname (<pg-app>.internal:5432), not the public hostname. Check the connection string Fly’s attach command output.
  • Sudden Error: machines updated; the machine cannot be ssh'd into during deploy. Deploy is in progress; wait for it to finish. Force with --no-deploy if you’re just sshing for diagnostics.
  • hostsync.fly.dev lookup fails. Internal DNS is region-aware; sometimes a region has issues. Try nslookup my-app.internal from a different region or check Fly’s status page.
  • LiteFS errors after deploy. LiteFS needs a leader/replica config in litefs.yml and Consul or static lease. Without it, all nodes try to be the leader and writes fail. Pin one machine as the primary.
  • fly deploy times out on the health check but the app is actually healthy. The check path returns 200 but takes more than timeout seconds because the app is doing one-shot initialization. Move warmup work into a background task and have the health endpoint return immediately, or increase grace_period so the first check starts later.
  • Auto-stopped machines never wake up. auto_start_machines = true requires that the proxy can reach a stopped machine’s metadata — if you’ve set min_machines_running = 0 and disabled all regions except one, and that region has an outage, requests fail. Keep at least one warm machine in a fallback region if uptime matters.
  • fly deploy succeeds but the app uses the old code. The Dockerfile cached an old COPY . . layer. Add a RUN echo $(date) line just before the copy, or pass --build-arg CACHEBUST=$(date +%s). Fly’s remote builder respects Docker cache the same way local builds do.

For related deployment and edge computing issues, see Cloudflare D1 not working, Docker Compose service failed to build, Heroku h10 app crashed, and Postgres connection refused.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles