Skip to content

Fix: LiteFS Not Working — Consul Lease, Primary Election, Halt Locks, and Replica Reads

FixDevs · (Updated: )

Part of:  Database Errors

Quick Answer

How to fix LiteFS errors — primary not elected, Consul lease setup, static lease single-node mode, halt locks for cross-node writes, replica seeing stale data, mount path mismatch, and LiteFS Cloud sync.

The Error

You deploy LiteFS to Fly.io with multiple machines and writes fail:

SQLITE_READONLY_DBMOVED: attempt to write a readonly database

Or the primary never elects:

$ fly logs
[litefs] waiting for consul connection
[litefs] error: consul: dial tcp: connection refused

Or a replica reads stale data immediately after a write:

# On the primary:
cur.execute("INSERT INTO users (name) VALUES (?)", ["Alice"])
conn.commit()

# On a replica seconds later:
cur.execute("SELECT name FROM users WHERE id = LAST_INSERT_ROWID()")
# Returns no rows.

Or the FUSE mount fails to start:

[litefs] error: cannot mount: fuse: device not found

Why This Happens

LiteFS is a FUSE filesystem that intercepts SQLite writes and replicates them across nodes. Most issues map to one of:

  • No primary election. LiteFS needs a coordination mechanism to pick one writer. Two backends: Consul (Fly’s built-in) or static lease (one-node mode for testing). Without one, no writes happen.
  • Replicas are read-only. A replica that receives a write request must forward it to the primary, or your app must route writes itself. The SQLITE_READONLY_DBMOVED error means a replica tried to commit a write directly.
  • Replication is asynchronous by default. A replica may be milliseconds (or seconds) behind. For “read your own writes,” use halt locks or pin reads to the primary.
  • FUSE requires kernel support. Fly’s machines ship with FUSE enabled, but local Docker testing usually doesn’t. LiteFS needs cap_add: SYS_ADMIN and /dev/fuse access.

There is a fundamental design tension here that explains many of the surprises. LiteFS was built to make SQLite replicate without changing your app’s database driver — your code keeps calling sqlite3.connect(...) against an ordinary file path, and the filesystem layer does the rest. That elegance comes at a cost: SQLite’s protocol is single-writer by definition, so LiteFS must enforce a single primary across the whole cluster. The Consul lease is the mechanism for that enforcement, and its TTL and renewal window dictate how fast failover can be. Default settings prioritize safety over speed, so a network partition that lasts 10 seconds can leave the cluster without a writable primary for that whole window.

The replication transport is asynchronous WAL streaming. A commit on the primary appends to the WAL, and replicas pull the new frames over HTTP. The latency between “primary committed” and “replica visible” is usually under a second on a healthy network but can balloon to tens of seconds during high write load or replica catch-up after a restart. Apps that read what they just wrote — common in REST handlers — see stale data and assume LiteFS is broken. The cure is one of three: read from the primary for read-your-own-writes paths, hold a halt lock during the commit, or accept eventual consistency and design the UI around it.

A third class of confusion comes from the Fly.io platform layer. Fly’s replay header lets you redirect a request from a replica region to the primary region, which is the cleanest write-forwarding pattern but only works if your app inspects and obeys that header. Apps deployed to Fly without the replay machinery silently get the read-only error on every POST that lands in a non-primary region. Most teams discover this only after their first regional traffic surge.

Fix 1: Configure Consul Lease (Multi-Node)

Fly provides a free Consul cluster per app. Use it for LiteFS leases:

fly consul attach

This injects FLY_CONSUL_URL into your app’s env. LiteFS reads it automatically.

In litefs.yml:

fuse:
  dir: "/litefs"
  allow-other: true

data:
  dir: "/var/lib/litefs"

proxy:
  addr: ":8080"
  target: "localhost:8081"
  db: "my-app.db"

lease:
  type: "consul"
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true
  advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"

  consul:
    url: "${FLY_CONSUL_URL}"
    key: "litefs/my-app"

Three lease fields:

  • candidate — whether this node can become primary. Restrict to PRIMARY_REGION for predictable failover.
  • promote — auto-promote if no primary exists.
  • advertise-url — how replicas reach this node. Fly’s .vm.<app>.internal DNS works.

Set PRIMARY_REGION in fly.toml:

[env]
  PRIMARY_REGION = "nrt"

Pro Tip: Pick a primary region with the lowest write latency for your team. Reads happen everywhere; writes go through one region. Picking a region far from your users for writes adds latency.

Fix 2: Use Static Lease for Single-Node Dev

For local testing or single-machine deployments where you don’t need failover:

lease:
  type: "static"
  advertise-url: "http://localhost:20202"
  candidate: true
  hostname: "primary"

type: "static" skips Consul entirely. The single node is always primary.

This is the simplest setup for fly machines scale 1 deployments or local Docker. For prod with HA, switch to consul.

Common Mistake: Mixing static lease and multi-node deploys. Two nodes both running with type: "static" think they’re both primary and writes conflict.

Fix 3: Route Writes Through the Primary

A replica that receives a write request must forward it. LiteFS provides a built-in HTTP proxy:

proxy:
  addr: ":8080"
  target: "localhost:8081"
  db: "my-app.db"

This makes LiteFS listen on :8080 and forward to your app on :8081. The proxy automatically routes writes (PUT/POST/PATCH/DELETE) to the primary; reads stay local.

Your fly.toml should expose :8080 (LiteFS) to the public, with your app on the internal :8081:

[http_service]
  internal_port = 8080   # LiteFS proxy port
  force_https = true

# Your app binds to localhost:8081

For apps that don’t use HTTP (background workers, queues), you need a different routing mechanism:

# In your app:
import requests
from litefs import get_primary  # Pseudo — read from LiteFS API

def write_user(name):
    primary = get_primary()  # Returns this node if primary, else the primary's URL
    if primary == "localhost":
        # Local write
        conn.execute("INSERT INTO users (name) VALUES (?)", [name])
    else:
        # Forward to primary
        requests.post(f"http://{primary}/api/users", json={"name": name})

LiteFS exposes /primary over HTTP that returns the primary’s hostname.

Fix 4: Halt Locks for Synchronous Writes

For “read your own writes” semantics, use a halt lock — it pauses replication on the primary while you commit, then explicitly resumes:

import requests

# Acquire halt lock (HTTP):
requests.post(f"http://localhost:20202/api/v1/dbs/my-app.db/halt")

try:
    cur.execute("INSERT INTO users ...")
    conn.commit()
finally:
    # Release:
    requests.delete(f"http://localhost:20202/api/v1/dbs/my-app.db/halt")

With halt held, the write is committed to the WAL but not yet streamed to replicas. After release, replication resumes and replicas catch up.

For Go apps, the superfly/litefs-go library provides ergonomic helpers:

import "github.com/superfly/litefs-go"

err := litefs.WithHalt(ctx, dbPath, func() error {
    _, err := db.ExecContext(ctx, "INSERT INTO ...")
    return err
})

Note: Halt locks block writes from other connections during the hold. Don’t hold them across slow operations. Use only for the few cases where stale reads are unacceptable.

Fix 5: FUSE in Docker / Local Dev

For local LiteFS testing in Docker:

# docker-compose.yml
services:
  app:
    image: my-app
    cap_add:
      - SYS_ADMIN
    devices:
      - /dev/fuse
    security_opt:
      - apparmor:unconfined
    volumes:
      - litefs:/var/lib/litefs
    command: litefs mount

volumes:
  litefs:

Three Docker requirements:

  • cap_add: SYS_ADMIN — FUSE needs admin capability.
  • devices: /dev/fuse — the FUSE device.
  • apparmor:unconfined — AppArmor on Ubuntu hosts blocks FUSE by default.

On Fly.io, these are set automatically — you don’t configure them in fly.toml.

Common Mistake: Using restart: unless-stopped with litefs mount. If LiteFS crashes, Docker restarts it but the FUSE mount is stale. Use restart: on-failure with a max retry count.

Fix 6: LiteFS Cloud for Hosted Replication

LiteFS Cloud (Fly’s hosted service) offers:

  • Point-in-time backups
  • Cross-region replication without managing Consul
  • A managed primary

To use:

lease:
  type: "consul"
  # consul: { url: ..., key: ... }
  advertise-url: ...

# Or with LiteFS Cloud:
# Configure via fly secrets set LITEFS_CLOUD_TOKEN=...
fly litefs-cloud create
fly secrets set LITEFS_CLOUD_TOKEN=<token>

LiteFS Cloud manages backups and snapshots. Restore:

fly litefs-cloud snapshots
fly litefs-cloud restore --snapshot=<id>

Pro Tip: Even if you use Consul lease, enable LiteFS Cloud for backups. SQLite without backups is one disk failure from total loss.

Fix 7: Mount Path and App Configuration

LiteFS mounts at fuse.dir. Your app reads/writes through this path:

fuse:
  dir: "/litefs"
import sqlite3

conn = sqlite3.connect("/litefs/my-app.db")
# Writes go through LiteFS, replicate to other nodes.

Don’t open the underlying file directly (/var/lib/litefs/dbs/my-app.db/database). That’s LiteFS’s internal storage; writes bypass replication and corrupt the cluster.

For dynamic DB names, use a subdirectory pattern:

fuse:
  dir: "/litefs"
conn = sqlite3.connect(f"/litefs/tenant-{tenant_id}.db")

Each .db file gets its own replication channel. LiteFS handles them independently.

Note: LiteFS doesn’t currently support cross-DB transactions (an INSERT into users.db + audit.db in one transaction isn’t atomic across the two). For multi-tenant patterns, isolate concerns or accept eventual consistency between DBs.

Fix 8: Monitoring Replication Lag

Check primary status:

curl http://localhost:20202/api/v1/dbs/my-app.db

Returns JSON with current TXID, primary hostname, replica positions.

For Prometheus:

http:
  addr: ":20202"
  # Exposes /metrics

Key metrics:

  • litefs_db_position_replica vs litefs_db_position_primary — gap shows replication lag.
  • litefs_subscriber_count — number of replicas connected to this primary.
  • litefs_halt_lock_active — whether halt is held.

Set up alerts for position_replica - position_primary > N to catch stuck replicas.

Pro Tip: A replica that’s hours behind isn’t a slow replica — it’s a disconnected one. Check litefs_subscriber_count first; if it’s zero, the replica isn’t getting the WAL stream at all.

Version History and Tooling Context

LiteFS is younger than most of the tools in its category, so version differences matter a lot — pre-1.0 deployments behave noticeably differently from current ones:

  • LiteFS 0.3–0.4 (early 2023) were experimental. Lease handling was Consul-only with rough edges, the proxy didn’t exist yet, and write forwarding required custom app code.
  • LiteFS 0.5 (July 2023) introduced the built-in HTTP proxy (proxy.addr / proxy.target in litefs.yml). Before this, every app had to implement its own write forwarding using the /primary endpoint. Most older tutorials skip the proxy and look more complicated than they need to be.
  • LiteFS 0.5.x added the halt lock API for synchronous read-your-own-writes patterns. This is the building block for libraries like litefs-go that wrap halt acquisition in a WithHalt helper.
  • LiteFS 1.0 (March 2024) stabilized the on-disk LTX format, locked the API for the lease backend, and shipped the production-ready Fly.io integration. The fly litefs-cloud subcommand and managed snapshot service became GA at this point. If you’re starting fresh in 2026, 1.0+ is the only sensible target.
  • LiteFS Cloud ships continuous backups and managed lease coordination as a separate hosted service. It removes the need to wire Consul yourself but adds an external dependency and a per-app cost.

Compared to alternatives: Turso (LibSQL) takes the same SQLite-at-the-edge idea but exposes it as a managed service with HTTP-based reads — no FUSE, no Consul, but you give up the “vanilla SQLite driver” promise and have to use the libSQL client or a network-aware connector. PowerSync focuses on embedded replicas in mobile and browser clients, with conflict resolution as a first-class concern; it complements LiteFS rather than competing with it. rqlite is the older “Raft-replicated SQLite” project — stronger consistency guarantees, weaker performance, and no FUSE magic. Cloudflare D1 also runs SQLite at the edge but is purpose-built for Workers and lives entirely inside Cloudflare’s ecosystem. The decision matrix usually comes down to: pick LiteFS if you’re already on Fly and want to keep your existing SQLite driver; pick Turso if you want a managed service with multiple SDKs; pick rqlite if you need strong consistency over raw throughput.

Still Not Working?

A few less-obvious failures:

  • failed to acquire lease. Consul connectivity broken. Verify fly consul attach ran. Check FLY_CONSUL_URL is set in the app’s env.
  • Primary stays in old region after a regional outage. Consul retains the lease for ttl seconds (default 10s) even after the holder dies. Either wait or manually expire via Consul KV.
  • Replication stops after vacuum. SQLite VACUUM rewrites the entire DB. LiteFS handles this but it’s expensive and can pause replication for the duration. Schedule vacuums during low-traffic windows.
  • PRAGMA journal_mode=DELETE ignored. LiteFS requires WAL mode. Default is WAL; trying to switch to DELETE or TRUNCATE fails silently.
  • Read replica writes succeed locally but never propagate. You wrote to the underlying file instead of through the FUSE mount. Fix the path in your app config.
  • Backups via sqlite3 .backup fail. Use the LiteFS-aware backup pattern: stop writes (halt lock), copy the file, release. Or use LiteFS Cloud snapshots.
  • SQLITE_BUSY under load. SQLite’s single-writer constraint plus LiteFS’s primary forwarding adds latency. For high-write workloads, consider Postgres instead.
  • App restarts but data is gone. The volume isn’t mounted. litefs.yml’s data.dir must be on a persistent volume ([[mounts]] in fly.toml).
  • fly replay header is ignored on a replica. Your app framework strips unknown response headers before they reach Fly’s edge. Whitelist fly-replay in your reverse proxy or framework config; otherwise writes from non-primary regions silently 4xx.
  • LiteFS Cloud snapshots restore an older timestamp than expected. Snapshots are taken on a schedule (every few minutes by default). The “latest” snapshot may not include writes from the last 60–120 seconds. Plan recovery objectives around the actual snapshot cadence shown in fly litefs-cloud snapshots.
  • Replica throws position not found after a long disconnect. WAL frames older than the retention window are pruned from the primary, so a stale replica can’t catch up incrementally. Force a full resync by stopping the replica, deleting its data.dir, and letting it bootstrap from the primary’s latest snapshot.

For related Fly.io, SQLite, and distributed-data issues, see Fly deploy not working, SQLite database is locked, Turso not working, and Postgres connection refused.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles