Fix: LiteFS Not Working — Consul Lease, Primary Election, Halt Locks, and Replica Reads
Part of: Database Errors
Quick Answer
How to fix LiteFS errors — primary not elected, Consul lease setup, static lease single-node mode, halt locks for cross-node writes, replica seeing stale data, mount path mismatch, and LiteFS Cloud sync.
The Error
You deploy LiteFS to Fly.io with multiple machines and writes fail:
SQLITE_READONLY_DBMOVED: attempt to write a readonly databaseOr the primary never elects:
$ fly logs
[litefs] waiting for consul connection
[litefs] error: consul: dial tcp: connection refusedOr a replica reads stale data immediately after a write:
# On the primary:
cur.execute("INSERT INTO users (name) VALUES (?)", ["Alice"])
conn.commit()
# On a replica seconds later:
cur.execute("SELECT name FROM users WHERE id = LAST_INSERT_ROWID()")
# Returns no rows.Or the FUSE mount fails to start:
[litefs] error: cannot mount: fuse: device not foundWhy This Happens
LiteFS is a FUSE filesystem that intercepts SQLite writes and replicates them across nodes. Most issues map to one of:
- No primary election. LiteFS needs a coordination mechanism to pick one writer. Two backends: Consul (Fly’s built-in) or static lease (one-node mode for testing). Without one, no writes happen.
- Replicas are read-only. A replica that receives a write request must forward it to the primary, or your app must route writes itself. The
SQLITE_READONLY_DBMOVEDerror means a replica tried to commit a write directly. - Replication is asynchronous by default. A replica may be milliseconds (or seconds) behind. For “read your own writes,” use halt locks or pin reads to the primary.
- FUSE requires kernel support. Fly’s machines ship with FUSE enabled, but local Docker testing usually doesn’t. LiteFS needs
cap_add: SYS_ADMINand/dev/fuseaccess.
There is a fundamental design tension here that explains many of the surprises. LiteFS was built to make SQLite replicate without changing your app’s database driver — your code keeps calling sqlite3.connect(...) against an ordinary file path, and the filesystem layer does the rest. That elegance comes at a cost: SQLite’s protocol is single-writer by definition, so LiteFS must enforce a single primary across the whole cluster. The Consul lease is the mechanism for that enforcement, and its TTL and renewal window dictate how fast failover can be. Default settings prioritize safety over speed, so a network partition that lasts 10 seconds can leave the cluster without a writable primary for that whole window.
The replication transport is asynchronous WAL streaming. A commit on the primary appends to the WAL, and replicas pull the new frames over HTTP. The latency between “primary committed” and “replica visible” is usually under a second on a healthy network but can balloon to tens of seconds during high write load or replica catch-up after a restart. Apps that read what they just wrote — common in REST handlers — see stale data and assume LiteFS is broken. The cure is one of three: read from the primary for read-your-own-writes paths, hold a halt lock during the commit, or accept eventual consistency and design the UI around it.
A third class of confusion comes from the Fly.io platform layer. Fly’s replay header lets you redirect a request from a replica region to the primary region, which is the cleanest write-forwarding pattern but only works if your app inspects and obeys that header. Apps deployed to Fly without the replay machinery silently get the read-only error on every POST that lands in a non-primary region. Most teams discover this only after their first regional traffic surge.
Fix 1: Configure Consul Lease (Multi-Node)
Fly provides a free Consul cluster per app. Use it for LiteFS leases:
fly consul attachThis injects FLY_CONSUL_URL into your app’s env. LiteFS reads it automatically.
In litefs.yml:
fuse:
dir: "/litefs"
allow-other: true
data:
dir: "/var/lib/litefs"
proxy:
addr: ":8080"
target: "localhost:8081"
db: "my-app.db"
lease:
type: "consul"
candidate: ${FLY_REGION == PRIMARY_REGION}
promote: true
advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"
consul:
url: "${FLY_CONSUL_URL}"
key: "litefs/my-app"Three lease fields:
candidate— whether this node can become primary. Restrict toPRIMARY_REGIONfor predictable failover.promote— auto-promote if no primary exists.advertise-url— how replicas reach this node. Fly’s.vm.<app>.internalDNS works.
Set PRIMARY_REGION in fly.toml:
[env]
PRIMARY_REGION = "nrt"Pro Tip: Pick a primary region with the lowest write latency for your team. Reads happen everywhere; writes go through one region. Picking a region far from your users for writes adds latency.
Fix 2: Use Static Lease for Single-Node Dev
For local testing or single-machine deployments where you don’t need failover:
lease:
type: "static"
advertise-url: "http://localhost:20202"
candidate: true
hostname: "primary"type: "static" skips Consul entirely. The single node is always primary.
This is the simplest setup for fly machines scale 1 deployments or local Docker. For prod with HA, switch to consul.
Common Mistake: Mixing static lease and multi-node deploys. Two nodes both running with type: "static" think they’re both primary and writes conflict.
Fix 3: Route Writes Through the Primary
A replica that receives a write request must forward it. LiteFS provides a built-in HTTP proxy:
proxy:
addr: ":8080"
target: "localhost:8081"
db: "my-app.db"This makes LiteFS listen on :8080 and forward to your app on :8081. The proxy automatically routes writes (PUT/POST/PATCH/DELETE) to the primary; reads stay local.
Your fly.toml should expose :8080 (LiteFS) to the public, with your app on the internal :8081:
[http_service]
internal_port = 8080 # LiteFS proxy port
force_https = true
# Your app binds to localhost:8081For apps that don’t use HTTP (background workers, queues), you need a different routing mechanism:
# In your app:
import requests
from litefs import get_primary # Pseudo — read from LiteFS API
def write_user(name):
primary = get_primary() # Returns this node if primary, else the primary's URL
if primary == "localhost":
# Local write
conn.execute("INSERT INTO users (name) VALUES (?)", [name])
else:
# Forward to primary
requests.post(f"http://{primary}/api/users", json={"name": name})LiteFS exposes /primary over HTTP that returns the primary’s hostname.
Fix 4: Halt Locks for Synchronous Writes
For “read your own writes” semantics, use a halt lock — it pauses replication on the primary while you commit, then explicitly resumes:
import requests
# Acquire halt lock (HTTP):
requests.post(f"http://localhost:20202/api/v1/dbs/my-app.db/halt")
try:
cur.execute("INSERT INTO users ...")
conn.commit()
finally:
# Release:
requests.delete(f"http://localhost:20202/api/v1/dbs/my-app.db/halt")With halt held, the write is committed to the WAL but not yet streamed to replicas. After release, replication resumes and replicas catch up.
For Go apps, the superfly/litefs-go library provides ergonomic helpers:
import "github.com/superfly/litefs-go"
err := litefs.WithHalt(ctx, dbPath, func() error {
_, err := db.ExecContext(ctx, "INSERT INTO ...")
return err
})Note: Halt locks block writes from other connections during the hold. Don’t hold them across slow operations. Use only for the few cases where stale reads are unacceptable.
Fix 5: FUSE in Docker / Local Dev
For local LiteFS testing in Docker:
# docker-compose.yml
services:
app:
image: my-app
cap_add:
- SYS_ADMIN
devices:
- /dev/fuse
security_opt:
- apparmor:unconfined
volumes:
- litefs:/var/lib/litefs
command: litefs mount
volumes:
litefs:Three Docker requirements:
cap_add: SYS_ADMIN— FUSE needs admin capability.devices: /dev/fuse— the FUSE device.apparmor:unconfined— AppArmor on Ubuntu hosts blocks FUSE by default.
On Fly.io, these are set automatically — you don’t configure them in fly.toml.
Common Mistake: Using restart: unless-stopped with litefs mount. If LiteFS crashes, Docker restarts it but the FUSE mount is stale. Use restart: on-failure with a max retry count.
Fix 6: LiteFS Cloud for Hosted Replication
LiteFS Cloud (Fly’s hosted service) offers:
- Point-in-time backups
- Cross-region replication without managing Consul
- A managed primary
To use:
lease:
type: "consul"
# consul: { url: ..., key: ... }
advertise-url: ...
# Or with LiteFS Cloud:
# Configure via fly secrets set LITEFS_CLOUD_TOKEN=...fly litefs-cloud create
fly secrets set LITEFS_CLOUD_TOKEN=<token>LiteFS Cloud manages backups and snapshots. Restore:
fly litefs-cloud snapshots
fly litefs-cloud restore --snapshot=<id>Pro Tip: Even if you use Consul lease, enable LiteFS Cloud for backups. SQLite without backups is one disk failure from total loss.
Fix 7: Mount Path and App Configuration
LiteFS mounts at fuse.dir. Your app reads/writes through this path:
fuse:
dir: "/litefs"import sqlite3
conn = sqlite3.connect("/litefs/my-app.db")
# Writes go through LiteFS, replicate to other nodes.Don’t open the underlying file directly (/var/lib/litefs/dbs/my-app.db/database). That’s LiteFS’s internal storage; writes bypass replication and corrupt the cluster.
For dynamic DB names, use a subdirectory pattern:
fuse:
dir: "/litefs"conn = sqlite3.connect(f"/litefs/tenant-{tenant_id}.db")Each .db file gets its own replication channel. LiteFS handles them independently.
Note: LiteFS doesn’t currently support cross-DB transactions (an INSERT into users.db + audit.db in one transaction isn’t atomic across the two). For multi-tenant patterns, isolate concerns or accept eventual consistency between DBs.
Fix 8: Monitoring Replication Lag
Check primary status:
curl http://localhost:20202/api/v1/dbs/my-app.dbReturns JSON with current TXID, primary hostname, replica positions.
For Prometheus:
http:
addr: ":20202"
# Exposes /metricsKey metrics:
litefs_db_position_replicavslitefs_db_position_primary— gap shows replication lag.litefs_subscriber_count— number of replicas connected to this primary.litefs_halt_lock_active— whether halt is held.
Set up alerts for position_replica - position_primary > N to catch stuck replicas.
Pro Tip: A replica that’s hours behind isn’t a slow replica — it’s a disconnected one. Check litefs_subscriber_count first; if it’s zero, the replica isn’t getting the WAL stream at all.
Version History and Tooling Context
LiteFS is younger than most of the tools in its category, so version differences matter a lot — pre-1.0 deployments behave noticeably differently from current ones:
- LiteFS 0.3–0.4 (early 2023) were experimental. Lease handling was Consul-only with rough edges, the proxy didn’t exist yet, and write forwarding required custom app code.
- LiteFS 0.5 (July 2023) introduced the built-in HTTP proxy (
proxy.addr/proxy.targetinlitefs.yml). Before this, every app had to implement its own write forwarding using the/primaryendpoint. Most older tutorials skip the proxy and look more complicated than they need to be. - LiteFS 0.5.x added the halt lock API for synchronous read-your-own-writes patterns. This is the building block for libraries like
litefs-gothat wrap halt acquisition in aWithHalthelper. - LiteFS 1.0 (March 2024) stabilized the on-disk LTX format, locked the API for the lease backend, and shipped the production-ready Fly.io integration. The
fly litefs-cloudsubcommand and managed snapshot service became GA at this point. If you’re starting fresh in 2026, 1.0+ is the only sensible target. - LiteFS Cloud ships continuous backups and managed lease coordination as a separate hosted service. It removes the need to wire Consul yourself but adds an external dependency and a per-app cost.
Compared to alternatives: Turso (LibSQL) takes the same SQLite-at-the-edge idea but exposes it as a managed service with HTTP-based reads — no FUSE, no Consul, but you give up the “vanilla SQLite driver” promise and have to use the libSQL client or a network-aware connector. PowerSync focuses on embedded replicas in mobile and browser clients, with conflict resolution as a first-class concern; it complements LiteFS rather than competing with it. rqlite is the older “Raft-replicated SQLite” project — stronger consistency guarantees, weaker performance, and no FUSE magic. Cloudflare D1 also runs SQLite at the edge but is purpose-built for Workers and lives entirely inside Cloudflare’s ecosystem. The decision matrix usually comes down to: pick LiteFS if you’re already on Fly and want to keep your existing SQLite driver; pick Turso if you want a managed service with multiple SDKs; pick rqlite if you need strong consistency over raw throughput.
Still Not Working?
A few less-obvious failures:
failed to acquire lease. Consul connectivity broken. Verifyfly consul attachran. CheckFLY_CONSUL_URLis set in the app’s env.- Primary stays in old region after a regional outage. Consul retains the lease for
ttlseconds (default 10s) even after the holder dies. Either wait or manually expire via Consul KV. - Replication stops after
vacuum. SQLiteVACUUMrewrites the entire DB. LiteFS handles this but it’s expensive and can pause replication for the duration. Schedule vacuums during low-traffic windows. PRAGMA journal_mode=DELETEignored. LiteFS requires WAL mode. Default is WAL; trying to switch to DELETE or TRUNCATE fails silently.- Read replica writes succeed locally but never propagate. You wrote to the underlying file instead of through the FUSE mount. Fix the path in your app config.
- Backups via
sqlite3 .backupfail. Use the LiteFS-aware backup pattern: stop writes (halt lock), copy the file, release. Or use LiteFS Cloud snapshots. SQLITE_BUSYunder load. SQLite’s single-writer constraint plus LiteFS’s primary forwarding adds latency. For high-write workloads, consider Postgres instead.- App restarts but data is gone. The volume isn’t mounted.
litefs.yml’sdata.dirmust be on a persistent volume ([[mounts]]infly.toml). fly replayheader is ignored on a replica. Your app framework strips unknown response headers before they reach Fly’s edge. Whitelistfly-replayin your reverse proxy or framework config; otherwise writes from non-primary regions silently 4xx.- LiteFS Cloud snapshots restore an older timestamp than expected. Snapshots are taken on a schedule (every few minutes by default). The “latest” snapshot may not include writes from the last 60–120 seconds. Plan recovery objectives around the actual snapshot cadence shown in
fly litefs-cloud snapshots. - Replica throws
position not foundafter a long disconnect. WAL frames older than the retention window are pruned from the primary, so a stale replica can’t catch up incrementally. Force a full resync by stopping the replica, deleting itsdata.dir, and letting it bootstrap from the primary’s latest snapshot.
For related Fly.io, SQLite, and distributed-data issues, see Fly deploy not working, SQLite database is locked, Turso not working, and Postgres connection refused.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Fly.io Deploy Not Working — fly.toml, Machines, Volumes, Secrets, and Internal DNS
How to fix Fly.io errors — fly.toml app vs name confusion, machines API vs legacy apps, Dockerfile build failures, volume per-region, secrets staging, fly proxy for local access, and internal IPv6 routing.
Fix: aiosqlite Not Working — Single Writer, WAL Mode, Row Factory, and Connection Patterns
How to fix Python aiosqlite errors — database is locked, WAL mode for concurrent reads, foreign_keys PRAGMA, row factory for dict-like rows, connection per request vs pool, datetime detect_types, and FastAPI integration.
Fix: Cloudflare D1 Not Working — Binding Errors, Local vs Remote, Migrations, and Foreign Keys
How to fix Cloudflare D1 errors — D1_ERROR no such table, binding undefined, --local vs --remote drift, migrations not applied, prepared statement bind index, foreign keys not enforced, and concurrent writes.
Fix: Peewee Not Working — Connection Pooling, Field Errors, and Migration Setup
How to fix Peewee errors — OperationalError database is locked, connection already open, field type mismatch, model meta database missing, N+1 queries, and peewee-migrate setup.