Fix: AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs

Q: How do I fix "AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs"?

How to fix Lambda SnapStart errors — feature requires published version, $LATEST not supported, restore hook for stale connections, UUID collisions after snapshot, time-based state staleness, and pricing surprises.

The Error

You enable SnapStart on a Lambda function and the change doesn’t apply:

The function policy doesn't support SnapStart. Publish a new version first.

Or every invocation still takes 800ms+ to start despite SnapStart being on:

Init Duration: 0.92 ms   (good — restored from snapshot)
Restore Duration: 410 ms (still significant)
Duration: 50 ms
Billed Duration: 460 ms

Or two requests get the same “random” UUID:

[invocation A] id=550e8400-e29b-41d4-a716-446655440000
[invocation B] id=550e8400-e29b-41d4-a716-446655440000  # Same!

Or DB connections fail with “connection closed” right after restore:

[invocation] Restored from snapshot.
Error: Connection terminated unexpectedly
    at /var/task/node_modules/pg/lib/...

Why This Happens

SnapStart pre-initializes your Lambda function once, snapshots the entire process memory image, and uses that snapshot to start every subsequent invocation. Cold-start time drops from 1-10 seconds (Java) or 200-800ms (Node and Python) down to roughly 50-200ms. The mechanism is similar to CRIU on Linux — a kernel-level memory checkpoint — and the trade-offs are similar too: anything captured in that checkpoint behaves as if time stopped at snapshot creation until you explicitly refresh it.

The hard constraints are simple to list but easy to violate. SnapStart only operates on published versions, not $LATEST — pointing your event source at $LATEST silently disables SnapStart even after you “enabled” it in the console. Restore happens after init has already run, so the snapshot captures whatever state init produced: open TCP sockets to the database, HTTP keep-alive pools, cached secrets, in-memory random-number-generator state. The runtime restores that memory image verbatim, and stale state is your problem to detect and refresh in a restore hook. Sources of uniqueness (Java’s SecureRandom, Python’s random module, V8’s Math.random seed) are all snapshotted — without explicit reseeding, multiple parallel restores produce correlated random sequences, including duplicate UUIDs.

The deeper traps live in priming and runtime support. SnapStart requires the runtime to declare itself snapshot-compatible: Java (Corretto 11, 17, 21), Python 3.12 and 3.13, and recent .NET runtimes — Node.js support is rolling out and not universally available. Network configuration matters because the initial snapshot creation has to reach the same endpoints your handler will use; a VPC without the right egress will create a snapshot that can’t actually talk to anything when restored. KMS encryption of the snapshot adds a permission requirement on the execution role — without kms:Decrypt, the restore phase fails before your code even runs. Most “I enabled SnapStart and nothing changed” reports are some combination of these constraints, not a SnapStart bug.

Diagnostic Timeline

Walk through a real “I turned on SnapStart and cold start is still 4 seconds” failure.

Minute 0 — first suspicion: enable SnapStart. The console shows SnapStart as “Enabled,” the function has been redeployed, and yet Init Duration is still 4000ms in CloudWatch. The first reflex is to toggle SnapStart off and back on. Nothing changes — the toggle was never the problem.

Minute 3 — first evidence: check runtime support. Open the function configuration. The runtime is Java 8 (Corretto 8). SnapStart requires Java 11 or later (Corretto 11/17/21), Python 3.12 or 3.13, or .NET 8+. On unsupported runtimes the SnapStart toggle stays “Enabled” in the UI but has no effect at runtime. Upgrade the runtime and republish.

Minute 6 — next check: alias vs $LATEST. With the runtime fixed, cold start drops to 800ms — better, but not the ~100ms expected. Run aws lambda get-policy --function-name my-app and look at the resource ARNs in event source mappings or API Gateway integration. They reference arn:aws:lambda:...:function:my-app (unqualified, which means $LATEST) instead of arn:aws:lambda:...:function:my-app:prod. SnapStart only optimizes published versions accessed by an alias. Repoint the trigger to the alias.

Minute 9 — discriminating evidence: priming on init. Cold start is now 200ms but the first 500ms of handler work is still slow — the JDBC driver hasn’t been class-loaded, the HTTP client hasn’t been warmed. Move that work into static initialization (static {} in Java, module scope in Python). Anything that runs at init is captured in the snapshot and restored instantly; anything that runs at handler time pays full cost.

Minute 12 — actual root cause: KMS permission on the snapshot. Restore now fails intermittently with KMSAccessDeniedException. SnapStart encrypts snapshots with an AWS-managed key by default; if you configured a customer-managed KMS key, the function’s execution role needs kms:Decrypt on that key. Grant the permission, force a re-snapshot by publishing a new version, and the function consistently restores in under 200ms.

Fix 1: Enable SnapStart on a Published Version

In the Lambda console: Function → Configuration → SnapStart → Apply → “Published versions” → Save.

Or via CLI:

aws lambda update-function-configuration \
  --function-name my-app \
  --snap-start ApplyOn=PublishedVersions

# Then publish a version:
aws lambda publish-version --function-name my-app
# Returns: { "Version": "5", ... }

# Point an alias at it:
aws lambda update-alias \
  --function-name my-app \
  --name prod \
  --function-version 5

API Gateway / function URLs / event sources must point at the alias (my-app:prod), not the function itself or $LATEST. Otherwise SnapStart doesn’t activate.

# AWS SAM template:
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      AutoPublishAlias: prod
      SnapStart:
        ApplyOn: PublishedVersions

AutoPublishAlias: prod makes SAM publish a new version and update the prod alias on each deploy. SnapStart picks it up automatically.

Pro Tip: For non-prod environments, use SnapStart too. Cold-start differences make perf testing meaningless if dev doesn’t use SnapStart.

Fix 2: Use Restore Hooks for Stale Connections

DB connections, HTTP keep-alive sockets, file handles — all become stale after a snapshot restore. You need to refresh them in a restore hook.

Java (with the AWS SDK):

import org.crac.Resource;
import org.crac.Core;

public class App implements Resource {
    private Connection dbConn;

    public App() {
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context) {
        if (dbConn != null) dbConn.close();
        dbConn = null;
    }

    @Override
    public void afterRestore(org.crac.Context<? extends Resource> context) {
        dbConn = createConnection();
    }
}

The org.crac package (Coordinated Restore at Checkpoint) is Java’s hook API. SnapStart calls beforeCheckpoint before snapshotting and afterRestore after restoring.

Python:

import os
# The runtime hook API surface and import path is still evolving — check
# the AWS Lambda Python runtime docs for the exact package and decorators.
# The shape generally looks like:

connection = None

def init_connection():
    global connection
    connection = psycopg2.connect(os.environ["DATABASE_URL"])

def close_connection():
    global connection
    if connection:
        connection.close()
        connection = None

def reopen_connection():
    init_connection()

# Register close_connection as a "before snapshot" hook and reopen_connection
# as an "after restore" hook via the current AWS-provided runtime API.

init_connection()  # Runs at startup, captured in snapshot

Node.js:

let dbClient;

async function init() {
  dbClient = await createPgClient();
}

// Register lifecycle hooks via the current AWS Lambda Node.js runtime API
// (the API name and import path are still evolving — check the docs).
// Conceptually:
//   - beforeSnapshot: close stale connections, dbClient = null.
//   - afterRestore: re-create dbClient by calling init().

await init();  // Initial setup, captured in snapshot.

For both runtimes, the hook API is newer than Java’s org.crac and the import paths have moved across releases — always check the AWS Lambda runtime docs for the current names.

Common Mistake: Initializing a DB connection at module load and assuming it survives the snapshot. It doesn’t — TCP sockets are dead after restore. Always re-establish in afterRestore.

Fix 3: Reseed Random Number Generators

Java’s SecureRandom and Random are stateful — the state is part of the snapshot. Without reseeding, restored instances generate correlated sequences:

@Override
public void afterRestore(org.crac.Context<? extends Resource> context) {
    // Reseed:
    SecureRandom.getInstanceStrong();  // Forces a reseed from /dev/urandom
}

For UUID v4 generation:

@Override
public void afterRestore(...) {
    // The internal Random used by UUID.randomUUID() shares the JVM's default.
    // Reseed explicitly:
    new SecureRandom().nextBytes(new byte[16]);
}

Python’s random module is also stateful:

import random
import secrets

@register_after_restore
def reseed_random():
    random.seed()  # Reseeds from /dev/urandom

secrets (CSPRNG, always reseeded from the OS) is unaffected by snapshots — prefer it over random for any value that must be unique across invocations.

Node.js’s Math.random() and crypto.randomUUID():

crypto.randomUUID() uses the OS’s CSPRNG — safe across snapshots.
Math.random() is V8 internal state — affected by snapshots, but practical impact is small for most apps.

For anything security-sensitive, use crypto.randomUUID() or crypto.getRandomValues() — never Math.random().

Pro Tip: Audit your code for any “random” that you depend on being globally unique. If it uses pre-restore RNG state, fix it.

Fix 4: Refresh Cached Time

If you cache System.currentTimeMillis() or Date.now() at init for “when this Lambda started,” that value is the snapshot time, not the current invocation:

private static final long STARTUP_TIME = System.currentTimeMillis();
// At snapshot: 2026-01-01 00:00:00
// At every restore: still 2026-01-01 00:00:00 (snapshot time)
// Don't use this for cache TTLs, log timestamps, etc.

Fix in afterRestore:

private static long restoreTime;

@Override
public void afterRestore(...) {
    restoreTime = System.currentTimeMillis();
}

Now restoreTime is when this specific invocation started.

For cache that should expire:

private static long CACHE_VALID_UNTIL = -1;
private static Result CACHED_RESULT;

public Result get() {
    if (System.currentTimeMillis() < CACHE_VALID_UNTIL) {
        return CACHED_RESULT;
    }
    CACHED_RESULT = fetch();
    CACHE_VALID_UNTIL = System.currentTimeMillis() + 60_000;
    return CACHED_RESULT;
}

This reads “now” at each call. The cache TTL is measured from the last fetch, not from snapshot — safe.

Fix 5: Reduce Snapshot Size (for Faster Restore)

Restore Duration is the time to fault in the snapshot’s memory pages. Larger snapshots = slower restore. To reduce:

Trim init. Lazy-load packages used by < 50% of invocations.
Avoid eager JIT in Java. Class Data Sharing (CDS) helps, but heavy class loading in static blocks increases snapshot size.
Skip pre-warming caches that don’t survive restore anyway. Pre-warming a DB connection just to throw it away in beforeCheckpoint wastes init time.

For Java specifically, pass JVM options via the standard JAVA_TOOL_OPTIONS Lambda env var:

JAVA_TOOL_OPTIONS=-XX:TieredStopAtLevel=1 -XX:+UseSerialGC

These keep JIT compilation light and use a simpler garbage collector — faster init, smaller heap, smaller snapshot.

Pro Tip: Profile with aws lambda invoke --log-type Tail to see Init Duration, Restore Duration, Duration. The goal: Restore Duration < 200ms. Above that, your init is too heavy.

Fix 6: Priming Code at Init

Code paths that run at handler time (first invocation) aren’t part of the snapshot — they cold-start. Move common logic to init so it’s captured:

public class App {
    private static final Database DB;
    private static final HttpClient HTTP;
    
    static {
        // Runs once at init, captured in snapshot
        DB = new Database();
        HTTP = HttpClient.newHttpClient();
    }

    public Response handleRequest(Request req) {
        // Fast because DB and HTTP are already constructed
        return DB.query(...);
    }
}

Same pattern in Python:

# Module-level — runs at init:
db = create_db_pool()
http_client = httpx.Client()

def handler(event, context):
    # Uses the pre-created pool
    return db.query(...)

Anything done in static {} (Java) / module scope (Python/Node) is part of the snapshot — restored fast. Anything in the handler function is per-invocation — adds to Duration.

Common Mistake: Initializing the DB inside the handler. Each invocation pays the connection cost. Move to init, refresh in restore hook.

Fix 7: Local Testing With SnapStart Behavior

SnapStart isn’t perfectly reproducible locally (no AWS environment). But you can simulate the restore lifecycle:

# Invoke once to trigger snapshot creation:
aws lambda invoke --function-name my-app:prod --payload '{}' /tmp/out.json

# Wait a few seconds for the snapshot to bake.

# Invoke many times to test restore behavior:
for i in {1..10}; do
  aws lambda invoke --function-name my-app:prod --payload "{\"i\":$i}" /tmp/out-$i.json
done

# Compare timings via CloudWatch Logs.

For unit testing restore hooks, mock the snapshot lifecycle:

@Test
public void afterRestore_reconnects() {
    var app = new App();
    app.beforeCheckpoint(null);
    assertNull(app.getDbConn());
    app.afterRestore(null);
    assertNotNull(app.getDbConn());
}

Test that connections are dropped at checkpoint and re-established at restore.

Fix 8: Pricing and Quotas

SnapStart adds cost:

Snapshot storage. Per GB-month for the snapshot data (small for most functions).
Restore time billed. Restore Duration is part of your billed time.
First invocation per version creates a snapshot. Slow first-after-publish.

Monitor:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=my-app Name=ExecutedVersion,Value=5 \
  --start-time 2026-05-20T00:00:00Z \
  --end-time 2026-05-20T23:59:59Z \
  --period 3600 \
  --statistics Average,Maximum

Compare versions with and without SnapStart. If SnapStart adds more cost than it saves in latency, it’s not worth it for that function (rare — usually a clear win for Java).

Pro Tip: For functions invoked < 1/minute, SnapStart’s snapshot storage may cost more than you save. For high-traffic functions, it’s almost always cheaper.

Still Not Working?

A few less-obvious failures:

Restore time is huge (>2 seconds). Snapshot is too big. Likely heavy class-loading in Java; check Init Duration of the pre-SnapStart version — that’s roughly your snapshot size.
Function doesn’t honor SnapStart after update. Each new version requires a new snapshot. Confirm aws lambda get-function-configuration shows the right SnapStart.OptimizationStatus.
Java cold-start time still bad. Verify you’re calling the alias, not $LATEST. $LATEST always cold-starts.
Python/Node SnapStart features differ from Java. Some hooks are Java-only as of writing. Check AWS docs for current support matrix per runtime.
DynamoDB / RDS connections hang. Connection pool’s TCP keep-alive doesn’t survive the snapshot. Always close + reopen in restore hooks.
Provisioned Concurrency vs SnapStart. They’re different mechanisms. SnapStart is cheaper and broader; Provisioned Concurrency is closer to “always-on” but expensive. Compare both for your workload.
Logs show snapshot age. Some snapshots can be reused across deploys (rare). If you suspect stale snapshots, force a new publish.
EFS / Lambda Layer changes invalidate snapshots. Snapshots tied to deployment artifact hash. Layer updates trigger re-snapshot.
KMSAccessDeniedException on restore. A customer-managed KMS key encrypts the snapshot. Grant the execution role kms:Decrypt on that key, then republish to trigger a new snapshot under the corrected permission set.
VPC config differs between snapshot time and restore time. SnapStart snapshots the network state. Moving the function to a different VPC, subnet, or security group after creating the version invalidates assumptions in the restored process — re-publish the version after VPC changes so the snapshot reflects the current topology.
Provisioned Concurrency overrides SnapStart timing. If both are enabled, Provisioned Concurrency pre-warms instances with traditional init and SnapStart never gets a chance to act. Pick one — SnapStart for cost-sensitive bursty workloads, Provisioned Concurrency for absolute latency floors at fixed cost.

For related AWS Lambda and serverless performance issues, see AWS Lambda cold start timeout, AWS Lambda timeout, AWS Lambda layer not working, and AWS Lambda import module error.