Skip to content

Fix: AWS ECS Task Failed to Start

FixDevs · (Updated: )

Part of:  Docker, DevOps & Infrastructure

Quick Answer

How to fix ECS tasks that fail to start — port binding errors, missing IAM permissions, Secrets Manager access, essential container exit codes, and health check failures.

The Error

An ECS task fails to start and the service shows no running tasks:

CannotPullContainerError: ref pull has been retried 1 time(s): failed to pull
and unpack image "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest":
failed to resolve reference: unexpected status code 403 Forbidden

Or the task starts and immediately stops:

Essential container in task exited
Exit Code: 1
Reason: Essential container in task exited

Or resource constraints prevent scheduling:

RESOURCE:MEMORY

Or port binding fails:

CannotStartContainerError: bind for 0.0.0.0:8080 failed: port is already allocated

Or Secrets Manager access is denied:

ResourceInitializationError: unable to pull secrets or registry auth:
execution resource retrieval failed: unable to retrieve secret from asm:
service call has been retried 1 time(s): AccessDeniedException

Why This Happens

ECS task failures have several distinct causes, and the error messages don’t always point directly to the root problem. The challenge with ECS is that a “task failed to start” can mean anything from an IAM misconfiguration to a VPC networking issue to a bug in your application code, and the failure surface is spread across the ECS console, CloudWatch Logs, and the EC2 or Fargate runtime.

The lifecycle of an ECS task has multiple failure points. First, ECS must pull the container image from a registry (ECR, Docker Hub, or a private registry). Then it must inject environment variables and secrets from Secrets Manager or Parameter Store. Next, it starts the container and begins health checking. Any failure at any step stops the task, but the error message you see in describe-tasks may describe a downstream symptom rather than the upstream cause. For example, a ResourceInitializationError about secrets might actually be a VPC endpoint misconfiguration — the task can’t reach Secrets Manager because there’s no route to the service endpoint.

Common triggers include: ECR pull failure (the task execution role lacks ecr:GetAuthorizationToken or the image URI is wrong), application crash on startup (the container exits immediately due to a missing env var or failed database connection), Secrets Manager / Parameter Store access denied (the task execution role needs explicit permission to retrieve secrets), port already allocated (using host network mode with a previous task holding the port), insufficient memory or CPU (the task’s resource settings exceed what’s available), health check failing (the container starts but the load balancer deregisters it), and missing taskRoleArn (the execution role handles ECS infrastructure while the task role handles application-level AWS access).

How Other Platforms Handle This

ECS task startup failures have direct analogs in every container orchestration platform. Understanding how each platform surfaces errors helps you diagnose issues faster, especially if you’re migrating between platforms or running a multi-cloud setup.

ECS Fargate vs ECS EC2 — even within ECS, the two launch types fail differently. On EC2, resource constraints depend on the instance type and the container instances registered to the cluster. You see RESOURCE:MEMORY when no instance has enough free memory. On Fargate, resource constraints are about invalid CPU/memory combinations (e.g., 256 CPU with 4096 memory is invalid). Fargate also requires awsvpc network mode, which means every task gets its own ENI and private IP. ENI limits per subnet can cause placement failures that don’t exist on EC2. Fargate’s ResourceInitializationError for secrets is more common because Fargate tasks in private subnets need NAT Gateways or VPC endpoints to reach AWS service endpoints — EC2 tasks can often reach these services through the instance’s own networking.

Kubernetes pods use a different model. A pod failure shows as CrashLoopBackOff (container exits and kubelet restarts it repeatedly), ImagePullBackOff (can’t pull the image), or Pending (can’t schedule due to resource constraints or node affinity). The Kubernetes equivalent of ECS’s task definition is the pod spec, and the equivalent of describe-tasks is kubectl describe pod. Kubernetes separates health checking into liveness probes (should the container be restarted?) and readiness probes (should the container receive traffic?). ECS has only one health check concept, which combines both concerns. This means an ECS task that’s slow to start gets killed and replaced, while Kubernetes would mark it as not ready but keep it running if only the readiness probe fails. Exit code semantics also differ: Kubernetes reports exit codes in kubectl describe pod and supports restartPolicy: Never for jobs, while ECS always restarts essential containers through the service scheduler.

HashiCorp Nomad tasks use a task stanza within a group stanza. Nomad’s error surface is similar to ECS but simpler: Dead with a failure reason, visible via nomad alloc status. Nomad doesn’t have IAM roles — it uses Vault for secrets injection, which has its own failure modes (Vault token expired, policy doesn’t grant access to the secret path). The log driver in Nomad is configured per task, similar to ECS, but Nomad supports docker.logging with direct access to the Docker log driver — no equivalent of ECS’s awslogs driver that routes to CloudWatch.

Google Cloud Run auto-scales containers from zero. A Cloud Run service that fails to start shows errors in Cloud Logging. Cloud Run has a strict 4-minute startup timeout (configurable up to 60 minutes for second-gen) — if your container doesn’t start listening on $PORT within that window, the revision fails. Unlike ECS where you specify the port in the task definition, Cloud Run injects the port via the PORT environment variable and expects your app to bind to it. This is a common migration pitfall: an ECS task hardcoded to port 8080 won’t work on Cloud Run without reading $PORT. Cloud Run also doesn’t support sidecar containers in first-gen, while ECS supports multiple containers in a single task definition with essential and dependsOn semantics.

Fix 1: Check Stopped Task Error Details

ECS stores the stop reason for recent tasks. This is the first place to look:

# List recent stopped tasks in a service
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status STOPPED

# Get detailed stop reason for a specific task
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123def456

# Look for:
# - stoppedReason: "Essential container in task exited"
# - containers[].exitCode: 1
# - containers[].reason: "CannotPullContainerError..."

In the ECS console:

Navigate to your cluster → Service → Tasks tab → filter by “Stopped” → click the task → expand the container to see the exit code and stop reason.

Check CloudWatch Logs for the application error:

# Get logs from the last task run
aws logs get-log-events \
  --log-group-name /ecs/my-service \
  --log-stream-name ecs/my-container/abc123def456 \
  --limit 100 \
  --start-from-head

Fix 2: Fix ECR Image Pull Failures

The ECS task execution role needs ECR permissions to pull the image:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}

Attach the managed policy (easiest):

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Verify the image URI is correct:

# List images in ECR repository
aws ecr list-images \
  --repository-name my-app \
  --region us-east-1

# The task definition image should match exactly:
# 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

For private registries (not ECR):

Add registry credentials to Secrets Manager and reference them in the task definition:

{
  "containerDefinitions": [{
    "name": "my-container",
    "image": "registry.example.com/my-app:latest",
    "repositoryCredentials": {
      "credentialsParameter": "arn:aws:secretsmanager:us-east-1:123456789012:secret:registry-credentials"
    }
  }]
}

Fix 3: Fix Secrets Manager Access

When task definitions reference secrets from Secrets Manager or Parameter Store, the task execution role (not the task role) needs access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameters",
        "ssm:GetParameter"
      ],
      "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
    }
  ]
}

Task definition with secrets:

{
  "containerDefinitions": [{
    "name": "my-app",
    "image": "my-image:latest",
    "secrets": [
      {
        "name": "DATABASE_URL",
        "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/database-url"
      },
      {
        "name": "API_KEY",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/api-key"
      }
    ]
  }]
}

Note: The secrets block in a task definition is resolved by ECS before the container starts. If ECS can’t retrieve a secret, the task fails with ResourceInitializationError — the application code never runs.

Fix 4: Fix Application Startup Crashes (Exit Code 1)

If the container starts and immediately exits with code 1 (or any non-zero code), the application is crashing before it can serve requests:

# Check the application logs immediately after the crash
aws logs filter-log-events \
  --log-group-name /ecs/my-service \
  --filter-pattern "ERROR" \
  --start-time $(date -d '30 minutes ago' +%s000)

Common startup crash causes:

Missing required environment variable:

# Application logs show:
# Error: Required environment variable DATABASE_URL is not set
# Process exited with code 1

Fix: add the missing variable to the task definition’s environment or secrets block.

Database connection failure at startup:

# Application logs show:
# FATAL: could not connect to server: Connection refused
# Process exited with code 1

Fix: check security group rules — ECS tasks need outbound access to the database’s port. Also check that the database hostname is reachable from within the VPC.

Wrong port configuration:

# Application listens on port 3000 but task definition maps port 8080
# Health check hits port 8080 → no response → task marked unhealthy

Fix: ensure the containerPort in the task definition matches the port your application listens on:

{
  "portMappings": [{
    "containerPort": 3000,   // Must match what the application binds to
    "hostPort": 0,           // 0 = dynamic port assignment (for awsvpc/bridge mode)
    "protocol": "tcp"
  }]
}

Test the Docker image locally before deploying:

# Simulate the ECS environment locally
docker run --rm \
  -e DATABASE_URL=postgres://... \
  -e API_KEY=test \
  -p 3000:3000 \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

Fix 5: Fix Resource Constraints

If ECS can’t place a task because of insufficient memory or CPU, tasks stay in PENDING state:

# Check service events for placement failures
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].events[:10]'

# Look for:
# "service my-service was unable to place a task because no container
#  instance met all of its requirements. The closest matching instance
#  had insufficient memory available."

For EC2 launch type — check instance resources:

# List container instances and their available resources
aws ecs list-container-instances --cluster my-cluster

aws ecs describe-container-instances \
  --cluster my-cluster \
  --container-instances $(aws ecs list-container-instances --cluster my-cluster --query 'containerInstanceArns[]' --output text) \
  --query 'containerInstances[*].{id:ec2InstanceId, cpu:remainingResources[?name==`CPU`].integerValue|[0], mem:remainingResources[?name==`MEMORY`].integerValue|[0]}'

Fix: reduce task memory/CPU or scale up the cluster:

{
  "cpu": "256",      // 0.25 vCPU — reduce if tasks are competing for resources
  "memory": "512",   // 512 MB — reduce or scale up instances
  "requiresCompatibilities": ["FARGATE"]
}

For Fargate — ensure the cpu/memory combination is valid. Fargate only supports specific combinations:

CPUValid Memory values
256 (.25 vCPU)512, 1024, 2048
512 (.5 vCPU)1024-4096 (in 1024 increments)
1024 (1 vCPU)2048-8192 (in 1024 increments)
2048 (2 vCPU)4096-16384 (in 1024 increments)
4096 (4 vCPU)8192-30720 (in 1024 increments)

Fix 6: Fix Health Check Failures

A task that starts successfully but fails load balancer health checks is repeatedly stopped and replaced:

# Check target group health in the ECS console or CLI
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:...

# Look for:
# "State": "unhealthy",
# "Reason": "Target.FailedHealthChecks"

Common health check fixes:

// Task definition health check
{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60    // Give the app time to start before health checks count
  }
}

ALB target group health check settings:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --health-check-timeout-seconds 10

Real-world scenario: A Node.js application takes 45 seconds to warm up (loading models, establishing DB connections). The default health check starts after 0 seconds with a 3-failure threshold. The app fails 3 checks before it’s ready, and ECS kills it. Setting startPeriod: 60 in the task definition health check gives the app time to initialize before failures count against it.

Still Not Working?

Enable ECS Exec to get a shell in a running container:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command

# Connect to a running task
aws ecs execute-command \
  --cluster my-cluster \
  --task <task-id> \
  --container my-container \
  --interactive \
  --command "/bin/sh"

Check the task execution role trust policy — the execution role must trust ecs-tasks.amazonaws.com:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ecs-tasks.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

Check VPC networking for Fargate tasks — Fargate tasks need either a public IP or a NAT Gateway to pull images from ECR and reach the internet:

# Fargate task in a private subnet needs NAT Gateway
# Or use VPC endpoints for ECR, S3, and CloudWatch Logs

# Required VPC endpoints for fully private Fargate:
# - com.amazonaws.<region>.ecr.api
# - com.amazonaws.<region>.ecr.dkr
# - com.amazonaws.<region>.s3 (Gateway endpoint)
# - com.amazonaws.<region>.logs
# - com.amazonaws.<region>.secretsmanager (if using Secrets Manager)

Check for essential container dependency ordering. If your task definition has multiple containers and one is marked as essential: true, all essential containers must stay running. A sidecar container that crashes can take down the entire task. Use dependsOn with condition: HEALTHY to ensure containers start in the right order:

{
  "containerDefinitions": [
    {
      "name": "app",
      "essential": true,
      "dependsOn": [{ "containerName": "envoy", "condition": "HEALTHY" }]
    },
    {
      "name": "envoy",
      "essential": true,
      "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:9901/ready || exit 1"] }
    }
  ]
}

Decode exit codes. Exit code 137 means the container was killed by the OOM killer (out of memory). Exit code 139 is a segfault. Exit code 143 means the container received SIGTERM (normal shutdown). Exit code 1 is a generic application error — check CloudWatch Logs for the actual error message. If there are no logs at all, the log driver configuration is likely wrong (missing CloudWatch Logs permissions on the execution role, or the log group doesn’t exist).

Check platform version for Fargate. LATEST is not always the most recent platform version. Specify platformVersion: "1.4.0" explicitly in your service or task definition. Older platform versions have known issues with ENI attachment, EFS mounts, and ECS Exec support.

For related AWS issues, see Fix: AWS ECR Authentication Failed, Fix: AWS CloudWatch Logs Not Appearing, Fix: AWS Lambda Timeout, and Fix: AWS IAM AccessDeniedException.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles