Fix: AWS Step Functions Not Working — ASL Syntax, Map State, Error Handling, and IAM
Part of: Docker, DevOps & Infrastructure
Quick Answer
How to fix AWS Step Functions errors — Amazon States Language syntax, Standard vs Express workflows, Distributed Map for large datasets, Retry/Catch error handling, Lambda invoke optimization, and IAM execution role permissions.
The Error
You define a state machine and the validation fails:
States.Runtime: An error occurred while executing the state.
The JSONPath '$.user.id' specified for the field 'InputPath' could
not be found in the inputOr a Lambda task throws and the workflow doesn’t catch it:
ExecutionFailed: States.TaskFailed in state 'CallLambda'
Function returned an error, but no Catch was defined.Or a Map state iterating over 100,000 items times out:
States.MapStateFailed: Map state execution exceeded the maximum
number of concurrent iterations.Or IAM denies access to a downstream service:
States.Runtime: AccessDenied — User: arn:aws:iam::123456789012:role/StepFunctions-Role
is not authorized to perform: lambda:InvokeFunctionWhy This Happens
Step Functions orchestrate AWS services through a JSON-based DSL called Amazon States Language (ASL). Every state machine is a graph of named states with explicit transitions, error handling, and input/output transformations. The benefit is that you get retry, parallelism, and durable state for free; the cost is that ASL is strict and error messages often refer to constructs (JSONPath, error class names, intrinsic functions) that only make sense if you have ASL internals in your head.
The main failure categories are predictable. ASL validation is unforgiving — missing End or Next on a state fails at deploy time, and JSONPath references that don’t exist in the input throw States.Runtime at execution time with messages that look like a typo but are actually missing fields. Standard versus Express is not a knob you can flip later: Standard workflows last up to a year, are billed per state transition, and persist full execution history; Express are capped at 5 minutes, billed per execution plus duration, and don’t store history unless you wire CloudWatch Logs. Map state has two completely different runtime models — inline Map runs ~40 concurrent iterations with a 256 KB payload limit, while Distributed Map streams from S3 and can handle millions of items. And IAM is layered: the state machine’s execution role needs explicit permission for every downstream service ARN, plus permission to start child executions for Distributed Map.
The harder traps are the silent ones. A retry policy that matches States.ALL swallows States.Timeout errors that should bubble up. A Choice state without a Default causes the execution to fail when no branch matches, instead of falling through. A Lambda Task using the standard ARN integration (arn:aws:lambda:...:function:Foo) gets no automatic retry on Lambda.TooManyRequestsException, while the same Lambda invoked via the optimized integration (arn:aws:states:::lambda:invoke) retries automatically. Most “Step Functions not working” reports are really one of these subtle defaults biting in production after passing dev tests.
Diagnostic Timeline
Walk through a real “my Map state finishes 100 iterations and then silently stops” failure.
Minute 0 — first suspicion: fix the JSON. The obvious move is to re-edit the ASL, re-validate in the visual editor, and redeploy. The JSON has always been valid — ASL validation only catches structural errors, not the semantic limits you’re hitting.
Minute 3 — first evidence: check the execution history. Open the Step Functions console, find the execution, click on the Map state. The history shows “MapIterationsSucceeded” reaching about 40 and then… nothing. No error, no failed iteration. That cap is the giveaway: inline Map silently throttles at ~40 concurrent iterations regardless of what you set in MaxConcurrency.
Minute 6 — next check: ASL escape rules. Open the failed iteration’s input. It contains a string with embedded quotes. ASL parses JSONPath inside strings, and unescaped quotes confuse the parser. The iteration didn’t fail with a useful error — it produced a malformed Payload to Lambda, which crashed silently. Escape the inner quotes or use States.JsonToString and States.StringToJson intrinsic functions to round-trip the value.
Minute 9 — discriminating evidence: Standard vs Express. Look at the workflow type. This is a Standard workflow running a Map of 5-minute Lambda calls, each costing one state transition per iteration. At 100K items, you’d burn 100K+ state transitions — billed at $0.025 per 1K. The cost is fine, but the per-state history storage is the real bottleneck: Standard execution history caps at 25K events. Your Map exceeded the history cap and the execution silently truncated.
Minute 12 — actual root cause: switch to Distributed Map. Inline Map accumulates results in memory and reports each iteration in execution history. Distributed Map runs each iteration as a child execution (Express, ideally), streams input from S3, and writes results back to S3. The parent execution sees only an aggregate. Convert the state to ProcessorConfig.Mode: DISTRIBUTED with ExecutionType: EXPRESS, point ItemReader at an S3 prefix, give the IAM role states:StartExecution on the child workflow ARN, and the Map runs all 100K items in parallel without hitting the history cap.
Fix 1: Write Valid ASL
{
"Comment": "Process an order",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
"Payload.$": "$"
},
"ResultPath": "$.validation",
"Next": "ProcessPayment"
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
"Payload": {
"orderId.$": "$.orderId",
"amount.$": "$.validation.Payload.amount"
}
},
"ResultPath": "$.payment",
"End": true
}
}
}Three required pieces per state:
Type— Task, Choice, Wait, Map, Parallel, Pass, Succeed, Fail.NextorEnd— what comes after. Every non-terminal state needsNext: "...".Resource— for Task, the ARN of the service integration.
Common JSONPath patterns:
Payload.$": "$"— the entire input."orderId.$": "$.orderId"— pull a specific field. The key suffix.$is required for JSONPath references.ResultPath: "$.payment"— where to merge the state’s output into the state document.OutputPath: "$.payment"— discard everything else and emit just this path.
Pro Tip: Use the visual editor in the AWS Console to design the state machine, then export the ASL. The validator catches missing transitions and syntax errors before you deploy.
Fix 2: Standard vs Express Workflows
Pick based on duration and history needs:
- Standard — up to 1 year, full execution history, $0.025 per 1K state transitions. For long-running orchestration, human approvals, audit trails.
- Express — up to 5 minutes, billed per execution + duration, no history (logs to CloudWatch if enabled). For high-throughput API backends, event processing.
Set the type at create time:
# SAM template:
Resources:
MyStateMachine:
Type: AWS::Serverless::StateMachine
Properties:
DefinitionUri: state-machine.asl.json
Type: STANDARD # or EXPRESS
Role: !GetAtt MyRole.ArnYou can’t convert between Standard and Express in place — create a new state machine.
For Express workflows that need history:
LoggingConfiguration:
Level: ALL
IncludeExecutionData: true
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt MyLogGroup.ArnThis sends Express execution logs to CloudWatch — replaces the missing in-product history.
Common Mistake: Using Express for workflows with human approval steps. Express has a 5-minute hard limit; humans take longer than 5 minutes. Use Standard.
Fix 3: Map State — Inline vs Distributed
Inline Map (default) handles up to ~40 concurrent iterations, with a 256 KB payload limit:
{
"MyMap": {
"Type": "Map",
"ItemsPath": "$.items",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:ProcessItem",
"End": true
}
}
},
"ResultPath": "$.results",
"End": true
}
}For larger datasets, use Distributed Map. It reads input from S3 and can iterate over millions of items:
{
"MyMap": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": {
"Bucket": "my-bucket",
"Prefix": "items/"
}
},
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:ProcessItem",
"End": true
}
}
},
"MaxConcurrency": 1000,
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket": "my-bucket",
"Prefix": "results/"
}
},
"End": true
}
}Distributed Map:
Mode: "DISTRIBUTED"inProcessorConfigswitches from inline.ExecutionType: "EXPRESS"is recommended for high-throughput; iteration sub-workflows run as Express.ItemReadercan be S3 list (objects), S3 GetObject (CSV/JSONL contents), or DynamoDB scan.ResultWriterpersists results to S3 — avoids the inline result accumulating into the parent state.
Pro Tip: For batch jobs over thousands of items, always use Distributed Map. Inline Map silently caps concurrency and accumulates results in memory — slow and OOM-prone.
Fix 4: Error Handling With Retry and Catch
For transient errors, retry:
{
"MyTask": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:MyFunction",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2,
"JitterStrategy": "FULL"
}
],
"Catch": [
{
"ErrorEquals": ["States.TaskFailed"],
"Next": "HandleFailure",
"ResultPath": "$.error"
}
],
"Next": "Success"
}
}Retry:
ErrorEquals— list of error types to retry.States.ALLcatches everything.IntervalSeconds— initial backoff.MaxAttempts— total retries (not including initial attempt).BackoffRate— multiplier per attempt (2 = exponential).JitterStrategy: "FULL"— adds random jitter (recommended for distributed retries).
Catch:
- Runs when retries are exhausted (or the error doesn’t match
Retry.ErrorEquals). ResultPath: "$.error"— stores the error info at this path for the next state to inspect.Next— the failure-handling state.
In the failure handler, you can read the error:
{
"HandleFailure": {
"Type": "Pass",
"Parameters": {
"errorType.$": "$.error.Error",
"errorMessage.$": "$.error.Cause"
},
"Next": "Notify"
}
}Common Mistake: Catching States.ALL without inspecting the error. You handle exceptions you shouldn’t (like States.Timeout that should bubble up). Be specific about what you catch.
Fix 5: Choice State Syntax
{
"RouteByStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.order.status",
"StringEquals": "pending",
"Next": "ProcessPending"
},
{
"Variable": "$.order.status",
"StringEquals": "confirmed",
"Next": "ProcessConfirmed"
},
{
"And": [
{ "Variable": "$.order.amount", "NumericGreaterThan": 100 },
{ "Variable": "$.order.region", "StringEquals": "US" }
],
"Next": "HighValueUSFlow"
}
],
"Default": "ProcessUnknown"
}
}Operators:
StringEquals,StringMatches,StringGreaterThan, etc. — string comparisons.NumericEquals,NumericGreaterThan, etc. — number comparisons.BooleanEquals— boolean.TimestampLessThan,TimestampLessThanEqualsPath, etc. — timestamp.IsPresent,IsString,IsNumeric,IsBoolean,IsNull,IsTimestamp— type checks.
Combinations:
And: [...]— all must match.Or: [...]— any.Not: { ... }— negation.
Default is required if no choices might match — without it, the execution fails.
For dynamic comparisons (compare two paths):
{
"Variable": "$.requested",
"NumericGreaterThanPath": "$.available"
}The Path suffix on the operator means “compare against another JSONPath.”
Fix 6: IAM Execution Role
The state machine’s execution role needs to invoke every service it touches:
StepFunctionsRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: InvokeLambdas
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: lambda:InvokeFunction
Resource:
- !GetAtt ValidateLambda.Arn
- !GetAtt ProcessLambda.Arn
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:PutItem
Resource: !GetAtt MyTable.Arn
- Effect: Allow
Action:
- logs:CreateLogDelivery
- logs:GetLogDelivery
- logs:UpdateLogDelivery
- logs:DeleteLogDelivery
- logs:ListLogDeliveries
Resource: "*"For X-Ray tracing:
- Effect: Allow
Action:
- xray:PutTraceSegments
- xray:PutTelemetryRecords
Resource: "*"For Distributed Map (needs to start child executions):
- Effect: Allow
Action: states:StartExecution
Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${StateMachineName}"
- Effect: Allow
Action:
- states:DescribeExecution
- states:StopExecution
Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${StateMachineName}:*"Common Mistake: Granting lambda:InvokeFunction on *. Scope to specific Lambda ARNs. A leaky role lets the state machine invoke functions it shouldn’t.
Fix 7: Lambda Integration — Optimized vs Standard
For Lambda invocation, there are two integration types:
Optimized integration (arn:aws:states:::lambda:invoke):
{
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "...",
"Payload.$": "$"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 1,
"MaxAttempts": 3
}
]
}The output has Lambda metadata wrapping:
{
"ExecutedVersion": "$LATEST",
"Payload": { ...your actual response... },
"SdkHttpMetadata": { ... }
}Read your data via $.Payload.
Standard ARN integration (arn:aws:lambda:...:function:MyFunction):
{
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:MyFunction",
"InputPath": "$",
"ResultPath": "$.result"
}The output is just your Lambda’s response — no wrapping. But error retries are less granular.
Pro Tip: Always use optimized integration (arn:aws:states:::lambda:invoke) for production. The built-in retry for Lambda.ServiceException and TooManyRequestsException handles transient AWS issues automatically.
Fix 8: Wait States and Heartbeats
Wait for a duration:
{
"WaitFiveSeconds": {
"Type": "Wait",
"Seconds": 5,
"Next": "Continue"
}
}Wait until a specific time:
{
"WaitUntil2026Year": {
"Type": "Wait",
"Timestamp": "2027-01-01T00:00:00Z",
"Next": "Continue"
}
}Wait for a path-resolved value:
{
"WaitUntilUserSchedule": {
"Type": "Wait",
"TimestampPath": "$.scheduledAt",
"Next": "Continue"
}
}For human-in-the-loop with TaskToken:
{
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "SendApprovalRequest",
"Payload": {
"taskToken.$": "$$.Task.Token",
"userId.$": "$.userId"
}
},
"Next": "Continue"
}
}The Lambda sends an approval link to the user, embedding taskToken. When the user clicks “Approve,” your backend calls SendTaskSuccess with the token, unblocking the state machine.
aws stepfunctions send-task-success \
--task-token "$TASK_TOKEN" \
--output '{"approved": true}'This pattern supports human approvals, external system callbacks, and async work that doesn’t fit in a single Lambda timeout.
Common Mistake: Forgetting .waitForTaskToken on the Resource ARN. Without it, the task completes immediately on Lambda return, instead of waiting for the explicit SendTaskSuccess call.
Still Not Working?
A few less-obvious failures:
- State machine fails to start. Check the execution role’s trust policy — must include
states.amazonaws.comas Principal. States.Runtimewith cryptic JSONPath. Your input doesn’t have the field you expect. Add a Pass state at the start to log the input, then read CloudWatch logs.- Step Functions billed more than expected. Each state transition costs (Standard). Loops with many iterations add up. Use Express for high-volume sub-workflows.
Mapstate OOM in inline mode. Switch to Distributed Map with S3 reader/writer.Retrydoesn’t retry the error you saw. Match the exact error type. Errors from Lambda have specific names;States.TaskFailedis the generic wrapper.Passstate doesn’t change input. By default,Passpasses input through. UseParametersto transform, orResultPathto merge a literalResult.Choicestate has no default and execution fails. Always provideDefault. Even"Default": "FailExplicitly"(a Fail state) is better than no default.- Time zones in
Timestampfield. Always UTC. Convert in your code if your data is in local time. - Express workflow doesn’t appear in the console history list. Express executions are not stored in the in-product history. Enable
LoggingConfigurationwithLevel: ALLandIncludeExecutionData: true, then query CloudWatch Logs Insights instead of the console. States.DataLimitExceededmid-execution. A state’s output crossed the 256 KB payload limit between states. Either trim the output withResultSelectororOutputPath, or persist the heavy result to S3 and pass only the key forward.- Distributed Map child executions show “Failed” with no error in the parent. The parent only sees aggregate counts. Open the Distributed Map run, click “Item processing details,” and drill into individual failed iterations — each child execution has its own log group with the actual stack trace.
For related AWS orchestration and serverless issues, see AWS Lambda timeout, AWS Lambda cold start timeout, AWS IAM permission denied, and AWS SQS not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs
How to fix Lambda SnapStart errors — feature requires published version, $LATEST not supported, restore hook for stale connections, UUID collisions after snapshot, time-based state staleness, and pricing surprises.
Fix: AWS Lambda Environment Variable Not Set — undefined or Missing at Runtime
How to fix AWS Lambda environment variables not available — Lambda console config, CDK/SAM/Terraform setup, secrets from SSM Parameter Store, encrypted variables, and local testing.
Fix: AWS Lambda Cold Start Timeout and Slow First Invocation
How to fix AWS Lambda cold start timeouts and slow first invocations — provisioned concurrency, reducing package size, connection reuse, and language-specific optimizations.
Fix: AWS RDS Proxy Not Working — Endpoint, IAM Auth, Connection Pinning, and Lambda VPC
How to fix AWS RDS Proxy errors — IAM authentication token mismatch, connection pinning blocking reuse, Lambda VPC routing, Secrets Manager rotation, max_connections, read/write splitter, and TLS requirement.