Skip to content

Fix: AWS Step Functions Not Working — ASL Syntax, Map State, Error Handling, and IAM

FixDevs · (Updated: )

Part of:  Docker, DevOps & Infrastructure

Quick Answer

How to fix AWS Step Functions errors — Amazon States Language syntax, Standard vs Express workflows, Distributed Map for large datasets, Retry/Catch error handling, Lambda invoke optimization, and IAM execution role permissions.

The Error

You define a state machine and the validation fails:

States.Runtime: An error occurred while executing the state.
The JSONPath '$.user.id' specified for the field 'InputPath' could 
not be found in the input

Or a Lambda task throws and the workflow doesn’t catch it:

ExecutionFailed: States.TaskFailed in state 'CallLambda'
Function returned an error, but no Catch was defined.

Or a Map state iterating over 100,000 items times out:

States.MapStateFailed: Map state execution exceeded the maximum 
number of concurrent iterations.

Or IAM denies access to a downstream service:

States.Runtime: AccessDenied — User: arn:aws:iam::123456789012:role/StepFunctions-Role 
is not authorized to perform: lambda:InvokeFunction

Why This Happens

Step Functions orchestrate AWS services through a JSON-based DSL called Amazon States Language (ASL). Every state machine is a graph of named states with explicit transitions, error handling, and input/output transformations. The benefit is that you get retry, parallelism, and durable state for free; the cost is that ASL is strict and error messages often refer to constructs (JSONPath, error class names, intrinsic functions) that only make sense if you have ASL internals in your head.

The main failure categories are predictable. ASL validation is unforgiving — missing End or Next on a state fails at deploy time, and JSONPath references that don’t exist in the input throw States.Runtime at execution time with messages that look like a typo but are actually missing fields. Standard versus Express is not a knob you can flip later: Standard workflows last up to a year, are billed per state transition, and persist full execution history; Express are capped at 5 minutes, billed per execution plus duration, and don’t store history unless you wire CloudWatch Logs. Map state has two completely different runtime models — inline Map runs ~40 concurrent iterations with a 256 KB payload limit, while Distributed Map streams from S3 and can handle millions of items. And IAM is layered: the state machine’s execution role needs explicit permission for every downstream service ARN, plus permission to start child executions for Distributed Map.

The harder traps are the silent ones. A retry policy that matches States.ALL swallows States.Timeout errors that should bubble up. A Choice state without a Default causes the execution to fail when no branch matches, instead of falling through. A Lambda Task using the standard ARN integration (arn:aws:lambda:...:function:Foo) gets no automatic retry on Lambda.TooManyRequestsException, while the same Lambda invoked via the optimized integration (arn:aws:states:::lambda:invoke) retries automatically. Most “Step Functions not working” reports are really one of these subtle defaults biting in production after passing dev tests.

Diagnostic Timeline

Walk through a real “my Map state finishes 100 iterations and then silently stops” failure.

Minute 0 — first suspicion: fix the JSON. The obvious move is to re-edit the ASL, re-validate in the visual editor, and redeploy. The JSON has always been valid — ASL validation only catches structural errors, not the semantic limits you’re hitting.

Minute 3 — first evidence: check the execution history. Open the Step Functions console, find the execution, click on the Map state. The history shows “MapIterationsSucceeded” reaching about 40 and then… nothing. No error, no failed iteration. That cap is the giveaway: inline Map silently throttles at ~40 concurrent iterations regardless of what you set in MaxConcurrency.

Minute 6 — next check: ASL escape rules. Open the failed iteration’s input. It contains a string with embedded quotes. ASL parses JSONPath inside strings, and unescaped quotes confuse the parser. The iteration didn’t fail with a useful error — it produced a malformed Payload to Lambda, which crashed silently. Escape the inner quotes or use States.JsonToString and States.StringToJson intrinsic functions to round-trip the value.

Minute 9 — discriminating evidence: Standard vs Express. Look at the workflow type. This is a Standard workflow running a Map of 5-minute Lambda calls, each costing one state transition per iteration. At 100K items, you’d burn 100K+ state transitions — billed at $0.025 per 1K. The cost is fine, but the per-state history storage is the real bottleneck: Standard execution history caps at 25K events. Your Map exceeded the history cap and the execution silently truncated.

Minute 12 — actual root cause: switch to Distributed Map. Inline Map accumulates results in memory and reports each iteration in execution history. Distributed Map runs each iteration as a child execution (Express, ideally), streams input from S3, and writes results back to S3. The parent execution sees only an aggregate. Convert the state to ProcessorConfig.Mode: DISTRIBUTED with ExecutionType: EXPRESS, point ItemReader at an S3 prefix, give the IAM role states:StartExecution on the child workflow ARN, and the Map runs all 100K items in parallel without hitting the history cap.

Fix 1: Write Valid ASL

{
  "Comment": "Process an order",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
        "Payload.$": "$"
      },
      "ResultPath": "$.validation",
      "Next": "ProcessPayment"
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
        "Payload": {
          "orderId.$": "$.orderId",
          "amount.$": "$.validation.Payload.amount"
        }
      },
      "ResultPath": "$.payment",
      "End": true
    }
  }
}

Three required pieces per state:

  • Type — Task, Choice, Wait, Map, Parallel, Pass, Succeed, Fail.
  • Next or End — what comes after. Every non-terminal state needs Next: "...".
  • Resource — for Task, the ARN of the service integration.

Common JSONPath patterns:

  • Payload.$": "$" — the entire input.
  • "orderId.$": "$.orderId" — pull a specific field. The key suffix .$ is required for JSONPath references.
  • ResultPath: "$.payment" — where to merge the state’s output into the state document.
  • OutputPath: "$.payment" — discard everything else and emit just this path.

Pro Tip: Use the visual editor in the AWS Console to design the state machine, then export the ASL. The validator catches missing transitions and syntax errors before you deploy.

Fix 2: Standard vs Express Workflows

Pick based on duration and history needs:

  • Standard — up to 1 year, full execution history, $0.025 per 1K state transitions. For long-running orchestration, human approvals, audit trails.
  • Express — up to 5 minutes, billed per execution + duration, no history (logs to CloudWatch if enabled). For high-throughput API backends, event processing.

Set the type at create time:

# SAM template:
Resources:
  MyStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: state-machine.asl.json
      Type: STANDARD   # or EXPRESS
      Role: !GetAtt MyRole.Arn

You can’t convert between Standard and Express in place — create a new state machine.

For Express workflows that need history:

LoggingConfiguration:
  Level: ALL
  IncludeExecutionData: true
  Destinations:
    - CloudWatchLogsLogGroup:
        LogGroupArn: !GetAtt MyLogGroup.Arn

This sends Express execution logs to CloudWatch — replaces the missing in-product history.

Common Mistake: Using Express for workflows with human approval steps. Express has a 5-minute hard limit; humans take longer than 5 minutes. Use Standard.

Fix 3: Map State — Inline vs Distributed

Inline Map (default) handles up to ~40 concurrent iterations, with a 256 KB payload limit:

{
  "MyMap": {
    "Type": "Map",
    "ItemsPath": "$.items",
    "MaxConcurrency": 10,
    "Iterator": {
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:::function:ProcessItem",
          "End": true
        }
      }
    },
    "ResultPath": "$.results",
    "End": true
  }
}

For larger datasets, use Distributed Map. It reads input from S3 and can iterate over millions of items:

{
  "MyMap": {
    "Type": "Map",
    "ItemReader": {
      "Resource": "arn:aws:states:::s3:listObjectsV2",
      "Parameters": {
        "Bucket": "my-bucket",
        "Prefix": "items/"
      }
    },
    "ItemProcessor": {
      "ProcessorConfig": {
        "Mode": "DISTRIBUTED",
        "ExecutionType": "EXPRESS"
      },
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:::function:ProcessItem",
          "End": true
        }
      }
    },
    "MaxConcurrency": 1000,
    "ResultWriter": {
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "my-bucket",
        "Prefix": "results/"
      }
    },
    "End": true
  }
}

Distributed Map:

  • Mode: "DISTRIBUTED" in ProcessorConfig switches from inline.
  • ExecutionType: "EXPRESS" is recommended for high-throughput; iteration sub-workflows run as Express.
  • ItemReader can be S3 list (objects), S3 GetObject (CSV/JSONL contents), or DynamoDB scan.
  • ResultWriter persists results to S3 — avoids the inline result accumulating into the parent state.

Pro Tip: For batch jobs over thousands of items, always use Distributed Map. Inline Map silently caps concurrency and accumulates results in memory — slow and OOM-prone.

Fix 4: Error Handling With Retry and Catch

For transient errors, retry:

{
  "MyTask": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:::function:MyFunction",
    "Retry": [
      {
        "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2,
        "JitterStrategy": "FULL"
      }
    ],
    "Catch": [
      {
        "ErrorEquals": ["States.TaskFailed"],
        "Next": "HandleFailure",
        "ResultPath": "$.error"
      }
    ],
    "Next": "Success"
  }
}

Retry:

  • ErrorEquals — list of error types to retry. States.ALL catches everything.
  • IntervalSeconds — initial backoff.
  • MaxAttempts — total retries (not including initial attempt).
  • BackoffRate — multiplier per attempt (2 = exponential).
  • JitterStrategy: "FULL" — adds random jitter (recommended for distributed retries).

Catch:

  • Runs when retries are exhausted (or the error doesn’t match Retry.ErrorEquals).
  • ResultPath: "$.error" — stores the error info at this path for the next state to inspect.
  • Next — the failure-handling state.

In the failure handler, you can read the error:

{
  "HandleFailure": {
    "Type": "Pass",
    "Parameters": {
      "errorType.$": "$.error.Error",
      "errorMessage.$": "$.error.Cause"
    },
    "Next": "Notify"
  }
}

Common Mistake: Catching States.ALL without inspecting the error. You handle exceptions you shouldn’t (like States.Timeout that should bubble up). Be specific about what you catch.

Fix 5: Choice State Syntax

{
  "RouteByStatus": {
    "Type": "Choice",
    "Choices": [
      {
        "Variable": "$.order.status",
        "StringEquals": "pending",
        "Next": "ProcessPending"
      },
      {
        "Variable": "$.order.status",
        "StringEquals": "confirmed",
        "Next": "ProcessConfirmed"
      },
      {
        "And": [
          { "Variable": "$.order.amount", "NumericGreaterThan": 100 },
          { "Variable": "$.order.region", "StringEquals": "US" }
        ],
        "Next": "HighValueUSFlow"
      }
    ],
    "Default": "ProcessUnknown"
  }
}

Operators:

  • StringEquals, StringMatches, StringGreaterThan, etc. — string comparisons.
  • NumericEquals, NumericGreaterThan, etc. — number comparisons.
  • BooleanEquals — boolean.
  • TimestampLessThan, TimestampLessThanEqualsPath, etc. — timestamp.
  • IsPresent, IsString, IsNumeric, IsBoolean, IsNull, IsTimestamp — type checks.

Combinations:

  • And: [...] — all must match.
  • Or: [...] — any.
  • Not: { ... } — negation.

Default is required if no choices might match — without it, the execution fails.

For dynamic comparisons (compare two paths):

{
  "Variable": "$.requested",
  "NumericGreaterThanPath": "$.available"
}

The Path suffix on the operator means “compare against another JSONPath.”

Fix 6: IAM Execution Role

The state machine’s execution role needs to invoke every service it touches:

StepFunctionsRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: 2012-10-17
      Statement:
        - Effect: Allow
          Principal:
            Service: states.amazonaws.com
          Action: sts:AssumeRole
    Policies:
      - PolicyName: InvokeLambdas
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action: lambda:InvokeFunction
              Resource:
                - !GetAtt ValidateLambda.Arn
                - !GetAtt ProcessLambda.Arn
            - Effect: Allow
              Action:
                - dynamodb:GetItem
                - dynamodb:PutItem
              Resource: !GetAtt MyTable.Arn
            - Effect: Allow
              Action:
                - logs:CreateLogDelivery
                - logs:GetLogDelivery
                - logs:UpdateLogDelivery
                - logs:DeleteLogDelivery
                - logs:ListLogDeliveries
              Resource: "*"

For X-Ray tracing:

- Effect: Allow
  Action:
    - xray:PutTraceSegments
    - xray:PutTelemetryRecords
  Resource: "*"

For Distributed Map (needs to start child executions):

- Effect: Allow
  Action: states:StartExecution
  Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${StateMachineName}"
- Effect: Allow
  Action:
    - states:DescribeExecution
    - states:StopExecution
  Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${StateMachineName}:*"

Common Mistake: Granting lambda:InvokeFunction on *. Scope to specific Lambda ARNs. A leaky role lets the state machine invoke functions it shouldn’t.

Fix 7: Lambda Integration — Optimized vs Standard

For Lambda invocation, there are two integration types:

Optimized integration (arn:aws:states:::lambda:invoke):

{
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {
    "FunctionName": "...",
    "Payload.$": "$"
  },
  "Retry": [
    {
      "ErrorEquals": [
        "Lambda.ServiceException",
        "Lambda.AWSLambdaException",
        "Lambda.SdkClientException",
        "Lambda.TooManyRequestsException"
      ],
      "IntervalSeconds": 1,
      "MaxAttempts": 3
    }
  ]
}

The output has Lambda metadata wrapping:

{
  "ExecutedVersion": "$LATEST",
  "Payload": { ...your actual response... },
  "SdkHttpMetadata": { ... }
}

Read your data via $.Payload.

Standard ARN integration (arn:aws:lambda:...:function:MyFunction):

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:MyFunction",
  "InputPath": "$",
  "ResultPath": "$.result"
}

The output is just your Lambda’s response — no wrapping. But error retries are less granular.

Pro Tip: Always use optimized integration (arn:aws:states:::lambda:invoke) for production. The built-in retry for Lambda.ServiceException and TooManyRequestsException handles transient AWS issues automatically.

Fix 8: Wait States and Heartbeats

Wait for a duration:

{
  "WaitFiveSeconds": {
    "Type": "Wait",
    "Seconds": 5,
    "Next": "Continue"
  }
}

Wait until a specific time:

{
  "WaitUntil2026Year": {
    "Type": "Wait",
    "Timestamp": "2027-01-01T00:00:00Z",
    "Next": "Continue"
  }
}

Wait for a path-resolved value:

{
  "WaitUntilUserSchedule": {
    "Type": "Wait",
    "TimestampPath": "$.scheduledAt",
    "Next": "Continue"
  }
}

For human-in-the-loop with TaskToken:

{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
    "Parameters": {
      "FunctionName": "SendApprovalRequest",
      "Payload": {
        "taskToken.$": "$$.Task.Token",
        "userId.$": "$.userId"
      }
    },
    "Next": "Continue"
  }
}

The Lambda sends an approval link to the user, embedding taskToken. When the user clicks “Approve,” your backend calls SendTaskSuccess with the token, unblocking the state machine.

aws stepfunctions send-task-success \
  --task-token "$TASK_TOKEN" \
  --output '{"approved": true}'

This pattern supports human approvals, external system callbacks, and async work that doesn’t fit in a single Lambda timeout.

Common Mistake: Forgetting .waitForTaskToken on the Resource ARN. Without it, the task completes immediately on Lambda return, instead of waiting for the explicit SendTaskSuccess call.

Still Not Working?

A few less-obvious failures:

  • State machine fails to start. Check the execution role’s trust policy — must include states.amazonaws.com as Principal.
  • States.Runtime with cryptic JSONPath. Your input doesn’t have the field you expect. Add a Pass state at the start to log the input, then read CloudWatch logs.
  • Step Functions billed more than expected. Each state transition costs (Standard). Loops with many iterations add up. Use Express for high-volume sub-workflows.
  • Map state OOM in inline mode. Switch to Distributed Map with S3 reader/writer.
  • Retry doesn’t retry the error you saw. Match the exact error type. Errors from Lambda have specific names; States.TaskFailed is the generic wrapper.
  • Pass state doesn’t change input. By default, Pass passes input through. Use Parameters to transform, or ResultPath to merge a literal Result.
  • Choice state has no default and execution fails. Always provide Default. Even "Default": "FailExplicitly" (a Fail state) is better than no default.
  • Time zones in Timestamp field. Always UTC. Convert in your code if your data is in local time.
  • Express workflow doesn’t appear in the console history list. Express executions are not stored in the in-product history. Enable LoggingConfiguration with Level: ALL and IncludeExecutionData: true, then query CloudWatch Logs Insights instead of the console.
  • States.DataLimitExceeded mid-execution. A state’s output crossed the 256 KB payload limit between states. Either trim the output with ResultSelector or OutputPath, or persist the heavy result to S3 and pass only the key forward.
  • Distributed Map child executions show “Failed” with no error in the parent. The parent only sees aggregate counts. Open the Distributed Map run, click “Item processing details,” and drill into individual failed iterations — each child execution has its own log group with the actual stack trace.

For related AWS orchestration and serverless issues, see AWS Lambda timeout, AWS Lambda cold start timeout, AWS IAM permission denied, and AWS SQS not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles