AwsCustomResource: Race condition in IAM policy updates #18237

lordjabez · 2022-01-01T22:50:36Z

What is the problem?

I have a construct that performs several AWS SDK calls using AwsCustomResource, each of which should have an IAM policy scoped as tightly as possible. Sometimes, later calls fail with a permission error, even though the IAM policy being applied is correct.

Per the documentation: "As this custom resource uses a singleton Lambda function, it's important to note the that function's role will eventually accumulate the permissions/grants from all resources."

What I discovered is that earlier custom resource calls succeed because it takes a little while (~60 seconds) for the Lambda to be created, allowing enough time for the associated IAM policy to propagate. However, subsequent custom resource calls reuse the Lambda, and if it executes too quickly after the policy is edited with the new permission, it won't have said permission and will fail.

Reproduction Steps

I can provide code on request but can't yet post publicly (working on getting permission to do so). But to reproduce should be straightforward:

Create an AwsCustomResource call that relies on policy A. Execute that call.
Create an AwsCustomResource call that relies on policy B. Execute that call.

If the second call happens fast enough after the policy is applied, it will fail.

What did you expect to happen?

All custom resource calls succeed.

What actually happened?

One of the custom resource calls failed with a permissions error.

CDK CLI Version

2.3.0 (build beaa5b2)

Framework Version

No response

Node.js Version

v16.7.0

OS

MacOS

Language

Typescript

Language Version

4.5.4

Other information

I have a CloudFormation event log I can share that illustrates the problem. Contact me directly and I can provide it.

The text was updated successfully, but these errors were encountered:

lordjabez · 2022-01-02T05:18:11Z

I see four potential fixes (there may be others):

Sleep in the custom resource runtime code long enough to be reasonably sure any associated IAM policies have propagated. Hacky, and has Lambda timeout implications, but quick and easy.
Implement simple retry logic in the above code. Would be easy enough, with say a hard-coded 10s, 30s, 60s backoff. Or could be parameterized.
Change AWS custom resource calls to be async, and implement the retry logic in the state machine poller. Seems a slightly more natural place for it, and easier to implement customization parameters. But would have implications for all other uses of AWS custom resource calls.
Modify the CDK framework so that it doesn't reuse the same Lambda for all custom resource calls in a given stack (or at least has an option to not reuse). I tried specifying a custom functionName parameter across multiple calls hoping it would force different functions, but the first custom name was used and the others were ignored. Perhaps this could be the mechanism to use separate Lambdas? Less performant but would help with my use case.

rix0rrr · 2022-01-03T10:29:27Z

If all is well, there should be DependsOn relationship between the Custom Resource invocation and the Policy that adds the necessary permissions. It will look like this (taken from an integration test snapshot):

{
    "PublishCustomResourcePolicyDF696FCA": {
      "Type": "AWS::IAM::Policy",
      "Properties": {
        "PolicyDocument": {
          "Statement": [ { ... } ],
        },
        "Roles": [
          { "Ref": "AWS679f53fac002430cb0da5b7982bd2287ServiceRoleC1EA0FF2" }
        ]
      }
    },
    "Publish2E9BDF73": {
      "Type": "Custom::SNSPublisher",
      "Properties": {
        "ServiceToken": { ... },
        "Create": { ... },
        "Update": { ... },
      },
      "DependsOn": [
        "PublishCustomResourcePolicyDF696FCA"  // <------------ this
      ],
    }
}

This dependency will avoid the race condition.

Can you confirm that this DependsOn relationship exists in your template, and if not, it would be very helpful to share the code you are using.

lordjabez · 2022-01-03T15:42:13Z

Here's snippets from the code. First from Construct A:

    const networkDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNetwork',
      parameters: { NetworkId: this.networkId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const networkData = new customresources.AwsCustomResource(this, 'NetworkDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}::networks/${this.networkId}`],
      }),
      onCreate: networkDataSdkCall,
      onUpdate: networkDataSdkCall,
    });

And then from Construct B (the one that sometimes fails):

    const nodeDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNode',
      parameters: { NetworkId: networkId, MemberId: memberId, NodeId: this.nodeId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const nodeData = new customresources.AwsCustomResource(this, 'NodeDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}:${account}:nodes/${this.nodeId}`],
      }),
      onCreate: nodeDataSdkCall,
      onUpdate: nodeDataSdkCall,
    });

The synthesized CloudFormation has the proper DependsOn between the update to the policy and the AwsCustomResource call in Construct B, so the policy is indeed created first. But because of IAM propagation delay, it sometimes isn't fully "ready" yet by the time the Lambda runs.

lordjabez · 2022-01-03T15:50:35Z

Here's a screenshot of the CloudFormation events that illustrate the issue:

rix0rrr · 2022-01-04T09:32:29Z

Alright, thanks for the thorough report.

Seems like built-in retry logic would be the best solution. I'm just a little concerned that it might be hard to classify the right error code from all services, since I'm pretty sure there's no standard for it. We can probably start with AccessDenied and take it from there though.

ibliskavka · 2022-11-29T18:46:50Z

I am experiencing this issue with connect:AssociateBot and lex:UpdateBotLocale.

The custom resource succeeds when turn off rollback and run the deploy multiple times - so I am certain its a policy propagation issue.

In the meantime I broadened my permission policies but a retry on AccessDenied would be perfect.

caguado · 2024-05-21T13:12:42Z

I have found this race condition as a consequence of disabling installLatestAwsSdk from a separate issue.

As observed in the analysis above from @lordjabez, the singleton lambda keeps running in the same CFN run and IAM permissions don't appear to propagate on time. I posit that such effect is only apparent because Lambda still running means the assumed role still uses the previous inline policy version, so even if IAM propagates the policy changes fast enough, Lambda may not spin a new function that uses them at all.

I have had success by forcing a grant on the default identity policy of the underlying Lambda role because that policy is deployed early on and together with the SingletonFunction. So, besides the retry, I would suggest to explore how to coalesce policies or added to the default role policy so that those are deployed before the custom resource's Lambda is first invoked.

The grant snippet below, using SSM as example, works for me as a workaround for now:

const resource = existingResource ?? otherConstruct.node.defaultChild as AwsCustomResource;

const grant = Grant.addToPrincipal({
    grantee: resource,
    actions: ['ssm:GetParameter'],
    resourceArns: AwsCustomResourcePolicy.ANY_RESOURCE,
});
grant.assertSuccess();

lordjabez added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 1, 2022

github-actions bot added the @aws-cdk/aws-iam Related to AWS Identity and Access Management label Jan 1, 2022

github-actions bot assigned rix0rrr Jan 1, 2022

rix0rrr added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 3, 2022

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 3, 2022

peterwoodworth added the p1 label Jan 3, 2022

rix0rrr added the effort/medium Medium work item – several days of effort label Jan 10, 2022

rix0rrr removed their assignment Feb 9, 2022

peterwoodworth removed the needs-triage This issue or PR still needs to be triaged. label Mar 31, 2022

lordjabez mentioned this issue May 23, 2022

Move all node configuration to constructor cdklabs/cdk-hyperledger-fabric-network#1

Open

lognoel mentioned this issue Mar 7, 2024

CrossAccountZoneDelegationRecord: IAM policy propogation delay causes intermittent failures #29391

Closed

kirtishrinkhala self-assigned this Apr 5, 2024

GavinZZ unassigned kirtishrinkhala Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AwsCustomResource: Race condition in IAM policy updates #18237

AwsCustomResource: Race condition in IAM policy updates #18237

lordjabez commented Jan 1, 2022 •

edited

Loading

lordjabez commented Jan 2, 2022

rix0rrr commented Jan 3, 2022

lordjabez commented Jan 3, 2022

lordjabez commented Jan 3, 2022

rix0rrr commented Jan 4, 2022

ibliskavka commented Nov 29, 2022

caguado commented May 21, 2024

AwsCustomResource: Race condition in IAM policy updates #18237

AwsCustomResource: Race condition in IAM policy updates #18237

Comments

lordjabez commented Jan 1, 2022 • edited Loading

What is the problem?

Reproduction Steps

What did you expect to happen?

What actually happened?

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

lordjabez commented Jan 2, 2022

rix0rrr commented Jan 3, 2022

lordjabez commented Jan 3, 2022

lordjabez commented Jan 3, 2022

rix0rrr commented Jan 4, 2022

ibliskavka commented Nov 29, 2022

caguado commented May 21, 2024

lordjabez commented Jan 1, 2022 •

edited

Loading