Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AwsCustomResource: Race condition in IAM policy updates #18237

Open
lordjabez opened this issue Jan 1, 2022 · 7 comments
Open

AwsCustomResource: Race condition in IAM policy updates #18237

lordjabez opened this issue Jan 1, 2022 · 7 comments
Labels
@aws-cdk/aws-iam Related to AWS Identity and Access Management bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@lordjabez
Copy link

lordjabez commented Jan 1, 2022

What is the problem?

I have a construct that performs several AWS SDK calls using AwsCustomResource, each of which should have an IAM policy scoped as tightly as possible. Sometimes, later calls fail with a permission error, even though the IAM policy being applied is correct.

Per the documentation: "As this custom resource uses a singleton Lambda function, it's important to note the that function's role will eventually accumulate the permissions/grants from all resources."

What I discovered is that earlier custom resource calls succeed because it takes a little while (~60 seconds) for the Lambda to be created, allowing enough time for the associated IAM policy to propagate. However, subsequent custom resource calls reuse the Lambda, and if it executes too quickly after the policy is edited with the new permission, it won't have said permission and will fail.

Reproduction Steps

I can provide code on request but can't yet post publicly (working on getting permission to do so). But to reproduce should be straightforward:

  1. Create an AwsCustomResource call that relies on policy A. Execute that call.
  2. Create an AwsCustomResource call that relies on policy B. Execute that call.

If the second call happens fast enough after the policy is applied, it will fail.

What did you expect to happen?

All custom resource calls succeed.

What actually happened?

One of the custom resource calls failed with a permissions error.

CDK CLI Version

2.3.0 (build beaa5b2)

Framework Version

No response

Node.js Version

v16.7.0

OS

MacOS

Language

Typescript

Language Version

4.5.4

Other information

I have a CloudFormation event log I can share that illustrates the problem. Contact me directly and I can provide it.

@lordjabez lordjabez added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 1, 2022
@github-actions github-actions bot added the @aws-cdk/aws-iam Related to AWS Identity and Access Management label Jan 1, 2022
@lordjabez
Copy link
Author

I see four potential fixes (there may be others):

  1. Sleep in the custom resource runtime code long enough to be reasonably sure any associated IAM policies have propagated. Hacky, and has Lambda timeout implications, but quick and easy.

  2. Implement simple retry logic in the above code. Would be easy enough, with say a hard-coded 10s, 30s, 60s backoff. Or could be parameterized.

  3. Change AWS custom resource calls to be async, and implement the retry logic in the state machine poller. Seems a slightly more natural place for it, and easier to implement customization parameters. But would have implications for all other uses of AWS custom resource calls.

  4. Modify the CDK framework so that it doesn't reuse the same Lambda for all custom resource calls in a given stack (or at least has an option to not reuse). I tried specifying a custom functionName parameter across multiple calls hoping it would force different functions, but the first custom name was used and the others were ignored. Perhaps this could be the mechanism to use separate Lambdas? Less performant but would help with my use case.

@rix0rrr
Copy link
Contributor

rix0rrr commented Jan 3, 2022

If all is well, there should be DependsOn relationship between the Custom Resource invocation and the Policy that adds the necessary permissions. It will look like this (taken from an integration test snapshot):

{
    "PublishCustomResourcePolicyDF696FCA": {
      "Type": "AWS::IAM::Policy",
      "Properties": {
        "PolicyDocument": {
          "Statement": [ { ... } ],
        },
        "Roles": [
          { "Ref": "AWS679f53fac002430cb0da5b7982bd2287ServiceRoleC1EA0FF2" }
        ]
      }
    },
    "Publish2E9BDF73": {
      "Type": "Custom::SNSPublisher",
      "Properties": {
        "ServiceToken": { ... },
        "Create": { ... },
        "Update": { ... },
      },
      "DependsOn": [
        "PublishCustomResourcePolicyDF696FCA"  // <------------ this
      ],
    }
}

This dependency will avoid the race condition.

Can you confirm that this DependsOn relationship exists in your template, and if not, it would be very helpful to share the code you are using.

@rix0rrr rix0rrr added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 3, 2022
@lordjabez
Copy link
Author

Here's snippets from the code. First from Construct A:

    const networkDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNetwork',
      parameters: { NetworkId: this.networkId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const networkData = new customresources.AwsCustomResource(this, 'NetworkDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}::networks/${this.networkId}`],
      }),
      onCreate: networkDataSdkCall,
      onUpdate: networkDataSdkCall,
    });

And then from Construct B (the one that sometimes fails):

    const nodeDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNode',
      parameters: { NetworkId: networkId, MemberId: memberId, NodeId: this.nodeId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const nodeData = new customresources.AwsCustomResource(this, 'NodeDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}:${account}:nodes/${this.nodeId}`],
      }),
      onCreate: nodeDataSdkCall,
      onUpdate: nodeDataSdkCall,
    });

The synthesized CloudFormation has the proper DependsOn between the update to the policy and the AwsCustomResource call in Construct B, so the policy is indeed created first. But because of IAM propagation delay, it sometimes isn't fully "ready" yet by the time the Lambda runs.

@lordjabez
Copy link
Author

Here's a screenshot of the CloudFormation events that illustrate the issue:

Screen Shot 2022-01-01 at 2 49 40 PM

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 3, 2022
@rix0rrr
Copy link
Contributor

rix0rrr commented Jan 4, 2022

Alright, thanks for the thorough report.

Seems like built-in retry logic would be the best solution. I'm just a little concerned that it might be hard to classify the right error code from all services, since I'm pretty sure there's no standard for it. We can probably start with AccessDenied and take it from there though.

@rix0rrr rix0rrr added the effort/medium Medium work item – several days of effort label Jan 10, 2022
@rix0rrr rix0rrr removed their assignment Feb 9, 2022
@peterwoodworth peterwoodworth removed the needs-triage This issue or PR still needs to be triaged. label Mar 31, 2022
@ibliskavka
Copy link

I am experiencing this issue with connect:AssociateBot and lex:UpdateBotLocale.

The custom resource succeeds when turn off rollback and run the deploy multiple times - so I am certain its a policy propagation issue.

In the meantime I broadened my permission policies but a retry on AccessDenied would be perfect.

@caguado
Copy link

caguado commented May 21, 2024

I have found this race condition as a consequence of disabling installLatestAwsSdk from a separate issue.

As observed in the analysis above from @lordjabez, the singleton lambda keeps running in the same CFN run and IAM permissions don't appear to propagate on time. I posit that such effect is only apparent because Lambda still running means the assumed role still uses the previous inline policy version, so even if IAM propagates the policy changes fast enough, Lambda may not spin a new function that uses them at all.

I have had success by forcing a grant on the default identity policy of the underlying Lambda role because that policy is deployed early on and together with the SingletonFunction. So, besides the retry, I would suggest to explore how to coalesce policies or added to the default role policy so that those are deployed before the custom resource's Lambda is first invoked.

The grant snippet below, using SSM as example, works for me as a workaround for now:

const resource = existingResource ?? otherConstruct.node.defaultChild as AwsCustomResource;

const grant = Grant.addToPrincipal({
    grantee: resource,
    actions: ['ssm:GetParameter'],
    resourceArns: AwsCustomResourcePolicy.ANY_RESOURCE,
});
grant.assertSuccess();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-iam Related to AWS Identity and Access Management bug This issue is a bug. effort/medium Medium work item – several days of effort p1
Projects
None yet
Development

No branches or pull requests

6 participants