Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(dynamodb): Failure while enabling replication on large tables #22253

Open
VarunWachaspati opened this issue Sep 27, 2022 · 8 comments
Open

(dynamodb): Failure while enabling replication on large tables #22253

VarunWachaspati opened this issue Sep 27, 2022 · 8 comments
Labels
@aws-cdk/aws-dynamodb Related to Amazon DynamoDB bug This issue is a bug. ddb-legacy-table This issue has to do with DynamoDB's legacy Table construct. Close after migration guide is out. p3

Comments

@VarunWachaspati
Copy link
Contributor

VarunWachaspati commented Sep 27, 2022

Describe the bug

While enabling the cross region replication for an existing DynamoDB table containing more than 50GB data, CDK deployment errors out after an hour with the following error message -

❌  TestStack (test-stack) failed: Error: The stack named test-stack failed to deploy: UPDATE_ROLLBACK_COMPLETE: CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [0d0816eb-cf01-4244-a2af-dba1b4bff744]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version., Received response status [FAILED] from custom resource. Message returned: Attempt to change a resource which is still in use: Cannot delete table while indexes are being created, updated, or deleted. This limit is applied globally for global tables.

Logs: /aws/lambda/test-stack-OnEventHandler42BEBAE0

    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)

I first checked the OnEventHandler lambda logs (as suggested in the above error log). These logs just had the following error while trying to rollback the changes done -

ResourceInUseException: Attempt to change a resource which is still in use: Cannot delete table while indexes are being created, updated, or deleted. This limit is applied globally for global tables.

However, the isCompleteHandlerService lambda logs had the following peculiar AccessDeniedException -

AccessDeniedException: User: arn:aws:sts::XXXXYYYYZZZZ:assumed-role/test-stack-IsCompleteHandlerService-188DPYRSJRTTA/test-stack --IsCompleteHandler7073F4D-RiV5qzy62AgI is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-west-2:XXXXYYYYZZZZ:table/test-table because no identity-based policy allows the dynamodb:DescribeTable action

When the deployment starts I have verified that the isCompleteHandlerService lambda has a corresponding IAM role granting it permissions to DescribeTable and it is able to get the table description the first time when it starts as well.
Cloudwatch logs below -
Screenshot 2022-09-27 at 10 16 55 PM

However, the above error surfaces if the creation of the cross-region replica takes more than 1 hour.

Haven't been able to figure out why the permissions error is thrown after an hour of working fine.

NOTE - For smaller tables, enabling replication is successful.

Expected Behavior

Cross region replica should be successfully created irrespective of the size of the data it contains inline with the manual creation of a replica via the console.

Current Behavior

For smaller tables(containing less than 10GB), enabling replication via CDK is successful.
But for large sized tables, enabling replication via CDK deploy fails with the following error -

The stack named test-stack failed to deploy: UPDATE_ROLLBACK_COMPLETE: CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [0d0816eb-cf01-4244-a2af-dba1b4bff744]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version., Received response status [FAILED] from custom resource. Message returned: Attempt to change a resource which is still in use: Cannot delete table while indexes are being created, updated, or deleted. This limit is applied globally for global tables.

Reproduction Steps

  • Deploy a dynamo table on a single region using the following snippet -
 const table = new dynamodb.Table(this, 'TestTable', {
  partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
});
  • Populate it with 20GB of data.
  • Enable cross-region replication for the table in another region using the following snippet -
const table = new dynamodb.Table(this, 'TestTable', {
  partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING },
  replicationRegions: ['us-west-2', 'us-east-1'],
  replicationTimeout: Duration.hours(6),
});

NOTE - CDK 1.174.0 was used to deploy the above CDK stack.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

1.174.0

Framework Version

No response

Node.js Version

v14.17.1

OS

MacOS

Language

Typescript

Language Version

No response

Other information

No response

@VarunWachaspati VarunWachaspati added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 27, 2022
@github-actions github-actions bot added the @aws-cdk/aws-dynamodb Related to Amazon DynamoDB label Sep 27, 2022
@peterwoodworth
Copy link
Contributor

peterwoodworth commented Sep 27, 2022

Can you see in your CloudTrail logs if there's more information surrounding the AccessDeniedException? Also, are you using temporary credentials to deploy?

@peterwoodworth peterwoodworth added p2 needs-reproduction This issue needs reproduction. and removed needs-triage This issue or PR still needs to be triaged. labels Sep 27, 2022
@VarunWachaspati
Copy link
Contributor Author

VarunWachaspati commented Sep 29, 2022

Thanks @peterwoodworth for pointing me towards CloudTrail logs.
So I figured out that the AccessDeniedException is not the root cause for the failure but rather a consequence of the stack rollback.

The actual root cause seems to be the following error in ProviderframeworkisComplete lambda function which causing the lambda not to execute the response URL successfully -

{
    "errorType": "Error",
    "errorMessage":
    {
        "RequestType": "Create",
        "ServiceToken": "arn:aws:lambda:us-west-2:XXXXYYYYZZZZ:function:test-stack-2021-10--ProviderframeworkonEvent-FmemwZsY0uPP",
        "ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3AXXXXYYYYZZZZ%3Astack/test-stack/18b70370-15af-11ec-865b-0263b279d39f%7Ctimeseries202110TableReplicauseast18D7A00EA%7C2360a976-32ee-4bd5-bcc7-fe9452362c98?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20220929T091355Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SIEFTRIP7%2F20220929%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=8bc584476d903c030767a96046d32e57b9b789d6248a66f5ebca6b810a2a6dbf",
        "StackId": "arn:aws:cloudformation:us-west-2:XXXXYYYYZZZZ:stack/test-stack/18b70370-15af-11ec-865b-0263b279d39f",
        "RequestId": "2360a976-32ee-4bd5-bcc7-fe9452362c98",
        "LogicalResourceId": "teststackTableReplicauseast18D7A00EA",
        "ResourceType": "Custom::DynamoDBReplica",
        "ResourceProperties":
        {
            "ServiceToken": "arn:aws:lambda:us-west-2:XXXXYYYYZZZZ:function:test-stack--ProviderframeworkonEvent-FmemwZsY0uPP",
            "TableName": "test-stack",
            "Region": "us-east-1"
        },
        "PhysicalResourceId": "test-stack-us-east-1"
    },
    "stack":
    [
        "Error: {\"RequestType\":\"Create\",\"ServiceToken\":\"arn:aws:lambda:us-west-2:XXXXYYYYZZZZ:function:test-stack-2021-10--ProviderframeworkonEvent-FmemwZsY0uPP\",\"ResponseURL\":\"https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3AXXXXYYYYZZZZ%3Astack/test-stack/18b70370-15af-11ec-865b-0263b279d39f%7Ctimeseries202110TableReplicauseast18D7A00EA%7C2360a976-32ee-4bd5-bcc7-fe9452362c98?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20220929T091355Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SIEFTRIP7%2F20220929%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=8bc584476d903c030767a96046d32e57b9b789d6248a66f5ebca6b810a2a6dbf\",\"StackId\":\"arn:aws:cloudformation:us-west-2:XXXXYYYYZZZZ:stack/test-stack/18b70370-15af-11ec-865b-0263b279d39f\",\"RequestId\":\"2360a976-32ee-4bd5-bcc7-fe9452362c98\",\"LogicalResourceId\":\"teststackTableReplicauseast18D7A00EA\",\"ResourceType\":\"Custom::DynamoDBReplica\",\"ResourceProperties\":{\"ServiceToken\":\"arn:aws:lambda:us-west-2:XXXXYYYYZZZZ:function:test-stack--ProviderframeworkonEvent-FmemwZsY0uPP\",\"TableName\":\"test-stack\",\"Region\":\"us-east-1\"},\"PhysicalResourceId\":\"test-stack-us-east-1\"}",
        "    at isComplete (/var/task/framework.js:53:15)",
        "    at processTicksAndRejections (internal/process/task_queues.js:95:5)",
        "    at async Runtime.handler (/var/task/cfn-response.js:48:13)"
    ]
}

Hypothesis - Perhaps Cloudformation waits for an hour for the custom resource provider framework lambda to respond back via the signed S3 URL, if no response is returned then the following error surfaces -

❌  TestStack (test-stack) failed: Error: The stack named test-stack failed to deploy: UPDATE_ROLLBACK_COMPLETE: CloudFormation did not receive a response from your Custom Resource. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.

    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)

Cloudformation Custom Resource Documentation, but couldn't find anything to validate the above hypothesis.

For smaller tables also ProviderframeworkisComplete lambda also throws the same error but somehow the CDK deployment succeeds in that case.

Not sure how to further debug this issue. Any pointers would be appreciated.

@alexandervandekleutab
Copy link

We also encountered this issue. Thanks for the debugging tips. The problem is still unresolved and our production stack is now stuck in ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS...

@alexandervandekleutab
Copy link

We also encountered this issue for very small tables (fewer than 1000 items). The problem was not reproducible on other almost-identical environments.

@Naumel
Copy link
Contributor

Naumel commented Oct 24, 2022

@Naumel Naumel removed their assignment Jun 28, 2023
@rix0rrr
Copy link
Contributor

rix0rrr commented Sep 21, 2023

This issue was for the existing Table construct, which used custom resources to implement table replication. We no longer recommend the use of the Table construct.

Instead, the TableV2 construct has been released in 2.95.1 (#27023) which maps to the AWS::DynamoDB::GlobalTable resource, has better support for replication and does not suffer from the issue described here.


Be aware that there are additional deployment steps involved in a migration from Table to TableV2. You need to do a RETAIN deployment, a delete deployment, then change the code to use TableV2 and then use cdk import. A link to a full guide will be posted once it is available.

Here are some other resources to get you started (using CfnGlobalTable instead of TableV2) if you want to get going on the migration:

@rix0rrr rix0rrr added the ddb-legacy-table This issue has to do with DynamoDB's legacy Table construct. Close after migration guide is out. label Sep 21, 2023
@vagneroliveirars
Copy link

This issue was for the existing Table construct, which used custom resources to implement table replication. We no longer recommend the use of the Table construct.

Instead, the TableV2 construct has been released in 2.95.1 (#27023) which maps to the AWS::DynamoDB::GlobalTable resource, has better support for replication and does not suffer from the issue described here.

Be aware that there are additional deployment steps involved in a migration from Table to TableV2. You need to do a RETAIN deployment, a delete deployment, then change the code to use TableV2 and then use cdk import. A link to a full guide will be posted once it is available.

Here are some other resources to get you started (using CfnGlobalTable instead of TableV2) if you want to get going on the migration:

@rix0rrr is there any migration guide available? (from Table to TableV2)

@pahud pahud added p3 and removed p2 labels Jun 11, 2024
@NAVOO-RiccardoMarostica

This comment was marked as resolved.

@pahud pahud removed the needs-reproduction This issue needs reproduction. label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-dynamodb Related to Amazon DynamoDB bug This issue is a bug. ddb-legacy-table This issue has to do with DynamoDB's legacy Table construct. Close after migration guide is out. p3
Projects
None yet
Development

No branches or pull requests

8 participants