Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_batch_alpha: Role permissions error after upgrade from v2.73 to v2.79 #25675

Closed
paulharbulak-maxar opened this issue May 22, 2023 · 6 comments · Fixed by #26288
Closed

aws_batch_alpha: Role permissions error after upgrade from v2.73 to v2.79 #25675

paulharbulak-maxar opened this issue May 22, 2023 · 6 comments · Fixed by #26288
Labels
@aws-cdk/aws-batch Related to AWS Batch bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@paulharbulak-maxar
Copy link

Describe the bug

After changing ComputeEnvironment, JobDefinition, and other related Batch constructs to ManagedEc2EcsComputeEnvironment, EcsJobDefinition, etc. due to breaking changes in v2.79, Batch jobs fail with a CloudWatch permissions error.

Expected Behavior

I expected any default IAM roles created by Batch constructs to have necessary permissions for the default logging configuration.

Current Behavior

Batch jobs fail with the following error: CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: AccessDeniedException: User: <BatchContainerDefinitionExecutionRole ARN>.

Reproduction Steps

This is the section of code that is currently used to generate the Batch infrastructure producing the error:

batch_security_group = ec2.SecurityGroup(
    self, id="BatchSecurityGroup", vpc=vpc, allow_all_outbound=True
)
postgres.connections.allow_default_port_from(batch_security_group)
batch_environ = batch.ManagedEc2EcsComputeEnvironment(
    self,
    "BatchEnvironment",
    compute_environment_name=f"{stack_name}-env",
    security_groups=[batch_security_group],
    spot=True,
    vpc=vpc,
)

seeder_queue = batch.JobQueue(
    self,
    "BatchSeedsJobQueue",
    job_queue_name=f"{stack_name}-seeds",
    compute_environments=[
        batch.OrderedComputeEnvironment(
            compute_environment=batch_environ, order=1
        ),
    ],
    priority=1,
)

wfs_job_def = batch.EcsJobDefinition(
    self,
    "WFSBatchJobDefinition",
    job_definition_name=f"{stack_name}-meta",
    parameters={
        "tasks": "ALL",
    },
    container=batch.EcsEc2ContainerDefinition(
        self,
        "WFSBatchContainerDefinition",
        image=ecs.ContainerImage.from_asset('./metadata'),
        environment={
            "POSTGRES_HOST": postgres.db_instance_endpoint_address,
            "POSTGRES_PORT": postgres.db_instance_endpoint_port,
            "POSTGRES_DB_NAME": "postgres",
            "POSTGRES_PASS": postgres.secret.secret_value_from_json(
                "password"
            ).to_string(),
        },
        cpu=1,
        memory=Size.mebibytes(8192),
    ),
)

wfs_datasets = [
    "test a",
    "test b",
    "test c",
    "test d",
]

wfs_jobs = step.Parallel(self, "WFSParallelJobs")
for idx, ds in enumerate(wfs_datasets):
    wfs_jobs.branch(
        tasks.BatchSubmitJob(
            self,
            f"WFSJobSubmit_{idx}",
            job_name=f"WFSJobSubmit_{idx}",
            job_queue_arn=seeder_queue.job_queue_arn,
            job_definition_arn=wfs_job_def.job_definition_arn,
            container_overrides=tasks.BatchContainerOverrides(
                environment={"DATASETS": ds}
            ),
        )
    )

wfs_state_machine = step.StateMachine(
    self,
    "WFSStateMachine",
    definition=wfs_jobs,
)

wfs_target = events_targets.SfnStateMachine(
    machine=wfs_state_machine,
)

events.Rule(
    self,
    "MetaRule",
    schedule=events.Schedule.rate(Duration.days(2)),
    targets=[wfs_target],
)

Possible Solution

It appears as if the default container execution role of EcsEc2ContainerDefinition is not granted logs:CreateLogStream permissions. If you manually remove the container execution role from the job definition, the job will succeed. I'm guessing it gets the necessary permissions from the BatchEnvironmentInstanceProfileRole. However, if you set the EcsJobDefinition execution_role property to None it will still create and assign a default execution role, which will fail with the permissions error.

Additional Information/Context

Batch jobs created with JobDefinition/JobDefinitionContainer in v2.73 do not appear to use a default execution role, whereas batch jobs created in v2.79 do.

CDK CLI Version

2.79.1

Framework Version

No response

Node.js Version

16.20

OS

Ubuntu 22.04.2

Language

Python

Language Version

3.8

Other information

No response

@paulharbulak-maxar paulharbulak-maxar added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 22, 2023
@github-actions github-actions bot added the @aws-cdk/aws-batch Related to AWS Batch label May 22, 2023
@peterwoodworth
Copy link
Contributor

If you manually remove the container execution role from the job definition, the job will succeed

Do you mean with escape hatches, manually removing the role from being associated with the container?

With the current implementation, a role should always be created, so an execution role should always be specified using this construct. Do you mean that this is the issue, and that no role should be created in some circumstances?

this.executionRole = props.executionRole ?? createExecutionRole(this, 'ExecutionRole');

@peterwoodworth peterwoodworth added p1 effort/small Small work item – less than a day of effort and removed needs-triage This issue or PR still needs to be triaged. labels May 23, 2023
@paulharbulak-maxar
Copy link
Author

@peterwoodworth thanks for the response. What I did was manually create a new revision of the job definition, and in that revision remove the execution role associated with the container.

Do you mean that this is the issue, and that no role should be created in some circumstances?

I think always specifying an execution role is fine, but I would hope that the default role wouldn't cause permissions errors with the default logging.

@wttat
Copy link

wttat commented May 31, 2023

face this problem as well, the default execution_role should be add for create cloudwatch logs

@comcalvi
Copy link
Contributor

comcalvi commented Jun 29, 2023

@wttat @paulharbulak-maxar are your containers actually calling CreateLogStream through the CLI / SDK? How is the log stream being created?

I reproduced this by not specifying logging. I think I have a fix for this, but in the mean time you can unblock yourselves by specifying logging.

@tam0ri
Copy link
Contributor

tam0ri commented Jul 1, 2023

Reproduction

I could reproduce it on my end. If we don't specify logging property, which means we don't specify logDriver explicitly in job definition, Batch uses awslogs with log group /aws/batch/job by default.

https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html#containerProperties

The log driver to use for the job. By default, AWS Batch enables the awslogs log driver.

@wttat @paulharbulak-maxar are your containers actually calling CreateLogStream through the CLI / SDK? How is the log stream being created?

In this context, CreateLogStream API is called by Docker daemon, because log driver is Docker's functionality. So user's containers don't call CloudWatch Logs API.

Problem

From #25466, executionRole has been always created by default. This means that underlying ECS tasks always use not EC2 instance role but the execution role for API calls by log driver.

However, if logging property is not specified, grantWrite for the execution role is never called. grantWrite is called in bind method.

if (props.logging) {
this.logDriverConfig = props.logging.bind(this, {
...this as any,
// TS!
taskDefinition: {
obtainExecutionRole: () => this.executionRole,
},
});
}

this.logGroup.grantWrite(containerDefinition.taskDefinition.obtainExecutionRole());

Solution

I think that one possible solution is granting permission for default log group to the execution role when logging property is not specified. If this approach is acceptable, I'll submit PR later.

@mergify mergify bot closed this as completed in #26288 Jul 12, 2023
mergify bot pushed a commit that referenced this issue Jul 12, 2023
)

Grant `CreateLogStream` to the job definition's execution role by default. Without this permission, jobs will fail if they produce any output, unless `logging` is specified.

Closes #25675.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

bmoffatt pushed a commit to bmoffatt/aws-cdk that referenced this issue Jul 29, 2023
…#26288)

Grant `CreateLogStream` to the job definition's execution role by default. Without this permission, jobs will fail if they produce any output, unless `logging` is specified.

Closes aws#25675.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-batch Related to AWS Batch bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants