Add one pager for improved container image lifecycle #10352

michellemcdaniel · 2022-08-09T16:35:25Z

To double check:

The right tests are in and and the right validation has happened. Guidance: https://github.com/dotnet/arcade/tree/main/Documentation/Validation

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

MichaelSimons · 2022-08-16T12:54:48Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+Assumptions include:
+- The Matrix of Truth work will enable us to identify all pipelines and branches that are using docker containers and which images they are using
+- We will be able to extend the existing publishing infrastructure to also idetify images that are due for removal
+- All of our existing base images can be replaced with MAR-approved images


I see at least two base images today that aren't approved - opensuse and raspbian.

There is also the fact that they are going to be removing the alpine images (which I just noticed while scrolling through the website). They are suggesting that folks running in alpine move to mariner.

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

MichaelSimons · 2022-08-16T13:58:38Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+
+### Rollout and Deployment
+
+As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest.


I see the potential for some UX issues here. There are two different but related scenarios.

I am a developer who is building a new image variant and wants to consume it in the product build. When the new image variant is built for the first time, will a latest exist right away or will only staging exist? As a developer I want to be able to check-in my code using latest after I validate my changes. I don't have to check-in using staging and then remember to make a second change to update it to latest once available.

I am a developer making a change to an existing image variant - e.g. adding a new component. Like the previous scenario if I validate the changes I want to be able to commit my changes that take a dependency on the new component right away. In this scenario, maybe you are envisioning that if a new component is added, then it becomes a new image variant?

How I envision this is that it will act the same as when we add new queues to helix: you request it, it goes into staging when the work is done, and then it rolls out to prod the following Wednesday. Our product teams are already used to that timeframe for all other images, and while it's a bit of an assumption on my part, I think they will adjust to that being true for docker images as well.

As part of this work, we are likely going to end up requiring most, if not all, changes to docker images to come through dnceng so that we can vet the changes and make sure new dependencies have updates and the like, much like we do with helix/buildpool machines.

I think they will adjust to that being true for docker images as well.

Perhaps there is a middle ground here. At least for new images, perhaps latest can be created right away.

I am sure it will be an iterative process that will be refined over time. I hope we all agree that we should be attentive to feedback and explore options to adjust to provide the best UX we can reasonably provide.

Agree. I think we don't necessarily have to talk about using the "latest" tag directly (which has a special meaning in docker as it's the tag that is used by default when a tag is not explicitly specified).

What about rather stating that we will need to design a tagging scheme that would allow us to handle containers in a well-defined scenarios (such as the weekly rollout that @adiaaida has on mind or the local development story that @MichaelSimons described), define these scenarios first and later, based on these scenarios, think about concrete implementation by a tagging schema?

Not saying we should discard the idea of staging / latest tagging completely, but perhaps we could park it for now and focus on investigating the scenarios first?

There is always the option hotfix prod to get a new latest, but I agree with @adiaaida that this is a process that teams are already used to. It seems pretty rare that there would be a case where a new platform would immediately be ready for check-in in < 1 week.

I do like branching as a method of managing what is 'prod' vs. what is head of tree. This is a well-worn path that is consistent with our other services.

tkapin · 2022-08-17T13:38:51Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+
+- Will the new implementation of any existing functionality cause breaking changes for existing consumers?
+
+The major risk in this portion of the epic is finding and updating all container usages by product teams, and making sure that moving them to the latest versions of the container images doesn't break their builds/tests because of missing artifacts. Our goal is to use docker tags to label the latest known good of each container image, and replace usages of specific docker image tags with a `<os>-<version>-<other-identifying-info>-latest` tag. That way, much like with helix images, their builds and tests will be updated automatically when we deploy a new latest version. In the transition to latest images, we may find that older versions of a container may have different versions of artifacts installed on those containers, which could affect builds and tests. We will need to be prepared to help product teams identify these issues and work through them.


The Matrix of Truth (MoT) code now contains some heuristics to derive the OperatingSystemId (the unified way to connect OS definitions between MoT, OSOB and RIDs) for docker environments. The OperatingSystemId definitions can be found in the helix-machines repo at os-definitions.json.

When designing the new tagging scheme, please make sure that it's possible to obtain the OperatingSystemId for our containers in some algorithmic way rather than by the currently used heuristics, that might not be 100% accurate in some cases. To be clear, I'm not proposing to make the OperatingSystemID to be used in the new tagging schema directly, but I'm raising this for awareness as this is the last remaining issue in unifying concept of an operating system identification across our infrastructure.

In what way is the RID used from this data? Because I see Mariner listed there with a RID of mariner.1-x64. But there are no RIDs defined for Mariner: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.NETCore.Platforms/src/runtime.json.

cc

I think having the MoT OperatingSystemId as a part of the docker tag wouldn't be a bad idea (Linux-Ubuntu-20.04-AMD64--latest for example).
The RID isn't currently used for anything, it is there for informational purposes.
We are aware that there is no RID for mariner, we should probably fix that. I'm not seeing anything that would tell us what to put there in dotnet/runtime#65566

It doesn't sound like there will be a RID for Mariner. There's no functional need.

Let me share a bit of context here. When we started designing the Matrix of Truth, it became very soon apparent that we need some unified concept of identifying a particular version of operating system so that we could unify the concept of OS as used by PMs to define what versions of OSes are supported by each version of the product with our internal infrastructure and tooling, such as the OSOB. Now even though RIDs aren't directly an OS description (but a concept of expressing target platforms where the application runs), it was clear that they are closely related to our OS definitions and that we should think how to incorporate them into our model. From discussion with @ericstj and @eerhardt, we've learnt that each version of a given OS has exactly one RID that can be thought of a "primary RID" for that given version OS. This RID is then what is used in the os-definitions.json mentioned above. Having said that, I don't think these values are currently used anywhere besides our MoT PowerBI reports.

The point with non-existing RID for Mariner is of course valid, @ericstj - would you know if there are plans for introducing RIDs for Mariner and if not, what RID would correspond to the primary rid of this platform?

Here's an issue for that: dotnet/runtime#65566
Folks are proposing removing the concept of the distro/version/arch specific RIDs: dotnet/designs#260

RID happens to be useful for you today in its current form. It may change, but that doesn't mean you can't continue to use the algorithm that the host had for identifying a RID. It's really just an algorithm that encodes a number of significant machine characteristics into a string. If the host changes that algorithm you can just maintain your own copy of one that works for you. Even the host's algorithm requires regular maintenance to ensure it continues to provide unique and meaningful strings per distro.

Thanks for clarification Eric!

tkapin · 2022-08-17T13:41:25Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+- What are your assumptions?
+  - The Matrix of Truth work will enable us to identify all pipelines and branches that are using docker containers and which images they are using
+  - We will be able to extend the existing publishing infrastructure to also idetify images that are due for removal
+  - All of our existing base images can be replaced with MAR-approved images (we can already see where this is not true, OpenSuse and Raspian are not available as MAR-tagged images, and Alpine will be deprecated soon)


Do we know what that (not having certain images available as MAR-tagged images) mean for us or this is rather a plain statement at this moment?

We do not know right know what that means for us, but it will likely require us to make some changes. In particular, with Alpine, they have very clearly stated that unless there is some definitely required scenario for alpine, they will not support it, and all alpine images should be moved to mariner.

You cannot simply replace Alpine usage with Mariner because Alpine is musl based and Mariner is gcc based. We need to test and possibly even build on musl based distros.

Exactly. We should find out if the Mariner team has some recommendations for customers that need a musl based environment for any reasons.

We are definitely going to need to talk to the MAR team about all of this in general. The language in their docs suggests.... complications. I will try to reach out to them about our scenario(s) this week, and update the team/possibly this PR based on those conversations

Basically, for anything that we will need images that don't exist, we will need to request them to tag them and give business justification for it.

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

tkapin · 2022-08-17T13:52:39Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+
+### Rollout and Deployment
+
+As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest.


Agree. I think we don't necessarily have to talk about using the "latest" tag directly (which has a special meaning in docker as it's the tag that is used by default when a tag is not explicitly specified).

What about rather stating that we will need to design a tagging scheme that would allow us to handle containers in a well-defined scenarios (such as the weekly rollout that @adiaaida has on mind or the local development story that @MichaelSimons described), define these scenarios first and later, based on these scenarios, think about concrete implementation by a tagging schema?

Not saying we should discard the idea of staging / latest tagging completely, but perhaps we could park it for now and focus on investigating the scenarios first?

tkapin · 2022-08-17T13:54:04Z

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

+
+As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest.
+
+We will also need a rollback story so that if an image breaks a product team's build or test, we can untag that image and retag the previous `latest` image. A rollback should be as simple as reverting a previous change and publishing the images at the new commit. While this image may be identical to a previously published image, it will effectively be treated as a new version.


Similarly to my comment above, we could just agree that we need to be able to perform a rollback when needed and later investigate the possibilities of how to do that.

I have removed implementation-type details for this section

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

michellemcdaniel · 2022-09-12T14:48:59Z

@tkapin @mthalman @MichaelSimons Are we all in agreement on this one pager? I'd like to get it checked in and breakdown the rest of the work. Have I missed any comments?

MichaelSimons

One pager looks good to me. I just had one comment regarding wording clarity.

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md

tkapin · 2022-09-12T16:48:21Z

@tkapin @mthalman @MichaelSimons Are we all in agreement on this one pager? I'd like to get it checked in and breakdown the rest of the work. Have I missed any comments?

LGTM, the only thing I'd like to see mentioned explicitly is that we need to ensure that the Matrix of Truth docker tag parsing keeps working when changing the tagging schema.

michellemcdaniel · 2022-09-12T16:56:15Z

@dkurepa can you confirm whether or not adding new tags (without touching old tags) will affect MoT?

Add one pager for improved container image lifecycle

595451e

michellemcdaniel requested review from MichaelSimons, Chrisboh, mthalman, tkapin and dkurepa August 9, 2022 16:35

update rollout story for tagging scheme

aa913cb

michellemcdaniel mentioned this pull request Aug 10, 2022

Write one pager for Improved Container Image Lifecycle #10349

Closed

michellemcdaniel requested a review from mmitche August 11, 2022 15:34

mthalman reviewed Aug 11, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Outdated Show resolved Hide resolved

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Outdated Show resolved Hide resolved

Address feedback

f888f91

MichaelSimons reviewed Aug 16, 2022

View reviewed changes

michellemcdaniel added 2 commits August 16, 2022 07:50

Address additional feedback

1cc40fe

Add a little more about missing images

f9bd1ea

tkapin reviewed Aug 17, 2022

View reviewed changes

mmitche reviewed Aug 19, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Outdated Show resolved Hide resolved

mmitche reviewed Aug 19, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Show resolved Hide resolved

mmitche reviewed Aug 19, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Show resolved Hide resolved

mmitche reviewed Aug 19, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Outdated Show resolved Hide resolved

Address feedback

42e0bb9

mmitche previously approved these changes Aug 25, 2022

View reviewed changes

MichaelSimons previously approved these changes Sep 12, 2022

View reviewed changes

Documentation/TeamProcess/One-Pagers/improved-container-image-lifecycle-arcade-10349.md Outdated Show resolved Hide resolved

tkapin previously approved these changes Sep 12, 2022

View reviewed changes

mthalman previously approved these changes Sep 12, 2022

View reviewed changes

Rephrase bullet point

811639e

michellemcdaniel dismissed stale reviews from mthalman and tkapin via 811639e September 12, 2022 18:07

michellemcdaniel dismissed stale reviews from MichaelSimons and mmitche via 811639e September 12, 2022 18:07

michellemcdaniel enabled auto-merge (squash) September 13, 2022 18:32

Chrisboh approved these changes Sep 13, 2022

View reviewed changes

michellemcdaniel merged commit 155def5 into dotnet:main Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add one pager for improved container image lifecycle #10352

Add one pager for improved container image lifecycle #10352

michellemcdaniel commented Aug 9, 2022

MichaelSimons Aug 16, 2022

michellemcdaniel Aug 16, 2022

MichaelSimons Aug 16, 2022

michellemcdaniel Aug 16, 2022

MichaelSimons Aug 16, 2022

tkapin Aug 17, 2022

mmitche Aug 19, 2022

tkapin Aug 17, 2022

mthalman Aug 17, 2022

dkurepa Aug 18, 2022

mthalman Aug 18, 2022

tkapin Aug 18, 2022

ericstj Aug 18, 2022

tkapin Aug 19, 2022

tkapin Aug 17, 2022

michellemcdaniel Aug 17, 2022

MichaelSimons Aug 17, 2022

tkapin Aug 18, 2022

michellemcdaniel Aug 22, 2022

michellemcdaniel Aug 25, 2022

tkapin Aug 17, 2022

tkapin Aug 17, 2022

michellemcdaniel Aug 22, 2022

michellemcdaniel commented Sep 12, 2022

MichaelSimons left a comment

tkapin commented Sep 12, 2022

michellemcdaniel commented Sep 12, 2022


		### Rollout and Deployment

		As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest.


		- Will the new implementation of any existing functionality cause breaking changes for existing consumers?

		The major risk in this portion of the epic is finding and updating all container usages by product teams, and making sure that moving them to the latest versions of the container images doesn't break their builds/tests because of missing artifacts. Our goal is to use docker tags to label the latest known good of each container image, and replace usages of specific docker image tags with a `<os>-<version>-<other-identifying-info>-latest` tag. That way, much like with helix images, their builds and tests will be updated automatically when we deploy a new latest version. In the transition to latest images, we may find that older versions of a container may have different versions of artifacts installed on those containers, which could affect builds and tests. We will need to be prepared to help product teams identify these issues and work through them.


		As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest.

		We will also need a rollback story so that if an image breaks a product team's build or test, we can untag that image and retag the previous `latest` image. A rollback should be as simple as reverting a previous change and publishing the images at the new commit. While this image may be identical to a previously published image, it will effectively be treated as a new version.

Add one pager for improved container image lifecycle #10352

Add one pager for improved container image lifecycle #10352

Conversation

michellemcdaniel commented Aug 9, 2022

To double check:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michellemcdaniel commented Sep 12, 2022

MichaelSimons left a comment

Choose a reason for hiding this comment

tkapin commented Sep 12, 2022

michellemcdaniel commented Sep 12, 2022