Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-install-payload: Handle official and VFIO / GPU containerd installations #256

Conversation

fidencio
Copy link
Member

This work is required as we'll move out from the forked version of containerd, but not for all the projects, and not without requiring a minimum version of containerd to do so.

As enclave-cc will still be depending on the containerd fork for v0.8.0, and as we may need to still install the minimum version of containerd on clusters using, for a reason or another, containerd v1.6.x, we need to handle those two different installations in a graceful manner.

…sion

This ensures a clear separation between using / installing the CoCo
specific fork of containerd, and using an official containerd release
(which will come later in this series).

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio requested a review from bpradipt August 31, 2023 08:31
@fidencio fidencio force-pushed the topic/pre-install-support-official-containerd-installation branch from b30b6d9 to 54ccfca Compare August 31, 2023 08:32
We're renaming a few variables here to make sure we use COCO stuff only
for the projects we've forked, such as containerd.

This will help immensely when having to deal with the official and the
CoCo version forked of a specific component.

In the long run, of course, we want to get rid of all the forks we have,
but this is needed at least for this release.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio force-pushed the topic/pre-install-support-official-containerd-installation branch 2 times, most recently from 3ad7619 to 771523f Compare August 31, 2023 08:46
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/pre-install-support-official-containerd-installation branch from 771523f to 816b357 Compare August 31, 2023 09:14
@fidencio
Copy link
Member Author

/test

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

install/pre-install-payload/payload.sh Outdated Show resolved Hide resolved
@BbolroC
Copy link
Member

BbolroC commented Aug 31, 2023

Regarding the test failure for s390x, it is caused by the instability of the infra. We can proceed to get this PR merged. 😉

Let's change the reqs-deploy.sh script so it's clear that we're dealing
with a specific kind of containerd installation, such as the coco forked
one.  Later on we'll introduce an option to install an official
contained binary, just in case folks are running on an cluster that
doesn't have the minimum pre-requirements in place.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
88dda3e removed those from one part of
the config, but forgot to remove those from the pre-install /
post-uninstall parts.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We rely on the node having containerd v1.7.0 or higher for the work
we're doing.

With this in mind, let's also ship the minimum containerd required by
us, so users can decide whether or not to use it instead of upgrading
their clusters.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's just make it matching the one used in the Makefile.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio force-pushed the topic/pre-install-support-official-containerd-installation branch 2 times, most recently from 975785f to 280815b Compare August 31, 2023 13:01
@fidencio
Copy link
Member Author

/test

@fidencio fidencio changed the title pre-install-payload: Handle official containerd installation pre-install-payload: Handle official and VFIO / GPU containerd installations Aug 31, 2023
# If set to true, this will install the CoCo fork of the containerd,
# the one that has patches for handling GPU / VFIO, on the node
# default: false
- name: "INSTALL_VFIO_GPU_CONTAINERD"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need three different options? Or just having INSTALL_COCO_CONTAINERD for custom containerd and INSTALL_OFFICIAL_CONTAINERD for officially released containerd version is good enough?

My apologies if this has been already discussed and I'm missing the full context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Pradipta, the extra context is that Zvonko & Nvidia need some extra features in containerd that are in our custom fork at the moment, but not related to the image pull pieces, so I believe Fabiano has separated them out as they are not related and will be resolved at different times. https://cloud-native.slack.com/archives/C039JSH0807/p1693476820263639 contains some of the info. Fabiano can correct, if I'm wrong here, but I think he's about 12months behind on github notifications.

I hope that helps!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point.

We have 3 different scenarios to cover, Pradipta.

  1. We need to use the forked containerd, which allows us to pull the image on the guest. This version is based on 1.6.x, and I really expect this to die as soon as possible, but that won't happen for v0.8.0 release as Enclave CC still depends on that. Without much thinking, we also added to that containerd the VFIO / GPU related patches, as that was already a forked version of containerd.

  2. We do not use the forked containerd, as we do rely on nydus-snapshotter to pull the image on the guest.
    First of all, this brings in a requirement to use containerd v1.7.x, as that simplifies a lot the snapshotter setup, which can be done per runtime handler now, and that was not possible with 1.6.x.

Then within 2, we have the follow scenarios:
2.1. I just want to use it, I don't care about VFIO / GPUs: Then the upstream containerd is what we want, and we can install it. I hope we will never do that, to be honest, but we're giving the user the option to do so in order for them to easily test CoCo.

2.2. I do need to use VFIO / GPU stuff: Now we have to deal with non-upstreamed patches, and that's yet another containerd fork. Could it be the same one as 1? Maybe, but I'd prefer making it as clear as possible the differentiation of why we're using a forked version of containerd. I'm afraid people will try to use VFIO / GPU and will end up just using the non-maintained / non-supported way of pulling image into the guest by mistake / lack of understanding on what has to be changed.

That's the reason we end up with those 3 different env vars. :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenhorsman @fidencio .. This helps ..:-)

tar xvzpf containerd-${OFFICIAL_CONTAINERD_VERSION}-linux-${ARCH}.tar.gz -C ${NODE_DESTINATION} && \
rm containerd-${OFFICIAL_CONTAINERD_VERSION}-linux-${ARCH}.tar.gz

#### Confidential Containerd forked containerd for VFIO / GPU stuff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Containerd -> Containers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the typo, thanks!

Copy link
Member

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @fidencio

As we realy on the node having containerd v1.7.0 or higher for the work
we're doing, and as some of the consumers are strictly requiring a fork
that contains a fix for dealing with VFIO / GPU related workloads, we
need to introduce this new flavour and allow them to use it, for the
specific cases mentioned above.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio force-pushed the topic/pre-install-support-official-containerd-installation branch from 280815b to 7b094ab Compare September 1, 2023 10:42
@fidencio
Copy link
Member Author

fidencio commented Sep 1, 2023

/test

@fidencio
Copy link
Member Author

fidencio commented Sep 1, 2023

SNP is failing with:

13:49:54 1..1
13:50:11 not ok 1 [cc][kubernetes][containerd][snp] Test SNP unencrypted container launch success
13:50:11 # (in test file snp.bats, line 90)
13:50:11 #   `return 1' failed
13:50:11 # Deleting previous test services...
13:50:11 # No resources found in default namespace.
13:50:11 # service/snp-unencrypted created
13:50:11 # deployment.apps/snp-unencrypted created
13:50:11 # pod/snp-unencrypted-68dd8f7987-f5b9z condition met
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME                               STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION                     CONTAINER-RUNTIME
13:50:11 # amd-milan-coco-ci-ubuntu2004-001   Ready    control-plane   4m9s   v1.24.0   10.216.91.128   <none>        Ubuntu 20.04.6 LTS   5.19.0-rc6-snp-host-d9bd54fea4d2   containerd://1.6.8.2
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE    SELECTOR
13:50:11 # kubernetes        ClusterIP   10.96.0.1      <none>        443/TCP   4m8s   <none>
13:50:11 # snp-unencrypted   ClusterIP   10.98.69.121   <none>        22/TCP    9s     app=snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME              READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES                                                       SELECTOR
13:50:11 # snp-unencrypted   1/1     1            1           9s    snp-unencrypted   ghcr.io/confidential-containers/test-container:unencrypted   app=snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME                               READY   STATUS    RESTARTS   AGE   IP           NODE                               NOMINATED NODE   READINESS GATES
13:50:11 # snp-unencrypted-68dd8f7987-f5b9z   1/1     Running   0          9s    10.244.0.7   amd-milan-coco-ci-ubuntu2004-001   <none>           <none>
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # Name:         snp-unencrypted-68dd8f7987-f5b9z
13:50:11 # Namespace:    default
13:50:11 # Priority:     0
13:50:11 # Node:         amd-milan-coco-ci-ubuntu2004-001/10.216.91.128
13:50:11 # Start Time:   Fri, 01 Sep 2023 11:50:01 +0000
13:50:11 # Labels:       app=snp-unencrypted
13:50:11 #               pod-template-hash=68dd8f7987
13:50:11 # Annotations:  <none>
13:50:11 # Status:       Running
13:50:11 # IP:           10.244.0.7
13:50:11 # IPs:
13:50:11 #   IP:           10.244.0.7
13:50:11 # Controlled By:  ReplicaSet/snp-unencrypted-68dd8f7987
13:50:11 # Containers:
13:50:11 #   snp-unencrypted:
13:50:11 #     Container ID:   containerd://e40cdd3e716da34472d0972225f936ed837af344815eb4118df4e8fba9548f11
13:50:11 #     Image:          ghcr.io/confidential-containers/test-container:unencrypted
13:50:11 #     Image ID:       [Encrypted]
13:50:11 #     Port:           <none>
13:50:11 #     Host Port:      <none>
13:50:11 #     State:          Running
13:50:11 #       Started:      Fri, 01 Sep 2023 11:50:09 +0000
13:50:11 #     Ready:          True
13:50:11 #     Restart Count:  0
13:50:11 #     Environment:    <none>
13:50:11 #     Mounts:
13:50:11 #       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nm6sr (ro)
13:50:11 # Conditions:
13:50:11 #   Type              Status
13:50:11 #   Initialized       True
13:50:11 #   Ready             True
13:50:11 #   ContainersReady   True
13:50:11 #   PodScheduled      True
13:50:11 # Volumes:
13:50:11 #   kube-api-access-nm6sr:
13:50:11 #     Type:                    Projected (a volume that contains injected data from multiple sources)
13:50:11 #     TokenExpirationSeconds:  3607
13:50:11 #     ConfigMapName:           kube-root-ca.crt
13:50:11 #     ConfigMapOptional:       <nil>
13:50:11 #     DownwardAPI:             true
13:50:11 # QoS Class:                   BestEffort
13:50:11 # Node-Selectors:              katacontainers.io/kata-runtime=true
13:50:11 # Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
13:50:11 #                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
13:50:11 # Events:
13:50:11 #   Type    Reason     Age   From               Message
13:50:11 #   ----    ------     ----  ----               -------
13:50:11 #   Normal  Scheduled  9s    default-scheduler  Successfully assigned default/snp-unencrypted-68dd8f7987-f5b9z to amd-milan-coco-ci-ubuntu2004-001
13:50:11 #   Normal  Pulling    4s    kubelet            Pulling image "ghcr.io/confidential-containers/test-container:unencrypted"
13:50:11 #   Normal  Pulled     1s    kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:unencrypted" in 3.066909737s
13:50:11 #   Normal  Created    1s    kubelet            Created container snp-unencrypted
13:50:11 #   Normal  Started    0s    kubelet            Started container snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # Pseudo-terminal will not be allocated because stdin is not a terminal.
13:50:11 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
13:50:11 # @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
13:50:11 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
13:50:11 # IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
13:50:11 # Someone could be eavesdropping on you right now (man-in-the-middle attack)!
13:50:11 # It is also possible that a host key has just been changed.
13:50:11 # The fingerprint for the ED25519 key sent by the remote host is
13:50:11 # SHA256:pRNlhyP1VNLGtHSvAu34PWl+vadTqGVcG4x9fSOcQb8.
13:50:11 # Please contact your system administrator.
13:50:11 # Add correct host key in /root/.ssh/known_hosts to get rid of this message.
13:50:11 # Offending ED25519 key in /root/.ssh/known_hosts:1
13:50:11 #   remove with:
13:50:11 #   ssh-keygen -f "/root/.ssh/known_hosts" -R "10.244.0.7"
13:50:11 # Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
13:50:11 # root@10.244.0.7: Permission denied (publickey,password,keyboard-interactive).
13:50:11 # KATA SNP TEST - FAIL: SNP is NOT Enabled
13:50:14 INFO: Uninstall the operator
13:50:15 ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
13:50:21 namespace "confidential-containers-syst

This is not related to this PR at all. /cc @ryansavino

@fidencio
Copy link
Member Author

fidencio commented Sep 1, 2023

TDX failed with:

14:39:33 TASK [Install qemu-user-static] ************************************************
14:39:33 fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker run --rm --privileged multiarch/qemu-user-static:7.2.0-1 --reset -p yes", "delta": "0:00:00.426877", "end": "2023-09-01 20:39:33.828234", "msg": "non-zero return code", "rc": 125, "start": "2023-09-01 20:39:33.401357", "stderr": "docker: Error response from daemon: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\": unavailable.", "stderr_lines": ["docker: Error response from daemon: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\": unavailable."], "stdout": "", "stdout_lines": []}

This is unrelated, but I retriggered the CI.

@fidencio fidencio merged commit 9f4dbeb into confidential-containers:main Sep 1, 2023
10 of 11 checks passed
@ryansavino
Copy link
Member

SNP is failing with:

13:49:54 1..1
13:50:11 not ok 1 [cc][kubernetes][containerd][snp] Test SNP unencrypted container launch success
13:50:11 # (in test file snp.bats, line 90)
13:50:11 #   `return 1' failed
13:50:11 # Deleting previous test services...
13:50:11 # No resources found in default namespace.
13:50:11 # service/snp-unencrypted created
13:50:11 # deployment.apps/snp-unencrypted created
13:50:11 # pod/snp-unencrypted-68dd8f7987-f5b9z condition met
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME                               STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION                     CONTAINER-RUNTIME
13:50:11 # amd-milan-coco-ci-ubuntu2004-001   Ready    control-plane   4m9s   v1.24.0   10.216.91.128   <none>        Ubuntu 20.04.6 LTS   5.19.0-rc6-snp-host-d9bd54fea4d2   containerd://1.6.8.2
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE    SELECTOR
13:50:11 # kubernetes        ClusterIP   10.96.0.1      <none>        443/TCP   4m8s   <none>
13:50:11 # snp-unencrypted   ClusterIP   10.98.69.121   <none>        22/TCP    9s     app=snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME              READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES                                                       SELECTOR
13:50:11 # snp-unencrypted   1/1     1            1           9s    snp-unencrypted   ghcr.io/confidential-containers/test-container:unencrypted   app=snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # NAME                               READY   STATUS    RESTARTS   AGE   IP           NODE                               NOMINATED NODE   READINESS GATES
13:50:11 # snp-unencrypted-68dd8f7987-f5b9z   1/1     Running   0          9s    10.244.0.7   amd-milan-coco-ci-ubuntu2004-001   <none>           <none>
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # Name:         snp-unencrypted-68dd8f7987-f5b9z
13:50:11 # Namespace:    default
13:50:11 # Priority:     0
13:50:11 # Node:         amd-milan-coco-ci-ubuntu2004-001/10.216.91.128
13:50:11 # Start Time:   Fri, 01 Sep 2023 11:50:01 +0000
13:50:11 # Labels:       app=snp-unencrypted
13:50:11 #               pod-template-hash=68dd8f7987
13:50:11 # Annotations:  <none>
13:50:11 # Status:       Running
13:50:11 # IP:           10.244.0.7
13:50:11 # IPs:
13:50:11 #   IP:           10.244.0.7
13:50:11 # Controlled By:  ReplicaSet/snp-unencrypted-68dd8f7987
13:50:11 # Containers:
13:50:11 #   snp-unencrypted:
13:50:11 #     Container ID:   containerd://e40cdd3e716da34472d0972225f936ed837af344815eb4118df4e8fba9548f11
13:50:11 #     Image:          ghcr.io/confidential-containers/test-container:unencrypted
13:50:11 #     Image ID:       [Encrypted]
13:50:11 #     Port:           <none>
13:50:11 #     Host Port:      <none>
13:50:11 #     State:          Running
13:50:11 #       Started:      Fri, 01 Sep 2023 11:50:09 +0000
13:50:11 #     Ready:          True
13:50:11 #     Restart Count:  0
13:50:11 #     Environment:    <none>
13:50:11 #     Mounts:
13:50:11 #       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nm6sr (ro)
13:50:11 # Conditions:
13:50:11 #   Type              Status
13:50:11 #   Initialized       True
13:50:11 #   Ready             True
13:50:11 #   ContainersReady   True
13:50:11 #   PodScheduled      True
13:50:11 # Volumes:
13:50:11 #   kube-api-access-nm6sr:
13:50:11 #     Type:                    Projected (a volume that contains injected data from multiple sources)
13:50:11 #     TokenExpirationSeconds:  3607
13:50:11 #     ConfigMapName:           kube-root-ca.crt
13:50:11 #     ConfigMapOptional:       <nil>
13:50:11 #     DownwardAPI:             true
13:50:11 # QoS Class:                   BestEffort
13:50:11 # Node-Selectors:              katacontainers.io/kata-runtime=true
13:50:11 # Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
13:50:11 #                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
13:50:11 # Events:
13:50:11 #   Type    Reason     Age   From               Message
13:50:11 #   ----    ------     ----  ----               -------
13:50:11 #   Normal  Scheduled  9s    default-scheduler  Successfully assigned default/snp-unencrypted-68dd8f7987-f5b9z to amd-milan-coco-ci-ubuntu2004-001
13:50:11 #   Normal  Pulling    4s    kubelet            Pulling image "ghcr.io/confidential-containers/test-container:unencrypted"
13:50:11 #   Normal  Pulled     1s    kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:unencrypted" in 3.066909737s
13:50:11 #   Normal  Created    1s    kubelet            Created container snp-unencrypted
13:50:11 #   Normal  Started    0s    kubelet            Started container snp-unencrypted
13:50:11 # -------------------------------------------------------------------------------
13:50:11 # Pseudo-terminal will not be allocated because stdin is not a terminal.
13:50:11 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
13:50:11 # @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
13:50:11 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
13:50:11 # IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
13:50:11 # Someone could be eavesdropping on you right now (man-in-the-middle attack)!
13:50:11 # It is also possible that a host key has just been changed.
13:50:11 # The fingerprint for the ED25519 key sent by the remote host is
13:50:11 # SHA256:pRNlhyP1VNLGtHSvAu34PWl+vadTqGVcG4x9fSOcQb8.
13:50:11 # Please contact your system administrator.
13:50:11 # Add correct host key in /root/.ssh/known_hosts to get rid of this message.
13:50:11 # Offending ED25519 key in /root/.ssh/known_hosts:1
13:50:11 #   remove with:
13:50:11 #   ssh-keygen -f "/root/.ssh/known_hosts" -R "10.244.0.7"
13:50:11 # Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
13:50:11 # root@10.244.0.7: Permission denied (publickey,password,keyboard-interactive).
13:50:11 # KATA SNP TEST - FAIL: SNP is NOT Enabled
13:50:14 INFO: Uninstall the operator
13:50:15 ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
13:50:21 namespace "confidential-containers-syst

This is not related to this PR at all. /cc @ryansavino

This one seems interesting. The only thing I could think of here is that it snagged the wrong key associated with ssh for the image. If we see this again on another PR, tag me and I'll take a look. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants