Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

payload: Update payload to latest CI #220

Merged

Conversation

stevenhorsman
Copy link
Member

  • Upload the ccruntime and enclave payloads to use the latest kata-containers-ci builds
  • I've left container-engine-for-cc-payload set as it doesn't update frequently and during 0.6 we found that the most recent commit wasn't working properly

Fixes: #219

@stevenhorsman
Copy link
Member Author

/test

Copy link
Member

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @stevenhorsman
I think we should also have a post-release checklist to keep track of switching the versions to point to head/latest etc

@stevenhorsman
Copy link
Member Author

/lgtm Thanks @stevenhorsman I think we should also have a post-release checklist to keep track of switching the versions to point to head/latest etc

Yeah, we had a big discussion in #169 about if we want to switch to latest by default after a release. I think Wainer was against it IIRC, and wanted to have an integrate CI to run first before we bumped. I'm not sure where that CI got to though.

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Jun 26, 2023

@wainersm - it looks like the uninstall operator test is hanging?

16:22:23 ok 14 [cc][agent][kubernetes][containerd] Test cannot pull an encrypted image inside the guest without decryption key
16:42:23 Build timed out (after 20 minutes). Marking the build as aborted.
16:42:23 Build was aborted

I have a feeling this has been seen before? Is there anything needed to debug it, or should I be able to reproduce it locally tomorrow?

@arronwy
Copy link
Member

arronwy commented Jun 27, 2023

@wainersm - it looks like the uninstall operator test is hanging?

16:22:23 ok 14 [cc][agent][kubernetes][containerd] Test cannot pull an encrypted image inside the guest without decryption key
16:42:23 Build timed out (after 20 minutes). Marking the build as aborted.
16:42:23 Build was aborted

I have a feeling this has been seen before? Is there anything needed to debug it, or should I be able to reproduce it tomorrow?

Yes, tdx CI also have this same issue, I also remember seen before, but already fixed.

@stevenhorsman
Copy link
Member Author

I've tried this manually and see:

# kubectl get pods -A
NAMESPACE                        NAME                                                     READY   STATUS    RESTARTS      AGE
confidential-containers-system   cc-operator-controller-manager-ccbbcfdf7-kgzwv           2/2     Running   0             18m
confidential-containers-system   cc-operator-daemon-install-rl52k                         1/1     Running   0             18m
confidential-containers-system   cc-operator-daemon-uninstall-5hk2v                       1/1     Running   0             62s
confidential-containers-system   cc-operator-post-uninstall-daemon-k7fx2                  0/1     Error     3 (40s ago)   62s
confidential-containers-system   cc-operator-pre-install-daemon-6b6z9                     1/1     Running   0             18m

Doing a describe on the uninstall daemon I get the events:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  96s                default-scheduler  Successfully assigned confidential-containers-system/cc-operator-post-uninstall-daemon-k7fx2 to sh-coco-operator1.fyre.ibm.com
  Normal   Pulled     94s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 1.147147807s
  Normal   Pulled     91s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 571.651515ms
  Normal   Pulled     75s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 614.524078ms
  Normal   Pulling    48s (x4 over 96s)  kubelet            Pulling image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305"
  Normal   Created    47s (x4 over 94s)  kubelet            Created container cc-runtime-post-uninstall-pod
  Normal   Started    47s (x4 over 94s)  kubelet            Started container cc-runtime-post-uninstall-pod
  Normal   Pulled     47s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 616.929539ms
  Warning  BackOff    6s (x8 over 89s)   kubelet            Back-off restarting failed container

It's latest logs were:

kubectl logs cc-operator-post-uninstall-daemon-k7fx2 -n confidential-containers-system
Removing containerd-for-cc artifacts from host
Removing the systemd drop-in file
Removing the systemd drop-in file's directory, if empty
Restarting containerd

So I think that is why the test is hanging. I'm not sure what is the cause of that at the moment though

@stevenhorsman
Copy link
Member Author

/test

@wainersm
Copy link
Member

@wainersm - it looks like the uninstall operator test is hanging?

16:22:23 ok 14 [cc][agent][kubernetes][containerd] Test cannot pull an encrypted image inside the guest without decryption key
16:42:23 Build timed out (after 20 minutes). Marking the build as aborted.
16:42:23 Build was aborted

I have a feeling this has been seen before? Is there anything needed to debug it, or should I be able to reproduce it tomorrow?

Yes, tdx CI also have this same issue, I also remember seen before, but already fixed.

Yep, this bug is not new. I am trying to remember and digging into my logs to find out how we solved it in the past.

Regarding the test getting stuck, I opened a RFE while ago (#181) but didn't have a change to implement it.

@wainersm
Copy link
Member

I've tried this manually and see:

# kubectl get pods -A
NAMESPACE                        NAME                                                     READY   STATUS    RESTARTS      AGE
confidential-containers-system   cc-operator-controller-manager-ccbbcfdf7-kgzwv           2/2     Running   0             18m
confidential-containers-system   cc-operator-daemon-install-rl52k                         1/1     Running   0             18m
confidential-containers-system   cc-operator-daemon-uninstall-5hk2v                       1/1     Running   0             62s
confidential-containers-system   cc-operator-post-uninstall-daemon-k7fx2                  0/1     Error     3 (40s ago)   62s
confidential-containers-system   cc-operator-pre-install-daemon-6b6z9                     1/1     Running   0             18m

Doing a describe on the uninstall daemon I get the events:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  96s                default-scheduler  Successfully assigned confidential-containers-system/cc-operator-post-uninstall-daemon-k7fx2 to sh-coco-operator1.fyre.ibm.com
  Normal   Pulled     94s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 1.147147807s
  Normal   Pulled     91s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 571.651515ms
  Normal   Pulled     75s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 614.524078ms
  Normal   Pulling    48s (x4 over 96s)  kubelet            Pulling image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305"
  Normal   Created    47s (x4 over 94s)  kubelet            Created container cc-runtime-post-uninstall-pod
  Normal   Started    47s (x4 over 94s)  kubelet            Started container cc-runtime-post-uninstall-pod
  Normal   Pulled     47s                kubelet            Successfully pulled image "quay.io/confidential-containers/container-engine-for-cc-payload:98a790e8abdcc06c4b629b290ebaa217bf82e305" in 616.929539ms
  Warning  BackOff    6s (x8 over 89s)   kubelet            Back-off restarting failed container

It's latest logs were:

kubectl logs cc-operator-post-uninstall-daemon-k7fx2 -n confidential-containers-system
Removing containerd-for-cc artifacts from host
Removing the systemd drop-in file
Removing the systemd drop-in file's directory, if empty
Restarting containerd

So I think that is why the test is hanging. I'm not sure what is the cause of that at the moment though

Apparently it seems stuck in https://github.com/confidential-containers/operator/blob/main/install/pre-install-payload/scripts/container-engine-for-cc-deploy.sh#L57 but the fact that cc-operator-post-uninstall-daemon-k7fx2 is in Error status might mean that the restart actually failed and the container script returned?

- Upload the ccruntime and enclave payloads to use the
latest kata-containers-ci builds
- I've left container-engine-for-cc-payload set as it
doesn't update frequently and during 0.6 we found that
the most recent commit wasn't working properly

Fixes: confidential-containers#219
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
@stevenhorsman stevenhorsman force-pushed the post-0.6-version-bump branch 3 times, most recently from bd77afd to 1efaf84 Compare July 5, 2023 11:09
@stevenhorsman
Copy link
Member Author

/test

@fidencio
Copy link
Member

fidencio commented Jul 6, 2023

/test
/test-tdx

@fidencio
Copy link
Member

fidencio commented Jul 6, 2023

Now that we finally fixed the payload generation on the Kata Containers side, and the latest payload has a fix for the Operator hanging on uninstall, hopefully this one will be merged soon.

@fidencio
Copy link
Member

fidencio commented Jul 6, 2023

/test

Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @stevenhorsman!

@fidencio
Copy link
Member

fidencio commented Jul 6, 2023

13:50:15 ok 15 [cc][operator] Test can uninstall the operator

Yep, we're good to go.

@fidencio
Copy link
Member

fidencio commented Jul 6, 2023

I've retriggered the Cloud Hypervisor test, but the queue is not so short. So, please, wait till it reports back before merging it.

@fidencio fidencio merged commit 8c10735 into confidential-containers:main Jul 6, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

payload: Update versions to latest
5 participants