Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-38335: certrotation: avoid using applySecret to catch racy cert updates #1772

Merged

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented Aug 6, 2024

applySecret/applyConfigMap does Get initially, so instead of using the version we got from informer it refreshes it and may overwrite if other changes were done in parallel.

Instead of using applySecret/applyConfigMap we should be invoking Create/Update here directly. If the informer has outdated version its update would be rejected by etcd and sync function would be restarted

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 6, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 6, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 6, 2024

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

In response to this:

applySecret/applyConfigMap does Get initially, so instead of using the version we got from informer it refreshes it and may overwrite if other changes were done in parallel.

Instead of using applySecret/applyConfigMap we should be invoking Create/Update here directly. If the informer has outdated version its update would be rejected by etcd and sync function would be restarted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

if len(caBundleConfigMap.ResourceVersion) == 0 {
actualCABundleConfigMap, err = c.Client.ConfigMaps(c.Namespace).Create(ctx, caBundleConfigMap, metav1.CreateOptions{})
} else {
actualCABundleConfigMap, err = c.Client.ConfigMaps(c.Namespace).Update(ctx, caBundleConfigMap.DeepCopy(), metav1.UpdateOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't record an event anymore, like ApplyConfigMap() does. Would that be something useful for users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caBundleConfigMap is already a copy, and it may have been modified already. Do we really need another DeepCopy() here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting an event on successful update is a good idea.

We don't need to call DeepCopy() here, please remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed

if err != nil {
return nil, err
}
if updated {
if equality.Semantic.DeepEqual(actualCABundleConfigMap.Data, caBundleConfigMap.Data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically other parts of caBundleConfigMap may have been updated (not only .Data). Should we log the message below in those cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this additional check anyway ? (line 89)
We already know that the update was required. So, if the call succeeded then we can simply report success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, we already have modified, so lets use it to emit the event

// and it works in tandem with a particular carry patch applied to the openshift kube-apiserver.
// we will remove this when we migrate all of the affected secret
// objects to their intended type: https://issues.redhat.com/browse/API-1800
UseSecretUpdateOnly bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this change here won't help with the race (?).

Since this was added here and you don't need it anymore, could we revert that PR instead? This would also get rid of the resourceapply changes, which we aren't doing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should cleanup and remove unused code like ApplySecretDoNotUse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

actualCABundleConfigMap, updated, err := resourceapply.ApplyConfigMap(ctx, c.Client, c.EventRecorder, caBundleConfigMap)
var actualCABundleConfigMap *corev1.ConfigMap
if len(caBundleConfigMap.ResourceVersion) == 0 {
actualCABundleConfigMap, err = c.Client.ConfigMaps(c.Namespace).Create(ctx, caBundleConfigMap, metav1.CreateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we would like to report successful creation as well why not change the code to be more explicit.
For example:

  1. we would introduce a new var creationRequired bool or var creationNeeded bool variable.
  2. we would set the new variable at line 60
  3. we would rename modified to updateRequired or updateNeeded
  4. we would introduce a branch before line 89, for example:
if creationRequired {
 create
 report
 assign to `caBundleConfigMap`
 } else if updateRequired {
 update 
 report
 assign to `caBundleConfigMap`
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could move the creation to line 53.

The creation is a rare operation and immediately after creation the sync loop will be triggered and we can issue an update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, now all functions should have exists and modified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch 2 times, most recently from 287516d to 73337c3 Compare August 9, 2024 09:54
if !exists {
actualSigningCertKeyPairSecret, err = c.Client.Secrets(c.Namespace).Create(ctx, signingCertKeyPairSecret, metav1.CreateOptions{})
} else {
actualSigningCertKeyPairSecret, err = c.Client.Secrets(c.Namespace).Update(ctx, signingCertKeyPairSecret.DeepCopy(), metav1.UpdateOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signingCertKeyPairSecret is either new or a deep copy of originalSigningCertKeyPairSecret, so I think you don't need another deep copy here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, removed

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from 73337c3 to 2a150ed Compare August 9, 2024 13:03
@vrutkovs vrutkovs changed the title API-1802: certrotation: avoid using applySecret to catch racy cert updates OCPBUGS-38335: certrotation: avoid using applySecret to catch racy cert updates Aug 12, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 12, 2024
@openshift-ci-robot
Copy link

@vrutkovs: This pull request references Jira Issue OCPBUGS-38335, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

applySecret/applyConfigMap does Get initially, so instead of using the version we got from informer it refreshes it and may overwrite if other changes were done in parallel.

Instead of using applySecret/applyConfigMap we should be invoking Create/Update here directly. If the informer has outdated version its update would be rejected by etcd and sync function would be restarted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 August 12, 2024 08:43
// Deprecated: DO NOT USE, it is intended as a short term hack for a very specific use case,
// and it works in tandem with a particular carry patch applied to the openshift kube-apiserver.
// Use ApplySecret instead.
func ApplySecretDoNotUse(ctx context.Context, client coreclientv1.SecretsGetter, recorder events.Recorder, required *corev1.Secret) (*corev1.Secret, bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original PR introduced more changes. Not all were remove. Please open a new PR that reverts #1705.

if !exists {
actualCABundleConfigMap, err = c.Client.ConfigMaps(c.Namespace).Create(ctx, caBundleConfigMap, metav1.CreateOptions{})
} else {
actualCABundleConfigMap, err = c.Client.ConfigMaps(c.Namespace).Update(ctx, caBundleConfigMap, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we get an event that its about to be updated, but not the update result. Added this to the latest commit

actualTargetCertKeyPairSecret, _, err := resourceapply.ApplySecret(ctx, c.Client, c.EventRecorder, targetCertKeyPairSecret)
var actualTargetCertKeyPairSecret *corev1.Secret
var action string
if len(targetCertKeyPairSecret.ResourceVersion) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !exists instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"github.com/openshift/library-go/pkg/operator/events"
)

func emitEvent(recorder events.Recorder, objType string, action string, obj string, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function could just receive a obj runtime.Object and you could figure out objType and obj using resourcehelper.GuessObjectGroupVersionKind(obj); gvk.Kind and resourcehelper.FormatResourceForCLIWithNamespace(obj) respectively.

So you could do:

emitEvent(c.EventRecorder, actualCABundleConfigMap, "Create", err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Also using FormatResourceForCLI instead of namespaced version as events are namespaced anyway

actualSigningCertKeyPairSecret, err = c.Client.Secrets(c.Namespace).Update(ctx, signingCertKeyPairSecret, metav1.UpdateOptions{})
action = "Update"
}
emitEvent(c.EventRecorder, "Secret", action, resourcehelper.FormatResourceForCLIWithNamespace(actualSigningCertKeyPairSecret), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do https://github.com/openshift/library-go/pull/1772/files#r1714300480, this would get shorter:

var actualSigningCertKeyPairSecret *corev1.Secret
if !exists {
	actualSigningCertKeyPairSecret, err = c.Client.Secrets(c.Namespace).Create(ctx, signingCertKeyPairSecret, metav1.CreateOptions{})
	emitEvent(c.EventRecorder, "Create", actualSigningCertKeyPairSecret, err)
} else {
	actualSigningCertKeyPairSecret, err = c.Client.Secrets(c.Namespace).Update(ctx, signingCertKeyPairSecret, metav1.UpdateOptions{})
	emitEvent(c.EventRecorder, "Update", actualSigningCertKeyPairSecret, err)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to minimize copying important params I introduced "action" which is the only variable there

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from c702d4d to ce677d9 Compare August 13, 2024 06:27
Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm not familiar with these controllers, I did my best to review this, and to the best of my knowledge, these changes appear to be correct.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2024
"k8s.io/apimachinery/pkg/runtime"
)

func emitEvent(recorder events.Recorder, action string, obj runtime.Object, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this new function I would prefer to reuse reportCreateEvent and reportUpdateEvent.

I think we could open a new PR in which we would:

  1. crate a new helpers pkg in library-go/pkg/operator/resource/helpers
  2. we would move operator/resource/resourceapply/event_helpers.go to the new pkg (with tests)
  3. we would export all report methods.

Then we could remove this file and use the new exported methods directly. WDYT?

@@ -59,6 +59,7 @@ func (c CABundleConfigMap) EnsureConfigMapCABundle(ctx context.Context, signingC
c.AdditionalAnnotations,
)}
modified = true
exists = false
}

needsOwnerUpdate := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we could modify line 70 to:

updateRequired = needsOwnerUpdate || needsMetadataUpdate

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from ce677d9 to 34d2c9d Compare August 14, 2024 06:47
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024
@@ -95,15 +96,25 @@ func (c RotatedSigningCASecret) EnsureSigningCertKeyPair(ctx context.Context) (*

LabelAsManagedSecret(signingCertKeyPairSecret, CertificateTypeSigner)

modified = true
updateRequired = true
signerUpdated = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this var ?

I think it could be replaced by creationRequired || updateRequired, right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. updateRequired means "any part of the secret has changed", but signerUpdated is specific to certificate part (we need to pass it on to trigger CA bundle / target cert updates)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I think that the following code:

	// at this point, the secret has the correct signer, so we should read that signer to be able to sign
	signingCertKeyPair, err := crypto.GetCAFromBytes(signingCertKeyPairSecret.Data["tls.crt"], signingCertKeyPairSecret.Data["tls.key"])
	if err != nil {
		return nil, signerUpdated, err
	}

	return signingCertKeyPair, signerUpdated, nil

could be replaced be:

	// at this point, the secret has the correct signer, so we should read that signer to be able to sign
	signingCertKeyPair, err := crypto.GetCAFromBytes(signingCertKeyPairSecret.Data["tls.crt"], signingCertKeyPairSecret.Data["tls.key"])
	if err != nil {
		return nil, creationRequired || updateRequired, err
	}

	return signingCertKeyPair, creationRequired || updateRequired, nil

and we could remove signerUpdated, am I right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, updateRequired || creationRequired is not equivalent to signerUpdated. We set updateRequired = true when annotations change too.

This PR is already growing too large - can we optimize this in a followup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, do you also want to move the refactor commit to a new PR ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave this version as is and experiment if signerUpdated can be removed in a separate PR.

I think we want more specific messages for signer/cabundle/target too, i.e. Updated contents and Updated metadata probably

@@ -108,8 +110,8 @@ func (c RotatedSelfSignedCertKeySecret) EnsureTargetCertKeyPair(ctx context.Cont
),
Type: corev1.SecretTypeTLS,
}
modified = true
secretDoesntExist = true
updateRequired = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, no longer needed

@p0lyn0mial
Copy link
Contributor

@vrutkovs could you also open a proof PR in kas-o so that we get a CI signal?

@vrutkovs
Copy link
Member Author

@p0lyn0mial
Copy link
Contributor

Opened openshift/cluster-kube-apiserver-operator#1721

the pr was created one week ago and e2e tests haven't been executed. Could you address #1772 (comment) and then pull this pr into 1721 and kick off the tests?

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from 34d2c9d to 0399b35 Compare August 14, 2024 08:38
@p0lyn0mial
Copy link
Contributor

/lgtm

let's wait for openshift/cluster-kube-apiserver-operator#1721 before merging.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024
@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from 0399b35 to 3ac47e0 Compare August 14, 2024 10:52
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024
@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from 3ac47e0 to a8b7daa Compare August 14, 2024 12:19
@vrutkovs
Copy link
Member Author

/test unit


if reason := c.CertCreator.NeedNewTargetCertKeyPair(targetCertKeyPairSecret, signingCertKeyPair, caBundleCerts, c.Refresh, c.RefreshOnlyWhenExpired, secretDoesntExist); len(reason) > 0 {
if reason := c.CertCreator.NeedNewTargetCertKeyPair(targetCertKeyPairSecret, signingCertKeyPair, caBundleCerts, c.Refresh, c.RefreshOnlyWhenExpired, creationRequired); len(reason) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does c.CertCreator.NeedNewTargetCertKeyPair persist any resource in etcd (does it use Apply* method?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't even pass client there. It decides on the contents of the secret whether it needs to be regenerated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks.

@@ -74,18 +75,18 @@ func (c RotatedSigningCASecret) EnsureSigningCertKeyPair(ctx context.Context) (*
),
Type: corev1.SecretTypeTLS,
}
signerExists = false
creationRequired = true
}

// apply necessary metadata (possibly via delete+recreate) if secret exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@vrutkovs vrutkovs force-pushed the cert-rotation-skip-get-in-apply branch from a8b7daa to 3a8d8c5 Compare August 14, 2024 13:12
@p0lyn0mial
Copy link
Contributor

/lgtm

let's wait for openshift/cluster-kube-apiserver-operator#1721 before merging.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024
Copy link
Contributor

openshift-ci bot commented Aug 14, 2024

@vrutkovs: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Contributor

openshift-ci bot commented Aug 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bertinatto, p0lyn0mial, soltysh, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e21e788 into openshift:master Aug 16, 2024
4 checks passed
@openshift-ci-robot
Copy link

@vrutkovs: Jira Issue OCPBUGS-38335: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38335 has been moved to the MODIFIED state.

In response to this:

applySecret/applyConfigMap does Get initially, so instead of using the version we got from informer it refreshes it and may overwrite if other changes were done in parallel.

Instead of using applySecret/applyConfigMap we should be invoking Create/Update here directly. If the informer has outdated version its update would be rejected by etcd and sync function would be restarted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants