-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Vault certificates with no lease #1106
Conversation
How many certificates are you expecting? Probably your best bet is to just enable leases unless you have huge numbers of certs. |
It looks like we currently have on the order of 20k active certificates, mostly with 7 day TTLs. We use consul-template to provision certs for K8S pods, and frequently start and stop pods in response to demand, so the number of certificates climbs quickly. Enabling leases for these certificates leads to long startup times for Vault, as mentioned in the documentation, on the order of 5-10 minutes, and what seems like excessive memory usage, currently about 6 Gigabytes. In addition, Vault frequently fails to revoke expired leases due to size constraints on the CRL, and calls to the PKI Tidy endpoint take tens of hours to complete, and don't seem to affect the memory or startup profile. We attributed these symptoms to certificate leases, but if this seems unlikely, then we can probably close this PR and move the conversation to an issue on the Vault repo. |
Vault doesn't have any startup delay due to leases since about 0.8.3, the documentation is out of date at this point. If tidy is taking 10+ hours to analyze 20k entries then your storage seems suspect. It should be a few minutes. |
Also expired leases shouldn't go onto the CRL so long as the reason for expiration is ttl. Are you manually revoking these leases? |
Thanks for the suggestions. We're using a DynamoDB backend, which could be the culprit when it comes to the abysmal tidy speed. Whenever Vault starts, it prints out several hundred errors regarding CRL size, and is unresponsive for several minutes until the rate of these errors decreases. I do not believe we're revoking certificates, unless consul-template does something of this sort when it shuts down. |
I have heard that DynamoDB is very, very, very slow w.r.t. Vault. Especially when Vault starts up or a node takes over active duty, it does make a lot of I/O as it reads leases, and you may well be getting throttled. This could also be the issue with tidy taking a long time. What version of Vault? |
We're currently running v0.10.1 |
With that version there shouldn't be correlation with those messages you're seeing (which you're seeing because leases are attempting to be revoked and failing due to CRL size -- you can use revoke-force to ignore the error if you want) and time until requests can be serviced, since leases are loaded lazily. So you're probably just incredibly I/O bound. |
That makes sense. Thanks again for all your help. We were trying to avoid running Consul, which I know is the recommended backend for Vault, since it's one more system for our team to manage. We're looking to use a managed service (hence DynamoDB) in order to avoid this. Looking at the supported storage backends, Google Cloud Storage/Spanner, MySQL/PostgreSQL, and S3 would fit that requirement. In your experience, do any of these offer better IO performance than DynamoDB? |
Honestly, I wouldn't know. They're all community supported so we don't use/test them. |
I think this may be related to an issue we just had this week. Our CRL list had grown so large and was prohibiting us from issuing new certs. We had tried to tidy it and the tidy appeared to do nothing. It was actually doing something but because our TTLs were still valid, tiday was not finding them as candidates and cleaning them. Our "fix" was to disable leases but now we are stuck with the 5 minute Cycle as well. Our current action is reenabling the leases. We have patched the pki_tidy command to clear all entries that have been revoked, so that we are at least able to manage our CRL list using vault tooling (opposed to just deleting stuff from our backend :p ) . |
Hey @dacohen, I'm taking over as maintainer of consul-template here at hashi and am curious if you are still having the issues that drove this PR? A good amount of time has passed and Vault has put out several releases that might have impacted your case. In short I'm curious if this feature/PR is still desired. Thanks. |
Hey @dacohen, since you submitted your pull request we've implemented a policy of requiring a signed CLA before merging. Would you please consider signing the CLA? Thanks. |
@eikenb, as far as I know the issues addressed by this PR still exist in current versions of Vault. The underlying issue is that consul-template won't attempt to renew a secret unless it has a lease duration associated with it. However, this requires Vault to keep track of leases for every active certificate. In recent versions, Vault does a better job of handling large numbers of leases, but it would still be convenient for Vault to not have to issue a lease, while still supporting renewal. Let me know if this answers your question. Thanks, Daniel |
@dacohen, thanks for the additional information. I'll want to check with the vault devs to make sure that this would compromise security in any way before thinking about working on merging it. But I'll try to at least kick that off... @hashicorp/vault, in case you are listening... what do you think about this? |
There are no security issues and this PR seems reasonable even with Vault not having the startup time issue as there seems to be other reasons to disable leases as @dmicanzerofox seems to indicate. So I'm thinking this is good to go. |
According to the Vault documentation, when issuing a large number of certificates, it is best practice to not generate a lease for the certificate:
However, consul-template uses lease durations to determine how often to render a template. If there is no lease associated with a secret, it defaults to 5 minutes, and renews approximately every minute and a half, which is too frequently for most certificates. Thus, it is not possible to follow Vault recommendations for large number of certs when using consul-template.
This PR is an attempt to rectify this. If the returned secret is a PKI-issued cert, then we can infer what the lease duration would have been by decoding the certificate data itself, removing the requirement for Vault to track the leases of all active certificates, and generate CRL entries for them.
This is my attempt at getting around the mismatch between consul-template functionality and Vault recommendations. If there is a better way of doing this, I'd love to start a conversation.