Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add diagnostics callback for tlscommon configs and httptransport #207

Merged
merged 10 commits into from
Jun 10, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented May 28, 2024

What does this PR do?

Add diagnostics callbacks for TLS configuration to make it easier to
include in diagnostics bundles. Diagnostics info will geneally verify
that the certs or cas are able to load, and display some basic info such
as cert names, expiry times, fingerprints, etc. httptransport callback
will attempt an HTTP request and return information collected from
httptrace about the request.

Why is it important?

Allows more information about TLS to be present in diagnostics to help troubleshoot any certificate issues.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 28, 2024
@michel-laterman michel-laterman changed the title Add diagnostics callback for tlscommon configs Add diagnostics callback for tlscommon configs and httptransport May 28, 2024
Add diagnostics callbacks for TLS configuration to make it easier to
include in diagnostics bundles. Diagnostics info will geneally verify
that the certs or cas are able to load, and display some basic info such
as cert names, expiry times, fingerprints, etc. httptransport callback
will attempt an HTTP request and return information collected from
httptrace about the request.
@michel-laterman
Copy link
Contributor Author

Example of the hooks functions and outputs added to a (draft) fleet-server pr: elastic/fleet-server#3587

Copy link
Contributor Author

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like feedback, specifically if we need more information to help troubleshoot issues.

transport/httpcommon/diag.go Show resolved Hide resolved
diagCertificate(logger, &c.Certificate)
diagCAs(logger, c.CAs)

return b.Bytes()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the fleet-server demo, output is:

tlscommon.ServerConfig: Start diagnostics 2024-05-29 00:18:41.569890423 +0000 UTC
tlscommon.ServerConfig: verification_mode=full
tlscommon.ServerConfig: client_auth=none
tlscommon.ServerConfig: ca_sha256=[]
tlscommon.ServerConfig: CertificateSettings: checking certificate keypair
tlscommon.ServerConfig: CertificateSettings: certificate keypair OK.
tlscommon.ServerConfig: CertificateSettings: cert 0 - Subject: CN=fleet-server-dev,O=elastic-fleet
	Issuer: CN=localhost,O=elastic-fleet
	NotBefore: 2024-05-29 00:18:15 +0000 UTC
	NotAfter: 2034-05-29 00:18:15 +0000 UTC
	Fingerprint: EvAr8erNJZJ0BoXor1jnCl5X2WCU79ISMgwMITOEJBs=
	SAN IP: []
	SAN DNS: [fleet-server-dev]
tlscommon.ServerConfig: CertificateAuthorities: certificate_authorities provided.
tlscommon.ServerConfig: CertificateAuthorities: - cert 0 Subject: CN=localhost,O=elastic-fleet
	IsCa: true
	BasicConstraintsValid: true
	NotBefore: 2024-05-29 00:18:15 +0000 UTC
	NotAfter: 2034-05-29 00:18:15 +0000 UTC
	Fingerprint: cSawoNEa2cSPV4Mwce4mRysrcBGtHBGcSeoBL+Cr48Q=

We want to make sure that certificate & ca information is provided in the diagnostics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight update after I added CertDiagString

tlscommon.ServerConfig: Start diagnostics 2024-05-29 22:10:07.333735518 +0000 UTC
tlscommon.ServerConfig: verification_mode=full
tlscommon.ServerConfig: client_auth=none
tlscommon.ServerConfig: ca_sha256=[]
tlscommon.ServerConfig: CertificateSettings: checking certificate keypair
tlscommon.ServerConfig: CertificateSettings: certificate keypair OK.
tlscommon.ServerConfig: CertificateSettings: cert 0 Subject=CN=fleet-server-dev,O=elastic-fleet
	Issuer=CN=localhost,O=elastic-fleet
	IsCA=false
	BasicConstrintsValid=false
	NotBefore=2024-05-29 22:08:32 +0000 UTC
	NotAfter=2034-05-29 22:08:32 +0000 UTC
	Fingerprint=w04CVUAd7Tfnd/AZ/A8YB9bCUAImGeGXBAUPnhEvGgM=
	SAN IP=[]
	SAN DNS=[fleet-server-dev]
	SAN URI=[]
tlscommon.ServerConfig: CertificateAuthorities: certificate_authorities provided.
tlscommon.ServerConfig: CertificateAuthorities: - cert 0 Subject=CN=localhost,O=elastic-fleet
	Issuer=CN=localhost,O=elastic-fleet
	IsCA=true
	BasicConstrintsValid=true
	NotBefore=2024-05-29 22:08:23 +0000 UTC
	NotAfter=2034-05-29 22:08:23 +0000 UTC
	Fingerprint=QHVJrVpzEHBJTTr6F9RW6TnYbj9ckPD2BQ5JphTzh7M=
	SAN IP=[]
	SAN DNS=[localhost]
	SAN URI=[]

@michel-laterman michel-laterman marked this pull request as ready for review May 29, 2024 18:54
@michel-laterman michel-laterman requested a review from a team as a code owner May 29, 2024 18:54
@michel-laterman michel-laterman requested review from belimawr and rdner and removed request for a team May 29, 2024 18:54
@leehinman
Copy link
Contributor

I would like feedback, specifically if we need more information to help troubleshoot issues.

This may or may not be helpful, so feel free to disregard.

I'd like to see some failure scenarios, paired with the new diagnostics showing how they give the info needed to correctly diagnose the problem. I'm thinking of things like:

  1. User configures NonTLS client but connects to TLS server
  2. User configures TLS client but connects to NonTLS server
  3. User configures TLS client, no client cert, private key but connects to TLS server that requires TLS client authentication
  4. User configures TLS client but connects to server that has expired certificate
  5. User configures TLS client but doesn't have CA that signed TLS server cert in it's trust store
  6. User configures TLS client but server requires an algorithm that isn't supported by the client
  7. User configures TLS client but connects to Proxy that doesn't have the server name as a SAN

Add some error parsing to HTTPSettings.DiagRequests to attempt to give a
human readable statement about the cause of any errors when gathering
request diagnostics.
Comment on lines +134 to +136
// isGoHTTPResp detects if the response is one that a go http.Server sends if an HTTP request is made to an HTTPS server.
// non Go servers may return a net.OpError instead.
func isGoHTTPResp(r *http.Response) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is a part of go servers, and also occurs for our current cloud deplyments

Comment on lines +147 to +149
// diagError tries to diagnose the error and return a cause/possible cause in a human readable format.
// If no matching errors are found err.Error is returned.
func diagError(err error) string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leehinman, I've tried to add this to explain errors based on the scenarios you provided (with test cases for most of them). we can add more in the future as well

transport/httpcommon/diag.go Show resolved Hide resolved
transport/httpcommon/diag_test.go Show resolved Hide resolved
transport/tlscommon/diag.go Outdated Show resolved Hide resolved
Co-authored-by: Tiago Queiroz <me@tiago.life>
@michel-laterman
Copy link
Contributor Author

@lucabelluccini is there anything I should add/change to help with your troubleshooting efforts?

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding the test cases.

@lucabelluccini
Copy link

Hello, thanks for the heads up on this one and thank you for those efforts into troubleshooting TLS.
I have few questions:

  1. We will have 1 TLS diagnostic per output (e.g. the integrations output, the monitoring output, fleet server connection)? How will it work?
  2. We will have a documentation section to cover some example failures a user might encounter?

Example about (1).
Imagine a user has 1 Elastic Agent.
The Elastic Agent also runs the System integration.
The EA sends logs & metrics to Elasticsearch.

So we have:

  • connection to ES for monitoring (logs & metrics of EA)
  • connection to ES for integrations (system integration)
  • connection to Fleet Server (Fleet Server enroll/checkins)

We will get 3 TLS diags?

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

@michel-laterman
Copy link
Contributor Author

@lucabelluccini

  1. yes, we'll have a request trace file per destination type; these will need to be manually added in future PRs. The implemenation issue for the elastic-agent is: Include TLS information in diagnostics bundle elastic-agent#4880.
    just note the fleet-server implementation: Add tlscommon and httpcommon diagnostics hooks fleet-server#3587 currently only tests the connection to ES; it does not check for monitoring output; but I can add that check in the pr
  2. I added diagError to try to turn an error that occurs in the connection into a human-readable cause such as %v: possible cause: HTTP schema used for HTTPS server., or %v: caused by no trusted client CA. but we can make a KB article to help track the causes of (connection) failures

@michel-laterman michel-laterman merged commit 89e2295 into elastic:main Jun 10, 2024
5 of 6 checks passed
@michel-laterman michel-laterman deleted the tlsdiag branch June 10, 2024 14:45
@lucabelluccini
Copy link

Lovely - Thank you @michel-laterman & team ❤️

michel-laterman added a commit to elastic/fleet-server that referenced this pull request Jun 24, 2024
Add custom hooks to use diag hooks added in elastic/elastic-agent-libs#207 to provide additional files that contain information about the TLS certs used by the server's API, TLS infomation used when connecting to elasticsearch, and a full trace to each specified elasticsearch host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants