-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export crashes through telemetry #285
Conversation
@@ -675,6 +675,8 @@ impl TelemetryWorkerHandle { | |||
message, | |||
level, | |||
stack_trace, | |||
tags: String::new(), | |||
is_sensitive: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this default to false? Where would we have is_sensitive: true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to false because the existing behavior is that telemetry logs are not assumed to contain sensitive data.
@@ -36,6 +39,9 @@ pub fn main() -> anyhow::Result<()> { | |||
// TODO Experiment to see if 30 is the right number. | |||
crash_info.upload_to_dd(endpoint, Duration::from_secs(30))?; | |||
} | |||
if let Some(uploader) = telemetry_uploader { | |||
uploader.upload_to_telemetry(&crash_info, Duration::from_secs(30))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make 30 seconds configurable? Or is that a standard value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, there needs to be a timeout, but 30s seems too much for me since we're blocking the parent process until we return, but I think the evp intake the data goes too after being proxied by the agent out can have pretty long outliers from time to time (requests blocking for up to 15s in rare cases).
I put 30s since that's what you used for the profile.
But I don't know if it's a good idea to let users modify it 🤔 Maybe the crashtracker should daemonize itself and do crash submission asynchronously after we receive it, and then we can keep the long timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue which I'm worried about is the pod being killed when the parent process dies, and the crash-tracker info never getting out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm true, and if the crash tracker daemonizes itself if the crashing process is not PID 1 in the container we might end up with zombie processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we block the process from exiting for a long time we might prevent any orchestrator from restarting it and worsen the impact a crash might have on availability
@@ -25,6 +26,8 @@ pub fn main() -> anyhow::Result<()> { | |||
std::io::stdin().lock().read_line(&mut metadata)?; | |||
let metadata: Metadata = serde_json::from_str(&metadata)?; | |||
|
|||
let telemetry_uploader = telemetry::TelemetryCrashUploader::new(&metadata, &config).ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do something if there is an error? Or just elide the upload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say elide the upload. Crashtracking is best effort anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a metric for failed uploads. 🐢 all the way down!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit afraid that if the crashtracking fails, metrics wouldn't work although I guess sending a single statsd point wouldn't hurt
5fbfe5b
to
8135064
Compare
@@ -0,0 +1,175 @@ | |||
// Unless explicitly stated otherwise all files in this repository are licensed under the Apache License Version 2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this machinery necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment at the top of the module describing what it does
/// This module implements an abstraction over compilation with cargo with the purpose
/// of testing full binaries or dynamic libraries, instead if just rust static libraries.
///
/// The main entrypoint isfn build_artifacts
which takes a list of artifacts to build,
/// either executable crates, cdylib, or extra binaries, invokes cargo and return the path
/// of the built artifact.
///
/// Builds are cached between invocations so that multiple tests can use the same artifact
/// without doing expensive work twice.
///
/// It is assumed that functions in this module are invoked in the context of a cargo #[test]
/// item, or acargo run
command to be able to locate artifacts built by cargo from the position
/// of the current binary.
}; | ||
let crashtracker_bin = ArtifactsBuild { | ||
name: "crashtracker_bin".to_owned(), | ||
profile: Profile::Debug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be interesting to do a prop-test covering both debug and release builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Interestingly, the null pointer dereference gets optimized out in release mode for the binary that's instrumented, and we receive a SIGTRAP instead. I guess it's because it's an undefined behavior
use bin_tests::{build_artifacts, ArtifactType, ArtifactsBuild, Profile}; | ||
|
||
#[test] | ||
#[cfg_attr(miri, ignore)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no kidding!
Crashtracker additions: * Add a telemetry uploader for crashes * Send part of the crash info though tags, and part of it as the message. * Extract necessary telemetry tags from profiler tags Logs payload addition: * Add the "tags" field to telemetry logs to be able to search on fields in crashes * Add a "sensitive" flag to telemetry logs to mark crash logs as possibly containing customer data Testing: * Add a way to run binary test * Add integration test for the crashtracker binary
b38fbb9
to
636dc99
Compare
7eb523e
to
98f545b
Compare
98f545b
to
47c7f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Should we add a CI step?
e19088d
to
f5e2a3f
Compare
f5e2a3f
to
34ef5c9
Compare
34ef5c9
to
71a39b1
Compare
4cf8a55
to
cf9c813
Compare
fe5fef9
to
24cd340
Compare
c6e9924
to
1b61275
Compare
#[derive(Debug, serde::Serialize)] | ||
/// This struct represents the part of the crash_info that we are sending in the | ||
/// log `message` field as a json | ||
struct TelemetryCrashInfoMessage<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the additional_stacktraces
go in here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer adding it in a followup PR
What does this PR do?
Crashtracker additions:
Logs payload addition:
Additional Notes
Currently, I implemented the telemetry uploader in the same crate as the crashtracker binary, because the uploader has dependencies on the crash info code in in the profiling crate, and the ddtelemetry crate.
If we implement crashtracking in situations where the profiler lib is not running, we might want to move the crash_info code to ddcommon.
For Reviewers
@DataDog/security-design-and-guidance
.