Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export crashes through telemetry #285

Merged
merged 19 commits into from
Mar 1, 2024

Conversation

paullegranddc
Copy link
Contributor

What does this PR do?

Crashtracker additions:

  • Add a telemetry uploader for crashes
  • Send part of the crash info though tags, and part of it as the message.
  • Extract necessary telemetry tags from profiler tags

Logs payload addition:

  • Add the "tags" field to telemetry logs to be able to search on fields in crashes
  • Add a "sensitive" flag to telemetry logs to mark crash logs as possibly containing customer data

Additional Notes

Currently, I implemented the telemetry uploader in the same crate as the crashtracker binary, because the uploader has dependencies on the crash info code in in the profiling crate, and the ddtelemetry crate.

If we implement crashtracking in situations where the profiler lib is not running, we might want to move the crash_info code to ddcommon.

For Reviewers

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@paullegranddc paullegranddc requested review from a team as code owners December 12, 2023 17:32
@github-actions github-actions bot added profiling Relates to the profiling* modules. telemetry labels Dec 12, 2023
ddtelemetry/src/data/payloads.rs Outdated Show resolved Hide resolved
@@ -675,6 +675,8 @@ impl TelemetryWorkerHandle {
message,
level,
stack_trace,
tags: String::new(),
is_sensitive: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this default to false? Where would we have is_sensitive: true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to false because the existing behavior is that telemetry logs are not assumed to contain sensitive data.

profiling/src/crashtracker/crash_info.rs Outdated Show resolved Hide resolved
@@ -36,6 +39,9 @@ pub fn main() -> anyhow::Result<()> {
// TODO Experiment to see if 30 is the right number.
crash_info.upload_to_dd(endpoint, Duration::from_secs(30))?;
}
if let Some(uploader) = telemetry_uploader {
uploader.upload_to_telemetry(&crash_info, Duration::from_secs(30))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make 30 seconds configurable? Or is that a standard value?

Copy link
Contributor Author

@paullegranddc paullegranddc Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, there needs to be a timeout, but 30s seems too much for me since we're blocking the parent process until we return, but I think the evp intake the data goes too after being proxied by the agent out can have pretty long outliers from time to time (requests blocking for up to 15s in rare cases).

I put 30s since that's what you used for the profile.

But I don't know if it's a good idea to let users modify it 🤔 Maybe the crashtracker should daemonize itself and do crash submission asynchronously after we receive it, and then we can keep the long timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue which I'm worried about is the pod being killed when the parent process dies, and the crash-tracker info never getting out.

Copy link
Contributor Author

@paullegranddc paullegranddc Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm true, and if the crash tracker daemonizes itself if the crashing process is not PID 1 in the container we might end up with zombie processes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we block the process from exiting for a long time we might prevent any orchestrator from restarting it and worsen the impact a crash might have on availability

@@ -25,6 +26,8 @@ pub fn main() -> anyhow::Result<()> {
std::io::stdin().lock().read_line(&mut metadata)?;
let metadata: Metadata = serde_json::from_str(&metadata)?;

let telemetry_uploader = telemetry::TelemetryCrashUploader::new(&metadata, &config).ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do something if there is an error? Or just elide the upload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say elide the upload. Crashtracking is best effort anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a metric for failed uploads. 🐢 all the way down!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit afraid that if the crashtracking fails, metrics wouldn't work although I guess sending a single statsd point wouldn't hurt

@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 3 times, most recently from 5fbfe5b to 8135064 Compare January 8, 2024 14:18
bin_tests/src/bin/test_the_tests.rs Outdated Show resolved Hide resolved
bin_tests/src/bin/crashtracker_bin_test.rs Outdated Show resolved Hide resolved
profiling-crashtracking-receiver/src/telemetry.rs Outdated Show resolved Hide resolved
profiling-crashtracking-receiver/src/telemetry.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,175 @@
// Unless explicitly stated otherwise all files in this repository are licensed under the Apache License Version 2.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this machinery necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment at the top of the module describing what it does

/// This module implements an abstraction over compilation with cargo with the purpose
/// of testing full binaries or dynamic libraries, instead if just rust static libraries.
///
/// The main entrypoint is fn build_artifacts which takes a list of artifacts to build,
/// either executable crates, cdylib, or extra binaries, invokes cargo and return the path
/// of the built artifact.
///
/// Builds are cached between invocations so that multiple tests can use the same artifact
/// without doing expensive work twice.
///
/// It is assumed that functions in this module are invoked in the context of a cargo #[test]
/// item, or a cargo run command to be able to locate artifacts built by cargo from the position
/// of the current binary.

bin_tests/tests/crashtracker_bin_test.rs Outdated Show resolved Hide resolved
};
let crashtracker_bin = ArtifactsBuild {
name: "crashtracker_bin".to_owned(),
profile: Profile::Debug,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to do a prop-test covering both debug and release builds?

Copy link
Contributor Author

@paullegranddc paullegranddc Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.
Interestingly, the null pointer dereference gets optimized out in release mode for the binary that's instrumented, and we receive a SIGTRAP instead. I guess it's because it's an undefined behavior

use bin_tests::{build_artifacts, ArtifactType, ArtifactsBuild, Profile};

#[test]
#[cfg_attr(miri, ignore)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no kidding!

@paullegranddc paullegranddc requested review from a team as code owners January 18, 2024 02:49
Base automatically changed from dsn/crash-handler-api to main January 19, 2024 21:58
Crashtracker additions:
* Add a telemetry uploader for crashes
* Send part of the crash info though tags, and part of it as the message.
* Extract necessary telemetry tags from profiler tags

Logs payload addition:
* Add the "tags" field to telemetry logs to be able to search on fields in crashes
* Add a "sensitive" flag to telemetry logs to mark crash logs as possibly containing customer data

Testing:
* Add a way to run binary test
* Add integration test for the crashtracker binary
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from b38fbb9 to 636dc99 Compare January 24, 2024 18:57
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 3 times, most recently from 7eb523e to 98f545b Compare February 16, 2024 18:34
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 98f545b to 47c7f68 Compare February 16, 2024 19:01
Copy link
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Should we add a CI step?

profiling/Cargo.toml Outdated Show resolved Hide resolved
bin_tests/src/bin/crashtracker_bin_test.rs Outdated Show resolved Hide resolved
bin_tests/src/bin/crashtracker_bin_test.rs Outdated Show resolved Hide resolved
bin_tests/tests/test_the_tests.rs Outdated Show resolved Hide resolved
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 2 times, most recently from e19088d to f5e2a3f Compare February 19, 2024 16:12
@DataDog DataDog deleted a comment from github-actions bot Feb 19, 2024
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from f5e2a3f to 34ef5c9 Compare February 19, 2024 16:16
@DataDog DataDog deleted a comment from github-actions bot Feb 19, 2024
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 34ef5c9 to 71a39b1 Compare February 19, 2024 16:17
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 4cf8a55 to cf9c813 Compare February 19, 2024 16:45
@github-actions github-actions bot removed the profiling Relates to the profiling* modules. label Feb 19, 2024
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from fe5fef9 to 24cd340 Compare February 29, 2024 14:52
@paullegranddc paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from c6e9924 to 1b61275 Compare February 29, 2024 17:26
bin_tests/tests/crashtracker_bin_test.rs Outdated Show resolved Hide resolved
#[derive(Debug, serde::Serialize)]
/// This struct represents the part of the crash_info that we are sending in the
/// log `message` field as a json
struct TelemetryCrashInfoMessage<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the additional_stacktraces go in here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer adding it in a followup PR

@paullegranddc paullegranddc merged commit 222d809 into main Mar 1, 2024
16 of 20 checks passed
@bantonsson bantonsson deleted the paullgdc/upload_crashes_to_telemetry branch March 7, 2024 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants