Export crashes through telemetry #285

paullegranddc · 2023-12-12T17:32:22Z

What does this PR do?

Crashtracker additions:

Add a telemetry uploader for crashes
Send part of the crash info though tags, and part of it as the message.
Extract necessary telemetry tags from profiler tags

Logs payload addition:

Add the "tags" field to telemetry logs to be able to search on fields in crashes
Add a "sensitive" flag to telemetry logs to mark crash logs as possibly containing customer data

Additional Notes

Currently, I implemented the telemetry uploader in the same crate as the crashtracker binary, because the uploader has dependencies on the crash info code in in the profiling crate, and the ddtelemetry crate.

If we implement crashtracking in situations where the profiler lib is not running, we might want to move the crash_info code to ddcommon.

For Reviewers

If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

ddtelemetry/src/data/payloads.rs

danielsn · 2023-12-13T16:09:52Z

ddtelemetry/src/worker/mod.rs

@@ -675,6 +675,8 @@ impl TelemetryWorkerHandle {
                message,
                level,
                stack_trace,
+                tags: String::new(),
+                is_sensitive: false,


should this default to false? Where would we have is_sensitive: true?

I set it to false because the existing behavior is that telemetry logs are not assumed to contain sensitive data.

profiling/src/crashtracker/crash_info.rs

danielsn · 2023-12-13T16:23:15Z

profiling-crashtracking-receiver/src/main.rs

@@ -36,6 +39,9 @@ pub fn main() -> anyhow::Result<()> {
                // TODO Experiment to see if 30 is the right number.
                crash_info.upload_to_dd(endpoint, Duration::from_secs(30))?;
            }
+            if let Some(uploader) = telemetry_uploader {
+                uploader.upload_to_telemetry(&crash_info, Duration::from_secs(30))?;


Should we make 30 seconds configurable? Or is that a standard value?

I'm not sure, there needs to be a timeout, but 30s seems too much for me since we're blocking the parent process until we return, but I think the evp intake the data goes too after being proxied by the agent out can have pretty long outliers from time to time (requests blocking for up to 15s in rare cases).

I put 30s since that's what you used for the profile.

But I don't know if it's a good idea to let users modify it 🤔 Maybe the crashtracker should daemonize itself and do crash submission asynchronously after we receive it, and then we can keep the long timeout.

The issue which I'm worried about is the pod being killed when the parent process dies, and the crash-tracker info never getting out.

Hmm true, and if the crash tracker daemonizes itself if the crashing process is not PID 1 in the container we might end up with zombie processes

Also, if we block the process from exiting for a long time we might prevent any orchestrator from restarting it and worsen the impact a crash might have on availability

danielsn · 2023-12-13T16:23:52Z

profiling-crashtracking-receiver/src/main.rs

@@ -25,6 +26,8 @@ pub fn main() -> anyhow::Result<()> {
    std::io::stdin().lock().read_line(&mut metadata)?;
    let metadata: Metadata = serde_json::from_str(&metadata)?;

+    let telemetry_uploader = telemetry::TelemetryCrashUploader::new(&metadata, &config).ok();


Do we want to do something if there is an error? Or just elide the upload?

I'd say elide the upload. Crashtracking is best effort anyway

We could have a metric for failed uploads. 🐢 all the way down!

I'm a bit afraid that if the crashtracking fails, metrics wouldn't work although I guess sending a single statsd point wouldn't hurt

bin_tests/src/bin/test_the_tests.rs

bin_tests/src/bin/crashtracker_bin_test.rs

profiling-crashtracking-receiver/src/telemetry.rs

danielsn · 2024-01-08T19:15:07Z

bin_tests/src/lib.rs

@@ -0,0 +1,175 @@
+// Unless explicitly stated otherwise all files in this repository are licensed under the Apache License Version 2.0.


Why is this machinery necessary?

Added a comment at the top of the module describing what it does

/// This module implements an abstraction over compilation with cargo with the purpose
/// of testing full binaries or dynamic libraries, instead if just rust static libraries.
///
/// The main entrypoint is fn build_artifacts which takes a list of artifacts to build,
/// either executable crates, cdylib, or extra binaries, invokes cargo and return the path
/// of the built artifact.
///
/// Builds are cached between invocations so that multiple tests can use the same artifact
/// without doing expensive work twice.
///
/// It is assumed that functions in this module are invoked in the context of a cargo #[test]
/// item, or a cargo run command to be able to locate artifacts built by cargo from the position
/// of the current binary.

bin_tests/tests/crashtracker_bin_test.rs

danielsn · 2024-01-08T19:16:29Z

bin_tests/tests/crashtracker_bin_test.rs

+    };
+    let crashtracker_bin = ArtifactsBuild {
+        name: "crashtracker_bin".to_owned(),
+        profile: Profile::Debug,


Might be interesting to do a prop-test covering both debug and release builds?

Added.
Interestingly, the null pointer dereference gets optimized out in release mode for the binary that's instrumented, and we receive a SIGTRAP instead. I guess it's because it's an undefined behavior

danielsn · 2024-01-08T19:16:59Z

bin_tests/tests/test_the_tests.rs

+use bin_tests::{build_artifacts, ArtifactType, ArtifactsBuild, Profile};
+
+#[test]
+#[cfg_attr(miri, ignore)]


no kidding!

Crashtracker additions: * Add a telemetry uploader for crashes * Send part of the crash info though tags, and part of it as the message. * Extract necessary telemetry tags from profiler tags Logs payload addition: * Add the "tags" field to telemetry logs to be able to search on fields in crashes * Add a "sensitive" flag to telemetry logs to mark crash logs as possibly containing customer data Testing: * Add a way to run binary test * Add integration test for the crashtracker binary

danielsn

Looks pretty good. Should we add a CI step?

profiling/Cargo.toml

bin_tests/src/bin/crashtracker_bin_test.rs

bin_tests/tests/test_the_tests.rs

bin_tests/tests/crashtracker_bin_test.rs

danielsn · 2024-02-29T20:10:24Z

crashtracker/src/telemetry.rs

+#[derive(Debug, serde::Serialize)]
+/// This struct represents the part of the crash_info that we are sending in the
+/// log `message` field as a json
+struct TelemetryCrashInfoMessage<'a> {


Should the additional_stacktraces go in here as well?

I'd prefer adding it in a followup PR

paullegranddc requested review from a team as code owners December 12, 2023 17:32

github-actions bot added profiling Relates to the profiling* modules. telemetry labels Dec 12, 2023

danielsn reviewed Dec 13, 2023

View reviewed changes

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 3 times, most recently from 5fbfe5b to 8135064 Compare January 8, 2024 14:18

danielsn reviewed Jan 8, 2024

View reviewed changes

paullegranddc requested review from a team as code owners January 18, 2024 02:49

paullegranddc requested a review from omerli January 18, 2024 02:49

Base automatically changed from dsn/crash-handler-api to main January 19, 2024 21:58

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from b38fbb9 to 636dc99 Compare January 24, 2024 18:57

Make bin_test work with the crashtracker in library

78761a8

github-actions bot added the common label Feb 16, 2024

Merge branch 'main' into paullgdc/upload_crashes_to_telemetry

a8bf45f

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 3 times, most recently from 7eb523e to 98f545b Compare February 16, 2024 18:34

Fix tests in some env

47c7f68

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 98f545b to 47c7f68 Compare February 16, 2024 19:01

paullegranddc added 2 commits February 16, 2024 20:13

Fix windows tests

4a3c49e

Skip bin tests on windows

bea6edc

danielsn reviewed Feb 16, 2024

View reviewed changes

profiling/Cargo.toml Outdated Show resolved Hide resolved

bin_tests/src/bin/crashtracker_bin_test.rs Outdated Show resolved Hide resolved

bin_tests/src/bin/crashtracker_bin_test.rs Outdated Show resolved Hide resolved

bin_tests/tests/test_the_tests.rs Outdated Show resolved Hide resolved

Log stderr if test fails

744a866

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch 2 times, most recently from e19088d to f5e2a3f Compare February 19, 2024 16:12

DataDog deleted a comment from github-actions bot Feb 19, 2024

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from f5e2a3f to 34ef5c9 Compare February 19, 2024 16:16

DataDog deleted a comment from github-actions bot Feb 19, 2024

adress PR comments

71a39b1

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 34ef5c9 to 71a39b1 Compare February 19, 2024 16:17

Do not resolve frames

cf9c813

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from 4cf8a55 to cf9c813 Compare February 19, 2024 16:45

Remove import

0de580f

github-actions bot removed the profiling Relates to the profiling* modules. label Feb 19, 2024

paullegranddc added 5 commits February 20, 2024 14:47

Fix tests on alpine

5d5315e

Add context to crashtracker tests

8807029

fmt

ae7ac10

Add assertion on sdout

9039866

Merge branch 'main' into paullgdc/upload_crashes_to_telemetry

24cd340

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from fe5fef9 to 24cd340 Compare February 29, 2024 14:52

paullegranddc added 2 commits February 29, 2024 18:23

Merge branch 'main' into paullgdc/upload_crashes_to_telemetry

8729a24

send telemetry on partial crash data

1b61275

paullegranddc force-pushed the paullgdc/upload_crashes_to_telemetry branch from c6e9924 to 1b61275 Compare February 29, 2024 17:26

update license-3rdparty.yml

acddc38

danielsn reviewed Feb 29, 2024

View reviewed changes

complete comment

f94d5c5

danielsn approved these changes Mar 1, 2024

View reviewed changes

paullegranddc merged commit 222d809 into main Mar 1, 2024
16 of 20 checks passed

danielsn mentioned this pull request Mar 5, 2024

crashtracker: emit additional_stacktraces to telemetry #344

Open

bantonsson deleted the paullgdc/upload_crashes_to_telemetry branch March 7, 2024 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export crashes through telemetry #285

Export crashes through telemetry #285

paullegranddc commented Dec 12, 2023

danielsn Dec 13, 2023

paullegranddc Dec 14, 2023

danielsn Dec 13, 2023

paullegranddc Dec 14, 2023 •

edited

Loading

danielsn Dec 22, 2023

paullegranddc Dec 22, 2023 •

edited

Loading

paullegranddc Jan 8, 2024

danielsn Dec 13, 2023

paullegranddc Dec 14, 2023

danielsn Dec 22, 2023

paullegranddc Jan 8, 2024

danielsn Jan 8, 2024

paullegranddc Jan 9, 2024

danielsn Jan 8, 2024

paullegranddc Jan 9, 2024 •

edited

Loading

danielsn Jan 8, 2024

danielsn left a comment

danielsn Feb 29, 2024

paullegranddc Mar 1, 2024

		@@ -0,0 +1,175 @@
		// Unless explicitly stated otherwise all files in this repository are licensed under the Apache License Version 2.0.

Export crashes through telemetry #285

Export crashes through telemetry #285

Conversation

paullegranddc commented Dec 12, 2023

What does this PR do?

Additional Notes

For Reviewers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paullegranddc Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paullegranddc Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paullegranddc Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielsn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paullegranddc Dec 14, 2023 •

edited

Loading

paullegranddc Dec 22, 2023 •

edited

Loading

paullegranddc Jan 9, 2024 •

edited

Loading