Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TraceQL Metrics] New baseline comparison function #3695

Merged
merged 23 commits into from
Jun 21, 2024

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented May 21, 2024

What this PR does:
This adds a new metrics function compare which is used to split the stream of spans into two groups: a selection and a baseline. Then it returns time-series for all attributes found on the spans to highlight the differences between the two groups. This is kind of hard to describe so there are some example outputs below:

Function signature:
The function is used like other metrics functions, which it is placed after any search query, and converts it into a metrics query:
...any spanset pipeline... | compare({subset filters}, <topN>, <start timestamp>, <end timestamp>)

Example:
{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error})

Parameters:

  1. Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise it is part of the baseline. Common filters are expected to be things like {status=error} (what is different about errors?) or {duration>1s} (what is different about slow spans?)
  2. Optional. The second parameter is the top N values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top N values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to 10.
  3. Optional. Start and end timestamps in unix nanoseconds, which can be used to additionally subset the spans in time. These timestamps must both be given, or neither. These parameters are unlike any others in traceql and therefore kind of clunky. Maybe in the future we can fix this by adding the ability to check span:startTime directly in the language, so it could be part of the filter.

Output:
The outputs are flat time-series for each attribute/value found in the spans. This function has a built-in select(*) so there can be a lot. Each series has a label __meta_type which denotes which group it is in, either selection or baseline.

Example output series:

{ __meta_type="baseline", resource.cluster="prod" } 123
{ __meta_type="baseline", resource.cluster="qa" } 124
{ __meta_type="selection", resource.cluster="prod" } 456   <--- significant difference detected
{ __meta_type="selection", resource.cluster="qa" } 125
{ __meta_type="selection", resource.cluster="dev"} 126  <--- cluster=dev was found in the highlighted spans but not in the baseline

When an attribute reaches the cardinality limit there will also be present an error indicator. This example means the attribute resource.cluster had too many values.

{ __meta_error="__too_many_values__", resource.cluster=<nil> }

Remaining Work

  1. Not 100% settled on the meta labels and indication of attributes that reached max cardinality. Would appreciate feedback.
  2. This function has a built-in select(*) to select all attributes of all spans (yes all). So it is likely to exceed gRPC payloads when run as a range query. We don't have official support for instant queries, but you can emulate it by setting step equal to end-start, so effectively it is a range query that returns a single datapoint.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@adrapereira
Copy link

Instead of requiring max cardinality what if we also make it optional and default to a sensible value?

I'm thinking it would be better to return the values Tempo got until max and an error instead of returning an error and nil when it reaches max cardinality. This way that attribute would still show some value, instead of none. Think of a graph with no data and an error label vs a graph with some data and an error/warning label, what would you prefer?

I agree that the timestamps in the function are ugly, would be more TraceQL-y to have it as span:startTime as you mention.

@mdisibio
Copy link
Contributor Author

Instead of requiring max cardinality what if we also make it optional and default to a sensible value?

Yep can do that. I think 10 is a sensible default.

I'm thinking it would be better to return the values Tempo got until max and an error instead of returning an error and nil when it reaches max cardinality. This way that attribute would still show some value, instead of none. Think of a graph with no data and an error label vs a graph with some data and an error/warning label, what would you prefer?

The main rationale was to avoid computing the exact topN values, which requires continuing to count and pass all values up to the query-frontend. There are two alternatives that are lossy but should be workable:

  1. Lossy topN - each job performs topN, and then the query-frontend performs topN again. This is lossy because low-rate but omnipresent values that might be the actual topN get overshadowed by bursty values.
  2. FirstN - also good performance, but not sure the usefulness.

@adrapereira
Copy link

My proposal was inline with your FirstN idea so either of your options would work for me, but curious about other opinions.

@mdisibio mdisibio marked this pull request as ready for review June 17, 2024 13:17
@mdisibio mdisibio requested a review from stoewer as a code owner June 17, 2024 13:17
pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved
pkg/traceql/engine_metrics_compare.go Outdated Show resolved Hide resolved
pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved
Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mdisibio mdisibio merged commit 6b2c0b1 into grafana:main Jun 21, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants