Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boolean transaction.error flag to transaction documents #281

Closed
cauemarcondes opened this issue Jun 15, 2020 · 26 comments
Closed

Add boolean transaction.error flag to transaction documents #281

cauemarcondes opened this issue Jun 15, 2020 · 26 comments

Comments

@cauemarcondes
Copy link

cauemarcondes commented Jun 15, 2020

Needed by: 7.9

Summary
Transaction documents should have a boolean flag error which indicates whether the transaction was erroneous or not.

Background
In APM UI we want to add a chart displaying the error rate. Currently we are calculating the rate as the ratio of errors to transactions in a given time range. But since a transaction can have multiple errors and we can have errors outside of transactions we can not calculate it like that.

Solution
A solution would be having a flag on the transaction informing if that transaction is erroneous. And to calculate the Error rate we'd do: total_number_of_transactions_with_errors / total_number_of_transactions in a given time range.

@cauemarcondes cauemarcondes changed the title Adds an error flag on transaction when an error happened Add an error flag on transaction when an error happened Jun 15, 2020
@cauemarcondes cauemarcondes changed the title Add an error flag on transaction when an error happened Add Error flag on transaction when an error happened Jun 15, 2020
@felixbarny
Copy link
Member

There are other use cases that would benefit from an error flag, too. For example #271 (probably requires an error flag for spans) and https://discuss.elastic.co/t/finding-errors-related-to-transactions/233667. The error rate could also be a useful addition to the transactions overview table (see also elastic/kibana#65619 (comment)).

Field definition
The existing ECS error fields don't really apply here.

Definition of error
Another open question is how to define the semantics of the flag. Is a 500 transaction without an error document an error? Is a 404 an error? Is a 200 with an error document an error?

My opinion on that is that a caught exception should not set the error flag to true and that a 500 transaction should. For 404s, I'm torn. They can be very noisy but a sudden increase in 404s could be a valuable signal, so I'm leaning towards counting that as an error, too. The UI could allow for filtering out specific errors if the user does not want to see them but that's not top priority, I think.

In the context of the error rate chart for the error page (elastic/kibana#65619) it might be weird to count 500s as erroneous even if there's no error document associated with a transaction. I'd still argue we should define error as seen from the user's perspective: requests that were not successful. Whether or not there's additional information, such as error documents, attached to it.

@felixbarny
Copy link
Member

tagging @elastic/apm-agent-devs

@cauemarcondes
Copy link
Author

@elastic/apm-agent-devs I'd like to know if it would be possible to have the error flag on a transaction for v7.9.0?

@SergeyKleyman
Copy link
Contributor

I think it would be hard to find a solution to fit popular use cases without some custom configuration. For example 404 is probably an error for a site with server-side rendering but for an API 404 might be things as usual (for example AJAX call to check if user name is already taken on a new user registration form). Maybe it's worth providing a configuration option to select status codes to be considered an error? It will still not cover 200 + error page use case though. Also according to dry HTTP standard 404 is a client side error so marking server side transaction as the one having an error can be somewhat misleading.

To me 500 seems to be a clear error - it's the only status code that has error in its reason phrase :)

As for the missing error document - we can always generate one although I am not sure what value it will provide. Maybe users would like to keep error events longer than transactions. in particular unsampled ones.

@gregkalapos
Copy link
Contributor

gregkalapos commented Jun 23, 2020

I also think we should add an error flag - there are some good use cases for it.

I feel going for a super detailed definition that is implemented the same way in all agents is a bit too optimistic.

I think a good definition is what Felix said here:

I'd still argue we should define error as seen from the user's perspective: requests that were not successful.

Then it’s either an HTTP request or not, if it’s not HTTP, then I feel we can’t tell more - so no other hints can be given in the definition.

If it’s HTTP, then yeah, we can try focusing on the response code.

My opinion on that is that a caught exception should not set the error flag to true and that a 500 transaction should.

Agree - for all 5xx errors. I also think that the fact that this can happen without an error document is strange, but I totally agree we still should mark this as an error. 2 reasons I can give: 1. the definition above :) 2. I talked to a user on discuss, they had some global error handling logic which catches all exceptions and just sets the HTTP status code to 500, but no exception is leaving the request pipeline - and the agent has also no chance to capture the exceptions (it’s like a try-catch early in the pipeline). They wanted those to be marked as error.

For 404s, I'm torn. … so I'm leaning towards counting that as an error, too.

Here I’d also focus on all 4xx errors, not only 400, and I think the definition of 4xx is “Client errors”, and with that I also think we should count this as an error.

So, overall I'd say something like this as the definition:

The flag is set to true if the transaction was not successful. If it's HTTP then this is either a client or a server side error (HTTP 4xx and HTTP 5xx).

@sorenlouv sorenlouv changed the title Add Error flag on transaction when an error happened Add boolean transaction.error flag to transaction documents Jun 23, 2020
@axw
Copy link
Member

axw commented Jun 24, 2020

I'm with @SergeyKleyman on this: to me 5xx is clearly an error, but it's not so clear for all of 4xx. There are several 4xx status codes which are not necessarily an error, and just part of expected flows:

401 Unauthorized: sent with WWW-Authenticate to prompt users to log in
404 Not Found: used in many APIs to indicate that a resource does not exist, which may be expected
409 Conflict: used in some APIs for optimistic concurrency
429 Too Many Requests: used for rate-limiting

I'd be inclined to make it minimal; leave at these ones out of the error classification, and enable users to customise the classification through the API.

@gregkalapos
Copy link
Contributor

There are several 4xx status codes which are not necessarily an error

That is totally true and these are indeed used in very common use cases (like REST). I think the interpretation in those cases is not super clear - it could be seen as error, or it could be seen as "not an error"; (in those cases probably "not an error") but it's just not clear. Nevertheless from what I know HTTP4xx is still defined as client side error. I feel we won't have a model here that fits every application.

I'd be inclined to make it minimal; leave at these ones out of the error classification, and enable users to customise the classification through the API.

My motivation with adding all 4xx to the error list was also to keep things minimal (in terms of complexity) and make sure we have a definition which is the same across all agents - considering everything which is between 400 and 599 is an error is easy to implement and maintain - and also simple to communicate to users. Once we have a list of 4xxs that are not errors then I feel a) the list must be complete (meaning if let's say 401 is not an error, then all other 4xx that "could be interpreted as 'not an error'" should also be in the list and b) it'd be nice to have the same list in all agents (this is the easier part). I think the challenge here would be to create a "list of 4xxs that are not errors" that better fit the universe of all use cases than the "every 4xx is an error" model. If we fail on that then we'd create confusion, like "Why are my 401s errors, but 451 aren't?" (or some other combination). Stating "every 4xx is an error, if it does not fit your app, then use the API to change this" (more on it in the next point) is I think significantly simpler.

On this part:

enable users to customise the classification through the API.

👍 totally agree - if our model does not fit the user's application (whichever model we take) I think this is a good way to enable users to change the interpretation of an error. Maybe for Java this will be harder - and maybe long term also for .NET if we have more users who attach the agent without code change, but I feel this is still an acceptable fall-back.

Of course I don't wanna block this - if we have a list of 4xx that we consider as "not an error" and we feel this is better than just marking all 4xxs as error, then let's go for it. Btw. an alternative approach would be to consider all 4xxs as "not an error" (I don't feel it'd be good, just listing here for completeness).

@felixbarny
Copy link
Member

While a 404 should definitely not page someone at night, it's still an error and you'll want to monitor the overall rate of 404s. If, for example, a new release triggers 10x more 404s, that's most certainly something you'd want to be aware of.

Somewhat related is the concept of a "business exception". REST endpoint handlers may throw an exception that gets converted to a client error. For example, because the user doesn't have enough credits to perform an action. It doesn't mean the server crashed but still, the request was not successful.
If all of a sudden an unusual percentage of the requests have that error it's something the user would want to know.

Maybe it's better if we don't add an error flag but flip the semantics and add a success flag. And a 404 is not a successful request.

enable users to customise the classification through the API.

Manual classification sounds like a good idea but I'd rather see that being handled by users being able to mute a particular error group, for example as opposed to having to modify the application.

@axw
Copy link
Member

axw commented Jun 24, 2020

@gregkalapos I agree that having some of the 400s treated as errors and some not is complex. I'm just concerned that by treating all of them as binary success/failure, we'll diminish the utility of the metric.

@felixbarny

While a 404 should definitely not page someone at night, it's still an error and you'll want to monitor the overall rate of 404s.

Maybe we have different definitions of what an error is. I take it to mean something exceptional - something you do not expect to happen. With this definition, there are cases where "it's still an error" does not hold. For example, another optimistic concurrency-control type of pattern:

  • fetch a thing which may (200) or may not exist (404)
  • update the thing
  • put the thing back (with some conflict checking, a la Etag/If-Not-Modified)

If this sort of thing is common in your API, and is more prevalent than truly exceptional errors, then it will be harder to see smaller rate changes for the exceptions.

If, for example, a new release triggers 10x more 404s, that's most certainly something you'd want to be aware of.

I think this is interesting. Even if something is an expected outcome, change in rate is probably interesting. It might mean changed application behaviours (e.g. more 429s because someone broke caching).

Maybe it's better if we don't add an error flag but flip the semantics and add a success flag. And a 404 is not a successful request.

That's just inverting the definition of error, so I don't see any practical difference.

What I think might be useful is if we had error categories, along the lines of OpenTelemetry's status codes. Then you could have an error rate per category. We kind of already have this with "HTTP 4xx" and "HTTP 5xx", but I think that might be a bit too coarse grained.

@beniwohli
Copy link
Contributor

While a 404 should definitely not page someone at night, it's still an error and you'll want to monitor the overall rate of 404s. If, for example, a new release triggers 10x more 404s, that's most certainly something you'd want to be aware of.

We still will have the 3xx/4xx/5xx RPM graphs, right? I think that covers this use case pretty well already. And people can still set up custom (ML) alerts based on the response code if their use case requires close monitoring of response code trends.

@felixbarny
Copy link
Member

Good points.

What I think might be useful is if we had error categories, along the lines of OpenTelemetry's status codes.

I think we need a very simple binary state so that we can display what the error rate of a service is, for example. In some cases, it may be an oversimplification but I'd argue it's still useful in most cases.

The OT status codes are just the gRPC status codes and seem to come from the historic context of OpenCensus. We already have a result property we can set to either the gRPC status or the HTTP status.

That's just inverting the definition of error, so I don't see any practical difference.

What I was trying to get to with this is that even if a 404 my not be seen as an error, it may not be a successful request either. But as we mainly want to show error rates, not success rates it would be just delegating that conversion to the UI.

@nehaduggal
Copy link

I think we should capture all unhandled exceptions as errors and use it in the error rate calculation. Along with the flag we should have a configuration that allows users to exclude errors from this error rate(by specifying error class or code). That exclusion config would be the escape hatch that ensures the error rate and the reported errors are not bloated by "false" error signals.

@alex-fedotyev
Copy link

I agree with there this conversation is going.
Capturing 4xx requests is something important, and may help when troubleshooting some issues, especially when indicated by RUM failed requests due to those 4xx issues.

At the same time, I suggest we don't include 4xx requests when calculating both throughput and error rates.
Otherwise performance telemetry and baselines would become skewed and would not reflect real users activities.

Example, any penetration test or simple PHPMyAdmin scan would then trigger false error alarms, while those don't indicate that service is indeed failing to deliver its functionality.
In contrast, seeing throughput significantly below normal and increased rate of 404s are indication of a true problem.

@hmdhk
Copy link
Contributor

hmdhk commented Jul 3, 2020

There are even more nuances with regards to the definition of error for the RUM agent. For example for page loads or route change transactions, the page might show successfully but there might be some errors as well. It's very common for pages to have multiple errors and still show some content.

For xhr transactions we can use the status code the same as backend agents will do.

@cauemarcondes This might result in show very high (or very low depending on our definition) error rates for the rum agent, which is probably not very useful on the UI. But we can investigate other ways to annotate a rum transaction as erroneous.

@vigneshshanmugam
Copy link
Member

Adding some context from the RUM agent side, For Hard Navigations (Page load transactions) and Soft Navigations(route-change, click transactions) there is no way to know whether the page was 4xx or 5xx as the information is not available through any browser API's. The error flag for the page load transactions on RUM agent depends purely on the JavaScript exceptions that happens during the lifetime of the transaction itself. There are other complicated ways to tie the loose ends with Network Error Logging API which could work, but it's out of scope for this issue.

However for API requests to the backend, which could captured as transactions(http-request type ). We can annotate errors based on the Status code as per the outcome of the decision we make here.

@felixbarny
Copy link
Member

4xx

Capturing 4xx requests is something important, and may help when troubleshooting some issues, especially when indicated by RUM failed requests due to those 4xx issues.

As @beniwohli indicated, we already capture the rate of 4xx transactions by capturing the result of a transaction.

At the same time, I suggest we don't include 4xx requests when calculating both throughput and error rates.

@alex-fedotyev Would you say that's true for every 4xx status code or just for 404s specifically?
@axw do you think it would be reasonable to exclude all 4xx with the reasoning that it's likely not an error caused by the server but by the client (4xx is defined as client error)?


RUM

It's very common for pages to have multiple errors and still show some content.

That's actually similar for backend agents, which is why I still think

My opinion on that is that a caught exception should not set the error flag to true and that a 500 transaction should.

there is no way to know whether the page was 4xx or 5xx as the information is not available through any browser API's

I guess that means that full-page load transactions will never have the error flag set to true. For route changes, is there a way to instrument framework specifics to find out whether the transition was successful?

For xhr transactions we can use the status code the same as backend agents will do.

However for API requests to the backend, which could captured as transactions(http-request type ). We can annotate errors based on the Status code as per the outcome of the decision we make here.

👍

@vigneshshanmugam
Copy link
Member

I guess that means that full-page load transactions will never have the error flag set to true.

It depends if we want to annotate errors based on status codes or with exceptions/errors that happens during the transaction.

For route changes, is there a way to instrument framework specifics to find out whether the transition was successful?

Yes, we already check whether the soft navigation (route change) status of some frameworks (Vue, Angular) whether it is success/error. We could potentially annotate them as errors.

@hmdhk
Copy link
Contributor

hmdhk commented Jul 7, 2020

@felixbarny , just to clarify, at least for page load transactions there's no reliable way to detect if the transaction was erroneous, whereas backend agents can look at the status code. So at least for the page load we should make it clear for the users that does not indicate anything (i.e. if it's not set it doesn't mean the page loaded correctly and vice versa).

For route changes we can potentially annotate them based on the result we get from the framework but even then it doesn't guarantee that the route change showed what it was supposed to. so it could give a false sense of success, we probably need to make this clear on the docs as well.

@alex-fedotyev
Copy link

alex-fedotyev commented Jul 8, 2020

@alex-fedotyev Would you say that's true for every 4xx status code or just for 404s specifically?

@felixbarny - I'd say that 4xx is a reasonable category as they all refer to client side errors.
In my personal experience I mainly saw 404's though.

@axw
Copy link
Member

axw commented Jul 13, 2020

@axw do you think it would be reasonable to exclude all 4xx with the reasoning that it's likely not an error caused by the server but by the client (4xx is defined as client error)?

@felixbarny Yes, I think that is reasonable.

@felixbarny
Copy link
Member

@graphaelli @axw @simitt any preference in terms of naming of this property in the intake API vs the ES documents? Should this be included in ECS? Is adding this to ECS a prerequisite before we can continue?

@cauemarcondes given that we're very close to FF and that even if all agents implement this by 7.9 (which is unlikely), we still need a strategy to handle data from old agents. This could either be your current impl (error documents/transactions) or disabling the error rate if transaction.error (or whatever this field ends up being named) is not present. I'd lean towards disabling vs showing inconsistent data.

@axw
Copy link
Member

axw commented Jul 15, 2020

I don't have a very strong opinion on this, but it seems to me this lines up with event.outcome. i.e. we would set event.outcome: "failure" for HTTP transactions with HTTP status 5xx.

@felixbarny
Copy link
Member

Sounds like a perfect match!

@cauemarcondes is it an issue for the query if the type is not boolean but string?

@webmat I saw event.outcome is marked as beta. Do you have any objections regarding APM using it to denote transaction success state? Are the values failure, success, and unknown expected to change before GA? Is there an ETA for it to be GA?

@sorenlouv
Copy link
Member

Great find @axw. event.outcome looks like exactly what we need.
Tangential: I can see how users might conflate event.outcome with transaction.result (even though the latter is freeform where the former is not).

@cauemarcondes
Copy link
Author

@cauemarcondes is it an issue for the query if the type is not boolean but string?

Not at all @felixbarny

@felixbarny
Copy link
Member

Let's continue discussions here: #299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests