-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add boolean transaction.error
flag to transaction documents
#281
Comments
There are other use cases that would benefit from an error flag, too. For example #271 (probably requires an error flag for spans) and https://discuss.elastic.co/t/finding-errors-related-to-transactions/233667. The error rate could also be a useful addition to the transactions overview table (see also elastic/kibana#65619 (comment)). Field definition Definition of error My opinion on that is that a caught exception should not set the error flag to In the context of the error rate chart for the error page (elastic/kibana#65619) it might be weird to count 500s as erroneous even if there's no error document associated with a transaction. I'd still argue we should define |
tagging @elastic/apm-agent-devs |
@elastic/apm-agent-devs I'd like to know if it would be possible to have the error flag on a transaction for |
I think it would be hard to find a solution to fit popular use cases without some custom configuration. For example 404 is probably an error for a site with server-side rendering but for an API 404 might be things as usual (for example AJAX call to check if user name is already taken on a new user registration form). Maybe it's worth providing a configuration option to select status codes to be considered an error? It will still not cover 200 + error page use case though. Also according to dry HTTP standard 404 is a client side error so marking server side transaction as the one having an error can be somewhat misleading. To me 500 seems to be a clear error - it's the only status code that has As for the missing error document - we can always generate one although I am not sure what value it will provide. Maybe users would like to keep error events longer than transactions. in particular unsampled ones. |
I also think we should add an error flag - there are some good use cases for it. I feel going for a super detailed definition that is implemented the same way in all agents is a bit too optimistic. I think a good definition is what Felix said here:
Then it’s either an HTTP request or not, if it’s not HTTP, then I feel we can’t tell more - so no other hints can be given in the definition. If it’s HTTP, then yeah, we can try focusing on the response code.
Agree - for all 5xx errors. I also think that the fact that this can happen without an error document is strange, but I totally agree we still should mark this as an error. 2 reasons I can give: 1. the definition above :) 2. I talked to a user on discuss, they had some global error handling logic which catches all exceptions and just sets the HTTP status code to 500, but no exception is leaving the request pipeline - and the agent has also no chance to capture the exceptions (it’s like a try-catch early in the pipeline). They wanted those to be marked as error.
Here I’d also focus on all 4xx errors, not only 400, and I think the definition of 4xx is “Client errors”, and with that I also think we should count this as an error. So, overall I'd say something like this as the definition:
|
transaction.error
flag to transaction documents
I'm with @SergeyKleyman on this: to me 5xx is clearly an error, but it's not so clear for all of 4xx. There are several 4xx status codes which are not necessarily an error, and just part of expected flows: 401 Unauthorized: sent with WWW-Authenticate to prompt users to log in I'd be inclined to make it minimal; leave at these ones out of the error classification, and enable users to customise the classification through the API. |
That is totally true and these are indeed used in very common use cases (like REST). I think the interpretation in those cases is not super clear - it could be seen as error, or it could be seen as "not an error"; (in those cases probably "not an error") but it's just not clear. Nevertheless from what I know HTTP4xx is still defined as client side error. I feel we won't have a model here that fits every application.
My motivation with adding all 4xx to the error list was also to keep things minimal (in terms of complexity) and make sure we have a definition which is the same across all agents - considering everything which is between 400 and 599 is an error is easy to implement and maintain - and also simple to communicate to users. Once we have a list of 4xxs that are not errors then I feel a) the list must be complete (meaning if let's say 401 is not an error, then all other 4xx that "could be interpreted as 'not an error'" should also be in the list and b) it'd be nice to have the same list in all agents (this is the easier part). I think the challenge here would be to create a "list of 4xxs that are not errors" that better fit the universe of all use cases than the "every 4xx is an error" model. If we fail on that then we'd create confusion, like "Why are my 401s errors, but 451 aren't?" (or some other combination). Stating "every 4xx is an error, if it does not fit your app, then use the API to change this" (more on it in the next point) is I think significantly simpler. On this part:
👍 totally agree - if our model does not fit the user's application (whichever model we take) I think this is a good way to enable users to change the interpretation of an error. Maybe for Java this will be harder - and maybe long term also for .NET if we have more users who attach the agent without code change, but I feel this is still an acceptable fall-back. Of course I don't wanna block this - if we have a list of 4xx that we consider as "not an error" and we feel this is better than just marking all 4xxs as error, then let's go for it. Btw. an alternative approach would be to consider all 4xxs as "not an error" (I don't feel it'd be good, just listing here for completeness). |
While a 404 should definitely not page someone at night, it's still an error and you'll want to monitor the overall rate of 404s. If, for example, a new release triggers 10x more 404s, that's most certainly something you'd want to be aware of. Somewhat related is the concept of a "business exception". REST endpoint handlers may throw an exception that gets converted to a client error. For example, because the user doesn't have enough credits to perform an action. It doesn't mean the server crashed but still, the request was not successful. Maybe it's better if we don't add an
Manual classification sounds like a good idea but I'd rather see that being handled by users being able to mute a particular error group, for example as opposed to having to modify the application. |
@gregkalapos I agree that having some of the 400s treated as errors and some not is complex. I'm just concerned that by treating all of them as binary success/failure, we'll diminish the utility of the metric.
Maybe we have different definitions of what an error is. I take it to mean something exceptional - something you do not expect to happen. With this definition, there are cases where "it's still an error" does not hold. For example, another optimistic concurrency-control type of pattern:
If this sort of thing is common in your API, and is more prevalent than truly exceptional errors, then it will be harder to see smaller rate changes for the exceptions.
I think this is interesting. Even if something is an expected outcome, change in rate is probably interesting. It might mean changed application behaviours (e.g. more 429s because someone broke caching).
That's just inverting the definition of error, so I don't see any practical difference. What I think might be useful is if we had error categories, along the lines of OpenTelemetry's status codes. Then you could have an error rate per category. We kind of already have this with "HTTP 4xx" and "HTTP 5xx", but I think that might be a bit too coarse grained. |
We still will have the 3xx/4xx/5xx RPM graphs, right? I think that covers this use case pretty well already. And people can still set up custom (ML) alerts based on the response code if their use case requires close monitoring of response code trends. |
Good points.
I think we need a very simple binary state so that we can display what the error rate of a service is, for example. In some cases, it may be an oversimplification but I'd argue it's still useful in most cases. The OT status codes are just the gRPC status codes and seem to come from the historic context of OpenCensus. We already have a
What I was trying to get to with this is that even if a 404 my not be seen as an error, it may not be a successful request either. But as we mainly want to show error rates, not success rates it would be just delegating that conversion to the UI. |
I think we should capture all unhandled exceptions as errors and use it in the error rate calculation. Along with the flag we should have a configuration that allows users to exclude errors from this error rate(by specifying error class or code). That exclusion config would be the escape hatch that ensures the error rate and the reported errors are not bloated by "false" error signals. |
I agree with there this conversation is going. At the same time, I suggest we don't include 4xx requests when calculating both throughput and error rates. Example, any penetration test or simple PHPMyAdmin scan would then trigger false error alarms, while those don't indicate that service is indeed failing to deliver its functionality. |
There are even more nuances with regards to the definition of error for the RUM agent. For example for page loads or route change transactions, the page might show successfully but there might be some errors as well. It's very common for pages to have multiple errors and still show some content. For xhr transactions we can use the status code the same as backend agents will do. @cauemarcondes This might result in show very high (or very low depending on our definition) error rates for the rum agent, which is probably not very useful on the UI. But we can investigate other ways to annotate a rum transaction as erroneous. |
Adding some context from the RUM agent side, For Hard Navigations (Page load transactions) and Soft Navigations(route-change, click transactions) there is no way to know whether the page was 4xx or 5xx as the information is not available through any browser API's. The error flag for the page load transactions on RUM agent depends purely on the JavaScript exceptions that happens during the lifetime of the transaction itself. There are other complicated ways to tie the loose ends with Network Error Logging API which could work, but it's out of scope for this issue. However for API requests to the backend, which could captured as transactions( |
4xx
As @beniwohli indicated, we already capture the rate of 4xx transactions by capturing the
@alex-fedotyev Would you say that's true for every 4xx status code or just for 404s specifically? RUM
That's actually similar for backend agents, which is why I still think
I guess that means that full-page load transactions will never have the error flag set to true. For route changes, is there a way to instrument framework specifics to find out whether the transition was successful?
👍 |
It depends if we want to annotate errors based on status codes or with exceptions/errors that happens during the transaction.
Yes, we already check whether the soft navigation (route change) status of some frameworks (Vue, Angular) whether it is success/error. We could potentially annotate them as errors. |
@felixbarny , just to clarify, at least for page load transactions there's no reliable way to detect if the transaction was erroneous, whereas backend agents can look at the status code. So at least for the page load we should make it clear for the users that does not indicate anything (i.e. if it's not set it doesn't mean the page loaded correctly and vice versa). For route changes we can potentially annotate them based on the result we get from the framework but even then it doesn't guarantee that the route change showed what it was supposed to. so it could give a false sense of success, we probably need to make this clear on the docs as well. |
@felixbarny - I'd say that 4xx is a reasonable category as they all refer to client side errors. |
@felixbarny Yes, I think that is reasonable. |
@graphaelli @axw @simitt any preference in terms of naming of this property in the intake API vs the ES documents? Should this be included in ECS? Is adding this to ECS a prerequisite before we can continue? @cauemarcondes given that we're very close to FF and that even if all agents implement this by 7.9 (which is unlikely), we still need a strategy to handle data from old agents. This could either be your current impl (error documents/transactions) or disabling the error rate if |
I don't have a very strong opinion on this, but it seems to me this lines up with event.outcome. i.e. we would set |
Sounds like a perfect match! @cauemarcondes is it an issue for the query if the type is not boolean but string? @webmat I saw |
Great find @axw. |
Not at all @felixbarny |
Let's continue discussions here: #299 |
Needed by: 7.9
Summary
Transaction documents should have a boolean flag
error
which indicates whether the transaction was erroneous or not.Background
In APM UI we want to add a chart displaying the error rate. Currently we are calculating the rate as the ratio of errors to transactions in a given time range. But since a transaction can have multiple errors and we can have errors outside of transactions we can not calculate it like that.
Solution
A solution would be having a flag on the transaction informing if that transaction is erroneous. And to calculate the Error rate we'd do:
total_number_of_transactions_with_errors / total_number_of_transactions
in a given time range.The text was updated successfully, but these errors were encountered: