Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate errors from exporters to receivers #7486

Closed

Conversation

jpkrohling
Copy link
Member

@jpkrohling jpkrohling commented Apr 4, 2023

When using an OTel Collector as a gateway between OTLP clients and servers, errors from the server are not always propagated correctly from the exporter back to the client. When this happens, the client gets misleading error information, making debugging harder.

For example, given the following scenario:

External OTLP HTTP client (1) -> Collector's OTLP receiver (2) -> Collector's OTLP exporter (3) -> External OTLP gRPC server (4)

When (4) returns a "InvalidArgument", indicating that the client has send bad input data, (1) should be informed about that by receiving a "400 Bad Request" from (2).

This PR changes both the OTLP exporter and receivers so that the error flow works in that combination. Other exporters should be advised to:

  1. return the gRPC errors as they are back to the pipeline, so that receivers can parse them accordingly
  2. use the newly created internal/errs package to wrap Client HTTP errors, so that receivers can reuse the original status code

This PR is in draft mode, currently lacking:

  1. more tests, although one unit test was changed to ensure that the receiver is correctly converting from gRPC to HTTP status codes
  2. the scenario where the external client is gRPC and the external server is HTTP hasn't been tested and is not expected to work currently. The translation from HTTP -> gRPC might need to be done in a similar fashion as gRPC -> HTTP. This hasn't been done as I wanted to validate the direction before investing more time in this.
  3. perhaps internal/errs should be a new module?

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

@codecov
Copy link

codecov bot commented Apr 4, 2023

Codecov Report

Patch coverage: 54.68% and project coverage change: +0.21 🎉

Comparison is base (ce65350) 90.79% compared to head (f9567e6) 91.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7486      +/-   ##
==========================================
+ Coverage   90.79%   91.00%   +0.21%     
==========================================
  Files         296      300       +4     
  Lines       14790    14972     +182     
==========================================
+ Hits        13428    13625     +197     
+ Misses       1087     1071      -16     
- Partials      275      276       +1     
Impacted Files Coverage Δ
receiver/otlpreceiver/erroradapter.go 22.22% <22.22%> (ø)
exporter/otlphttpexporter/otlp.go 80.48% <50.00%> (+0.96%) ⬆️
internal/colerrs/request.go 100.00% <100.00%> (ø)
receiver/otlpreceiver/otlphttp.go 53.54% <100.00%> (+5.75%) ⬆️

... and 19 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks a lot @jpkrohling. This will help with open-telemetry/opentelemetry-collector-contrib#20511 too.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 22, 2023
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2023

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this May 7, 2023
@jpkrohling jpkrohling removed the Stale label May 10, 2023
@jpkrohling jpkrohling reopened this May 10, 2023
@jpkrohling
Copy link
Member Author

Reopening, this is not stale; it's waiting on the dependent PR.

@jpkrohling jpkrohling force-pushed the jpkrohling/propagate-errors branch from 647e196 to 0e2400a Compare May 15, 2023 17:07
@jpkrohling jpkrohling marked this pull request as ready for review May 15, 2023 17:07
@jpkrohling jpkrohling requested review from a team and Aneurysm9 May 15, 2023 17:07
@jpkrohling
Copy link
Member Author

PR updated. I think we are ready to move with this one given we don't have a final proposal for #7439.

internal/errs/request.go Outdated Show resolved Hide resolved
receiver/otlpreceiver/otlphttp.go Outdated Show resolved Hide resolved
@djaglowski
Copy link
Member

djaglowski commented May 16, 2023

This functionality may not work if we have any of the following:

  • Asynchronous processors (e.g. batch)
  • "Asymmetrical" processors (e.g. Anything that merges, splits, reorganizes, drops, or adds ptraces/pmetrics/plogs)
  • Non-linear data flow (e.g. shared receiver replicates data stream to multiple exporters, each may fail independently)

Do we need to document or codify the requirements for this functionality to be used reliably?

@dmitryax
Copy link
Member

For example, given the following scenario:

External OTLP HTTP client (1) -> Collector's OTLP receiver (2) -> Collector's OTLP exporter (3) -> External OTLP gRPC server (4)

When (4) returns a "InvalidArgument", indicating that the client has send bad input data, (1) should be informed about that by receiving a "400 Bad Request" from (2).

I have a couple of Qs about this:

  1. Just curious, what is the bad input data rejected by (4) and accepted by (2)? Aren't they both supposed to use the same OTLP server?
  2. As a user looking at an error received by (1), how can I understand where is it coming from, (2) or (4)?

Copy link
Contributor

@MovieStoreGuy MovieStoreGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments from me.

internal/errs/request.go Outdated Show resolved Hide resolved
receiver/otlpreceiver/erroradapter.go Show resolved Hide resolved
internal/errs/request.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member Author

I just updated this PR to address the latest review comments.

Just curious, what is the bad input data rejected by (4) and accepted by (2)? Aren't they both supposed to use the same OTLP server?

For the context propagation to work, the pipeline is synchronous, such as when there are no batching processors or sending queues in it. When that happens, (4) may reject data, which will then be reflected by (2)'s response to (1). To be clear: (2) will not send a response to (1) without hearing back from (4) before.

As a user looking at an error received by (1), how can I understand where is it coming from, (2) or (4)?

Given the constraints I mentioned earlier, the error from (2) is the same as (4), but it might happen that (2) is the one returning the error without even touching (4). In that case, the source of the error is not clear to (1). I don't think it should matter to the client where the error is coming from. Operators of (2) and (4) should be able to determine the place the error happened based on (2)'s and (4)'s metrics.

@dmitryax
Copy link
Member

dmitryax commented May 22, 2023

For the context propagation to work, the pipeline is synchronous, such as when there are no batching processors or sending queues in it. When that happens, (4) may reject data, which will then be reflected by (2)'s response to (1). To be clear: (2) will not send a response to (1) without hearing back from (4) before.

My question was why (2) doesn't reject data with the same 400 error while (4) does reject it. Do they have different OTLP validation logic?

I don't think it should matter to the client where the error is coming from. Operators of (2) and (4) should be able to determine the place the error happened based on (2)'s and (4)'s metrics.

I disagree here. As a developer of (1), if I see an error, I need to know whether it's caused by a problem in the collector or the backend to be able to report it to a responsible team or investigate myself. I think collector can wrap the error message keeping the same error code to make it clear where the error is coming from

@jpkrohling
Copy link
Member Author

My question was why (2) doesn't reject data with the same 400 error while (4) does reject it. Do they have different OTLP validation logic?

I think I'm not seeing what you are seeing and would appreciate more details. What I have in mind is that perhaps there's an extra processor (or, in the future, an interceptor) that will return a 400 on (4) while (2) is just acting as a proxy.

I think collector can wrap the error message keeping the same error code to make it clear where the error is coming from

That's doable. We have the RequestError with a constructor already, we can add an extra message to the message.

@jpkrohling jpkrohling force-pushed the jpkrohling/propagate-errors branch 3 times, most recently from 6934602 to 4024e2f Compare May 29, 2023 13:58
@jpkrohling jpkrohling force-pushed the jpkrohling/propagate-errors branch from 4024e2f to 6c3fd19 Compare May 31, 2023 18:06
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling jpkrohling force-pushed the jpkrohling/propagate-errors branch from 975a234 to e45443d Compare June 5, 2023 17:58
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling jpkrohling force-pushed the jpkrohling/propagate-errors branch from bae14d0 to ba9fee0 Compare June 5, 2023 18:24
@jpkrohling
Copy link
Member Author

@evan-bradley , @codeboten , could you please review this PR again? I believe all points have been addressed, apart from the documentation one brought by Alex. If we agree that docs/design.md is the right place, I'll document it as part of a follow-up PR.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Member Author

I updated this PR to return 500s if the receiver got a permanent error from the next consumer. I believe this, along with #7868 bring the correct behavior with minimal disruption to existing pipelines.

@andrzej-stencel
Copy link
Member

andrzej-stencel commented Jun 13, 2023

Regarding this comment from @bogdandrutu on another PR:

I think the problem was not finding if one pipeline permanently errored, but what to do in the cases like one pipeline succeed the other has a retriable error? Do we return to the client a retriable error and they can retry? That will cause duplicate data on one path, what do you do then?

It's worth noting (is it obvious? or is it not correct?) that the component that is sending the data (e.g. the collector's receiver) is not able to retry/re-send the data only to a selection of consumers, e.g. those that returned a non-permanent error and not to those that returned success (or a permanent error).

In light of the above, here are the scenarios that I can identify and options for each:

Scenario A: Multiple consumers, some succeeding, some failing with permanent errors and some failing with retryable errors
Scenario B: Multiple consumers, all failing with retryable errors
Scenario C: Single consumer, returning either success, or a permanent error, or a retryable error

For Scenario A, we seem to only have the following two options for the sender of the data:

  1. Re-send the data to all consumers, causing duplication in those that succeeded
  2. Not re-send the data at all, ignoring the retryable errors

In the second case, the sending component also shouldn't be passing the retryable error back to the caller.

I lean towards the second option. We'd need to provide documentation educating users about the consequences of creating pipelines with a fan-out, like using the same receiver in different pipelines of the same signal type, or using more than one exporter in a pipeline.

Scenario B: When all consumers failed with retryable errors, I believe it's important for the sender to be able to pass the retryable error back to the caller, or retry itself, depending on its configuration.

Scenario C: The sending component should be able to act accordingly to the error - i.e., in case of retryable error, retry (if configured to do so) or return the retryable error back to the caller. If I'm not mistaken, this is what this PR was about from the start.

@bogdandrutu
Copy link
Member

I lean towards the second option. We'd need to provide documentation educating users about the consequences of creating pipelines with a fan-out, like using the same receiver in different pipelines of the same signal type, or using more than one exporter in a pipeline.

For exporters that is not a real problem, since with the retry and queuing mechanism that should not happen (or very rarely one exporter may be slower and have it's queue full).

@bogdandrutu
Copy link
Member

@astencel-sumo @jpkrohling before trying to solve this scenarios, I would like to understand when a retryable error can happen.

Also I am thinking that we may have to have a way to mark a "health/status" at pipeline level (based on pipeline ID) and the fanout / router consumer can use that information. Think about something like when we have the queue full for a an exporter, the we mark the entire pipeline that exports to that exporter (or multiple pipelines) as "busy/resource exhausted" and then when receiver tries to put data to a fanout consumer that includes that pipeline we will reject them immediately so data can be retried to a different collector instance without duplication.

I have not thought at all scenarios, but I would like one of you to think, document:

  • what are the possible retryable errors, and which components should/are allowed to produce them?
  • is it possible to have the status at pipeline level and avoid if possible getting into the scenario A?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 20, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants