Propagate errors from exporters to receivers #7486

jpkrohling · 2023-04-04T19:26:03Z

When using an OTel Collector as a gateway between OTLP clients and servers, errors from the server are not always propagated correctly from the exporter back to the client. When this happens, the client gets misleading error information, making debugging harder.

For example, given the following scenario:

External OTLP HTTP client (1) -> Collector's OTLP receiver (2) -> Collector's OTLP exporter (3) -> External OTLP gRPC server (4)

When (4) returns a "InvalidArgument", indicating that the client has send bad input data, (1) should be informed about that by receiving a "400 Bad Request" from (2).

This PR changes both the OTLP exporter and receivers so that the error flow works in that combination. Other exporters should be advised to:

return the gRPC errors as they are back to the pipeline, so that receivers can parse them accordingly
use the newly created internal/errs package to wrap Client HTTP errors, so that receivers can reuse the original status code

This PR is in draft mode, currently lacking:

more tests, although one unit test was changed to ensure that the receiver is correctly converting from gRPC to HTTP status codes
the scenario where the external client is gRPC and the external server is HTTP hasn't been tested and is not expected to work currently. The translation from HTTP -> gRPC might need to be done in a similar fashion as gRPC -> HTTP. This hasn't been done as I wanted to validate the direction before investing more time in this.
perhaps internal/errs should be a new module?

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

codecov · 2023-04-04T19:36:50Z

Codecov Report

Patch coverage: 54.68% and project coverage change: +0.21 🎉

Comparison is base (ce65350) 90.79% compared to head (f9567e6) 91.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7486      +/-   ##
==========================================
+ Coverage   90.79%   91.00%   +0.21%     
==========================================
  Files         296      300       +4     
  Lines       14790    14972     +182     
==========================================
+ Hits        13428    13625     +197     
+ Misses       1087     1071      -16     
- Partials      275      276       +1

Impacted Files	Coverage Δ
receiver/otlpreceiver/erroradapter.go	`22.22% <22.22%> (ø)`
exporter/otlphttpexporter/otlp.go	`80.48% <50.00%> (+0.96%)`	⬆️
internal/colerrs/request.go	`100.00% <100.00%> (ø)`
receiver/otlpreceiver/otlphttp.go	`53.54% <100.00%> (+5.75%)`	⬆️

... and 19 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

andrzej-stencel

This is great, thanks a lot @jpkrohling. This will help with open-telemetry/opentelemetry-collector-contrib#20511 too.

github-actions · 2023-04-22T03:15:36Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-05-07T03:15:24Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

jpkrohling · 2023-05-10T18:24:10Z

Reopening, this is not stale; it's waiting on the dependent PR.

jpkrohling · 2023-05-15T17:14:09Z

PR updated. I think we are ready to move with this one given we don't have a final proposal for #7439.

internal/errs/request.go

receiver/otlpreceiver/otlphttp.go

djaglowski · 2023-05-16T18:46:28Z

This functionality may not work if we have any of the following:

Asynchronous processors (e.g. batch)
"Asymmetrical" processors (e.g. Anything that merges, splits, reorganizes, drops, or adds ptraces/pmetrics/plogs)
Non-linear data flow (e.g. shared receiver replicates data stream to multiple exporters, each may fail independently)

Do we need to document or codify the requirements for this functionality to be used reliably?

dmitryax · 2023-05-17T05:53:37Z

For example, given the following scenario:

External OTLP HTTP client (1) -> Collector's OTLP receiver (2) -> Collector's OTLP exporter (3) -> External OTLP gRPC server (4)

When (4) returns a "InvalidArgument", indicating that the client has send bad input data, (1) should be informed about that by receiving a "400 Bad Request" from (2).

I have a couple of Qs about this:

Just curious, what is the bad input data rejected by (4) and accepted by (2)? Aren't they both supposed to use the same OTLP server?
As a user looking at an error received by (1), how can I understand where is it coming from, (2) or (4)?

MovieStoreGuy

Few comments from me.

internal/errs/request.go

receiver/otlpreceiver/erroradapter.go

internal/errs/request.go

receiver/otlpreceiver/erroradapter.go

jpkrohling · 2023-05-22T19:24:36Z

I just updated this PR to address the latest review comments.

Just curious, what is the bad input data rejected by (4) and accepted by (2)? Aren't they both supposed to use the same OTLP server?

For the context propagation to work, the pipeline is synchronous, such as when there are no batching processors or sending queues in it. When that happens, (4) may reject data, which will then be reflected by (2)'s response to (1). To be clear: (2) will not send a response to (1) without hearing back from (4) before.

As a user looking at an error received by (1), how can I understand where is it coming from, (2) or (4)?

Given the constraints I mentioned earlier, the error from (2) is the same as (4), but it might happen that (2) is the one returning the error without even touching (4). In that case, the source of the error is not clear to (1). I don't think it should matter to the client where the error is coming from. Operators of (2) and (4) should be able to determine the place the error happened based on (2)'s and (4)'s metrics.

dmitryax · 2023-05-22T19:43:39Z

For the context propagation to work, the pipeline is synchronous, such as when there are no batching processors or sending queues in it. When that happens, (4) may reject data, which will then be reflected by (2)'s response to (1). To be clear: (2) will not send a response to (1) without hearing back from (4) before.

My question was why (2) doesn't reject data with the same 400 error while (4) does reject it. Do they have different OTLP validation logic?

I don't think it should matter to the client where the error is coming from. Operators of (2) and (4) should be able to determine the place the error happened based on (2)'s and (4)'s metrics.

I disagree here. As a developer of (1), if I see an error, I need to know whether it's caused by a problem in the collector or the backend to be able to report it to a responsible team or investigate myself. I think collector can wrap the error message keeping the same error code to make it clear where the error is coming from

jpkrohling · 2023-05-22T20:08:07Z

My question was why (2) doesn't reject data with the same 400 error while (4) does reject it. Do they have different OTLP validation logic?

I think I'm not seeing what you are seeing and would appreciate more details. What I have in mind is that perhaps there's an extra processor (or, in the future, an interceptor) that will return a 400 on (4) while (2) is just acting as a proxy.

I think collector can wrap the error message keeping the same error code to make it clear where the error is coming from

That's doable. We have the RequestError with a constructor already, we can add an extra message to the message.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling · 2023-06-05T18:39:23Z

@evan-bradley , @codeboten , could you please review this PR again? I believe all points have been addressed, apart from the documentation one brought by Alex. If we agree that docs/design.md is the right place, I'll document it as part of a follow-up PR.

receiver/otlpreceiver/otlphttp.go

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling · 2023-06-09T20:30:39Z

I updated this PR to return 500s if the receiver got a permanent error from the next consumer. I believe this, along with #7868 bring the correct behavior with minimal disruption to existing pipelines.

andrzej-stencel · 2023-06-13T08:05:25Z

Regarding this comment from @bogdandrutu on another PR:

I think the problem was not finding if one pipeline permanently errored, but what to do in the cases like one pipeline succeed the other has a retriable error? Do we return to the client a retriable error and they can retry? That will cause duplicate data on one path, what do you do then?

It's worth noting (is it obvious? or is it not correct?) that the component that is sending the data (e.g. the collector's receiver) is not able to retry/re-send the data only to a selection of consumers, e.g. those that returned a non-permanent error and not to those that returned success (or a permanent error).

In light of the above, here are the scenarios that I can identify and options for each:

Scenario A: Multiple consumers, some succeeding, some failing with permanent errors and some failing with retryable errors
Scenario B: Multiple consumers, all failing with retryable errors
Scenario C: Single consumer, returning either success, or a permanent error, or a retryable error

For Scenario A, we seem to only have the following two options for the sender of the data:

Re-send the data to all consumers, causing duplication in those that succeeded
Not re-send the data at all, ignoring the retryable errors

In the second case, the sending component also shouldn't be passing the retryable error back to the caller.

I lean towards the second option. We'd need to provide documentation educating users about the consequences of creating pipelines with a fan-out, like using the same receiver in different pipelines of the same signal type, or using more than one exporter in a pipeline.

Scenario B: When all consumers failed with retryable errors, I believe it's important for the sender to be able to pass the retryable error back to the caller, or retry itself, depending on its configuration.

Scenario C: The sending component should be able to act accordingly to the error - i.e., in case of retryable error, retry (if configured to do so) or return the retryable error back to the caller. If I'm not mistaken, this is what this PR was about from the start.

bogdandrutu · 2023-06-13T16:49:43Z

I lean towards the second option. We'd need to provide documentation educating users about the consequences of creating pipelines with a fan-out, like using the same receiver in different pipelines of the same signal type, or using more than one exporter in a pipeline.

For exporters that is not a real problem, since with the retry and queuing mechanism that should not happen (or very rarely one exporter may be slower and have it's queue full).

bogdandrutu · 2023-06-13T16:57:15Z

@astencel-sumo @jpkrohling before trying to solve this scenarios, I would like to understand when a retryable error can happen.

Also I am thinking that we may have to have a way to mark a "health/status" at pipeline level (based on pipeline ID) and the fanout / router consumer can use that information. Think about something like when we have the queue full for a an exporter, the we mark the entire pipeline that exports to that exporter (or multiple pipelines) as "busy/resource exhausted" and then when receiver tries to put data to a fanout consumer that includes that pipeline we will reject them immediately so data can be retried to a different collector instance without duplication.

I have not thought at all scenarios, but I would like one of you to think, document:

what are the possible retryable errors, and which components should/are allowed to produce them?
is it possible to have the status at pipeline level and avoid if possible getting into the scenario A?

github-actions · 2023-06-28T03:18:24Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-07-20T03:15:31Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-08-04T03:15:14Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

andrzej-stencel approved these changes Apr 5, 2023

View reviewed changes

jpkrohling mentioned this pull request Apr 12, 2023

[consumer] Add retryable error types and add counts to permanent errors #7439

Closed

github-actions bot added the Stale label Apr 22, 2023

github-actions bot closed this May 7, 2023

jpkrohling removed the Stale label May 10, 2023

jpkrohling reopened this May 10, 2023

jpkrohling force-pushed the jpkrohling/propagate-errors branch from 647e196 to 0e2400a Compare May 15, 2023 17:07

jpkrohling marked this pull request as ready for review May 15, 2023 17:07

jpkrohling requested review from a team and Aneurysm9 May 15, 2023 17:07

erikbaranowski mentioned this pull request May 16, 2023

Add support to TLSSetting to not only read from file path, but from memory #7676

Merged

evan-bradley reviewed May 16, 2023

View reviewed changes

internal/errs/request.go Outdated Show resolved Hide resolved

receiver/otlpreceiver/otlphttp.go Outdated Show resolved Hide resolved

MovieStoreGuy reviewed May 17, 2023

View reviewed changes

internal/errs/request.go Outdated Show resolved Hide resolved

receiver/otlpreceiver/erroradapter.go Show resolved Hide resolved

internal/errs/request.go Outdated Show resolved Hide resolved

andrzej-stencel reviewed May 17, 2023

View reviewed changes

receiver/otlpreceiver/erroradapter.go Show resolved Hide resolved

andrzej-stencel mentioned this pull request May 22, 2023

Memory Limiter processor should be configurable to drop the data. #7699

Open

jpkrohling force-pushed the jpkrohling/propagate-errors branch from e1dc75d to a0e7a4a Compare May 22, 2023 19:24

jpkrohling force-pushed the jpkrohling/propagate-errors branch 3 times, most recently from 6934602 to 4024e2f Compare May 29, 2023 13:58

jpkrohling force-pushed the jpkrohling/propagate-errors branch from 4024e2f to 6c3fd19 Compare May 31, 2023 18:06

jpkrohling added 10 commits June 5, 2023 10:47

Addressed the first review comments

d61ec11

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Wrap original error

97ef7c2

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Rename internal/errs to internal/colerrs

9c04786

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Rename gRPC to HTTP adapter

38030d5

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Revert unwanted change

b9cee4f

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Fix import to renamed package

25a00d0

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Fix colerrs pointers/err checks

73f6a18

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Fix license header for new file

aa732ee

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

make goporto

3b68677

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

Addressed comments

e45443d

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling force-pushed the jpkrohling/propagate-errors branch from 975a234 to e45443d Compare June 5, 2023 17:58

Reorganized switch case

ba9fee0

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

jpkrohling force-pushed the jpkrohling/propagate-errors branch from bae14d0 to ba9fee0 Compare June 5, 2023 18:24

jpkrohling commented Jun 9, 2023

View reviewed changes

receiver/otlpreceiver/otlphttp.go Outdated Show resolved Hide resolved

This was referenced Jun 9, 2023

Mark fanout consumer error as permanent #7867

Closed

[fanoutconsumer] Mark fanout consumer error as permanent #7868

Closed

Return a 500 on permanent errors

f9567e6

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

github-actions bot added the Stale label Jun 28, 2023

jpkrohling removed the Stale label Jul 5, 2023

jpkrohling mentioned this pull request Jul 10, 2023

Ensure OTLP receiver handles consume errors correctly [OTLP/HTTP] #8038

Closed

github-actions bot added the Stale label Jul 20, 2023

github-actions bot closed this Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate errors from exporters to receivers #7486

Propagate errors from exporters to receivers #7486

jpkrohling commented Apr 4, 2023 •

edited

Loading

codecov bot commented Apr 4, 2023 •

edited

Loading

andrzej-stencel left a comment

github-actions bot commented Apr 22, 2023

github-actions bot commented May 7, 2023

jpkrohling commented May 10, 2023

jpkrohling commented May 15, 2023

djaglowski commented May 16, 2023 •

edited

Loading

dmitryax commented May 17, 2023

MovieStoreGuy left a comment

jpkrohling commented May 22, 2023

dmitryax commented May 22, 2023 •

edited

Loading

jpkrohling commented May 22, 2023

jpkrohling commented Jun 5, 2023

jpkrohling commented Jun 9, 2023

andrzej-stencel commented Jun 13, 2023 •

edited

Loading

bogdandrutu commented Jun 13, 2023

bogdandrutu commented Jun 13, 2023

github-actions bot commented Jun 28, 2023

github-actions bot commented Jul 20, 2023

github-actions bot commented Aug 4, 2023

Propagate errors from exporters to receivers #7486

Propagate errors from exporters to receivers #7486

Conversation

jpkrohling commented Apr 4, 2023 • edited Loading

codecov bot commented Apr 4, 2023 • edited Loading

Codecov Report

andrzej-stencel left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 22, 2023

github-actions bot commented May 7, 2023

jpkrohling commented May 10, 2023

jpkrohling commented May 15, 2023

djaglowski commented May 16, 2023 • edited Loading

dmitryax commented May 17, 2023

MovieStoreGuy left a comment

Choose a reason for hiding this comment

jpkrohling commented May 22, 2023

dmitryax commented May 22, 2023 • edited Loading

jpkrohling commented May 22, 2023

jpkrohling commented Jun 5, 2023

jpkrohling commented Jun 9, 2023

andrzej-stencel commented Jun 13, 2023 • edited Loading

bogdandrutu commented Jun 13, 2023

bogdandrutu commented Jun 13, 2023

github-actions bot commented Jun 28, 2023

github-actions bot commented Jul 20, 2023

github-actions bot commented Aug 4, 2023

jpkrohling commented Apr 4, 2023 •

edited

Loading

codecov bot commented Apr 4, 2023 •

edited

Loading

djaglowski commented May 16, 2023 •

edited

Loading

dmitryax commented May 22, 2023 •

edited

Loading

andrzej-stencel commented Jun 13, 2023 •

edited

Loading