frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

cristiangraz · 2018-07-02T15:48:48Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

cedd913

What version of Go are you using (`go version`)?

1.10.1

What operating system (Linux, Windows, …) and version?

Linux Alpine

What did you do?

We use grpc for inter-service communication but also via the Google Cloud Go API client which I'm pretty sure uses grpc to communicate with the Google APIs.

We recently upgraded to cedd913 from 484b3eb. As soon as we upgraded, both service-to-service grpc calls for our own API as well as Google pub/sub workers using the google-cloud-go API started logging these errors fairly frequently. The workers seem to log them after a period if inactivity.

We use exponential backoffs and I notice this pattern: the first attempt fails, the second attempt 1s later fails with the same error, and the third attempt ~2s after the last attempt finally works.

I initially wasn't sure if the issue was with grpc or pub/sub, but as soon as we started updating more of our services we noticed the same errors popping up for service to service communication in more and more logs. That leads me to believe the issue might be here.

We don't use any special connection options -- we just grpc.Dial(). The only thing we have beyond the default is mutual TLS configured for grpc calls. When these errors started appearing, we did not make any other updates to grpc, settings, or how we connect -- only updating the version in our glide.lock file.

What did you expect to see?

grpc communications with no errors.

What did you see instead?

read tcp <ip:port>-><ip:port>: read: connection reset by peer

The text was updated successfully, but these errors were encountered:

lyuxuan · 2018-07-02T23:42:43Z

This issue looks related to #1362, where the problem is about too aggressive logging and not real connection issue. I wonder if you saw more Dial/RPC failures or you just observed the increase of this line in the log. And could you please provide us with some logs so we can better diagnose what is happening? Thanks!

cristiangraz · 2018-07-03T03:48:06Z

@lyuxuan Can you elaborate or provide an example on what you mean by "aggressive logging and not a real connection issue". The logs I'm seeing our written by our application when a grpc request fails.

When we encounter err != nil in the grpc requests, we log the error in our application with a message e.g.

timestamp=2018-07-02T16:09:32.012697389Z level=error message="cannot find orders" context=order:get_all error="read tcp [redacted]->[redacted]: read: connection reset by peer"

(I omitted other fields like request id, user id, etc)

Recently if I'm in our product's dashboard and encounter a 500 error (what we return for various unexpected error messages), I can often pull up the logs and find an error similar to the one above linked to the request id in my browser. I often have to retry 1-2x for the request to go through.

With the pub/sub workers when the error occurs we log it in the same fashion as above but also have to let the message retry using our exponential backoff (1-2x) because the job wasn't completed due to the error.

lyuxuan · 2018-07-03T20:49:13Z

It seems that this issue is unrelated to aggressive logging in grpc. (We used to have aggressive logging which results in log spamming.)

However, the error you logged does not seem like what grpc will return for a RPC call. The grpc returned error should follow the format "rpc error: code = %s desc = %s".

Could you please let us know how you make rpc calls and how error is logged in your application? You can refer to this to know more about the error returned by grpc. Thanks!

cristiangraz · 2018-07-03T21:44:41Z

Thanks @lyuxuan. The info about the rpc error piece was actually really helpful. Turns out this is not a grpc error but is happening as part of a grpc request (where the work is done). Appreciate all of your help, going to close this one out as it's not an issue with this library.

lyuxuan added the Status: Requires Reporter Clarification label Jul 3, 2018

cristiangraz closed this as completed Jul 3, 2018

lock bot locked as resolved and limited conversation to collaborators Dec 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

cristiangraz commented Jul 2, 2018

lyuxuan commented Jul 2, 2018 •

edited

Loading

cristiangraz commented Jul 3, 2018

lyuxuan commented Jul 3, 2018

cristiangraz commented Jul 3, 2018

frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

Comments

cristiangraz commented Jul 2, 2018

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

lyuxuan commented Jul 2, 2018 • edited Loading

cristiangraz commented Jul 3, 2018

lyuxuan commented Jul 3, 2018

cristiangraz commented Jul 3, 2018

What version of Go are you using (`go version`)?

lyuxuan commented Jul 2, 2018 •

edited

Loading