Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent "read tcp ... read: connection reset by peer" error after upgrading #2194

Closed
cristiangraz opened this issue Jul 2, 2018 · 4 comments

Comments

@cristiangraz
Copy link

Please answer these questions before submitting your issue.

What version of gRPC are you using?

cedd913

What version of Go are you using (go version)?

1.10.1

What operating system (Linux, Windows, …) and version?

Linux Alpine

What did you do?

We use grpc for inter-service communication but also via the Google Cloud Go API client which I'm pretty sure uses grpc to communicate with the Google APIs.

We recently upgraded to cedd913 from 484b3eb. As soon as we upgraded, both service-to-service grpc calls for our own API as well as Google pub/sub workers using the google-cloud-go API started logging these errors fairly frequently. The workers seem to log them after a period if inactivity.

We use exponential backoffs and I notice this pattern: the first attempt fails, the second attempt 1s later fails with the same error, and the third attempt ~2s after the last attempt finally works.

I initially wasn't sure if the issue was with grpc or pub/sub, but as soon as we started updating more of our services we noticed the same errors popping up for service to service communication in more and more logs. That leads me to believe the issue might be here.

We don't use any special connection options -- we just grpc.Dial(). The only thing we have beyond the default is mutual TLS configured for grpc calls. When these errors started appearing, we did not make any other updates to grpc, settings, or how we connect -- only updating the version in our glide.lock file.

What did you expect to see?

grpc communications with no errors.

What did you see instead?

read tcp <ip:port>-><ip:port>: read: connection reset by peer

@lyuxuan
Copy link
Contributor

lyuxuan commented Jul 2, 2018

This issue looks related to #1362, where the problem is about too aggressive logging and not real connection issue. I wonder if you saw more Dial/RPC failures or you just observed the increase of this line in the log. And could you please provide us with some logs so we can better diagnose what is happening? Thanks!

@cristiangraz
Copy link
Author

@lyuxuan Can you elaborate or provide an example on what you mean by "aggressive logging and not a real connection issue". The logs I'm seeing our written by our application when a grpc request fails.

When we encounter err != nil in the grpc requests, we log the error in our application with a message e.g.

timestamp=2018-07-02T16:09:32.012697389Z level=error message="cannot find orders" context=order:get_all error="read tcp [redacted]->[redacted]: read: connection reset by peer"

(I omitted other fields like request id, user id, etc)

Recently if I'm in our product's dashboard and encounter a 500 error (what we return for various unexpected error messages), I can often pull up the logs and find an error similar to the one above linked to the request id in my browser. I often have to retry 1-2x for the request to go through.

With the pub/sub workers when the error occurs we log it in the same fashion as above but also have to let the message retry using our exponential backoff (1-2x) because the job wasn't completed due to the error.

@lyuxuan
Copy link
Contributor

lyuxuan commented Jul 3, 2018

It seems that this issue is unrelated to aggressive logging in grpc. (We used to have aggressive logging which results in log spamming.)

However, the error you logged does not seem like what grpc will return for a RPC call. The grpc returned error should follow the format "rpc error: code = %s desc = %s".

Could you please let us know how you make rpc calls and how error is logged in your application? You can refer to this to know more about the error returned by grpc. Thanks!

@cristiangraz
Copy link
Author

Thanks @lyuxuan. The info about the rpc error piece was actually really helpful. Turns out this is not a grpc error but is happening as part of a grpc request (where the work is done). Appreciate all of your help, going to close this one out as it's not an issue with this library.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants