-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent "read tcp ... read: connection reset by peer" error after upgrading #2194
Comments
This issue looks related to #1362, where the problem is about too aggressive logging and not real connection issue. I wonder if you saw more Dial/RPC failures or you just observed the increase of this line in the log. And could you please provide us with some logs so we can better diagnose what is happening? Thanks! |
@lyuxuan Can you elaborate or provide an example on what you mean by "aggressive logging and not a real connection issue". The logs I'm seeing our written by our application when a grpc request fails. When we encounter
(I omitted other fields like request id, user id, etc) Recently if I'm in our product's dashboard and encounter a 500 error (what we return for various unexpected error messages), I can often pull up the logs and find an error similar to the one above linked to the request id in my browser. I often have to retry 1-2x for the request to go through. With the pub/sub workers when the error occurs we log it in the same fashion as above but also have to let the message retry using our exponential backoff (1-2x) because the job wasn't completed due to the error. |
It seems that this issue is unrelated to aggressive logging in grpc. (We used to have aggressive logging which results in log spamming.) However, the error you logged does not seem like what grpc will return for a RPC call. The grpc returned error should follow the format Could you please let us know how you make rpc calls and how error is logged in your application? You can refer to this to know more about the error returned by grpc. Thanks! |
Thanks @lyuxuan. The info about the |
Please answer these questions before submitting your issue.
What version of gRPC are you using?
cedd913
What version of Go are you using (
go version
)?1.10.1
What operating system (Linux, Windows, …) and version?
Linux Alpine
What did you do?
We use grpc for inter-service communication but also via the Google Cloud Go API client which I'm pretty sure uses grpc to communicate with the Google APIs.
We recently upgraded to cedd913 from 484b3eb. As soon as we upgraded, both service-to-service grpc calls for our own API as well as Google pub/sub workers using the google-cloud-go API started logging these errors fairly frequently. The workers seem to log them after a period if inactivity.
We use exponential backoffs and I notice this pattern: the first attempt fails, the second attempt 1s later fails with the same error, and the third attempt ~2s after the last attempt finally works.
I initially wasn't sure if the issue was with grpc or pub/sub, but as soon as we started updating more of our services we noticed the same errors popping up for service to service communication in more and more logs. That leads me to believe the issue might be here.
We don't use any special connection options -- we just
grpc.Dial()
. The only thing we have beyond the default is mutual TLS configured for grpc calls. When these errors started appearing, we did not make any other updates to grpc, settings, or how we connect -- only updating the version in ourglide.lock
file.What did you expect to see?
grpc communications with no errors.
What did you see instead?
read tcp <ip:port>-><ip:port>: read: connection reset by peer
The text was updated successfully, but these errors were encountered: