Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal: fix GO_AWAY deadlock #2391

Merged
merged 2 commits into from
Oct 19, 2018
Merged

Conversation

jeanbza
Copy link
Member

@jeanbza jeanbza commented Oct 19, 2018

A deadlock can occur when a GO_AWAY is followed by a connection closure. This
happens because onClose needlessly closes the current ac.transport: if a
GO_AWAY already occured, and the transport was already reset, then the later
closure (of the original address) sets ac.transport - which is now healthy -
to nil.

The manifestation of this problem is that picker_wrapper spins forever trying
to use a READY connection whose ac.transport is nil.

s1.Stop()

// Wait for client to close.
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work instead? <-stream.Context().Done() ?

Copy link
Member Author

@jeanbza jeanbza Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By stream are you suggesting, stream, err := client.FullDuplexCall(ctx)?

If so, it does not appear to ever finish.

edit: Is there something you're supposed to do to the stream other than server.Stop() to cause it to unblock?

test/end2end_test.go Outdated Show resolved Hide resolved
test/end2end_test.go Outdated Show resolved Hide resolved
test/end2end_test.go Outdated Show resolved Hide resolved
A deadlock can occur when a GO_AWAY is followed by a connection closure. This
happens because onClose needlessly closes the current ac.transport: if a
GO_AWAY already occured, and the transport was already reset, then the later
closure (of the original address) sets ac.transport - which is now healthy -
to nil.

The manifestation of this problem is that picker_wrapper spins forever trying
to use a READY connection whose ac.transport is nil.
@jeanbza
Copy link
Member Author

jeanbza commented Oct 19, 2018

PTAL

@@ -7055,7 +7056,8 @@ func TestGoAwayThenClose(t *testing.T) {

client := testpb.NewTestServiceClient(cc)

// Should go on connection 1.
// Should go on connection 1. We use a long-lived RPC because it will cause GracefulStop to send GO_AWAY, but the
// connection doesn't get closed until the server stops and the client receives.
stream, err := client.FullDuplexCall(ctx)
if err != nil {
t.Fatalf("UnaryCall(_) = _, %v; want _, nil", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*FullDuplexCall

@jeanbza jeanbza merged commit ff2aa05 into grpc:master Oct 19, 2018
dfawley added a commit to dfawley/grpc that referenced this pull request Oct 25, 2018
When security is disabled, not waiting for the HTTP/2 handshake can lead to
DoS-style behavior.  For details, see:
grpc/grpc-go#954.  This requirement will incur an
extra half-RTT latency before the first RPC can be sent under plaintext, but
this is negligible and unencrypted connections are rarer than secure ones.

Under TLS, the server will effectively send its part of the HTTP/2 handshake
along with its final TLS "server finished" message, which the client must wait
for before transmitting any data securely.  This means virtually no extra
latency is incurred by this requirement.

Go had attempted to separate "connection ready" with "connection successful"
(Issue: grpc/grpc-go#1444 PR:
grpc/grpc-go#1648).  However, this is confusing to
users and introduces an arbitrary distinction between these two events.  It has
led to several bugs in our reconnection logic (e.g.s
grpc/grpc-go#2380,
grpc/grpc-go#2391,
grpc/grpc-go#2392), due to the complexity, and it makes
custom transports (grpc/proposal#103) more difficult
for users to implement.

We are aware of some use cases (in particular,
https://github.com/soheilhy/cmux) expecting the behavior of transmitting an RPC
before the HTTP/2 handshake is completed.  Before making behavior changes to
implement this, we will reach out to our users to the best of our abilities.
@lock lock bot locked as resolved and limited conversation to collaborators Apr 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants