Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect() & timeout #20

Closed
LeeTibbert opened this issue Sep 4, 2022 · 17 comments
Closed

connect() & timeout #20

LeeTibbert opened this issue Sep 4, 2022 · 17 comments
Labels
bug Something isn't working

Comments

@LeeTibbert
Copy link
Collaborator

Is there a culturally compatible way to give a timeout value, or a re-try value
to connect()?

epoll is not Java, but Java can provide some guidance as to what
has proven helpful to large groups of people over time.

Java Socket class has connect(SocketAddress endpoint, int timeout).

There is a place in TcpSuite test("server-client ping-pong") where
it calls: _ <- ch.connect(serverAddr) which in turn calls
a more complicated underlying (4 arg?) version.

I was debugging a network related problem yesterday.
The issue was that the Suite was hanging/ceasing_to_make_progress.
To simplify things, I commented out all tests but "pong-pong".
At one point, I had thought the hang was due to the connect()
not timing out.

Turns out the problem was bad info going to connect() rather
than lack of a timeout. However, most networking programs
I know are defensive and have a time out

So the issue is, if culturally appropriate:

  1. no timeout parameter to connect (and probably the
    chain of things it calls, which could get interesting/complicated)

  2. TcpSuite probably has multiple connect() calls which, understandably, do not call this hypothetical timeout.

@armanbilge
Copy link
Owner

The issue was that the Suite was hanging/ceasing_to_make_progress.

Hanging is not good :) we should add this to our regression tests and fix it. What does the JVM do here?

@armanbilge armanbilge added the bug Something isn't working label Sep 4, 2022
@LeeTibbert
Copy link
Collaborator Author

What does the JVM do here

Interesting question.

In the specific ping-pong case here the error never triggers because
the information sent to connect() is not garbage.

In the general case:
JVM has at least two connect: one with and one without timeout.

In the connect(where, timeout), the timeout should happen. Directly and not
in an Exception path. Much easier to predict, handle, and recover from.

What happens without timeout specified is that there can be a long,
unpredictable wait, with eventual error reported.

UDP sockets are not usually connected, so I have no idea there.
I think the connect is an internal bookeeping & security thing
on the local system. There may be a name lookup and trip-over-the-wire.

The case of interest is TCP. I will skip over recent efforts to make
the handshake more efficient and talk about the classic TCP
three way handshake at connection establishment. I think there
are at least three trips over-the-wire. Each of those have an
internal timeout. I do not know off the top of my head what those
are, because I try to always code in such a way that I am not
exposed to them.

The local sends a connection message to the remote. If
it does not get a response in two timeout periods, it reports
an error back to the caller. Normal case.

If the local tcp gets an acknowledgement back, it sends
the third packet and waits, I think two timeout periods
for an ACK of the data in the third packet. I think
I am describing an early correction to the protocol.
The first variants may have required additional timeouts.

The TL;DR is that if the wire breaks after a certain
point in the connection process, general mayhem
ensues, each with a timeout.

The other issue about whatever connection error not
eventually bubbling up to the user is more immediate.
This one is a good to have general programming practice.

@armanbilge
Copy link
Owner

JVM has at least two connect: one with and one without timeout.

Sorry, where do you see this? I'm staring hard at the JavaDoc for connect(...). I do see two methods: one that takes a CompletionHandler and one that returns a Future. But nothing related to timeouts.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/channels/AsynchronousSocketChannel.html#connect(java.net.SocketAddress,A,java.nio.channels.CompletionHandler)

@LeeTibbert
Copy link
Collaborator Author

Ah, I see "Channels" original and still useful concept is:
https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html

I do not know channels well enough if they themselves even
know about the concept of timeout.

The accept server timeout is important. The importance of this one
depends on if connect reliably throws errors. I it does not, then the
importance slides up, but still well below accept.

@armanbilge
Copy link
Owner

@LeeTibbert do you have a reproduction for this issue as well? I am having trouble replicating, because left to my own devices I haven't got a clue 😀

@LeeTibbert
Copy link
Collaborator Author

The old "works fine on my machine" excuse, eh?

The one line reproducer you used earlier works on my machine.

./core/src/main/scala/epollcat/internal/ch/EpollAsyncSocketChannel.scala://          
toCString("8080"), // force a connection error, as long as nobody is running an HTTPS server.

You should see about a two minute hang, the, given your earlier PRs, the client
throws a ConnectException, which gets displayed as a traceback in the test
and cancels the accept, failing the test.

@armanbilge
Copy link
Owner

Oh, gosh, it's the same reproducer! Ok will give that a try thanks.

@armanbilge
Copy link
Owner

Huh, I think this is a case of works fine on my machine! 😛 here's the change I made to connect

diff --git a/core/src/main/scala/epollcat/internal/ch/EpollAsyncSocketChannel.scala b/core/src/main/scala/epollcat/internal/ch/EpollAsyncSocketChannel.scala
index 4397375..46eb7df 100644
--- a/core/src/main/scala/epollcat/internal/ch/EpollAsyncSocketChannel.scala
+++ b/core/src/main/scala/epollcat/internal/ch/EpollAsyncSocketChannel.scala
@@ -191,7 +191,8 @@ final class EpollAsyncSocketChannel private (fd: Int)
         .netdb
         .getaddrinfo(
           toCString(addr.getAddress().getHostAddress()),
-          toCString(addr.getPort.toString),
+          // toCString(addr.getPort.toString),
+          c"8080",
           hints,
           addrinfo
         )

Which immediately gets me this result (notice the 0s at the bottom), no hanging.

sbt:root> testsNative/testOnly *.TcpSuite
[info] Starting process '/workspace/epollcat/tests/native/target/scala-2.13/tests-test-out' on port '35211'.
[info] Starting process '/workspace/epollcat/tests/native/target/scala-2.13/tests-test-out' on port '37603'.
epollcat.TcpSuite:
==> X epollcat.TcpSuite.server-client ping-pong 0.00s java.net.ConnectException: Connection refused
    at java.lang.StackTrace$.currentStackTrace(Unknown Source)
    at java.lang.Throwable.fillInStackTrace(Unknown Source)
    at epollcat.internal.ch.EpollAsyncSocketChannel.$anonfun$connect$2(Unknown Source)
    at epollcat.internal.ch.EpollAsyncSocketChannel$$Lambda$4.run(Unknown Source)
    at epollcat.internal.ch.EpollAsyncSocketChannel.notifyEvents(Unknown Source)
    at epollcat.unsafe.EpollExecutorScheduler.poll(Unknown Source)
    at cats.effect.unsafe.PollingExecutorScheduler.loop(Unknown Source)
    at cats.effect.unsafe.PollingExecutorScheduler.$anonfun$scheduleIfNeeded$1(Unknown Source)
    at cats.effect.unsafe.PollingExecutorScheduler$$Lambda$1.run(Unknown Source)
    at scala.scalanative.runtime.ExecutionContext$.loop(Unknown Source)
    at scala.scalanative.runtime.package$.loop(Unknown Source)
    at scala.scalanative.testinterface.NativeRPC.loop(Unknown Source)
    at scala.scalanative.testinterface.TestMain$.main(Unknown Source)
    at scala.scalanative.testinterface.TestMain.main(Unknown Source)
    at <none>.main(Unknown Source)
    at <none>.__libc_start_main(Unknown Source)
    at <none>._start(Unknown Source)
[error] Failed: Total 1, Failed 1, Errors 0, Passed 0
[error] Failed tests:
[error]         epollcat.TcpSuite
[error] (testsNative / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 0 s, completed Sep 6, 2022, 11:23:39 PM

@armanbilge
Copy link
Owner

Ok, I was able to replicate on both JVM and Native.

I modified the "HTTP echo" test to attempt to connect to postman at 8080.

  test("HTTP echo".only) {
    val address = new InetSocketAddress(InetAddress.getByName("postman-echo.com"), 8080)

I also increased the timeout for the test runner (default is 30s, only works for JVM).

  override val munitTimeout = 1.hour

Which gets me this result:

sbt:root> testsJVM/testOnly *.TcpSuite
[info] compiling 1 Scala source to /workspace/epollcat/tests/jvm/target/scala-3.1.3/test-classes ...
connecting
epollcat.TcpSuite:
==> X epollcat.TcpSuite.HTTP echo  129.353s java.net.ConnectException: Connection timed out

There are your two minutes, give or take 😉

So if I am understanding correctly, I believe we are matching JVM behavior here. In which case, what to do?

If this is a nuisance for our test suite, I can investigate how to make the timeouts work properly.

@LeeTibbert
Copy link
Collaborator Author

LeeTibbert commented Sep 7, 2022

Now that is a cunning way to replicate.

Matching the JVM behavior is good.

POSIX says " unspecified timeout interval until the connection is established"

I think that is exactly what we are seeing.

In C, there is only the one `connect(args) method and
one has to set the socket non-blocking and poll/select.
A more recent and elegant solution is to get a timerfd
and then epoll for either becoming ready.

Java solves this problem by having two connect. One is the
connect(SocketAddress endpoint) which the current code emulates.
The other is connect(SocketAddress endpoint, int timeout).
The test program should be calling the equivalent of this
two arg method.

Implementing an async connect_with_timeout may be a pain/time_suck.
I have been ratting around to see if there is a cats-effect IO way.

The two arg epoll could be
Something like IO(posix.connect(args)).timeout(n) // where n is a FiniteDuration passed in.

Then the test program would call `connect(args, FiniteDuration(20, Seconds).

A Java style connect-with-timeout could be used in multiple tests within TcpSuite.scala.
It would simplify those tests and might make a down payment on its implementaton cost.

on re-reading, I realize that I mangled the argument types, but the central idea
is still clear.

@LeeTibbert
Copy link
Collaborator Author

Let me do a quick look to see if there is a C level socket option.

@armanbilge
Copy link
Owner

The test program should be calling the equivalent of this
two arg method.

The problem is the two arg method doesn't exist in the JDK AsynchronousSocketChannel API. So I don't know what we can do about that 😕

If you are satisfied that this is something we only need to fix in our test suite (and not the socket implementations) then I can make an upstream PR to munit-cats-effect to add support for this.

@LeeTibbert
Copy link
Collaborator Author

LeeTibbert commented Sep 7, 2022

This discussion sounds familiar. Let me check the JDK AsynchronousSocketChannel you mentioned.
Something is telling me that it does not have the timeout variant.

If that is true, I suggest that we push this to the backburner. Now that the server will both show tracebacks
and timeout itself, this is less critical. Especially now that we know it is 2 minutes or so and not infinity.

I have been looking in the Scala Native javalib to see how it is implemented there.
AbstractPlainSocketImpl.scala UnixPlainSocketImpl.scala. That code uses poll to
implement the timeout.

To my thinking kexec is likely to make more people happy.

As always, thank you for spending time on this. Sometimes one just has to say
"best that could be done, given constraints".

PS: no obvious socket level option. That would have been too easy.

PPS: AsynchronousSocketChannel has two connect() methods but
neither specifies a timeout.

@LeeTibbert
Copy link
Collaborator Author

Never leave a bone with a hungry dog.

On linux the connection timeout period for the IPv4 the test uses is a kernel parameter: tcp_syn_retries
URL: https://sysctl-explorer.net/net/ipv4/tcp_syn_retries/

Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127 seconds.

That 127 seconds looks very familiar.

So now we have an answer for people like me who complain about long timeout: decrease your kernel parameter.
That was a grim attempt at a joke. Hard to do a dead pan face in type.

We will see this again, no doubt. Now we know what to tell people.

@LeeTibbert
Copy link
Collaborator Author

If it is agreeable to you, I may submit a PR for TcpSuite with a short explanation that
connect errors may take up to slightly more than 2 minutes. So anything less than
a 2:30 or so wait is not a hang.

My backbrain suggested this to me as a way to set expectations for people running
the test in the wild, or using it as a model, and for my future self.

@armanbilge
Copy link
Owner

@LeeTibbert I've opened this PR so that we can set a proper timeout for our test suite, to prevent hangs like this:

I will publish one of my bootlegs later and then we can give it a try, see what you think.

@LeeTibbert
Copy link
Collaborator Author

Agree with closure. Will exercise the timeout when it is soup.
Bathtub code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants