Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times #15

Closed
SpenceV1 opened this issue Jun 14, 2022 · 33 comments

Comments

@SpenceV1
Copy link

EthernetServer accept function will no longer return new clients after doing the following steps.

  1. Program Teensy with "ServerWithListeners" example; modify ip to your liking
  2. Connect a device to the Teensy server with a simple client which connects and sends a message every ~1 second
  3. Disconnect ethernet cable
  4. Connect ethernet cable
  5. Repeat steps 2-4 until Teensy no longer accepts connections with server.accept()

This would affect any long running application which may occasionally be unplugged from the network. I have poked around in some of the TCP files with a debugger and I see that the teensy is receiving the data I send every second but the library does not seem to accept the connection.

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 14, 2022

I repeated the steps with some TCP debugging options enabled and caught this.
tcp_listen_input: could not allocate PCB

LWIP_DEBUGF(TCP_DEBUG, ("tcp_listen_input: could not allocate PCB\n"));

@ssilverman
Copy link
Owner

Thanks for this report. There's a limited number of total connections. Are you certain the other connections are closed?

@ssilverman
Copy link
Owner

Thinking some more (I'm not able to test just yet): What happens after you wait 2 minutes after you see this error? Technically, those connections are still active even if the Ethernet plug gets unplugged. There may be a 2-minute timeout; have you tried waiting this long after the last connection made and after you unplug the cable?

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 15, 2022

In the example I am using to test this issue, you call client.stop() after 5 seconds of no input from the client. In my program I call client.close() on link state change, if false, and both programs have this issue. Is that what you mean by connection closed?

I did what you suggested and waited to see if the client would ever connect, and it did after ~32 minutes. I repeated this twice with the same results.

@ssilverman
Copy link
Owner

Thanks for the extra info. Are you saying things are repaired after that ~32 minutes? After it’s “repaired”, does the same thing happen after unplugging and re-plugging the cable again a few times?

@SpenceV1
Copy link
Author

Correct, after 32 minutes my client connects to the Teensy and everything seems to operate as normal. I am then able to repeat the steps in my original post, unplugging and plugging back in the ethernet cable ~7 times before the issue repeats itself.

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 15, 2022

Here is a more complete log after connecting the ethernet cable for the last time and failing to connect.

[Ethernet] Link ON
TCP connection request 64809 -> 80.
tcp_alloc: killing off oldest TIME-WAIT connection
tcp_alloc: killing off oldest LAST-ACK connection
tcp_alloc: killing off oldest CLOSING connection
tcp_alloc: killing oldest connection with prio lower than 64
tcp_listen_input: could not allocate PCB
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application
tcp_slowtmr: processing active pcb
tcp_slowtmr: polling application

LWIP_DEBUGF(TCP_DEBUG, ("tcp_alloc: killing off oldest TIME-WAIT connection\n"));

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 15, 2022

Correct me if I am wrong on any of this. I learned "~7" times corresponds to the max number of TCP pcb which is set to 8. Looks like I am hitting the max and receiving an error when memp tries to make space for a new pcb but can't.
This is the error I am receiving.

memp_malloc: out of memory in pool TCP_PCB

Looks like this is where the issue is showing itself, but I assume the actual issue lies somewhere else, not sure if I can trace much further without digging deep.

pcb = (struct tcp_pcb *)memp_malloc(MEMP_TCP_PCB);

There must be an issue freeing up TCP pcb when the ethernet cable gets pulled? But something is clearly forcing them to free up after some time ~30mins? Maybe there is a way to force them to free up? Even though it looks like it tries to free up space after the line shown above and fails to do so.

@SpenceV1
Copy link
Author

Log when the pcb finally gets purged and everything goes back to normal. (The 30ish minute wait)

tcp_slowtmr: max DATA retries reached
tcp_pcb_purge
tcp_pcb_purge: data left on ->unacked

corresponding to

LWIP_DEBUGF(TCP_DEBUG, ("tcp_slowtmr: max DATA retries reached\n"));

and
LWIP_DEBUGF(TCP_DEBUG, ("tcp_pcb_purge: data left on ->unacked\n"));

@ssilverman
Copy link
Owner

Question: Are you using DHCP, and if so, do you see the address change ever, whenever plugging the Ethernet back in?

@ssilverman
Copy link
Owner

ssilverman commented Jun 18, 2022

I'm having trouble reproducing this. Can you tell me more about the client you're using to test connections? I'm using a browser and reloading manually around every second while unplugging and plugging in the cable. (Static IP.)

Could you also tell me more about how you modified the ServerWithListeners example?

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 18, 2022

I have the Teensy directly connected to a computer both with a static IP. Not sure if it matters, but I am using a USB to ethernet adapter for this.

The only thing I changed in ServerWithListeners is the following:

IPAddress staticIP{169, 254, 1, 77};//{192, 168, 1, 101};
IPAddress subnetMask{255, 255, 255, 0};
IPAddress gateway{169, 254, 1, 1};

Here is a C# client I threw together to send "hi" every second and reconnect on error. The original client I discovered this issue with was is in LabVIEW but this has the same issue.

using System;
using System.Net.Sockets;

namespace socket
{
    class Program
    {
        static void Main(string[] args)
        {
            while (true)
            {
                TcpClient client = null;
                NetworkStream stream = null;
                bool connected = false;
                Byte[] data = System.Text.Encoding.ASCII.GetBytes("hi");
                try
                {
                    Console.WriteLine("Connecting...");
                    client = new TcpClient("169.254.1.77", 80);
                    stream = client.GetStream();
                    connected = true;
                    Console.WriteLine("Connected");
                }catch (Exception e){
                    Console.WriteLine("Failed to connect");
                }
                while (connected)
                {
                    try
                    {
                        stream.Write(data, 0, data.Length);
                        Console.WriteLine("Sent: {0}", "hi");
                        System.Threading.Thread.Sleep(1000);
                    }catch(Exception e)
                    {
                        connected = false;
                        stream.Close();
                        client.Close();
                        Console.WriteLine("Disconnected...");
                    }
                }
                System.Threading.Thread.Sleep(1000);
            }
        }
    }
}

@ssilverman
Copy link
Owner

ssilverman commented Jun 18, 2022

I've not been able to reproduce the problem.

My Java code (Java 17):

/*
 * Created by shawn on 6/18/22 10:38 AM.
 */

import java.io.IOException;
import java.io.OutputStream;
import java.net.InetAddress;
import java.net.Socket;
import java.net.UnknownHostException;

public class Main {
  private static final String ADDRESS = "change-to-your-address";
  private static final int PORT = 80;

  private static final byte[] data = { 'h', 'i' };

  public static void main(String[] args) throws UnknownHostException {
    InetAddress addr = InetAddress.getByName(ADDRESS);

    while (true) {
      System.out.println("Connecting...");
      try (Socket socket = new Socket(addr, PORT)) {
        System.out.println("Connected.");
        try (OutputStream out = socket.getOutputStream()) {
          int count = 0;
          while (true) {
            out.write(data);
            out.flush();
            System.out.println((++count) + ": Wrote data");
            Thread.sleep(1000);
          }
        } catch (IOException ex) {
          System.out.println("Output: " + ex);
        }
        Thread.sleep(1000);
      } catch (IOException ex) {
        System.out.println("Socket: " + ex);
      } catch (InterruptedException ex) {
        Thread.currentThread().interrupt();
        break;
      }
    }
  }
}

ServerWithListeners example changes:

  1. Set staticIP, subnetMask, and gateway to something appropriate to my network.
  2. That's it.

For my testing procedure, I plugged and unplugged the cable multiple times with various timings:

  1. More rapidly,
  2. Until the Teensy program recognizes a timeout, and
  3. Combinations of the above.

My Java program can always reconnect. There's a case where I leave the cable unplugged and it takes about a minute for the socket to realize there's nothing connected and then it properly complains of a "broken pipe". Also, if I leave the cable unplugged while the Java program is running, wait for a timeout on the Teensy side, and then reconnect the cable, the Java program shortly realizes there's a "broken pipe" and then restarts the connection.

What versions of QNEthernet and Teensyduino are you using? Additionally, I'm running these tests on a Mac. On what hardware are you running your test program? I wonder if there's a difference in the client-side TCP/IP stacks we're using.

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 18, 2022

I am using Windows 10 PCs, Teensyduino 1.56 and Arduino 1.8.16.

I just ran a few tests plugging directly into the Ethernet jack on the PC vs using the USB adapter and the issue was still there. I tried a different PC to make sure no software was causing the issue and the issue was still there. Then I tried connecting the Teensy to a modem/router, changing the IP/gateway and ran the test again. This method allows me to unplug/re-plug the ethernet cable many times without issue but I believe this method does not act in the same way as a direct connection which is my current configuration. I don't have a MAC that I can use to test this unfortunately, I would be interested to know if that makes the difference.

@ssilverman
Copy link
Owner

Could you re-try with the latest Teensyduino 1.56? Just to be sure you have the latest of everything. Also, what version of QNEthernet do you have?

I'll re-try my tests by plugging the Ethernet into my laptop (via one of those Belkin USB-C Ethernet adapters).

@ssilverman
Copy link
Owner

ssilverman commented Jun 18, 2022

I just did similar tests, but with the Teensy connected with Ethernet directly to my computer via that USB-C Ethernet adapter. The only problem I saw was the Java program not being able to connect—I saw this once. I simply restarted the Java program and things returned to normal. Does that sound like what you're seeing sometimes?

Which lwIP debugging options did you turn on when you built the project?

@SpenceV1
Copy link
Author

I am now using Teensyduino 1.56 (previously 1.55) with Arduino 1.8.16 and QNEthernet 0.14.0. I ran the test again and the issue is still there. Restarting the client program does not fix the issue in my case.

At the moment I do not have any debugging options enabled. I think the ones I enabled previously were MEMP_DEBUG, TCP_DEBUG and a few others.

@ssilverman
Copy link
Owner

ssilverman commented Jun 22, 2022

I wonder how often the Ethernet.loop() function is being called. It's called automatically after each call to your program's loop() (inside yield()), and internally inside many of the Ethernet/Client/Server functions. But if you have some loop somewhere inside loop() that isn't letting loop() finish, nor is calling any network functions, then that could explain the long delay.

That function is where all the network processing is done (as opposed to from ISRs).

@SpenceV1
Copy link
Author

I believe I know what is happening now. I previously found out that the PCBs were not freeing up, today I checked the state of them and they are all stuck in fin_wait_1. After doing some searching, this is known behavior. What is happening is when I unplug the ethernet cable the Teensy will send out FIN and wait in state fin_wait_1 until it receives an ACK, but it never receives it since the link was disconnected. The only other thing that will free them up is if they hit max retransmissions which takes a while. Here are posts I found explaining what I believe to be the problem.

http://savannah.nongnu.org/bugs/?func=detailitem&item_id=44092
https://savannah.nongnu.org/bugs/?31487
http://savannah.nongnu.org/bugs/?44092

I'm still trying to figure out what to do to circumvent this. It seems your MAC handles this differently, similarly if I plug into a router instead of direct connection/network switch.

@ssilverman
Copy link
Owner

ssilverman commented Jun 28, 2022

Definitely a tricky problem. Enabling the keepalive option seems busy, but are there good TCP_MAXRTX values that work for you? (Adding this here for future readers of the thread; I'm imagining you're already exploring these options.)

I certainly wonder how the Mac and router handle these differently.
Edit: Maybe FIN replies are sent much more often?

@ssilverman
Copy link
Owner

ssilverman commented Jun 28, 2022

@SpenceV1 Would it be useful to you if I added a way to call tcp_abort() from the API? This way the application layer can kill TCP connections when it deems that it's been too long after close(). It sounds like close() has the timeout problem (is this correct, as you've seen things?), but an abort() may not.

I might couple this with inactive polling callbacks (see tcp_poll()). Thinking about how I'd do this...

@SpenceV1
Copy link
Author

Yes, I was exploring my options the other day. Keepalive, TCP_MAXRTX, and rebooting seem to be the simple options. I don't think keepalive is sufficient for me, it didn't seem to speed up the time at which the pcbs were freed. Setting TCP_MAXRTX lower definitely helps, I think this is the best solution without going against the TCP protocol. My application has pretty quick timings so I don't think I will ever want a packet being resent 1+ minute later anyways.

@ssilverman
Copy link
Owner

ssilverman commented Jun 28, 2022

Sorry, I was probably updating my above comment after you responded. What do you think about my tcp_abort() and tcp_poll() (called periodically for idle connections) thoughts?

You could also possibly call “Abort” instead of “Close” in the link-off detection.

@SpenceV1
Copy link
Author

Right, calling close() is the start of the issue. Close tries to send out FIN and never seems to receive an ACK so the pcb is stuck in fin_wait_1. Although I'm not sure why it would never get anything back once the cable is re-plugged and there are still retries.

It could help to have a function to call tcp_abort(), what happens naturally if I were to wait for the max retries is tcp_slowtmr hits this line which seems to clear out the unacked packets and free up the pcb.

} else if (pcb->nrtx >= TCP_MAXRTX) {

I'm not really sure what would be best, it sounds like tcp_poll might be appropriate but anything we do might be breaking the flow of the TCP protocol. If I used tcp_abort() I would probably use it when the link gets disconnected since I know this is a problem.

@SpenceV1
Copy link
Author

Since this issue probably wont come up often and most of the time would be solved by waiting, it is probably sufficient for me to lower TCP_MAXRTX. If you provide the ability for me to call tcp_abort() I will probably use it to be sure, but my ideal flow would be to only abort the oldest connection when a new one comes in and we don't have the space for another pcb. Similar to how tcp_alloc() in tcp.c handles connections states such as TIME-WAIT and LAST-ACK.

@ssilverman
Copy link
Owner

Maybe the best option is to add a section to the README describing how to address this with either keepalive or by lowering TCP_MAXRTX. I might save “Abort” for another day unless it’s really needed. What do you think of this plan?

@SpenceV1
Copy link
Author

SpenceV1 commented Jun 28, 2022

I'm going back and forth on this one. If someone connects 5 clients to one Teensy and unplugs, then re-plugs the ethernet cable one time, they will not be able to reconnect all of the clients until the original connections timeout. It would be great to have someone verify that this is a Windows specific issue. I fired up Wireshark and I noticed FIN flags (some retransmissions) but I didn't see any ACKs going back to the Teensy. Windows may be killing the connection as soon as it no longer sees a cable connected, I tried looking this up and this is the only thing I could find. https://stackoverflow.com/a/438212 This would make sense why adding a router between the Teensy and PC would potentially solve this issue since the Win PC would still see a cable connected on it's end. If windows is aborting connections on ethernet unplug, I feel it would be appropriate to do the same. I will try and see if a network switch behaves the same or not as this is how the actual application will run, although it may still depend on which cable you disconnect.

@SpenceV1
Copy link
Author

SpenceV1 commented Jul 1, 2022

I tested this using my actual setup of one Windows 10 PC connected to multiple Teensy 4.1 with an unmanaged switch. I am able to disconnect and reconnect the ethernet cable going between a Teensy and the switch many times with no problem. I believe the Windows TCP stack is most likely the issue as described previously.

@ssilverman
Copy link
Owner

Thanks for diagnosing. Good to know of this issue. I’ll have an abort() implementation for you to try when I have a chance to push it into a test branch. Might be a few days.

@ssilverman
Copy link
Owner

ssilverman commented Jul 14, 2022

@SpenceV1 I've added an EthernetClient::abort() function and a new section to the README. Before I push, I'd love your critique on the section contents. I'm going for accuracy and also a good flow. This is a first draft:

## On connections that hang around after cable disconnect

Ref: [EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times](https://github.com/ssilverman/QNEthernet/issues/15)

TCP uses various mechanisms to maintain connections, even when the physical
connection is unreliable. This includes such things as timeouts, retries, and
exponential backoff.

It turns out that some systems drop and forget a connection when the physical
link is disconnected. This means that the other side may still be waiting to
continue the connection.

The above link contains a discussion where a user of this library couldn't
accept any new connections until all the current connections timed out after
about a half hour. What happened was this: connections were being made, the
Ethernet cable was disconnected and reconnected, and then more connections were
being made. The Teensy side still maintained connection state for all the
connections, choosing to do what TCP does: make a best effort to maintain those
connections. Once all the available sockets had been exhausted, no more
connections could be accepted.

Those connections couldn't be cleared and sockets made available until all the
TCP retries had elapsed. The main problem was that the other side simply dropped
the connections when it detected a link disconnect. If the other system had
maintained connection state, the connections would have continued as normal when
the Ethernet cable was reconnected. That's why tests on my system couldn't
reproduce the issue. The IP stack on the Mac maintained state across cable
disconnects/reconnects. The issue reporter was using Windows, and the IP stack
there apparently drops a connection if the link disconnects. This left the
Teensy side waiting for replies and retrying, and the Windows side no longer
sending traffic.

To mitigate this problem, there are a few possible solutions, including:
1. Reduce the number of retransmission attempts by changing the `TCP_MAXRTX`
   setting in `lwipopts.h`, or
2. Abort connections upon link disconnect.

To accomplish #2, there is an `EthernetClient::abort()` function that simply
drops a TCP connection without going though the normal TCP close process. This
could be called on connections when the link has been disconnected. (See
`Ethernet.onLinkState(cb)`.)

Fun links:
* [Removing Exponential Backoff from TCP - acm sigcomm](http://www.sigcomm.org/node/2736)
* [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff)

@ssilverman
Copy link
Owner

Here's my latest revision:

## On connections that hang around after cable disconnect

Ref: [EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times](https://github.com/ssilverman/QNEthernet/issues/15)

TCP tries its best to maintain reliable communication between two endpoints,
even when the physical link is unreliable. It uses techniques such as timeouts,
retries, and exponential backoff. For example, if a cable is disconnected and
then reconnected, there may be some packet loss during the disconnect time, so
TCP will try to resend any lost packets by retrying at successively larger
intervals.

The TCP close process uses some two-way communication to properly shut down a
connection, and therefore is also subject to physical link reliability. If the
physical link is interrupted or the other side doesn't participate in the close
process then the connection may appear to become "stuck", even when told to
close. The TCP stack won't consider the connection closed until all timeouts and
retries have elapsed.

It turns out that some systems drop and forget a connection when the physical
link is disconnected. This means that the other side may still be waiting to
continue or close the connection, timing out and retrying until all attempts
have failed. This can be as long as a half hour, or maybe more, depending on how
the stack is configured.

The above link contains a discussion where a user of this library couldn't
accept any new connections, even when all the connections had been closed, until
all the existing connections timed out after about a half hour. What happened
was this: connections were being made, the Ethernet cable was disconnected and
reconnected, and then more connections were made. When the cable was
disconnected, all connections were closed using the `close()` function. The
Teensy side still maintained connection state for all the connections, choosing
to do what TCP does: make a best effort to maintain or properly close those
connections. Once all the available sockets had been exhausted, no more
connections could be accepted.

Those connections couldn't be cleared and sockets made available until all the
TCP retries had elapsed. The main problem was that the other side simply dropped
the connections when it detected a link disconnect. If the other system had
maintained those connections, it would have continued the close processes as
normal when the Ethernet cable was reconnected. That's why tests on my system
couldn't reproduce the issue: the IP stack on the Mac maintained state across
cable disconnects/reconnects. The issue reporter was using Windows, and the IP
stack there apparently drops a connection if the link disconnects. This left the
Teensy side waiting for replies and retrying, and the Windows side no longer
sending traffic.

To mitigate this problem, there are a few possible solutions, including:
1. Reduce the number of retransmission attempts by changing the `TCP_MAXRTX`
   setting in `lwipopts.h`, or
2. Abort connections upon link disconnect.

To accomplish #2, there's an `EthernetClient::abort()` function that simply
drops a TCP connection without going though the normal TCP close process. This
could be called on connections when the link has been disconnected. (See also
`Ethernet.onLinkState(cb)` or `Ethernet.linkState()`.)

Fun links:
* [Removing Exponential Backoff from TCP - acm sigcomm](http://www.sigcomm.org/node/2736)
* [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff)

@ssilverman
Copy link
Owner

I've pushed some new changes, including EthernetClient::abort(). Closing; please reopen if you need anything else related to this issue.

@SpenceV1
Copy link
Author

SpenceV1 commented Aug 3, 2022

Thanks for the addition to the library. Your explanation is good, I hope it helps others who may run into this issue. Since I am working strictly with Windows 10 PCs and I assume they will all behave in a similar way, I added an abort on link disconnect to be safe. I appreciate your help with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants