-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.1.4: [input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select() #3156
Comments
Hi @brenc, |
From v2.1.3. We use butt to broadcast but this is happening with metadata updates as well:
We haven't had any issues since reverting to 2.1.3. |
I'm unable to reproduce the problem.
Maybe if you set |
Still no |
I should mention it earlier, but could you show the results of the
Is it
Docker can cause the problem.
If you're building the container anyway, then why wouldn't you build it using |
We're using Docker Stack on enterprise servers. There isn't anything interesting or unusual in the stack file. It's setting env vars, telling it what image to use, opening ports, etc.
Hasn't before this 2.1.4 for us anyway.
Stock image works perfectly fine. No reason to use anything else. |
OK - I have gathered some more data. I was able to recreate this on my dev box which is running the same (or very close) version of Docker. On dev I use Docker Compose which is set up to mimic prod as closely as possible (prod is using Swarm). I'm using haproxy to frontend all these connection and pass them off to the LS instance. Just a simple TCP proxy with minimal processing. So I left butt streaming all night long and when I got up today it was still working fine. Then I disconnected and reconnected and was able to recreate this issue. When going through the proxy is takes about 50s before butt will disconnect. I believe the proxy server is buffering data but LS doesn't respond so once a timer is reached the proxy disconnects the client. On dev I can bypass the proxy server so I did that. butt gets disconnected immediately and will reconnect over and over again until it eventually reconnects. Here are the logs LS produces (with
I see
|
Maybe I'm doing something wrong, but I can't replicate this behavior. I tried to fully reproduce your setup as far as I understand it.
Did I miss something? |
That looks correct yes. Like I said, it doesn't happen immediately. It has to run for a while then something triggers it. Once triggered a restart fixes it for a period of time. |
I wonder if you can replicate the faulty behavior using the following command? while true; do
ffmpeg \
-re \
-f lavfi \
-i "sine=frequency=1000:duration=5" \
-b:a 128k \
-c:a mp3 \
-f mp3 \
"icecast://source:hackme@$IP:8001/radio";
done This script will play 5 seconds of sine at 1000Hz in a loop. Replace |
Yes, here is what ffmpeg logs:
Restarted LS and the script works as expected. It's only when this bug is triggered that it fails, of course. |
I'm not sure why, but so far I haven't experienced any issues. If you have some free time, I can suggest additional methods to recreate the issue on my end.
Can you create a disk image for a virtual machine that includes all the packages, containers, and scripts needed? This is not necessary at the moment because you have an easy way to repeat the error. Alternatively, you could check if the error persists in |
Yeah I've been thinking about how to create a minimal config to reproduce this. I think I have an idea. Will try to get something to you soon. |
@brenc, could you share the values of |
In the container:
No additional logs anywhere. I have a small API over Harbor that gets queried every second by the proxy servers and other processes. I think that is involved in triggering this. |
Docker has a bug moby/moby#38814 I'm not sure if this is related to the docker issue, but can you try setting limits in the dev environment and reproduce the problem? services:
liquidsoap:
ulimits:
nofile:
soft: 1024
hard: 524288 Reference documentation: https://docs.docker.com/compose/compose-file/compose-file-v3/#ulimits |
I can't see how this would be a Docker issue considering 2.1.3 works but 2.1.4 does not. |
I don't know either, but it looks suspicious.
Also, I have this issue with docker and limits on my local host. So I have to rebuild the packages against the latest go version. |
Hi! There was a file descriptor leak with SSL in |
@toots that would make sense. I started to upgrade to 2.2 but had to roll back. Will stay at 2.1.3 for now. |
Just pushed a commit that will print the backtrace for these two errors when looking a log.level 4 or more. |
Hmm. It's gonna be complicated to rebuild |
@toots all good. I'm just about done moving to 2.2. |
This appears to be fixed in 2.2.x. |
Spoke too soon. Rolled out to production and within a day LS started logging this:
Also metadata updates fail:
It's possible this is due to an error condition (and possibly resource exhaustion?). I see one of the Icecast servers was unreachable (we have three instances). This happened for a number of hours before the select errors started:
I fixed that and will see if this happens again. |
OK so this happened again during normal running conditions. I'm seeing this regularly:
And I'm seeing this if I try to connect to stream w/ butt:
This doesn't seem to happen in dev/staging now. The only real difference that I can think of there is that listeners aren't connecting / disconnecting regularly on these envs. Double checked some resource usage like open files. That doesn't seem to be an issue.
Currently LS has about 3K open files which seems high considering we have just one listener at the moment. I increased logging to 4 so we'll see if that produces any interesting output. |
Yeah this looks like a fd leak. Could you check: ls -l /proc/<pid>/fd |
Yep looks like that's what it is. Here is some additional logging:
Currently 1138 open files so definitely over that soft limit. |
Here are the metadata update failure logs:
|
Looks like it's mostly socket connections:
|
Hi @brenc, |
When the errors are happening it's always about 1080 FDs. Less than 1100 but more than 1024. I can watch the number of open FDs steadily grow until this point. |
Thanks! I'm gonna review the code to see if I can spot it. Is your script still:
How often does the DJ connect to |
More or less yes. The DJ rarely ever connects. I have an API that gets connections 1-3+/s though. The proxy server connects to check if the API is up and running every second and things like that. I think that's part of what's causing this. Been meaning to whip up a simple config to see if I can recreate this. Will try that now. |
Right now there are 1319 FDs. I think the number might continue to increase until I restart LS. Edit: 1300 of these are socket FDs. The number continues to increase until restart. |
Thanks for your patience. I believe that this is fixed now. A cleanup call got deleted when handling redirections in our internal |
OK so this isn't fixed. FDs continue to increase until LS stops operating. Some additional observations once open FDs get to above 1000:
Continue to test and think of ways to easily recreate this. |
Another observation: open FDs continue to rise w/o the additional |
@brenc any chance you could provide an extract of |
Yep here it is. A little over 200 open FDs right now. |
Looks like this piece of code here is the culprit. If I uncomment this, FDs/open sockets increase:
Here's the transport:
Non-SSL Harbor doesn't do this. |
Thanks. I'm running this: server
client
To see if I can reproduce. |
OK I have finally figured out how to replicate this reliably. Using the following config:
In one window run:
In another run:
When you connect, you'll see the FD/socket count increment. If you do a normal request i.e.:
After it disconnects you'll then see the FD count decrement as expected. However, if you connect and immediately press ctrl-c to disconnect you'll see the socket count increment but not decrement. What causes this in production is the health check in haproxy. I use a TCP frontend to proxy SSL requests to LS. The health check runs every 2 seconds to see if the port is responding but it doesn't consistently increment the FD count. My theory is that it doesn't always close the connection cleanly which is triggering this bug. If I disable the health check, FD count stays steady. Interestingly, if I simply telnet to the port then ctrl-c the FD gets cleaned up properly. So it seems that the connection has to go through the SSL negotiation first. |
@brenc, after researching the file limit problem, I have concluded that setting the soft limit to 1024 will be better. Potentially it can prevent some unexpected segfault or memory corruption related to Linux @toots, did I come to the correct conclusion? Is it possible to move away from |
@vitoyucepi everything is switched to |
After some serious digging by @brenc and a full reproduction setup at https://github.com/brenc/lsfdbug it would appear that this bug is now closed! |
Describe the bug
After updating to 2.1.4 we started to see this when a client connects to Harbor input:
[input.harbor.ssl_0:3] Decoding...
[input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select()
[input.harbor.ssl_0:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).
The client is then disconnected. The only thing that fixes this is restarting LS. This doesn't happen all the time. Seems something triggers it then after that no client can connect.
To Reproduce
Happens with a simple Harbor input, both SSL and non-SSL are affected.
Expected behavior
Allow streaming clients to connect and stream.
Version details
Install method
Docker
The text was updated successfully, but these errors were encountered: