Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.1.4: [input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select() #3156

Closed
brenc opened this issue Jun 17, 2023 · 48 comments

Comments

@brenc
Copy link

brenc commented Jun 17, 2023

Describe the bug
After updating to 2.1.4 we started to see this when a client connects to Harbor input:

[input.harbor.ssl_0:3] Decoding...
[input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select()
[input.harbor.ssl_0:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).

The client is then disconnected. The only thing that fixes this is restarting LS. This doesn't happen all the time. Seems something triggers it then after that no client can connect.

To Reproduce
Happens with a simple Harbor input, both SSL and non-SSL are affected.

let ngr.inputs.harbor = input.harbor(
  "radio",
  buffer=6.0,
  port=8003,
  timeout=10.0,
  user=icecast_mount_radio_usr,
  password=icecast_mount_radio_psw,
)

Expected behavior
Allow streaming clients to connect and stream.

Version details

  • OS: Docker
  • Version: savonet/liquidsoap:v2.1.4

Install method
Docker

@vitoyucepi
Copy link
Collaborator

Hi @brenc,
From which version did you upgrade?
Was it caused by a certain client, a single user, or a particular software?

@brenc
Copy link
Author

brenc commented Jun 18, 2023

From v2.1.3. We use butt to broadcast but this is happening with metadata updates as well:

[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()

We haven't had any issues since reverting to 2.1.3.

@vitoyucepi
Copy link
Collaborator

I'm unable to reproduce the problem.

  1. What's the operation system you're running docker on?
  2. Which docker version are you using?
  3. Could you share container run parameters?
  4. How often this problem happens?
  5. What are the encoding parameters used in the butt?

Maybe if you set settings.log.level.set(6), then it will produce more information.

@brenc
Copy link
Author

brenc commented Jun 19, 2023

  1. Debian 11 x64
  2. Docker version 24.0.1, build 6802122
  3. All we're doing is copying over some files and running CMD ["/usr/bin/liquidsoap", "main.liq"]
  4. Something seems to trigger it but I don't know what. I will try increasing logging on staging and see if it logs anything else.
  5. Stock 128k mp3.

Still no select() errors since rolling back to v2.1.3.

@vitoyucepi
Copy link
Collaborator

Docker version 24.0.1, build 6802122

I should mention it earlier, but could you show the results of the docker info command?

All we're doing is copying over some files and running CMD ["/usr/bin/liquidsoap", "main.liq"]

Is it docker run -it --rm savonet/liquidsoap:v2.1.4, or did I miss something? Maybe some network options?
You can post a docker-compose project if you're using it.

Something seems to trigger it but I don't know what.

Docker can cause the problem.

All we're doing is copying over some files

If you're building the container anyway, then why wouldn't you build it using ubuntu:22.04 + deb package from release?

@brenc
Copy link
Author

brenc commented Jun 20, 2023

Client:                                 
 Context:    default       
 Debug Mode: false                      
 Plugins:                               
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4                   
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.17.3
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 20
  Running: 7
  Paused: 0
  Stopped: 13
 Images: 18
 Server Version: 23.0.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: syslog
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk s
yslog
 Swarm: active
  NodeID: jqbmejv6mo0vdqthj4wety3hr
  Is Manager: true
  ClusterID: r26xd03m4b254wlxtsiock6oy
  Managers: 7
  Nodes: 7
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:                                                                [4/91266]
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1 
   Election Tick: 10 
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.10.1.57
  Manager Addresses: 
   10.10.1.19:2377
   10.10.1.54:2377
   10.10.1.55:2377
   10.10.1.56:2377
   10.10.1.57:2377
   10.10.1.58:2377
   10.10.1.59:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-22-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.78GiB
 Name: docker04
 ID: Z3WQ:TVQJ:OA33:LUUT:G4V4:TAAF:7MF7:6IFA:L5AU:PF2Q:Y6SW:MR2T
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false 
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

You can post a docker-compose project if you're using it.

We're using Docker Stack on enterprise servers. There isn't anything interesting or unusual in the stack file. It's setting env vars, telling it what image to use, opening ports, etc.

Docker can cause the problem.

Hasn't before this 2.1.4 for us anyway.

If you're building the container anyway, then why wouldn't you build it using ubuntu:22.04 + deb package from release?

Stock image works perfectly fine. No reason to use anything else.

@brenc
Copy link
Author

brenc commented Jun 20, 2023

OK - I have gathered some more data. I was able to recreate this on my dev box which is running the same (or very close) version of Docker. On dev I use Docker Compose which is set up to mimic prod as closely as possible (prod is using Swarm). I'm using haproxy to frontend all these connection and pass them off to the LS instance. Just a simple TCP proxy with minimal processing.

So I left butt streaming all night long and when I got up today it was still working fine. Then I disconnected and reconnected and was able to recreate this issue. When going through the proxy is takes about 50s before butt will disconnect. I believe the proxy server is buffering data but LS doesn't respond so once a timer is reached the proxy disconnects the client.

On dev I can bypass the proxy server so I did that. butt gets disconnected immediately and will reconnect over and over again until it eventually reconnects. Here are the logs LS produces (with settings.log.level.set(6)):

[harbor:4] PUT (source) request on /radio.
[harbor:4] Adding source on mountpoint "/radio" with type "audio/mpeg".
[harbor:5] Relaying HTTP/1.1.
[decoder:4] Available decoders:
[decoder:4] FFMPEG (priority: 10)
[decoder:4] MAD (priority: 1)
[decoder:4] Selected decoder FFMPEG for mime-type audio/mpeg with expected content {audio=pcm(stereo),video=none,midi=none}
[threads:4] Created thread "harbor source feeding" (2 total).
[input.harbor_0:3] Decoding...
[input.harbor_0:2] Error while reading from client: Invalid argument in select()
[input.harbor_0:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).
[threads:4] Thread "harbor source feeding" terminated (1 remaining).

I see select() errors on metadata updates as well:

[replay_metadata_11:5] Got metadata at position 0: calling handlers...
[on_metadata_20:5] Got metadata at position 0: calling handler...
[on_metadata_20:5] Got metadata at position 0: calling handlers...
[replay_metadata_11:4] Fading in (type: sin, duration: 6.s).
[on_track_22:5] Got metadata at position 0: calling handlers...
[fade.in:5] Got metadata at position 0: calling handlers...
[fade.in:5] Got metadata at position 0: calling handlers...
[add_13:5] Got metadata at position 0: calling handlers...
[max_duration_13:5] Got metadata at position 0: calling handlers...
[sequence_20:5] Got metadata at position 0: calling handlers...
[switch_2:5] Got metadata at position 0: calling handlers...
[rms_2:5] Got metadata at position 0: calling handlers...
[lufs_0:5] Got metadata at position 0: calling handlers...
[bpm_0:5] Got metadata at position 0: calling handlers...
[mksafe:5] Got metadata at position 0: calling handlers...
[insert_metadata_0:5] Got metadata at position 0: calling handlers...
[max_duration_13:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[sequence_19:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[sequence_20:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[ffmpeg.encode.audio.producer_0:5] Got metadata at position 0: calling handlers...
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[DEBUG]: ngr.inputs.update_live_status: input is now off-air

@vitoyucepi
Copy link
Collaborator

Maybe I'm doing something wrong, but I can't replicate this behavior.

I tried to fully reproduce your setup as far as I understand it.

  1. Use debian 11 cloudimage x64.
  2. Add docker upstream repos using the official guide.
  3. Install packages.
    apt-get install --no-install-recommends \
      docker-ce=5:23.0.5-1~debian.11~bullseye \
      docker-ce-cli=5:24.0.1-1~debian.11~bullseye \
      docker-buildx-plugin=0.10.4-1~debian.11~bullseye \
      docker-compose-plugin=2.17.3-1~debian.11~bullseye
  4. Create liquidsoap config.
    settings.log.level.set(6)
    
    a = input.harbor(
      "radio",
      port=8001,
      user="source",
      password="hackme",
      buffer=6.0,
      timeout=10.0,
    )
    
    output.icecast(
      %mp3,
      mksafe(a),
      user="source",
      password="hackme",
      mount="/radio",
      host="icecast",
      port=8000,
    )
  5. Run containers with the following compose config.
    services:
      liquidsoap:
        image: savonet/liquidsoap:v2.1.4
        ports:
          - 8001:8001
        volumes:
          - ./liquidsoap/:/home/liquidsoap/
        command: "/home/liquidsoap/main.liq"
      icecast:
        image: registry.gitlab.com/vito-containers/icecast:2.4
        ports:
          - 8000:8000
        environment:
          ICECAST_AUTHENTICATION_SOURCE_PASSWORD: hackme
  6. Connect with butt to http://source:hackme@$IP:8001/radio.

Did I miss something?

@brenc
Copy link
Author

brenc commented Jun 20, 2023

That looks correct yes. Like I said, it doesn't happen immediately. It has to run for a while then something triggers it. Once triggered a restart fixes it for a period of time.

@vitoyucepi
Copy link
Collaborator

I wonder if you can replicate the faulty behavior using the following command?

while true; do 
  ffmpeg \
    -re \
    -f lavfi \
    -i "sine=frequency=1000:duration=5" \
    -b:a 128k \
    -c:a mp3 \
    -f mp3 \
    "icecast://source:hackme@$IP:8001/radio";
done

This script will play 5 seconds of sine at 1000Hz in a loop. Replace icecast://source:hackme@$IP:8001/radio with your harbor parameters.

@brenc
Copy link
Author

brenc commented Jun 21, 2023

Yes, here is what ffmpeg logs:

ffmpeg version 4.3.6-0+deb11u1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 10 (Debian 10.2.1-6)
  configuration: --prefix=/usr --extra-version=0+deb11u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100
Input #0, lavfi, from 'sine=frequency=1000:duration=5':
  Duration: N/A, start: 0.000000, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le, 44100 Hz, mono, s16, 705 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
[icecast @ 0x5599a90f6d80] It seems you are streaming an unsupported format.
[icecast @ 0x5599a90f6d80] It might work, but is not officially supported in Icecast!
Output #0, mp3, to 'icecast://radio:HAX@localhost:8203/radio':
  Metadata:
    TSSE            : Lavf58.45.100
    Stream #0:0: Audio: mp3 (libmp3lame), 44100 Hz, mono, s16p, 128 kb/s
    Metadata:
      encoder         : Lavc58.91.100 libmp3lame
av_interleaved_write_frame(): Broken pipe
Error writing trailer of icecast://radio:asdf123$@localhost:8203/radio: Broken pipe
size=       0kB time=00:00:00.00 bitrate= 337.7kbits/s speed=0.0143x    
video:0kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Conversion failed!

Restarted LS and the script works as expected. It's only when this bug is triggered that it fails, of course.

@vitoyucepi
Copy link
Collaborator

vitoyucepi commented Jun 22, 2023

I'm not sure why, but so far I haven't experienced any issues.
After a day of connects (and disconnects), there were no errors.

If you have some free time, I can suggest additional methods to recreate the issue on my end.

I was able to recreate this on my dev box

Can you create a disk image for a virtual machine that includes all the packages, containers, and scripts needed?

This is not necessary at the moment because you have an easy way to repeat the error.
I think it will be possible to produce a build that will print a stack trace of the error. But I do not know how to do that, so the only way for me to help you is to create a stable way to reproduce the problem anywhere.

Alternatively, you could check if the error persists in savonet/liquidsoap:rolling-release-v2.2.x.

@brenc
Copy link
Author

brenc commented Jun 22, 2023

Yeah I've been thinking about how to create a minimal config to reproduce this. I think I have an idea. Will try to get something to you soon.

@vitoyucepi
Copy link
Collaborator

@brenc, could you share the values of ulimit -a -S and ulimit -a -H? Inside the liquidsoap container and in the system.
Did you check journalctl, maybe something happened there?

@brenc
Copy link
Author

brenc commented Jun 23, 2023

# ulimit -a -S
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 8228984
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 256745
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

# ulimit -a -H
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 8228984
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 256745
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

In the container:

$ ulimit -a -S
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

$ ulimit -a -H
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

No additional logs anywhere. I have a small API over Harbor that gets queried every second by the proxy servers and other processes. I think that is involved in triggering this.

@vitoyucepi
Copy link
Collaborator

Docker has a bug moby/moby#38814

I'm not sure if this is related to the docker issue, but can you try setting limits in the dev environment and reproduce the problem?

services:
  liquidsoap:
    ulimits:
      nofile:
        soft: 1024
        hard: 524288

Reference documentation: https://docs.docker.com/compose/compose-file/compose-file-v3/#ulimits

@brenc
Copy link
Author

brenc commented Jun 23, 2023

I can't see how this would be a Docker issue considering 2.1.3 works but 2.1.4 does not.

@vitoyucepi
Copy link
Collaborator

I don't know either, but it looks suspicious.

moby/moby#38814 (comment)

The soft limit of 1024 would resolve issues with programs which loop through the entire file-descriptor range, and is the largest value compatible with software which uses select(2).

Invalid argument in select().

Also, I have this issue with docker and limits on my local host. So I have to rebuild the packages against the latest go version.

@toots
Copy link
Member

toots commented Jun 26, 2023

Hi! There was a file descriptor leak with SSL in v2.1.4. Any chance you could try with rolling-release-v2.2.x? Please note that the syntax for using SSL has changed there, see: https://www.liquidsoap.info/doc-dev/harbor_http.html#https-support

@brenc
Copy link
Author

brenc commented Jun 26, 2023

@toots that would make sense. I started to upgrade to 2.2 but had to roll back. Will stay at 2.1.3 for now.

@vitoyucepi
Copy link
Collaborator

Hi @toots,
I suppose you're talking about #3067. I don't think this is the case.
Could you create a build that will print traces for these events?

Error while reading from client: Invalid argument in select()
Metadata update may have failed with error: Invalid argument in select()

@toots
Copy link
Member

toots commented Jun 27, 2023

Hi @toots, I suppose you're talking about #3067. I don't think this is the case. Could you create a build that will print traces for these events?

Error while reading from client: Invalid argument in select()
Metadata update may have failed with error: Invalid argument in select()

Just pushed a commit that will print the backtrace for these two errors when looking a log.level 4 or more.

@vitoyucepi
Copy link
Collaborator

@toots, I've expected these changes in 2.1.x branch, because @brenc can reproduce this issue only with 2.1.4 (2.1.3 is fine).

@toots
Copy link
Member

toots commented Jun 28, 2023

Hmm. It's gonna be complicated to rebuild v2.1.4 through the actions. If @brenc can compile locally via opam I'm happy to provide a patch but, at this point, we're very close to the 2.2.0 release so I'd rather focus on fixing the issue there, if it exists..

@brenc
Copy link
Author

brenc commented Jun 28, 2023

@toots all good. I'm just about done moving to 2.2.

@brenc
Copy link
Author

brenc commented Jun 29, 2023

This appears to be fixed in 2.2.x.

@brenc brenc closed this as completed Jun 29, 2023
@brenc
Copy link
Author

brenc commented Jun 30, 2023

Spoke too soon. Rolled out to production and within a day LS started logging this:

[/radio_mp3:3] Connecting mount /radio.mp3 for radio@10.0.23.45...                                                      
[/radio_mp3:2] Connection failed: Unix.Unix_error(Unix.EINVAL, "select", "")
[/radio_mp3:3] Will try again in 3.00 sec

Also metadata updates fail:

[/radio_mp3.2:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")

It's possible this is due to an error condition (and possibly resource exhaustion?). I see one of the Icecast servers was unreachable (we have three instances). This happened for a number of hours before the select errors started:

[/radio_mp3:3] Connecting mount /radio.mp3 for radio@10.0.23.45...
[/radio_mp3:2] Connection failed: could not write data to host: Unix.Unix_error(Unix.EHOSTUNREACH, "write", "")
[/radio_mp3:3] Will try again in 3.00 sec.

I fixed that and will see if this happens again.

@brenc brenc reopened this Jun 30, 2023
@brenc
Copy link
Author

brenc commented Jul 1, 2023

OK so this happened again during normal running conditions. I'm seeing this regularly:

[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")

And I'm seeing this if I try to connect to stream w/ butt:

[input.harbor.2:3] Decoding...
[input.harbor.2:2] Error while reading from client: Unix.Unix_error(Unix.EINVAL, "select", "")
[input.harbor.2:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).

This doesn't seem to happen in dev/staging now. The only real difference that I can think of there is that listeners aren't connecting / disconnecting regularly on these envs.

Double checked some resource usage like open files. That doesn't seem to be an issue.

# cat /proc/2845871/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       256745               256745               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

Currently LS has about 3K open files which seems high considering we have just one listener at the moment.

I increased logging to 4 so we'll see if that produces any interesting output.

@toots
Copy link
Member

toots commented Jul 1, 2023

Yeah this looks like a fd leak. Could you check:

ls -l /proc/<pid>/fd

@brenc
Copy link
Author

brenc commented Jul 2, 2023

Yep looks like that's what it is. Here is some additional logging:

[harbor:4] Client logged in.                                                                                                                               
[harbor:4] PUT (source) request on /radio.
[harbor:4] Adding source on mountpoint "/radio" with type "audio/mpeg".
[decoder:4] Available decoders:                                                                                                                            
[decoder:4] ffmpeg (priority: 10)                                            
[decoder:4] mad (priority: 1)                                                                                                                              
[decoder:4] Selected decoder ffmpeg for mime-type audio/mpeg with expected content {audio=pcm(stereo)}
[threads:4] Created thread "harbor source feeding" (2 total).
[input.harbor.2:3] Decoding...                                                                                                                             
[input.harbor.2:2] Error while reading from client: Unix.Unix_error(Unix.EINVAL, "select", "")
[input.harbor.2:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161
[input.harbor.2:4] Called from Tutils.wait_for.wait in file "src/core/tools/tutils.ml", line 345, characters 10-30
[input.harbor.2:4] Called from Harbor_input.http_input_server#feed.read.(fun).f in file "src/core/sources/harbor_input.ml", line 99, characters 24-110
[input.harbor.2:4] Called from Harbor_input.http_input_server#feed.read.(fun) in file "src/core/sources/harbor_input.ml", line 106, characters 20-24       
[input.harbor.2:4]                                                                                                                                         
[input.harbor.2:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).
[threads:4] Thread "harbor source feeding" terminated (1 remaining).

Currently 1138 open files so definitely over that soft limit.

@brenc
Copy link
Author

brenc commented Jul 2, 2023

Here are the metadata update failure logs:

[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")                                                             
[/radio_mp3.3:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161                                                      
[/radio_mp3.3:4] Called from Cry.connect_sockaddr.check_timeout in file "src/cry.ml", line 141, characters 24-58                                                    
[/radio_mp3.3:4] Re-raised at Cry.connect_sockaddr in file "src/cry.ml", line 164, characters 4-38                                                                  
[/radio_mp3.3:4] Called from Http.unix_transport.object#connect in file "src/core/tools/http.ml", line 50, characters 15-64                                         
[/radio_mp3.3:4] Called from Cry.manual_update_metadata in file "src/cry.ml", line 693, characters 4-73                                                             
[/radio_mp3.3:4] Called from Icecast2.output#insert_metadata in file "src/core/outputs/icecast2.ml", line 501, characters 16-12

@brenc
Copy link
Author

brenc commented Jul 7, 2023

Looks like it's mostly socket connections:

lrwx------ 1 2002 2002 64 Jul  7 15:17 32 -> 'socket:[264092829]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 33 -> 'socket:[264093918]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 34 -> 'socket:[264082258]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 35 -> 'socket:[264088139]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 36 -> 'socket:[264091087]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 37 -> 'socket:[264092711]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 38 -> 'socket:[264093066]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 39 -> 'socket:[264083326]'
l-wx------ 1 2002 2002 64 Jul  7 15:17 4 -> 'pipe:[264075097]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 40 -> 'socket:[264090274]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 41 -> 'socket:[264093431]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 42 -> 'socket:[264109195]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 43 -> 'socket:[264110129]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 44 -> 'socket:[264112264]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 45 -> 'socket:[264109240]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 46 -> 'socket:[264109356]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 47 -> 'socket:[264107606]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 48 -> 'socket:[264109399]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 49 -> 'socket:[264112017]'
lr-x------ 1 2002 2002 64 Jul  7 15:17 5 -> 'pipe:[264075098]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 50 -> 'socket:[264116483]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 51 -> 'socket:[264122632]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 52 -> 'socket:[264121766]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 53 -> 'socket:[264124590]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 54 -> 'socket:[264117934]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 55 -> 'socket:[264116974]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 56 -> 'socket:[264137810]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 57 -> 'socket:[264139037]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 58 -> 'socket:[264137046]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 59 -> 'socket:[264136209]'
l-wx------ 1 2002 2002 64 Jul  7 15:00 6 -> 'pipe:[264075098]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 60 -> 'socket:[264139385]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 61 -> 'socket:[264139360]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 62 -> 'socket:[264141345]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 63 -> 'socket:[264135470]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 64 -> 'socket:[264132157]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 65 -> 'socket:[264131416]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 66 -> 'socket:[264131446]'

@vitoyucepi
Copy link
Collaborator

Hi @brenc,
How many sockets are opened?
ls -l "/proc/$(pidof liquidsoap)/fd" | wc -l

@brenc
Copy link
Author

brenc commented Jul 7, 2023

When the errors are happening it's always about 1080 FDs. Less than 1100 but more than 1024. I can watch the number of open FDs steadily grow until this point.

@toots
Copy link
Member

toots commented Jul 7, 2023

Thanks! I'm gonna review the code to see if I can spot it.

Is your script still:

settings.log.level.set(6)

a = input.harbor(
  "radio",
  port=8001,
  user="source",
  password="hackme",
  buffer=6.0,
  timeout=10.0,
)

output.icecast(
  %mp3,
  mksafe(a),
  user="source",
  password="hackme",
  mount="/radio",
  host="icecast",
  port=8000,
)

How often does the DJ connect to input.harbor?

@brenc
Copy link
Author

brenc commented Jul 7, 2023

More or less yes. The DJ rarely ever connects. I have an API that gets connections 1-3+/s though. The proxy server connects to check if the API is up and running every second and things like that. I think that's part of what's causing this.

Been meaning to whip up a simple config to see if I can recreate this. Will try that now.

@brenc
Copy link
Author

brenc commented Jul 8, 2023

Right now there are 1319 FDs. I think the number might continue to increase until I restart LS.

Edit: 1300 of these are socket FDs. The number continues to increase until restart.

@toots toots closed this as completed in 1374696 Jul 8, 2023
@toots
Copy link
Member

toots commented Jul 8, 2023

Thanks for your patience. I believe that this is fixed now. A cleanup call got deleted when handling redirections in our internal http handlers. If we ever have a bugfix for 2.1.x this will surely be in it.

@brenc
Copy link
Author

brenc commented Jul 10, 2023

OK so this isn't fixed. FDs continue to increase until LS stops operating. Some additional observations once open FDs get to above 1000:

  • LS doesn't appear to crash but it does stop streaming audio, getting new tracks to play, etc. The station reports as being down (Icecast off-air message).
  • LS does still respond to requests on the Harbor API I built.
  • Metadata updates still fail:
[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")
[/radio_mp3.3:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161
[/radio_mp3.3:4] Called from Cry.connect_sockaddr.check_timeout in file "src/cry.ml", line 141, characters 24-58
[/radio_mp3.3:4] Re-raised at Cry.connect_sockaddr in file "src/cry.ml", line 164, characters 4-38
[/radio_mp3.3:4] Called from Http.unix_transport.object#connect in file "src/core/tools/http.ml", line 50, characters 15-64
[/radio_mp3.3:4] Called from Cry.manual_update_metadata in file "src/cry.ml", line 693, characters 4-73
[/radio_mp3.3:4] Called from Icecast2.output#insert_metadata in file "src/core/outputs/icecast2.ml", line 505, characters 16-128
  • I can still connect with butt via Harbor SSL though. This doesn't seem to change anything. It accepts input but the station is still down. Eventually LS spits out this:
    [input.harbor.2:4] Generator max length exeeded (882000 < 883152)! Dropping content..
  • On production, after I connected via butt and then disconnected it was spitting this out continually until I restarted it. There were over 40K open FDs at the time.
    [harbor:2] Failed to accept new client: SSL accept() error: error:00000000:lib(0)::reason(0)
  • As far as I know, the client wasn't trying to reconnect.
  • I'm no longer getting select errors when I connect with butt.
  • I think I can get this to happen more quickly if I set a thread.run to make a simple HTTP GET request to an endpoint that returns a 200 ever 0.10 seconds.
  • When I tried to gracefully restart LS, it wouldn't shut down. I had to hard restart it. I'm using nodemon to run LS on dev. It sends a SIGTERM to gracefully restart on config changes.

Continue to test and think of ways to easily recreate this.

@brenc
Copy link
Author

brenc commented Jul 10, 2023

Another observation: open FDs continue to rise w/o the additional http.get requests I was doing. It's a slow steady rise. It takes about 12-16 hours for it to reach 1000+ to where LS starts having problems.

@toots
Copy link
Member

toots commented Jul 10, 2023

@brenc any chance you could provide an extract of lsof -p <pid>? That would give more context about the leaking FDs..

@brenc
Copy link
Author

brenc commented Jul 10, 2023

Yep here it is. A little over 200 open FDs right now.
lsof_output.txt

@brenc
Copy link
Author

brenc commented Jul 11, 2023

Looks like this piece of code here is the culprit. If I uncomment this, FDs/open sockets increase:

let ngr.inputs.harbor_ssl = input.harbor(
  "radio",
  buffer=6.0,
  port=8000,
  timeout=10.0,
  transport=ngr.transports.ssl,
  user=icecast_mount_radio_usr,
  password=icecast_mount_radio_psw,
)

Here's the transport:

let ngr.transports = ()

let ngr.transports.ssl = http.transport.ssl(
  certificate="/app/star_newgroundsradio_com.cert",
  key="/app/star_newgroundsradio_com.key",
)

Non-SSL Harbor doesn't do this.

@toots
Copy link
Member

toots commented Jul 11, 2023

Thanks. I'm running this:

server

settings.log.level.set(4)

transport = http.transport.ssl(
  certificate="/tmp/ssl.cert",
  key="/tmp/ssl.key"
)

a = input.harbor(
  "radio",
  transport=transport,
  port=8001,
  user="source",
  password="hackme",
  buffer=6.0,
  timeout=10.0,
)

output.dummy(fallible=true, a)

client

s = playlist("/tmp/playlists")

s = insert_metadata(s)

counter = ref(0)

thread.run(every=1., {begin
  s.insert_metadata([("title","counter #{counter()}")])
  counter.set(counter()+1)
end})

transport = http.transport.ssl(
  certificate="/tmp/ssl.cert",
  key="/tmp/ssl.key"
)

output.icecast(
  transport=transport,
  fallible=true,
  port=8001,
  mount="radio",
  %mp3,
  s
)

To see if I can reproduce.

@brenc
Copy link
Author

brenc commented Jul 13, 2023

OK I have finally figured out how to replicate this reliably. Using the following config:

settings.log.file.set(false)
settings.log.level.set(3)
settings.log.stdout.set(true)

ssl_transport = http.transport.ssl(
  certificate="cert.pem",
  key="key.pem",
)

harbor_ssl = input.harbor(
  "radio",
  buffer=6.0,
  port=8000,
  timeout=10.0,
  transport=ssl_transport,
  user="radio",
  password="radio",
)

output.dummy(sine(440.))

In one window run:

watch -n1 'ls /proc/$(pidof liquidsoap)/fd | wc -l'

In another run:

openssl s_client -connect HOST:PORT

When you connect, you'll see the FD/socket count increment. If you do a normal request i.e.:

HEAD / HTTP/1.1
[hit enter]

After it disconnects you'll then see the FD count decrement as expected. However, if you connect and immediately press ctrl-c to disconnect you'll see the socket count increment but not decrement.

What causes this in production is the health check in haproxy. I use a TCP frontend to proxy SSL requests to LS. The health check runs every 2 seconds to see if the port is responding but it doesn't consistently increment the FD count. My theory is that it doesn't always close the connection cleanly which is triggering this bug. If I disable the health check, FD count stays steady.

Interestingly, if I simply telnet to the port then ctrl-c the FD gets cleaned up properly. So it seems that the connection has to go through the SSL negotiation first.

@vitoyucepi
Copy link
Collaborator

@brenc, after researching the file limit problem, I have concluded that setting the soft limit to 1024 will be better. Potentially it can prevent some unexpected segfault or memory corruption related to Linux select function.

@toots, did I come to the correct conclusion? Is it possible to move away from select?

@toots
Copy link
Member

toots commented Jul 14, 2023

@vitoyucepi everything is switched to poll whenever available now!

@toots
Copy link
Member

toots commented Jul 19, 2023

After some serious digging by @brenc and a full reproduction setup at https://github.com/brenc/lsfdbug it would appear that this bug is now closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants