v2.1.4: [input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select() #3156

brenc · 2023-06-17T22:52:07Z

Describe the bug
After updating to 2.1.4 we started to see this when a client connects to Harbor input:

[input.harbor.ssl_0:3] Decoding...
[input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select()
[input.harbor.ssl_0:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).

The client is then disconnected. The only thing that fixes this is restarting LS. This doesn't happen all the time. Seems something triggers it then after that no client can connect.

To Reproduce
Happens with a simple Harbor input, both SSL and non-SSL are affected.

let ngr.inputs.harbor = input.harbor(
  "radio",
  buffer=6.0,
  port=8003,
  timeout=10.0,
  user=icecast_mount_radio_usr,
  password=icecast_mount_radio_psw,
)

Expected behavior
Allow streaming clients to connect and stream.

Version details

OS: Docker
Version: savonet/liquidsoap:v2.1.4

Install method
Docker

The text was updated successfully, but these errors were encountered:

vitoyucepi · 2023-06-17T23:18:34Z

Hi @brenc,
From which version did you upgrade?
Was it caused by a certain client, a single user, or a particular software?

brenc · 2023-06-18T19:06:30Z

From v2.1.3. We use butt to broadcast but this is happening with metadata updates as well:

[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()

We haven't had any issues since reverting to 2.1.3.

vitoyucepi · 2023-06-18T21:42:48Z

I'm unable to reproduce the problem.

What's the operation system you're running docker on?
Which docker version are you using?
Could you share container run parameters?
How often this problem happens?
What are the encoding parameters used in the butt?

Maybe if you set settings.log.level.set(6), then it will produce more information.

brenc · 2023-06-19T18:03:24Z

Debian 11 x64
Docker version 24.0.1, build 6802122
All we're doing is copying over some files and running CMD ["/usr/bin/liquidsoap", "main.liq"]
Something seems to trigger it but I don't know what. I will try increasing logging on staging and see if it logs anything else.
Stock 128k mp3.

Still no select() errors since rolling back to v2.1.3.

vitoyucepi · 2023-06-20T00:12:03Z

Docker version 24.0.1, build 6802122

I should mention it earlier, but could you show the results of the docker info command?

All we're doing is copying over some files and running CMD ["/usr/bin/liquidsoap", "main.liq"]

Is it docker run -it --rm savonet/liquidsoap:v2.1.4, or did I miss something? Maybe some network options?
You can post a docker-compose project if you're using it.

Something seems to trigger it but I don't know what.

Docker can cause the problem.

All we're doing is copying over some files

If you're building the container anyway, then why wouldn't you build it using ubuntu:22.04 + deb package from release?

brenc · 2023-06-20T03:22:27Z

Client:                                 
 Context:    default       
 Debug Mode: false                      
 Plugins:                               
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4                   
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.17.3
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 20
  Running: 7
  Paused: 0
  Stopped: 13
 Images: 18
 Server Version: 23.0.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: syslog
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk s
yslog
 Swarm: active
  NodeID: jqbmejv6mo0vdqthj4wety3hr
  Is Manager: true
  ClusterID: r26xd03m4b254wlxtsiock6oy
  Managers: 7
  Nodes: 7
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:                                                                [4/91266]
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1 
   Election Tick: 10 
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.10.1.57
  Manager Addresses: 
   10.10.1.19:2377
   10.10.1.54:2377
   10.10.1.55:2377
   10.10.1.56:2377
   10.10.1.57:2377
   10.10.1.58:2377
   10.10.1.59:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-22-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.78GiB
 Name: docker04
 ID: Z3WQ:TVQJ:OA33:LUUT:G4V4:TAAF:7MF7:6IFA:L5AU:PF2Q:Y6SW:MR2T
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false 
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

You can post a docker-compose project if you're using it.

We're using Docker Stack on enterprise servers. There isn't anything interesting or unusual in the stack file. It's setting env vars, telling it what image to use, opening ports, etc.

Docker can cause the problem.

Hasn't before this 2.1.4 for us anyway.

If you're building the container anyway, then why wouldn't you build it using ubuntu:22.04 + deb package from release?

Stock image works perfectly fine. No reason to use anything else.

brenc · 2023-06-20T19:35:59Z

OK - I have gathered some more data. I was able to recreate this on my dev box which is running the same (or very close) version of Docker. On dev I use Docker Compose which is set up to mimic prod as closely as possible (prod is using Swarm). I'm using haproxy to frontend all these connection and pass them off to the LS instance. Just a simple TCP proxy with minimal processing.

So I left butt streaming all night long and when I got up today it was still working fine. Then I disconnected and reconnected and was able to recreate this issue. When going through the proxy is takes about 50s before butt will disconnect. I believe the proxy server is buffering data but LS doesn't respond so once a timer is reached the proxy disconnects the client.

On dev I can bypass the proxy server so I did that. butt gets disconnected immediately and will reconnect over and over again until it eventually reconnects. Here are the logs LS produces (with settings.log.level.set(6)):

[harbor:4] PUT (source) request on /radio.
[harbor:4] Adding source on mountpoint "/radio" with type "audio/mpeg".
[harbor:5] Relaying HTTP/1.1.
[decoder:4] Available decoders:
[decoder:4] FFMPEG (priority: 10)
[decoder:4] MAD (priority: 1)
[decoder:4] Selected decoder FFMPEG for mime-type audio/mpeg with expected content {audio=pcm(stereo),video=none,midi=none}
[threads:4] Created thread "harbor source feeding" (2 total).
[input.harbor_0:3] Decoding...
[input.harbor_0:2] Error while reading from client: Invalid argument in select()
[input.harbor_0:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).
[threads:4] Thread "harbor source feeding" terminated (1 remaining).

I see select() errors on metadata updates as well:

[replay_metadata_11:5] Got metadata at position 0: calling handlers...
[on_metadata_20:5] Got metadata at position 0: calling handler...
[on_metadata_20:5] Got metadata at position 0: calling handlers...
[replay_metadata_11:4] Fading in (type: sin, duration: 6.s).
[on_track_22:5] Got metadata at position 0: calling handlers...
[fade.in:5] Got metadata at position 0: calling handlers...
[fade.in:5] Got metadata at position 0: calling handlers...
[add_13:5] Got metadata at position 0: calling handlers...
[max_duration_13:5] Got metadata at position 0: calling handlers...
[sequence_20:5] Got metadata at position 0: calling handlers...
[switch_2:5] Got metadata at position 0: calling handlers...
[rms_2:5] Got metadata at position 0: calling handlers...
[lufs_0:5] Got metadata at position 0: calling handlers...
[bpm_0:5] Got metadata at position 0: calling handlers...
[mksafe:5] Got metadata at position 0: calling handlers...
[insert_metadata_0:5] Got metadata at position 0: calling handlers...
[max_duration_13:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[sequence_19:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[sequence_20:5] Content kind: {audio=pcm(stereo),video=none,midi=none}, content type: {audio=pcm(stereo),video=none,midi=none}
[ffmpeg.encode.audio.producer_0:5] Got metadata at position 0: calling handlers...
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[/radio.mp3:3] Metadata update may have failed with error: Invalid argument in select()
[DEBUG]: ngr.inputs.update_live_status: input is now off-air

vitoyucepi · 2023-06-20T20:58:00Z

Maybe I'm doing something wrong, but I can't replicate this behavior.

I tried to fully reproduce your setup as far as I understand it.

Use debian 11 cloudimage x64.
Add docker upstream repos using the official guide.

Install packages.

apt-get install --no-install-recommends \
  docker-ce=5:23.0.5-1~debian.11~bullseye \
  docker-ce-cli=5:24.0.1-1~debian.11~bullseye \
  docker-buildx-plugin=0.10.4-1~debian.11~bullseye \
  docker-compose-plugin=2.17.3-1~debian.11~bullseye

Create liquidsoap config.

settings.log.level.set(6)

a = input.harbor(
  "radio",
  port=8001,
  user="source",
  password="hackme",
  buffer=6.0,
  timeout=10.0,
)

output.icecast(
  %mp3,
  mksafe(a),
  user="source",
  password="hackme",
  mount="/radio",
  host="icecast",
  port=8000,
)

Run containers with the following compose config.

services:
  liquidsoap:
    image: savonet/liquidsoap:v2.1.4
    ports:
      - 8001:8001
    volumes:
      - ./liquidsoap/:/home/liquidsoap/
    command: "/home/liquidsoap/main.liq"
  icecast:
    image: registry.gitlab.com/vito-containers/icecast:2.4
    ports:
      - 8000:8000
    environment:
      ICECAST_AUTHENTICATION_SOURCE_PASSWORD: hackme

Connect with butt to http://source:hackme@$IP:8001/radio.

Did I miss something?

brenc · 2023-06-20T22:44:45Z

That looks correct yes. Like I said, it doesn't happen immediately. It has to run for a while then something triggers it. Once triggered a restart fixes it for a period of time.

vitoyucepi · 2023-06-20T23:35:19Z

I wonder if you can replicate the faulty behavior using the following command?

while true; do 
  ffmpeg \
    -re \
    -f lavfi \
    -i "sine=frequency=1000:duration=5" \
    -b:a 128k \
    -c:a mp3 \
    -f mp3 \
    "icecast://source:hackme@$IP:8001/radio";
done

This script will play 5 seconds of sine at 1000Hz in a loop. Replace icecast://source:hackme@$IP:8001/radio with your harbor parameters.

brenc · 2023-06-21T02:08:52Z

Yes, here is what ffmpeg logs:

ffmpeg version 4.3.6-0+deb11u1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 10 (Debian 10.2.1-6)
  configuration: --prefix=/usr --extra-version=0+deb11u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100
Input #0, lavfi, from 'sine=frequency=1000:duration=5':
  Duration: N/A, start: 0.000000, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le, 44100 Hz, mono, s16, 705 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
[icecast @ 0x5599a90f6d80] It seems you are streaming an unsupported format.
[icecast @ 0x5599a90f6d80] It might work, but is not officially supported in Icecast!
Output #0, mp3, to 'icecast://radio:HAX@localhost:8203/radio':
  Metadata:
    TSSE            : Lavf58.45.100
    Stream #0:0: Audio: mp3 (libmp3lame), 44100 Hz, mono, s16p, 128 kb/s
    Metadata:
      encoder         : Lavc58.91.100 libmp3lame
av_interleaved_write_frame(): Broken pipe
Error writing trailer of icecast://radio:asdf123$@localhost:8203/radio: Broken pipe
size=       0kB time=00:00:00.00 bitrate= 337.7kbits/s speed=0.0143x    
video:0kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Conversion failed!

Restarted LS and the script works as expected. It's only when this bug is triggered that it fails, of course.

vitoyucepi · 2023-06-22T00:42:53Z

I'm not sure why, but so far I haven't experienced any issues.
After a day of connects (and disconnects), there were no errors.

If you have some free time, I can suggest additional methods to recreate the issue on my end.

I was able to recreate this on my dev box

Can you create a disk image for a virtual machine that includes all the packages, containers, and scripts needed?

This is not necessary at the moment because you have an easy way to repeat the error.
I think it will be possible to produce a build that will print a stack trace of the error. But I do not know how to do that, so the only way for me to help you is to create a stable way to reproduce the problem anywhere.

Alternatively, you could check if the error persists in savonet/liquidsoap:rolling-release-v2.2.x.

brenc · 2023-06-22T04:51:33Z

Yeah I've been thinking about how to create a minimal config to reproduce this. I think I have an idea. Will try to get something to you soon.

vitoyucepi · 2023-06-22T23:29:16Z

@brenc, could you share the values of ulimit -a -S and ulimit -a -H? Inside the liquidsoap container and in the system.
Did you check journalctl, maybe something happened there?

brenc · 2023-06-23T19:30:32Z

# ulimit -a -S
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 8228984
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 256745
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

# ulimit -a -H
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 8228984
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 256745
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

In the container:

$ ulimit -a -S
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

$ ulimit -a -H
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 256745
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

No additional logs anywhere. I have a small API over Harbor that gets queried every second by the proxy servers and other processes. I think that is involved in triggering this.

vitoyucepi · 2023-06-23T19:57:04Z

Docker has a bug moby/moby#38814

I'm not sure if this is related to the docker issue, but can you try setting limits in the dev environment and reproduce the problem?

services:
  liquidsoap:
    ulimits:
      nofile:
        soft: 1024
        hard: 524288

Reference documentation: https://docs.docker.com/compose/compose-file/compose-file-v3/#ulimits

brenc · 2023-06-23T20:00:26Z

I can't see how this would be a Docker issue considering 2.1.3 works but 2.1.4 does not.

vitoyucepi · 2023-06-23T21:45:46Z

I don't know either, but it looks suspicious.

moby/moby#38814 (comment)

The soft limit of 1024 would resolve issues with programs which loop through the entire file-descriptor range, and is the largest value compatible with software which uses select(2).

Invalid argument in select().

Also, I have this issue with docker and limits on my local host. So I have to rebuild the packages against the latest go version.

toots · 2023-06-26T16:41:34Z

Hi! There was a file descriptor leak with SSL in v2.1.4. Any chance you could try with rolling-release-v2.2.x? Please note that the syntax for using SSL has changed there, see: https://www.liquidsoap.info/doc-dev/harbor_http.html#https-support

brenc · 2023-06-26T23:04:27Z

@toots that would make sense. I started to upgrade to 2.2 but had to roll back. Will stay at 2.1.3 for now.

vitoyucepi · 2023-06-27T04:16:20Z

Hi @toots,
I suppose you're talking about #3067. I don't think this is the case.
Could you create a build that will print traces for these events?

Error while reading from client: Invalid argument in select()
Metadata update may have failed with error: Invalid argument in select()

toots · 2023-06-27T20:46:29Z

Hi @toots, I suppose you're talking about #3067. I don't think this is the case. Could you create a build that will print traces for these events?

Error while reading from client: Invalid argument in select()
Metadata update may have failed with error: Invalid argument in select()

Just pushed a commit that will print the backtrace for these two errors when looking a log.level 4 or more.

vitoyucepi · 2023-06-28T00:47:54Z

@toots, I've expected these changes in 2.1.x branch, because @brenc can reproduce this issue only with 2.1.4 (2.1.3 is fine).

toots · 2023-06-28T00:51:15Z

Hmm. It's gonna be complicated to rebuild v2.1.4 through the actions. If @brenc can compile locally via opam I'm happy to provide a patch but, at this point, we're very close to the 2.2.0 release so I'd rather focus on fixing the issue there, if it exists..

brenc · 2023-06-28T18:14:40Z

@toots all good. I'm just about done moving to 2.2.

brenc · 2023-06-29T21:07:23Z

This appears to be fixed in 2.2.x.

brenc · 2023-06-30T21:07:58Z

Spoke too soon. Rolled out to production and within a day LS started logging this:

[/radio_mp3:3] Connecting mount /radio.mp3 for radio@10.0.23.45...                                                      
[/radio_mp3:2] Connection failed: Unix.Unix_error(Unix.EINVAL, "select", "")
[/radio_mp3:3] Will try again in 3.00 sec

Also metadata updates fail:

[/radio_mp3.2:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")

It's possible this is due to an error condition (and possibly resource exhaustion?). I see one of the Icecast servers was unreachable (we have three instances). This happened for a number of hours before the select errors started:

[/radio_mp3:3] Connecting mount /radio.mp3 for radio@10.0.23.45...
[/radio_mp3:2] Connection failed: could not write data to host: Unix.Unix_error(Unix.EHOSTUNREACH, "write", "")
[/radio_mp3:3] Will try again in 3.00 sec.

I fixed that and will see if this happens again.

brenc · 2023-07-01T20:06:19Z

OK so this happened again during normal running conditions. I'm seeing this regularly:

[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")

And I'm seeing this if I try to connect to stream w/ butt:

[input.harbor.2:3] Decoding...
[input.harbor.2:2] Error while reading from client: Unix.Unix_error(Unix.EINVAL, "select", "")
[input.harbor.2:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).

This doesn't seem to happen in dev/staging now. The only real difference that I can think of there is that listeners aren't connecting / disconnecting regularly on these envs.

Double checked some resource usage like open files. That doesn't seem to be an issue.

# cat /proc/2845871/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       256745               256745               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

Currently LS has about 3K open files which seems high considering we have just one listener at the moment.

I increased logging to 4 so we'll see if that produces any interesting output.

toots · 2023-07-01T23:22:44Z

Yeah this looks like a fd leak. Could you check:

ls -l /proc/<pid>/fd

brenc · 2023-07-02T05:17:16Z

Yep looks like that's what it is. Here is some additional logging:

[harbor:4] Client logged in.                                                                                                                               
[harbor:4] PUT (source) request on /radio.
[harbor:4] Adding source on mountpoint "/radio" with type "audio/mpeg".
[decoder:4] Available decoders:                                                                                                                            
[decoder:4] ffmpeg (priority: 10)                                            
[decoder:4] mad (priority: 1)                                                                                                                              
[decoder:4] Selected decoder ffmpeg for mime-type audio/mpeg with expected content {audio=pcm(stereo)}
[threads:4] Created thread "harbor source feeding" (2 total).
[input.harbor.2:3] Decoding...                                                                                                                             
[input.harbor.2:2] Error while reading from client: Unix.Unix_error(Unix.EINVAL, "select", "")
[input.harbor.2:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161
[input.harbor.2:4] Called from Tutils.wait_for.wait in file "src/core/tools/tutils.ml", line 345, characters 10-30
[input.harbor.2:4] Called from Harbor_input.http_input_server#feed.read.(fun).f in file "src/core/sources/harbor_input.ml", line 99, characters 24-110
[input.harbor.2:4] Called from Harbor_input.http_input_server#feed.read.(fun) in file "src/core/sources/harbor_input.ml", line 106, characters 20-24       
[input.harbor.2:4]                                                                                                                                         
[input.harbor.2:2] Feeding stopped: Avutil.Error(Invalid data found when processing input).
[threads:4] Thread "harbor source feeding" terminated (1 remaining).

Currently 1138 open files so definitely over that soft limit.

brenc · 2023-07-02T06:37:23Z

Here are the metadata update failure logs:

[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")                                                             
[/radio_mp3.3:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161                                                      
[/radio_mp3.3:4] Called from Cry.connect_sockaddr.check_timeout in file "src/cry.ml", line 141, characters 24-58                                                    
[/radio_mp3.3:4] Re-raised at Cry.connect_sockaddr in file "src/cry.ml", line 164, characters 4-38                                                                  
[/radio_mp3.3:4] Called from Http.unix_transport.object#connect in file "src/core/tools/http.ml", line 50, characters 15-64                                         
[/radio_mp3.3:4] Called from Cry.manual_update_metadata in file "src/cry.ml", line 693, characters 4-73                                                             
[/radio_mp3.3:4] Called from Icecast2.output#insert_metadata in file "src/core/outputs/icecast2.ml", line 501, characters 16-12

brenc · 2023-07-07T19:21:06Z

Looks like it's mostly socket connections:

lrwx------ 1 2002 2002 64 Jul  7 15:17 32 -> 'socket:[264092829]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 33 -> 'socket:[264093918]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 34 -> 'socket:[264082258]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 35 -> 'socket:[264088139]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 36 -> 'socket:[264091087]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 37 -> 'socket:[264092711]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 38 -> 'socket:[264093066]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 39 -> 'socket:[264083326]'
l-wx------ 1 2002 2002 64 Jul  7 15:17 4 -> 'pipe:[264075097]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 40 -> 'socket:[264090274]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 41 -> 'socket:[264093431]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 42 -> 'socket:[264109195]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 43 -> 'socket:[264110129]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 44 -> 'socket:[264112264]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 45 -> 'socket:[264109240]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 46 -> 'socket:[264109356]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 47 -> 'socket:[264107606]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 48 -> 'socket:[264109399]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 49 -> 'socket:[264112017]'
lr-x------ 1 2002 2002 64 Jul  7 15:17 5 -> 'pipe:[264075098]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 50 -> 'socket:[264116483]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 51 -> 'socket:[264122632]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 52 -> 'socket:[264121766]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 53 -> 'socket:[264124590]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 54 -> 'socket:[264117934]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 55 -> 'socket:[264116974]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 56 -> 'socket:[264137810]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 57 -> 'socket:[264139037]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 58 -> 'socket:[264137046]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 59 -> 'socket:[264136209]'
l-wx------ 1 2002 2002 64 Jul  7 15:00 6 -> 'pipe:[264075098]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 60 -> 'socket:[264139385]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 61 -> 'socket:[264139360]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 62 -> 'socket:[264141345]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 63 -> 'socket:[264135470]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 64 -> 'socket:[264132157]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 65 -> 'socket:[264131416]'
lrwx------ 1 2002 2002 64 Jul  7 15:17 66 -> 'socket:[264131446]'

vitoyucepi · 2023-07-07T19:45:19Z

Hi @brenc,
How many sockets are opened?
ls -l "/proc/$(pidof liquidsoap)/fd" | wc -l

brenc · 2023-07-07T21:04:14Z

When the errors are happening it's always about 1080 FDs. Less than 1100 but more than 1024. I can watch the number of open FDs steadily grow until this point.

toots · 2023-07-07T21:07:11Z

Thanks! I'm gonna review the code to see if I can spot it.

Is your script still:

settings.log.level.set(6)

a = input.harbor(
  "radio",
  port=8001,
  user="source",
  password="hackme",
  buffer=6.0,
  timeout=10.0,
)

output.icecast(
  %mp3,
  mksafe(a),
  user="source",
  password="hackme",
  mount="/radio",
  host="icecast",
  port=8000,
)

How often does the DJ connect to input.harbor?

brenc · 2023-07-07T21:13:54Z

More or less yes. The DJ rarely ever connects. I have an API that gets connections 1-3+/s though. The proxy server connects to check if the API is up and running every second and things like that. I think that's part of what's causing this.

Been meaning to whip up a simple config to see if I can recreate this. Will try that now.

brenc · 2023-07-08T18:24:26Z

Right now there are 1319 FDs. I think the number might continue to increase until I restart LS.

Edit: 1300 of these are socket FDs. The number continues to increase until restart.

toots · 2023-07-08T21:48:58Z

Thanks for your patience. I believe that this is fixed now. A cleanup call got deleted when handling redirections in our internal http handlers. If we ever have a bugfix for 2.1.x this will surely be in it.

brenc · 2023-07-10T20:00:32Z

OK so this isn't fixed. FDs continue to increase until LS stops operating. Some additional observations once open FDs get to above 1000:

LS doesn't appear to crash but it does stop streaming audio, getting new tracks to play, etc. The station reports as being down (Icecast off-air message).
LS does still respond to requests on the Harbor API I built.
Metadata updates still fail:

[/radio_mp3.3:2] Metadata update may have failed with error: Unix.Unix_error(Unix.EINVAL, "select", "")
[/radio_mp3.3:4] Raised by primitive operation at Unix in file "unix.ml" (inlined), line 402, characters 0-161
[/radio_mp3.3:4] Called from Cry.connect_sockaddr.check_timeout in file "src/cry.ml", line 141, characters 24-58
[/radio_mp3.3:4] Re-raised at Cry.connect_sockaddr in file "src/cry.ml", line 164, characters 4-38
[/radio_mp3.3:4] Called from Http.unix_transport.object#connect in file "src/core/tools/http.ml", line 50, characters 15-64
[/radio_mp3.3:4] Called from Cry.manual_update_metadata in file "src/cry.ml", line 693, characters 4-73
[/radio_mp3.3:4] Called from Icecast2.output#insert_metadata in file "src/core/outputs/icecast2.ml", line 505, characters 16-128

I can still connect with butt via Harbor SSL though. This doesn't seem to change anything. It accepts input but the station is still down. Eventually LS spits out this:
[input.harbor.2:4] Generator max length exeeded (882000 < 883152)! Dropping content..
On production, after I connected via butt and then disconnected it was spitting this out continually until I restarted it. There were over 40K open FDs at the time.
[harbor:2] Failed to accept new client: SSL accept() error: error:00000000:lib(0)::reason(0)
As far as I know, the client wasn't trying to reconnect.
I'm no longer getting select errors when I connect with butt.
I think I can get this to happen more quickly if I set a thread.run to make a simple HTTP GET request to an endpoint that returns a 200 ever 0.10 seconds.
When I tried to gracefully restart LS, it wouldn't shut down. I had to hard restart it. I'm using nodemon to run LS on dev. It sends a SIGTERM to gracefully restart on config changes.

Continue to test and think of ways to easily recreate this.

brenc · 2023-07-10T20:51:17Z

Another observation: open FDs continue to rise w/o the additional http.get requests I was doing. It's a slow steady rise. It takes about 12-16 hours for it to reach 1000+ to where LS starts having problems.

toots · 2023-07-10T20:56:30Z

@brenc any chance you could provide an extract of lsof -p <pid>? That would give more context about the leaking FDs..

brenc · 2023-07-10T21:10:59Z

Yep here it is. A little over 200 open FDs right now.
lsof_output.txt

brenc · 2023-07-11T05:36:03Z

Looks like this piece of code here is the culprit. If I uncomment this, FDs/open sockets increase:

let ngr.inputs.harbor_ssl = input.harbor(
  "radio",
  buffer=6.0,
  port=8000,
  timeout=10.0,
  transport=ngr.transports.ssl,
  user=icecast_mount_radio_usr,
  password=icecast_mount_radio_psw,
)

Here's the transport:

let ngr.transports = ()

let ngr.transports.ssl = http.transport.ssl(
  certificate="/app/star_newgroundsradio_com.cert",
  key="/app/star_newgroundsradio_com.key",
)

Non-SSL Harbor doesn't do this.

toots · 2023-07-11T12:58:24Z

Thanks. I'm running this:

server

settings.log.level.set(4)

transport = http.transport.ssl(
  certificate="/tmp/ssl.cert",
  key="/tmp/ssl.key"
)

a = input.harbor(
  "radio",
  transport=transport,
  port=8001,
  user="source",
  password="hackme",
  buffer=6.0,
  timeout=10.0,
)

output.dummy(fallible=true, a)

client

s = playlist("/tmp/playlists")

s = insert_metadata(s)

counter = ref(0)

thread.run(every=1., {begin
  s.insert_metadata([("title","counter #{counter()}")])
  counter.set(counter()+1)
end})

transport = http.transport.ssl(
  certificate="/tmp/ssl.cert",
  key="/tmp/ssl.key"
)

output.icecast(
  transport=transport,
  fallible=true,
  port=8001,
  mount="radio",
  %mp3,
  s
)

To see if I can reproduce.

brenc · 2023-07-13T18:20:23Z

OK I have finally figured out how to replicate this reliably. Using the following config:

settings.log.file.set(false)
settings.log.level.set(3)
settings.log.stdout.set(true)

ssl_transport = http.transport.ssl(
  certificate="cert.pem",
  key="key.pem",
)

harbor_ssl = input.harbor(
  "radio",
  buffer=6.0,
  port=8000,
  timeout=10.0,
  transport=ssl_transport,
  user="radio",
  password="radio",
)

output.dummy(sine(440.))

In one window run:

watch -n1 'ls /proc/$(pidof liquidsoap)/fd | wc -l'

In another run:

openssl s_client -connect HOST:PORT

When you connect, you'll see the FD/socket count increment. If you do a normal request i.e.:

HEAD / HTTP/1.1
[hit enter]

After it disconnects you'll then see the FD count decrement as expected. However, if you connect and immediately press ctrl-c to disconnect you'll see the socket count increment but not decrement.

What causes this in production is the health check in haproxy. I use a TCP frontend to proxy SSL requests to LS. The health check runs every 2 seconds to see if the port is responding but it doesn't consistently increment the FD count. My theory is that it doesn't always close the connection cleanly which is triggering this bug. If I disable the health check, FD count stays steady.

Interestingly, if I simply telnet to the port then ctrl-c the FD gets cleaned up properly. So it seems that the connection has to go through the SSL negotiation first.

#3156" This reverts commit 78b1f86.

vitoyucepi · 2023-07-14T18:50:50Z

@brenc, after researching the file limit problem, I have concluded that setting the soft limit to 1024 will be better. Potentially it can prevent some unexpected segfault or memory corruption related to Linux select function.

@toots, did I come to the correct conclusion? Is it possible to move away from select?

toots · 2023-07-14T23:08:28Z

@vitoyucepi everything is switched to poll whenever available now!

toots · 2023-07-19T13:52:48Z

After some serious digging by @brenc and a full reproduction setup at https://github.com/brenc/lsfdbug it would appear that this bug is now closed!

toots added a commit that referenced this issue Jun 27, 2023

Log backtraces on two more errors. Refs: #3156

e2c4e17

brenc closed this as completed Jun 29, 2023

brenc reopened this Jun 30, 2023

toots closed this as completed in 1374696 Jul 8, 2023

toots reopened this Jul 10, 2023

toots added a commit that referenced this issue Jul 10, 2023

Double check that curl connections are always cleaned up. Refs: #3156

78b1f86

toots added a commit that referenced this issue Jul 14, 2023

Revert "Double check that curl connections are always cleaned up. Refs:

b856f62

#3156" This reverts commit 78b1f86.

toots closed this as completed Jul 19, 2023

vitoyucepi mentioned this issue Jun 8, 2024

Feature Request: Get the number of file descriptors #3953

Open

v2.1.4: [input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select() #3156

v2.1.4: [input.harbor.ssl_0:2] Error while reading from client: Invalid argument in select() #3156

Comments

brenc commented Jun 17, 2023

vitoyucepi commented Jun 17, 2023

brenc commented Jun 18, 2023

vitoyucepi commented Jun 18, 2023

brenc commented Jun 19, 2023

vitoyucepi commented Jun 20, 2023

brenc commented Jun 20, 2023

brenc commented Jun 20, 2023

vitoyucepi commented Jun 20, 2023

brenc commented Jun 20, 2023

vitoyucepi commented Jun 20, 2023

brenc commented Jun 21, 2023

vitoyucepi commented Jun 22, 2023 • edited Loading

brenc commented Jun 22, 2023

vitoyucepi commented Jun 22, 2023

brenc commented Jun 23, 2023

vitoyucepi commented Jun 23, 2023

brenc commented Jun 23, 2023

vitoyucepi commented Jun 23, 2023

toots commented Jun 26, 2023

brenc commented Jun 26, 2023

vitoyucepi commented Jun 27, 2023

toots commented Jun 27, 2023

vitoyucepi commented Jun 28, 2023

toots commented Jun 28, 2023

brenc commented Jun 28, 2023

brenc commented Jun 29, 2023

brenc commented Jun 30, 2023

brenc commented Jul 1, 2023

toots commented Jul 1, 2023

brenc commented Jul 2, 2023

brenc commented Jul 2, 2023 • edited Loading

brenc commented Jul 7, 2023

vitoyucepi commented Jul 7, 2023

brenc commented Jul 7, 2023

toots commented Jul 7, 2023

brenc commented Jul 7, 2023

brenc commented Jul 8, 2023 • edited Loading

toots commented Jul 8, 2023

brenc commented Jul 10, 2023

brenc commented Jul 10, 2023

toots commented Jul 10, 2023

brenc commented Jul 10, 2023

brenc commented Jul 11, 2023

toots commented Jul 11, 2023

server

client

brenc commented Jul 13, 2023 • edited Loading

vitoyucepi commented Jul 14, 2023

toots commented Jul 14, 2023

toots commented Jul 19, 2023

vitoyucepi commented Jun 22, 2023 •

edited

Loading

brenc commented Jul 2, 2023 •

edited

Loading

brenc commented Jul 8, 2023 •

edited

Loading

brenc commented Jul 13, 2023 •

edited

Loading