-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
At about 70 client connections the client/server do not show/list all clients anymore #547
Comments
I did a quick test where I created 100 artificial clients with the following code in socket.cpp:
But I cannot reproduce the issue. I could see all 100 faders in the client and the server table was also complete. Maybe the problem was caused by the CPU being at 100% which can cause weird effects. |
We were doing undocumented tests last night on a server I've compiled to allow 255 clients and had similar issues, on one of my clients I could see 102 clients while the other was stuck at 53. I was watching htop and no cpu (4 total) was over 50% I plan to do this test properly with properly recorded results over the next few days. Audio at 102 clients was badly broken but intelligible. |
I just completed the following test and observation. First, I started with 30 connections on the server. The client could see those 30 connections. Then I added another 20 clients slowly, so as to not overrun the server with fast connection requests; I did each one every 5 seconds. Then someone else came on (client with a different name) However, with 50 connections, I could not see that person on my Jamulus client (Windows 10). However, I could see him on the master list as viewed with my connection status panel. I then reduced the number of connections (by stopping each) until I got down to about 25 connections. Now I could see all of them without having to scroll sideways. However, I still could not see the one client whom I was talking with. I then stopped and re-started the client and I did see him. This suggests to me that there may be some data stuck in the client that had to be reset upon a restart of the client. Throughout all of this, the sound from that one client was very good despite the fact that I could not see his fader on my panel. Throughout all of this, there were only occasional instances when any of the four dedicated CPU's hit over fifty percent. |
@softins Maybe the issue is related to #255. The client list for the audio mixer board gets quite big in case of 100 clients (similar to the server list). Is it possible that you support us with a wireshark analysis of that situation? Maybe if a test is running with more than 50 clients, you could log onto that session and check the protocol messages sent to your client? |
Sure, I'd be happy to help, although not available today (Sat). |
Just a couple of general comments before I disappear for the day:
|
The protocol transmits messages "message by message" in the order the messages are scheduled. If one message does not make it through to the client because of fragmentation, the protocol mechanism will be stuck trying to transmit that one message and does not process any further messages. That would explain the described behaviour I think (or better, I guess... Therefore it would be good to get a proof of that by checking the network traffic with wireshark). |
Okay, thank you for the suggestion, Simon. Sometime next week when I have time, I will re-set up everything but install wireshark on both the server sunning the clients and the jamulus server itself and somehow capture the screens and record them as a video. I have a question. Can I attach a video or at least a link to a video on this ticket? Or better yet, can I set up a jitsi or zoom session and show the stuff live? |
Thanks for offering the test. I guess it would be easier if you talk to softins when you start your test so that he can connect to your server and capture the wireshark output on his side. So you should do the test when you and softins have time for it. |
Hi Mark, there's no Simon on this thread. I'm Tony (also on Facebook). Happy to liaise with you in the week. I'm on UK time, I think you are EDT? If you haven't seen it, check out my repo https://github.com/softins/jamulus-wireshark |
I'm embarrassed that I made this mistake. I keep thinking that Corrados is Simon. Thank you for straightening me out. If you are in UK, then the best time for me would be my morning, which is your late afternoon. So, if I try to do this around nine in the morning Pacific U.S. Time, that would be six in the evening for you, is that an okay time? |
I was looking at the original error description again. Here's what I found:
I checked in the code of the client that the total count (in this case 81) is only shown if a client list protocol message for the audio mixer board was correctly and completely received. So maybe the issue is not related to #255. But I think it still does make sense to check the protocol anyway under this stress situation to check that everything works as expected. |
Except for the week straddling Oct/Nov (we end DST a week earlier than the US), we are 8 hours ahead of Pacific time, so your 9am is UK 5pm. That would be fine for me. I normally go to eat about 6pm or so. I think any day this week works for me. You can find me on Discord as Tony M or on Facebook Messenger as Tony Mountifield |
I completed a test where I had 60 real connections (not using Volker's modification to the socket.cpp file). Those 60 connections were done from another server in the cloud. Then I made a connection from my client, which is under a different name and could be identified on the mixer panel. I confirm with the latest download for windows 10, that could be seen at the far right end of the mixer with the bottom slider all the way to the right. Unfortunately, I cannot create over 60 clients from a ubuntu server because the jack software starts to break down, which is an unrelated issue here. |
@storeilly Today I did some multithreading tests on your "jam mt 26" server and could reproduce the issue. And I think I know now whats the issue. If you quickly start a massive number of clients, we get a huge protocol traffic to the clients. For each new client which connects to the server, the complete clients list is updated and also immediately the mixer levels are updated. So there is massive number of protocol messages in the queue. According to the Jamulus protocol MAIN FRAME design, there is a "1 byte cnt" which identifies the messages in the queue. So we can have 256 different messages. If we now have that massive number of messages which are transmitted in a very short time, it can happen that the order of the received network packet is changed and that a packet which should be 256 protocol messages later processed is received, since the counter wraps around the protocol mechanism thinks it has received the correct message and acknowledges it. But it was the incorrect one (caused by the wrap around). If that happens, the protocol system gets stuck and no message is delivered anymore. One solution to the problem would be to use 2 bytes instead of just one for the counter. But that would break the compatibility to old Jamulus versions (client and server) which is not good. @softins Do you think with your wireshark tools, you could prove that my assumption is true? |
I see the logs, you were connecting at about 40 per second at one stage. Well done!! I don't have the ability to do that! I was working with @brynalf and we were leaving 5 seconds between each new connection, not sure if this helps? |
I've just installed Tshark on that machine, so if you want to 'hit' it again just give me a little notice to start the capture and we can send it to @softins or yourself for analisys? |
Just seen this. Happy to help. I usually capture using I have a script that sets up
You may need to change the interface name from |
@corrados It's certainly possible that the 8-bit sequence number rolled over. Happy to verify that if I can reproduce it or be sent a capture file. If that is what is happening, then maybe when sending to a client, it could pause before using a sequence number that is still unacked from the previous time, and wait until the ack comes in? This situation would seldom occur in practice, so on the rare occasion it does, a few ms pause would be tolerable, I would think. I haven't looked at the code, but that is how I would initially approach it. |
The Jamulus dissector for Wireshark is a single |
Thanks for all your support. By looking at the code I have found a possible bug in the protocol mechanism. Hopefully I get some time this evening to further investigate this. I'll keep you informed by my progress. |
Folks: Here is what I was finally able to do to overcome the restrictions of running 60 clients on Jackd. I was able to create a second user on the ubuntu server that I am using to send the clients to newark-music.allyn.com and I was able to have that new user also send 60 clients over to newark-music as it had it's own instance of jackd, using a separate aloop audio device. I configured the script to wait 2 seconds between each client invocation as to not to overrun the server. After about 70 to 80 clients sent to the server, I notice the at listing from the master server started to have a big gap of empty lines in the listing for newark-music before the listing for the next jamulus server. The total count listed at the top (next to the server listing itself did say 100, but apparently not all 100 are listed; those after about 70 or so had a blank line. I re-ran the scenario but with no delay between client invocations and the effect was the same. So to me, it does not seem to be an overrun with speed issue. I also notice that I could not do a systemctl restart of the jamulus server; I had to do a full reboot of the machine. After I rebooted the machine, the master server's listing of the newark serve remained stuck for a full three or four minutes until it finally reset to 0 at about the time that the newark server finisted it reboot. So, that seems to indicate to me that the master server does not correct the listing for a while after my server was rebooted. I am wondering if the issue is with the master server being overloaded. I checked the logs on the newark server and I saw no error indications. All of these tests have been made with no music sent on any of the clients. I hope this all helps. Mark Allyn |
Unfortunately it turned out not to be a bug.
I also do not think that this is the case anymore after I have checked some things today.
That is interesting. I just run a set of tests this evening and I found out the opposite. When I start the clients without a delay, there is a threshold of 58 clients until the server get's confused and all sort of strange things happen. If I put a delay of about a second after the creation of each test client, I can start more than 58 and do not see the issue. I'll further investigate the issue... |
Just out of curiosity, I slowed down the script so that it issued a client connection once every 20 seconds. This got interesting. The missing connections on the listing from the master server git fewer. It made it to about 80 connections (instead of upper 60's low 70's). But there was still a gap in the listing and I could not connect my own client from my PC after we have 90 connections (server has capacity of 100). At 45 connections, I then initiated my own connection from my PC and was able to hear myself. However after about 75 connections, my return sound was very much warbly and distorted. I checked htop and found no cpu's hitting over 80 percent and I can see all four CPU's engaged. This is a dedicated cpu instance on Linode/Newark. I check network and disk utilization on the Linode dashboard and the network never hit more than about 7 MB outbound. Disk and memory were only nominal. I am wondering if we may have something both performance (ability to handle fast multiple connections) as well as functional (slowing connection requests to one per 20 seconds) reduced, but not eliminated the issues. This entire session would have resulted in a too big of a file if I ran tcpdump. I hope this all helps; if there anything I can try to do more, please let me know. Mark |
How do I get and compile in your change? Just do a new git sync or do I need a tag or CONFIG? |
If you are in Git master, a |
The network only hit 80MB at about 21:30 last night (the resolution drops on AWS as time progresses). The network capacity is 7TB so I doubt that is the issue. I'll build that commit shortly. Thanks @corrados |
There was nothing of interest in that packet file. Just a short-lived connection from a client in Malaysia at around 17:43 UTC yesterday. |
@maallyn Now that's interesting, and I'd like to observe that. I now have a tcpdump on the backend of Jamulus Explorer (capturing specific IP addresses including newark-music), and a corresponding one on newark-music (capturing only traffic with my server and with client.allyn.com). If you can rerun your big test when convenient, I'll look at the traces. Please ping me on discord before you do, so I can make sure I am watching Explorer |
OK, I have looked at the packet traces from both newark-music and jamulus.softins while you did the test of one new client every 20 seconds. As we saw, jamulus.softins stopped displaying the Version/OS and client list for newark-music once the number of clients reached 62. This is partly due to the design of the jamulus explorer backend: it sends out all its pings to the servers in the server list; when it gets a ping back it sends a version/OS request and a client list request. Because some servers will not respond, it needs to wait until it has received no packets for a certain length of time. This idle timeout is currently 1.5 seconds, which is usually plenty, after which it sends the accumulated data back to the jamulus explorer front end. Increasing the idle timeout makes it take longer for the front end to display when switching genres. But when the number of clients in the jamulus server reaches a threshold, the delay in responding starts to increase disproportionately, making the replies too late to be caught by the jamulus explorer client. This is what I observed in this test:
I looked at the other traffic at the time, and while there is a lot of traffic taken with sending level lists and client lists to the connected clients, there are still a lot of gaps, indicating that it is not due to network saturation. It is interesting to see how the delay increases so much. |
@maallyn On
|
Thank you, Tony! This saves me from having to dig into the code!
Mark
From: "Tony Mountifield" <notifications@github.com>
To: "corrados/jamulus" <jamulus@noreply.github.com>
Cc: "Mark Allyn" <allyn@well.com>, "Mention" <mention@noreply.github.com>
Sent: Tuesday, September 8, 2020 2:32:51 PM
Subject: Re: [corrados/jamulus] At about 70 client connections the client/server do not show/list all clients anymore (#547)
@maallyn On client.allyn.com , I have also made an updated version of your junker script as junker2 to give the clients individual names:
#/bin/bash
for i in {1..46}
do
sleep 20
NAME=`echo -n Test $i | base64`
INIFILE=".jamulus$i.ini"
echo "<client><name_base64>$NAME</name_base64></client>" >$INIFILE
/home/maallyn/jamulus/Jamulus -i $INIFILE -j -n --connect 172.104.29.25 >/dev/null 2>&1 &
done
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub , or unsubscribe .
…--
Mark Allyn
Bellingham, Washington
www.allyn.com
|
If replying to a github message by email, you need to avoid quoting the message being replied to! I discovered this myself the other day. |
I just did test using the feature_protosplit. Here is what I did on newark_music.allyn.com to do build for both client and serve: ================================== Do any local changes heregit apply /home/maallyn/max-client.patch move to clientcd /home/maallyn do compile on clientssh maallyn@client.allyn.com rm -f client_compile.sh rm -f client_compile.sh cat << 'EOF' > client_compile.sh cd jamulus chmod +x client_compile.sh do compile herecd jamulus qmake $server_qmake Jamulus.pro ========================================================== The first set of 46 clients launched okay. However after about client 28 on the 2nd set, the server stopped reporting and the output on jamulus.softins.co.uk collapsed. When I did a killall, the server did restore proper operation without having to reboot or restart. |
Do you run the test clients and the server on the same PC? |
To Volker: I have two machines in the cloud in the same data center. One runs the server. The other runs the clients which are run via vncserver. Both are in the Linode Newark data center. If you need to have access and look around, I can install your ssh key in them. Tony already has access. |
That is interesting. On my test today I could run about 70 clients on storellys server and it still worked with good audio quality. The question is why your server hold so much less clients... |
Note that there are limitations with Jamulus Explorer:
This exponential delay in the Jamulus server responding is a separate issue that will need investigation at some point. |
Tony: The one major exception would be for choirs and perhaps a large orchestra. I have been trying to sell this to my choirs, one of which has 50 and the other has130 voices. However, I am beginning to feel that I should not be attempting to sell my choir members to use a client desktop that looks like a sound mixer and may be intimidating to choir members in my church / chorus who are techno-phobic and just want something simple plug and play to participate, like Zoom. I have already got strong pushback by other members of my Unitarian Fellowship's audio visual and tech committee, which now I am beginning to agree with. If there is a very simple client, without the faders and the VU meters, would the traffic being handled by the server be less and the issue you are seeing with 60 to 70 plus connections go away? |
It may be related. I did some tests yesterday between separate client and server machines on my LAN, while monitoring the interface data rates using SNMP. I found the bandwidth usage on a server increased linearly with the number of clients, which makes sense, since it sends and receives one stream to/from each client. I only went up to 21 clients, so wasn't pushing the limits - I would need more client machines than just my Raspberry Pi to really exercise the server. However, thinking about it, the demands placed on the system by the mixing will increase by the square of the number of clients: with N clients connected, and each client having their own separate mix generated, the server will be producing N mixes, each from N streams. So as the number of clients increases, there will come a point where it quickly degrades and can't keep up. That will depend on the power of the hardware. As I understand it (I'm still trying to fully understand the code structure), there is a high priority thread that handles the audio mixing, and a lower priority thread that is responsible for handling the protocol. I don't yet know how that is affected by the multi-threading enhancements. I'm not sure that a simpler client would reduce the traffic enough to be significant, when compared to the bandwidth consumed by the actual audio, and the N² factor on the mixing. However, I can certainly see the benefit of such a client for user friendliness, where the users do not need/want to have their own fine control of a large number of participants. |
Maybe for a choir kind of usage, we need some kind of architecture where there is a generic mix produced, that can be controlled by one person, but that is then sent identically to most of the clients. That would reduce both the mixing load and the technical usability burden on those users. There could still also be clients that have their own custom mix. But these would be ideas for Jamulus V4. |
Tony: |
Even large choirs will need to mix their individual with the "supplied" mix so does that still need n * n mixes? I've been under pressure and haven't had time to contribute to this for the last two weeks, but I am still very much engaged. I don't think we've figured out why the system creaks yet without peaking on the CPU's or network.! I don't think the choirs can wait for V4. My one can't anyhow, we will make something work. Please don't drop this! |
I was thinking about a setup where the mix would be done by the choir director. He or she would have the 'normal' Jamulus client with all of the faders. Then, each choir member would have a simple client with just a master volume control. In fact it could be 'pre-provisioned' such that it would 'know' which server to connect to, which would be even better for those choir members who are afraid of technology. There are some in my two choirs who are almost afraid just to turn the computer on. |
Have you tried a fast server hardware with Jamulus multithreading enabled and all musicians use Mono mode? If the musicians do not connect all at the same time (which is usually not that case), the protocol traffic should be low enough for a session even if a lot of clients are connected. |
Jamulus uses different modes like the "Small network buffers" and "Mono/Stereo". To keep such a proposed implementation simple, would it be an option that in your new mode only large buffers (128) and Mono is allowed? That would make the implementation easier so that the one master client only has to generate and store just one block of OPUS coded audio data. |
I'd just like to chime in that I'm similarly trying to get this going for high school bands during COVID restrictions, and I also would benefit from being able to have a "director-controlled" mix, with all clients receiving the same mix. That potentially eliminates the n^2 problem as well as simplifying the user interface for those that don't need a mix, and possibly reducing the amount of info each client needs to get about the other channels, so smaller messages. I'm hoping to be able to contribute some help, but I'm just ramping up on the code so I don't have a ton to offer yet. Just to brainstorm a possible approach, possibly there could be two server UDP ports. One is a "director" port and functions as today. The other is a "member" port and which is included in the mix but simply gets back a replica copy of one of the "director" mixes, verbatim. No custom mix if you're connected on the member port. If more than one person has joined via the director port, possibly a member could select which of those mixes they use. Number of directors could be limited to the old max, like 15 or 20 or whatever, while members could go much higher. Another option would be to just have one person be the director, and maybe let that person "pass the baton" (if they choose) to a different person, who then becomes the one-and-only director. Everyone else just gets a replica mix. Or you could password-protect the director port. But these options diverge further from the existing mode. In my case at least, locking to 128-byte buffers would most likely be acceptable. I favor Mono-in/Stereo-out I think, if I understand its function correctly (so that members can be given separation across left-to-right by the director,) but the specific mode would not be a dealbreaker in any case. |
I ran some tests with 40 clients which worked fine, but adding even a few more caused issues with the last clients having no name information and poor audio. It was a pretty predictable line at 40, which led me to suspect it was adding a 3rd thread that caused issues. Also doing math on the ConnClientsList message, that's right around where it would reach MTU size for the packet. One plausible hypothesis on why you're seeing failure start at different numbers of clients - @maallyn you are setting client names, right? @corrados are you? That message is variable length so having longer client names would get that message to MTU faster. At least in my case I don't think fragments will survive the journey. |
I do not want to make big changes to Jamulus to support this. Here is my specification which will be quite easy to implement, just adding a new command line argument to the server like --singlemix:
There is a vecvecbyCodedData buffer which is used for both, encoding and decoding. I'll introduce a separate buffer so that I can re-use the output buffer of the first client for all other clients. So instead of calling MixEncodeTransmitData for all the other clients, they simply get vecChannels[iCurChanID].PrepAndSendPacket ( &Socket, vecvecbyCodedData[iDirectorID], iCeltNumCodedBytes );. I just did a quick hack: If I modify CreateChannelList so that no client is added, the audio mixer panel is just empty. This would be the case for the slave clients. But then they do not see how many clients are currently connected which is not a big issue. If "--singlemix" is given, "-F" and "-T" is deactivated and a warning is shown that these cannot be combined. In the OnNetTranspPropsReceived function we can check that the client uses 128 samples, if not, refuse to connect. |
I just created a new branch for this: https://github.com/corrados/jamulus/tree/feature_singlemixserver |
I just created a new Issue for that: #599 |
@kraney You can start testing now. The current version of the code does not implement any checks for correct client settings. So you have to make sure:
All clients see the full mixer panel but only the first connected client actually controls the mix. If the other clients move the faders, nothing will happen in the audio mix. Have fun :-). Feedback welcome. |
I'll close this Issue now since the original issue caused by UDP packet drops should be solved by the "split protocol messages" fix which is already implemented. The discussion about the singlemix server should be continued in the new Issue I created. |
Just to let you know, now that this is closed, I went ahead and destroyed my test server client.allyn.com |
See the report in this post: #455 (comment)
The text was updated successfully, but these errors were encountered: