Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker teamd restart failed to create all port channels #1154

Closed
stcheng opened this issue Nov 15, 2017 · 4 comments
Closed

docker teamd restart failed to create all port channels #1154

stcheng opened this issue Nov 15, 2017 · 4 comments
Labels

Comments

@stcheng
Copy link
Contributor

stcheng commented Nov 15, 2017

The current master:

root@str-s6000-on-2:/home/admin# show version 
SONiC Software Version: SONiC.HEAD.347-cea87e9
Distribution: Debian 8.9                      
Kernel: 3.16.0-4-amd64                        
Build commit: cea87e9                         
Build date: Wed Nov 15 01:12:28 UTC 2017      
Built by: sonicbld@jenkins-slave-phx-1        

After running systemctl restart teamd, not all teamd daemons start.

After manually calling the command, it shows the following error:

root@str-s6000-on-2:/etc/teamd# teamd -f /etc/teamd/PortChannel8.conf -d
This program is not intended to be run as root.
Daemon process failed.
Failed: File exists

----------------------------------------------------------------------------------------------------

Update:

The device I am using:

root@str-s6000-on-2:/home/admin# show platform summ
Platform: x86_64-dell_s6000_s1220-r0
HwSKU: Force10-S6000
ASIC: broadcom

What I see:
port channels are not started successfully. Sometimes partial port channels are created, sometimes, none of them is created. If I go into the docker, I could observe something like below:

root@str-s6000-on-2:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.7  0.1  47892 16172 ?        Ss+  01:42   0:00 /usr/bin/python /usr/bin/supervisord
root        77  0.0  0.0 258684  3232 ?        Sl   01:42   0:00 /usr/sbin/rsyslogd -n
root        82  0.0  0.0  20052  2868 ?        S    01:42   0:00 bash /usr/bin/teamd.sh /etc/teamd/PortChannel0.conf
root        83  0.0  0.0  20052  2888 ?        S    01:42   0:00 bash /usr/bin/teamd.sh /etc/teamd/PortChannel8.conf
root        85  0.0  0.0  20052  2840 ?        S    01:42   0:00 bash /usr/bin/teamd.sh /etc/teamd/PortChannel16.conf
root        87  0.0  0.0  31364  3144 ?        S    01:42   0:00 teamd -f /etc/teamd/PortChannel0.conf
root        88  0.0  0.0  31364  3160 ?        S    01:42   0:00 teamd -f /etc/teamd/PortChannel8.conf
root        91  0.0  0.0  31364  3196 ?        S    01:42   0:00 teamd -f /etc/teamd/PortChannel16.conf
root        99  0.1  0.0 108404  4264 ?        Sl   01:42   0:00 /usr/bin/teamsyncd
root       221  0.2  0.0  20256  3332 ?        Ss   01:43   0:00 bash
root       249  0.0  0.0  17504  2056 ?        R+   01:44   0:00 ps aux

It shows that not all port channels are started. Besides, for the already started port channels, they are also in bad states:

root@str-s6000-on-2:/# teamdctl PortChannel0 state
teamdctl_connect failed (Connection timed out)

By looking at the syslog, I notice:

INFO supervisord: teamd-PortChannel48 teamd_init() failed.
INFO supervisord: teamd-PortChannel48 Failed: Cannot allocate memory
INFO supervisord: teamd-PortChannel32 teamd_init() failed.
INFO supervisord: teamd-PortChannel32 Failed: Cannot allocate memory
INFO supervisord: teamd-PortChannel56 teamd_init() failed.
INFO supervisord: teamd-PortChannel56 Failed: Cannot allocate memory

for all port channels that are not started yet.

How to reproduce this issue
The current topology is t1-lag: we have 8 LAGs in total each of the LAG having two member ports.
By repeatedly rebooting the system, it is fairly easy to get into the current situation just by checking the output of the command teamshow.

@jleveque
Copy link
Contributor

Unable to reproduce on another device.

@stcheng
Copy link
Contributor Author

stcheng commented Nov 16, 2017

during reboot it will also show:

4051:Nov 15 23:56:43.351021 str-s6000-on-2 INFO supervisord: teamd-PortChannel24 teamd_init() failed.
4052:Nov 15 23:56:43.352567 str-s6000-on-2 INFO supervisord: teamd-PortChannel24 Failed: Cannot allocate memory
4065:Nov 15 23:56:43.466376 str-s6000-on-2 INFO teamd.sh[1279]: 2017-11-15 23:56:43,404 INFO exited: teamd-PortChannel24 (exit status 1; not expected)

Even with successfullly started teamd processes:

root@str-s6000-on-2:/# teamdctl PortChannel0 state  
teamdctl_connect failed (Connection timed out)      

@jleveque
Copy link
Contributor

Appears to be related to my recent change, which causes all teamd processes to be started simultaneously by supervisor. With enough processes started simultaneously, we can potentially run out of available memory. On my test device, I'm only starting four teamd processes, thus I cannot reproduce it.

@jleveque
Copy link
Contributor

Closing as offending change was reverted here: #1156

Either teamd will need to be modified to allow starting multiple processes at once without consuming all available memory, in which we can re-commit the change, or we will need to find a workaround to start the processes sequentially, which is not easy to do with supervisor.

liat-grozovik pushed a commit that referenced this issue Jan 5, 2023
Update sonic-sairedis submodule pointer to include the following:

402eb14 [ppi]: Enable bulk API. (#1171)
86bb828 Switch to using stock gcovr 5.2 (#1174)
1c9ca78 Manage LANES mapping on VOQ system (#1127)
5887d31 Fix for [EVPN] When MAC moves from remote end point to local, ASIC DB fields are not updated properly for the mac #11503Update NotificationProcessor.cpp (#1118)
559bd5b [ci][asan] add DVS tests run with ASAN (#1139)
4ab46b5 Initialize attr variables in Legacy.switch_get and LegacyFdbEntry.fdb_entry_get (#1169)
4e24c77 The meta_sai_validate_fdb_entry() validates the input FDB entry for the (#1154)

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants