Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

Merged
merged 12 commits into from
Jul 14, 2023

Conversation

lixiaoyuner
Copy link
Contributor

@lixiaoyuner lixiaoyuner commented Jun 12, 2023

Why I did it

  • During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
  • When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.
Work item tracking
  • Microsoft ADO (number only):
    24172368

How I did it

  • Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
  • When need to go back to local mode, we do systemd restart immediately.

How to verif it

Feature's systemd service can be always restarted successfully during upgrade process via k8s.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

  • 20220531.28

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

rules/config Outdated Show resolved Hide resolved
src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py Outdated Show resolved Hide resolved
src/sonic-ctrmgrd/ctrmgr/kube_commands.py Outdated Show resolved Hide resolved
@qiluo-msft
Copy link
Collaborator

qiluo-msft commented Jun 22, 2023

Your code changes is more than what you said in PR title and PR description. Could you update them?


In reply to: 1602113471

@lixiaoyuner lixiaoyuner changed the title Bypass the systemd service restart limit Bypass the systemd service restart limit and do immediately restart when change to local mode and only do image clean up when do tag latest Jun 22, 2023
@lixiaoyuner
Copy link
Contributor Author

Your code changes is more than what you said in PR title and PR description. Could you update them?

Thanks for your comments, have updated, could you please go ahead to review.

@losha228 losha228 self-requested a review June 27, 2023 11:26
@losha228 losha228 self-requested a review June 27, 2023 11:26
@lguohan
Copy link
Collaborator

lguohan commented Jun 30, 2023

for the first pr, i think it should be a separate pr. I spent quite some time to figure where which line code map to this reset failed. even for this feature, we need to explain on this "It's easy to meet this limit when upgrade and fallback happen at the same time." why? i couldn't figure out.

@lixiaoyuner
Copy link
Contributor Author

lixiaoyuner commented Jul 3, 2023

for the first pr, i think it should be a separate pr. I spent quite some time to figure where which line code map to this reset failed. even for this feature, we need to explain on this "It's easy to meet this limit when upgrade and fallback happen at the same time." why? i couldn't figure out.

Maybe the word "easy" cause it confused, actually not that easy, it's an extreme case. Let me explain first how the systemd sevice restart when k8s upgrade the container. For example v1(kube) --> v2(kube), k8s will stop v1 container first, this time the systemd service is doing "docker wait v1-id", once the v1 container stops, the "docker wait v1-id" will return error code and the systemd service will exit with error code, due to the restart policy, the systemd service will restart, and the failed number will +1. But the failed number limit is only 3 within 20 minutes, it means if we do three times upgrade or fallback within 20 minutes, the systemd service will be never up. For a possible example, if fallback happens when upgrade, the failed number will be 2, just once again, the systemd service will be down. So, we need to reset-failed number before we do systemd restart.

@lguohan
Copy link
Collaborator

lguohan commented Jul 5, 2023

k8s will stop v1 container first, this time the systemd service is doing "docker wait v1-id", once the v1 container stops, the "docker wait v1-id" will return error code and the systemd service will exit with error code,

in this case, it is planned stop, so why docker wait will return error code? can we make this like planned stopped with error code = 0?

@lixiaoyuner
Copy link
Contributor Author

lixiaoyuner commented Jul 6, 2023

in this case, it is planned stop, so why docker wait will return error code? can we make this like planned stopped with error code = 0?

The reason is that k8s will kill the container, so the docker wait result is not zero. There is one way that we can check whether the wait id is the feature name or not. Feature name means it's a local container. Not feature name means it's a kube container, if it's a kube container, after docker wait returns, we could not care the docker wait returns code and we can return 0 directly, this maybe a solution. I can try to implement to verify it's a feasible solution.

Latest reply
I did a quick test, after changed the exit code to 0, it still doesn't work. I thought systemd would use the failed exit count, but actually it uses the service start count, don't care failed exit and successful exit last time. Once the service starts three times within 20 minutes, it will fail to restart again. So, we can only use "systemctl reset-failed" command to bypass.
I paste our service's configuration and systemd official doc about the limit below.

reference:
Our systemd service configuration:
StartLimitIntervalSec=1200
StartLimitBurst=3
systemd service exlaination:
Configure unit start rate limiting. Units which are started more than burst times within an interval time span are not permitted to start any more.

@lixiaoyuner lixiaoyuner changed the title Bypass the systemd service restart limit and do immediately restart when change to local mode and only do image clean up when do tag latest Bypass the systemd service restart limit and do immediately restart when change to local mode Jul 7, 2023
@lguohan lguohan merged commit df13380 into sonic-net:master Jul 14, 2023
16 checks passed
lixiaoyuner added a commit to lixiaoyuner/sonic-buildimage that referenced this pull request Jul 14, 2023
…start when change to local mode (sonic-net#15432)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
yxieca pushed a commit that referenced this pull request Jul 14, 2023
…start when change to local mode (#15432) (#15839)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 17, 2023
…start when change to local mode (sonic-net#15432)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #15868

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 17, 2023
…start when change to local mode (sonic-net#15432)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #15869

mssonicbld added a commit that referenced this pull request Jul 19, 2023
mssonicbld pushed a commit that referenced this pull request Jul 20, 2023
…start when change to local mode (#15432)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…start when change to local mode (sonic-net#15432)

Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking
Microsoft ADO (number only):
24172368

How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants