Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[memory_checker] Change log severity to WARNING in a case when the docker service is not running #16018

Conversation

vadymhlushko-mlnx
Copy link
Contributor

@vadymhlushko-mlnx vadymhlushko-mlnx commented Aug 2, 2023

Why I did it

To fix the logic introduced by [memory_checker] Do not check memory usage of containers which are not created #11129.
There could be a scenario before the reboot, where

  1. The docker service has stopped
  2. In a very short period of time, the monit service performs the root@sonic:/home/admin# monit status container_memory_telemetry

In such scenario, the memory_checker script will throw an error to the syslog:

ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'

But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the FileNotFoundError(2, 'No such file or directory' exception in the syslog.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Change the log severity to the warning and changed the return value.

How to verify it

It is really hard to catch the exact moment described in the Why I did it section.
In order to check the logic:

  1. Change the Unix socket path to non-existing in /usr/bin/memory_checker file on the switch.
  2. Execute the root@sonic:/home/admin# monit restart container_memory_telemetry
  3. Check the syslog for such messages:
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@vadymhlushko-mlnx
Copy link
Contributor Author

@yozhao101 could you please take a look at this PR?

@vadymhlushko-mlnx
Copy link
Contributor Author

@yozhao101 kind reminder

@liat-grozovik
Copy link
Collaborator

@qiluo-msft @yozhao101 could you please help to review? this is a fix following a PR merged/reviewed by you so are the optimal reviewer for this one

@@ -140,10 +140,10 @@ def get_running_container_names():
running_container_list = docker_client.containers.list(filters={"status": "running"})
running_container_names = [ container.name for container in running_container_list ]
except (docker.errors.APIError, docker.errors.DockerException) as err:
syslog.syslog(syslog.LOG_ERR,
Copy link
Collaborator

@qiluo-msft qiluo-msft Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG_ERR

Is it possible to check first docker service is running, if Failed to retrieve the running container list, treat it still as ERR? #Closed

Copy link
Contributor Author

@vadymhlushko-mlnx vadymhlushko-mlnx Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the idea of get_running_container_names(), if the docker service is not running it just gives us a warning message to the syslog and returns the empty list of docker containers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify:

  1. check docker service is running, if not running, no ERR, no WARNING. Low frequent INFO may be good.
  2. if docker service is running, retrieve the running container list. If failed to retrieve, treat it still as ERR.

Copy link
Contributor Author

@vadymhlushko-mlnx vadymhlushko-mlnx Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the check if the docker service is running, but in some cases (which I described in the PR description) is not enough.
This check is right before the get_running_container_names() function call (161 line number).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Let me update my proposal:

  1. when you catch an exception, if you are sure docker service is not running, log INFO in low frequency
  2. if not a docker service issue, something really bad happens, let's still use ERR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is aligned according to your last comment, please take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft please take a look

@vadymhlushko-mlnx vadymhlushko-mlnx force-pushed the master-memory-checker-docker-log-severity branch from bdbe756 to 8e70bf0 Compare August 28, 2023 13:34
… service is not running.

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
@vadymhlushko-mlnx vadymhlushko-mlnx force-pushed the master-memory-checker-docker-log-severity branch from 8e70bf0 to e0df3f4 Compare August 30, 2023 15:36
@qiluo-msft qiluo-msft merged commit 43340cd into sonic-net:master Aug 31, 2023
17 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 3, 2023
… service is not running. (sonic-net#16018)

#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #16405

mssonicbld pushed a commit that referenced this pull request Sep 3, 2023
… service is not running. (#16018)

#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
… service is not running. (sonic-net#16018)

#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Oct 10, 2023
… service is not running. (sonic-net#16018)

#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

7 similar comments
@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

@mssonicbld
Copy link
Collaborator

@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!!
#16821

mssonicbld pushed a commit that referenced this pull request Oct 20, 2023
… service is not running. (#16018)

#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants