Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ASIC/SDK health event #1533

Merged
merged 9 commits into from
Feb 5, 2024

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Dec 1, 2023

ASIC/SDK health event

A way for syncd to notify orchagent an ASIC/SDK health event before asking orchagent to shutdown is introduced in this document.

For most of ethernet switches, the switch ASIC is the core component in the system. It is very important to identify a switch ASIC is in a failure state and report such event to NOS.

Currently, such failure is detected by SDK/FW on most of platforms. A vendor SAI notifies orchagent to shutdown using switch_shutdown_request notification when it detects an ASIC/SDK internal error. Usually, the vendor SAI prints log message before calling shutdown API.

Orchagent can abort itself if a SAI API call fails, usually due to a bad arguments, and can not be recovered. From a customer's perspective of view, this can be distinguished from the ASIC/SDK health event only by analyzing the log message.

The current implementation has the following limitations:

  • It is difficult for a customer to understand what occured on SAI and below or distinguish an SDK/FW internal error from a SAI API call. Even a customer can analyze the issue using the log message, it is not intuitive.
  • It is unable notify an ASIC/FW/SDK event if the event is less serious to ask for shutdown.
  • It is unable for telementry agent to collect such information.

In this design, we will introduce a new way to address the limitations.

Implementation PRs

Module PR title state
sonic-buildimage Support ASIC/SDK health event GitHub issue/pull request detail
sonic-swss-common Support ASIC/SDK health event GitHub issue/pull request detail
sonic-sairedis Support ASIC/SDK health event GitHub issue/pull request detail
sonic-swss Support ASIC/SDK health event GitHub issue/pull request detail
sonic-utilities Support ASIC/SDK health event GitHub issue/pull request detail

Signed-off-by: Stephen Sun stephens@nvidia.com

stephenxs and others added 4 commits November 17, 2023 18:44
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao
Copy link
Collaborator

If anyone want to be reviewer of this feature, plesae leave your comments here. Thanks.

@stephenxs
Copy link
Collaborator Author

stephenxs commented Dec 12, 2023

Some comments from the HLD review meeting

  1. A way to limit the number of the ASIC/SDK health event
  2. A police generate dump by SONiC in case SAI doesn't collect dump
  3. Notify SAI the folder to store the collected dump
  4. Consider whether to leverage the event/alarm system
  5. A way to disable the events for platforms who expose the capability bits but don't support collect dumps on receiving an event.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Collaborator Author

Some comments from the HLD review meeting

  1. A way to limit the number of the ASIC/SDK health event

Provided a way to limit the number of events in the database but won't limit the rate at which the vendor SAI generates the events (which is vendor SAI's responsibility)

  1. A police generate dump by SONiC in case SAI doesn't collect dump
  2. Notify SAI the folder to store the collected dump

Won't address it for now.

  1. Consider whether to leverage the event/alarm system

We already did but still need to keep the ASIC_SDK_HEALTH_EVENT table in STATE_DB

  1. A way to disable the events for platforms who expose the capability bits but don't support collect dumps on receiving an event.

Vendor SAI should expose the capability only when it is supported completely.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@zhangyanzhao
Copy link
Collaborator

Can you please help to add the code PRs by referring to #806 ? @stephenxs

@prsunny
Copy link
Contributor

prsunny commented Jan 18, 2024

@venkatmahalingam /@prvattem , please signoff

@stephenxs
Copy link
Collaborator Author

Can you please help to add the code PRs by referring to #806 ? @stephenxs

Done.

@liat-grozovik
Copy link
Collaborator

@venkatmahalingam /@prvattem kindly reminder. if not further comments i will just go and merge the HLD.

@prsunny prsunny merged commit 35bb194 into sonic-net:master Feb 5, 2024
1 check passed
@stephenxs stephenxs deleted the handle-SDK-health-event branch February 5, 2024 23:55
a114j0y pushed a commit to a114j0y/SONiC that referenced this pull request Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants