Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ASIC/SDK health event #3020

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

stephenxs
Copy link
Collaborator

What I did

Support ASIC/SDK health event

  1. Initialization
    • Fetch capabilities and expose to STATE_DB
    • Register the event handler and categories for each severity when supported
  2. Handle suppress ASIC/SDK health event categories
  3. Handle ASIC/SDK health event reported by SAI redis in the callback context
    • Decode it
    • Log message
    • Send event
  4. Eliminate old events of each severity according to users' configuration

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

How I verified it

Unit test.

Details if related

@stephenxs stephenxs changed the title ASIC/SDK health event Support ASIC/SDK health event Jan 23, 2024
@prsunny
Copy link
Collaborator

prsunny commented Feb 26, 2024

@prabhataravind to review once the PR is ready

@stephenxs stephenxs marked this pull request as ready for review March 14, 2024 09:43
@stephenxs stephenxs requested a review from prsunny as a code owner March 14, 2024 09:43
@stephenxs stephenxs force-pushed the asic-sdk-health-event branch 3 times, most recently from 4133a0d to e1bfded Compare March 31, 2024 04:43
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

Many covered lines were identified as not-covered. Retry for now

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Collaborator

prsunny commented Apr 15, 2024

@kperumalbfn for viz

orchagent/eliminate_events.lua Show resolved Hide resolved
orchagent/switchorch.cpp Show resolved Hide resolved
orchagent/notifications.cpp Show resolved Hide resolved
@stephenxs
Copy link
Collaborator Author

Looks like the coverage report is not accurate. retriggered

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

Looks like there is an issue in coverage report. Many covered lines were reported as uncovered.

(gdb) bt
#0  SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x5555561431c0, consumer=...) at ../../orchagent/switchorch.cpp:948
#1  0x0000555555a0c691 in SwitchOrch::doTask (this=0x5555561431c0, consumer=...) at ../../orchagent/switchorch.cpp:1008
#2  0x000055555586d712 in Orch::doTask (this=0x5555561431c0) at ../../orchagent/orch.cpp:541
#3  0x000055555583d508 in switchorch_test::SwitchOrchTest_SwitchOrchTestSuppressCategories_Test::TestBody (this=<optimized out>) at switchorch_ut.cpp:158
#4  0x0000555555ceb1a7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#5  0x0000555555ce143e in testing::Test::Run() ()
#6  0x0000555555ce1595 in testing::TestInfo::Run() ()
#7  0x0000555555ce1a29 in testing::TestSuite::Run() ()
#8  0x0000555555ce2072 in testing::internal::UnitTestImpl::RunAllTests() ()
#9  0x0000555555ceb717 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#10 0x0000555555ce1658 in testing::UnitTest::Run() ()
#11 0x00005555556d7050 in main ()

orchagent/switchorch.cpp Outdated Show resolved Hide resolved
orchagent/switchorch.cpp Outdated Show resolved Hide resolved
orchagent/notifications.cpp Show resolved Hide resolved
@stephenxs
Copy link
Collaborator Author

Build failures were caused by UT which I didn't see locally. Maybe it is relevant to bookworm docker. Will fix it

Copy link

linux-foundation-easycla bot commented Apr 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@stephenxs
Copy link
Collaborator Author

Build failures were caused by UT which I didn't see locally. Maybe it is relevant to bookworm docker. Will fix it

Fixed. It was caused by the failure to load Lua script in the slave docker.

@stephenxs
Copy link
Collaborator Author

Hi @prsunny
Many covered lines were identified as uncovered. who can help to check this?
Thanks.

orchagent/switchorch.cpp | 34.2% | 148-150,161-162,205-206,228,231,878-879,883,885,887-888,895-896,909,913,915,917-918,920,922-924,926,928,930-932,938,940,942-944,947,949-951,953-954,956,958,960-963,966,968-969,973,975,978,980,983,985,989,992,994,1014,1016,1073,1080-1086,1088,1090,1092,1094-1095,1097-1098,1101-1104,1106,1108,1110-1111,1113,1117,1120,1122,1124,1128-1129,1132-1134,1136,1138,1140,1142,1144,1259,1261-1262,1264
Thread 1 "tests" hit Breakpoint 2, SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x55555a8ad3e0, consumer=...) at ../../orchagent/switchorch.cpp:915
915         SWSS_LOG_ENTER();
(gdb) n
917         auto &map = consumer.m_toSync;
(gdb) 
918         auto it = map.begin();
(gdb) 
920         while (it != map.end())
(gdb) 
922             auto keyOpFieldsValues = it->second;
(gdb) 
923             auto key = kfvKey(keyOpFieldsValues);
(gdb) 
924             auto op = kfvOp(keyOpFieldsValues);
(gdb) 
926             SWSS_LOG_INFO("KEY: %s, OP: %s", key.c_str(), op.c_str());
(gdb) 
928             if (key.empty())
(gdb) 
938                 saiSeverity = switch_asic_sdk_health_event_severity_to_switch_attribute_map.at(key);
(gdb) 
947             if (op == SET_COMMAND)
(gdb) 
949                 bool categoriesConfigured = false;
(gdb) 
950                 bool continueMainLoop = false;
(gdb) 
951                 for (const auto &cit : kfvFieldsValues(keyOpFieldsValues))
(gdb) 
953                     auto fieldName = fvField(cit);
(gdb) 
954                     auto fieldValue = fvValue(cit);
(gdb) 
956                     SWSS_LOG_INFO("FIELD: %s, VALUE: %s", fieldName.c_str(), fieldValue.c_str());
(gdb) 
958                     if (m_supportedAsicSdkHealthEventAttributes.find(saiSeverity) == m_supportedAsicSdkHealthEventAttributes.end())
(gdb) 
966                     if (fieldName == "categories")
(gdb) 
968                         registerAsicSdkHealthEventCategories(saiSeverity, key, fieldValue);
(gdb) 

Thread 1 "tests" hit Breakpoint 1, SwitchOrch::registerAsicSdkHealthEventCategories (this=0x55555a8ad3e0, saiSeverity=SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY, severityString="warning", 
    suppressed_category_list="software,cpu_hw,invalid_category", isInitializing=false) at ../../orchagent/switchorch.cpp:878
878             auto &&categories = tokenize(suppressed_category_list, ',');
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
Thread 1 "tests" hit Breakpoint 3, SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555a8ad3e0, switch_id=141733920768, severity=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/switchorch.cpp:1080
1080        std::vector<swss::FieldValueTuple> values;
(gdb) n
1081        const string &severity_str = switch_asic_sdk_health_event_severity_reverse_map.at(severity);
(gdb) 
1082        const string &category_str = switch_asic_sdk_health_event_category_reverse_map.at(category);
(gdb) 
1083        string description_str;
(gdb) 
1084        const std::time_t &t = (std::time_t)timestamp.tv_sec;
(gdb) 
1085        stringstream time_ss;
(gdb) 
1086        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
(gdb) 
1088        switch (data.data_type)
(gdb) 
1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 

1094            description_with_terminator.push_back(0);
(gdb) 
1095            description_str = string(reinterpret_cast<char*>(description_with_terminator.data()));
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1104                                      description_str.end()))
(gdb) 
1103                                      }),
(gdb) 
1104                                      description_str.end()))
(gdb) 
1097            if (description_str.end() !=
(gdb) 

1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 
1117            { "sai_timestamp", time_ss.str() },
(gdb) 
1120            { "description", description_str }};
(gdb) 
1122        if (0 == gMyAsicName.size())
(gdb) 
1128            SWSS_LOG_NOTICE("[%s] ASIC/SDK health event occurred at %s, asic %s, category %s: %s", severity_str.c_str(), time_ss.str().c_str(), gMyAsicName.c_str(), category_str.c_str(), description_str.c_str());
(gdb) 
1129            params["asic_name"] = gMyAsicName;
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1132        values.emplace_back("severity", severity_str);
(gdb) 
1133        values.emplace_back("category", category_str);
(gdb) 
1134        values.emplace_back("description", description_str);
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1138        event_publish(g_events_handle, "asic-sdk-health-event", &params);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1140        if (severity == SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL)
(gdb) 
1142            m_fatalEventCount++;
(gdb) 
1120            { "description", description_str }};
(gdb) 
1085        stringstream time_ss;
(gdb) 
1083        string description_str;
(gdb) 
1080        std::vector<swss::FieldValueTuple> values;
(gdb) 
1144    }
(gdb) 

@prsunny
Copy link
Collaborator

prsunny commented Apr 22, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Collaborator

prsunny commented Apr 22, 2024

Hi @prsunny Many covered lines were identified as uncovered. who can help to check this? Thanks.

orchagent/switchorch.cpp | 34.2% | 148-150,161-162,205-206,228,231,878-879,883,885,887-888,895-896,909,913,915,917-918,920,922-924,926,928,930-932,938,940,942-944,947,949-951,953-954,956,958,960-963,966,968-969,973,975,978,980,983,985,989,992,994,1014,1016,1073,1080-1086,1088,1090,1092,1094-1095,1097-1098,1101-1104,1106,1108,1110-1111,1113,1117,1120,1122,1124,1128-1129,1132-1134,1136,1138,1140,1142,1144,1259,1261-1262,1264
Thread 1 "tests" hit Breakpoint 2, SwitchOrch::doCfgSuppressAsicSdkHealthEventTableTask (this=0x55555a8ad3e0, consumer=...) at ../../orchagent/switchorch.cpp:915
915         SWSS_LOG_ENTER();
(gdb) n
917         auto &map = consumer.m_toSync;
(gdb) 
918         auto it = map.begin();
(gdb) 
920         while (it != map.end())
(gdb) 
922             auto keyOpFieldsValues = it->second;
(gdb) 
923             auto key = kfvKey(keyOpFieldsValues);
(gdb) 
924             auto op = kfvOp(keyOpFieldsValues);
(gdb) 
926             SWSS_LOG_INFO("KEY: %s, OP: %s", key.c_str(), op.c_str());
(gdb) 
928             if (key.empty())
(gdb) 
938                 saiSeverity = switch_asic_sdk_health_event_severity_to_switch_attribute_map.at(key);
(gdb) 
947             if (op == SET_COMMAND)
(gdb) 
949                 bool categoriesConfigured = false;
(gdb) 
950                 bool continueMainLoop = false;
(gdb) 
951                 for (const auto &cit : kfvFieldsValues(keyOpFieldsValues))
(gdb) 
953                     auto fieldName = fvField(cit);
(gdb) 
954                     auto fieldValue = fvValue(cit);
(gdb) 
956                     SWSS_LOG_INFO("FIELD: %s, VALUE: %s", fieldName.c_str(), fieldValue.c_str());
(gdb) 
958                     if (m_supportedAsicSdkHealthEventAttributes.find(saiSeverity) == m_supportedAsicSdkHealthEventAttributes.end())
(gdb) 
966                     if (fieldName == "categories")
(gdb) 
968                         registerAsicSdkHealthEventCategories(saiSeverity, key, fieldValue);
(gdb) 

Thread 1 "tests" hit Breakpoint 1, SwitchOrch::registerAsicSdkHealthEventCategories (this=0x55555a8ad3e0, saiSeverity=SAI_SWITCH_ATTR_REG_WARNING_SWITCH_ASIC_SDK_HEALTH_CATEGORY, severityString="warning", 
    suppressed_category_list="software,cpu_hw,invalid_category", isInitializing=false) at ../../orchagent/switchorch.cpp:878
878             auto &&categories = tokenize(suppressed_category_list, ',');
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
879             for (auto category : categories)
(gdb) 
883                     interested_categories_set.erase(switch_asic_sdk_health_event_category_map.at(category));
(gdb) 
Thread 1 "tests" hit Breakpoint 3, SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555a8ad3e0, switch_id=141733920768, severity=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/switchorch.cpp:1080
1080        std::vector<swss::FieldValueTuple> values;
(gdb) n
1081        const string &severity_str = switch_asic_sdk_health_event_severity_reverse_map.at(severity);
(gdb) 
1082        const string &category_str = switch_asic_sdk_health_event_category_reverse_map.at(category);
(gdb) 
1083        string description_str;
(gdb) 
1084        const std::time_t &t = (std::time_t)timestamp.tv_sec;
(gdb) 
1085        stringstream time_ss;
(gdb) 
1086        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
(gdb) 
1088        switch (data.data_type)
(gdb) 
1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 

1094            description_with_terminator.push_back(0);
(gdb) 
1095            description_str = string(reinterpret_cast<char*>(description_with_terminator.data()));
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1104                                      description_str.end()))
(gdb) 
1103                                      }),
(gdb) 
1104                                      description_str.end()))
(gdb) 
1097            if (description_str.end() !=
(gdb) 

1092            vector<uint8_t> description_with_terminator(description.list, description.list + description.count);
(gdb) 
1117            { "sai_timestamp", time_ss.str() },
(gdb) 
1120            { "description", description_str }};
(gdb) 
1122        if (0 == gMyAsicName.size())
(gdb) 
1128            SWSS_LOG_NOTICE("[%s] ASIC/SDK health event occurred at %s, asic %s, category %s: %s", severity_str.c_str(), time_ss.str().c_str(), gMyAsicName.c_str(), category_str.c_str(), description_str.c_str());
(gdb) 
1129            params["asic_name"] = gMyAsicName;
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1132        values.emplace_back("severity", severity_str);
(gdb) 
1133        values.emplace_back("category", category_str);
(gdb) 
1134        values.emplace_back("description", description_str);
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1136        m_asicSdkHealthEventTable->set(time_ss.str(),values);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1138        event_publish(g_events_handle, "asic-sdk-health-event", &params);
(gdb) 
525           basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
(gdb) 
1140        if (severity == SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL)
(gdb) 
1142            m_fatalEventCount++;
(gdb) 
1120            { "description", description_str }};
(gdb) 
1085        stringstream time_ss;
(gdb) 
1083        string description_str;
(gdb) 
1080        std::vector<swss::FieldValueTuple> values;
(gdb) 
1144    }
(gdb) 

I see coverage works for other PRs. lets check the latest result

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: stephens <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Copy link
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@stephenxs
Copy link
Collaborator Author

/apzw run

@prsunny
Copy link
Collaborator

prsunny commented Apr 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny merged commit 054ed34 into sonic-net:master Apr 29, 2024
17 checks passed
@stephenxs stephenxs deleted the asic-sdk-health-event branch April 29, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants