-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipmitool to be used with AER only on Ampere Altra platforms #56
base: master
Are you sure you want to change the base?
Conversation
When monitoring AER errors, verify if we are on an Altra platform to use ipmitool when dealing with this type of errors. Needs both --enable-aer and --enable-amp-ns-decode Fixes: 738bafa ("Add error handling for Ampere-specific errors.") Signed-off-by: William Roche <william.roche@oracle.com>
The new functionalities added for Ampere platforms with commit 738bafa ("Add error handling for Ampere-specific errors.") introduced 2 problems:
The first problem is the most important to solve. To do so, we introduce an initialization function in the AER code to check if ipmitool should be used (present and on the appropriate platform), this function task is to set a boolean flag indicating if ipmitool needs to be used. |
Take Jason Tian's feedback into account to provide more details about the conditions allowing the use of ipmitool on Altra or Altra Max platforms. Signed-off-by: William Roche <william.roche@oracle.com>
Feedback provided by Jason about ras-aer-handler.c changes:
The new commit "Enrich ras_report_aer_ipmi_init() comments." I just added is a suggestion to address that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can identify different platform from Ampere, I am ok to this change.
This pull request is still pending, and I'm available to give any detail if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of things that require changes. Basically:
- dependencies from command line tools should be set via
./configure
; - don't rely on lscpu/dmidecode/grep - location of such tools are system-dependent and the additional complexity under
configure.ac
would just make harder to compile it. Instead, just open the needed file(s) from /sys/class/dmi/id/ and/or read the contents of /proc/cpuinfo - seeking for the corresponding CPU model, board name, etc.
980e51e
to
4e1fffb
Compare
Fix the following compile errors that occurs when building against musl: ras-events.c: In function 'read_ras_event_all_cpus': ras-events.c:366:16: error: 'PATH_MAX' undeclared (first use in this function) 366 | char pipe_raw[PATH_MAX]; | ^~~~~~~~ ras-events.c: In function 'handle_ras_events_cpu': ras-events.c:564:16: error: 'PATH_MAX' undeclared (first use in this function) 564 | char pipe_raw[PATH_MAX]; | Signed-off-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Removes trailing spaces at the end of a line from file location and fixes --layout option to parse dimm nodes to get the size of each dimm from ras-mc-ctl. Issue is reported mchehab#43 Where '> ras-mc-ctl --layout' reports all 0s With this change the layout option prints the correct dimm sizes > sudo ras-mc-ctl --layout +-----------------------------------------------+ | mc0 | | csrow0 | csrow1 | csrow2 | csrow3 | ----------+-----------------------------------------------+ ... channel7: | 16384 MB | 0 MB | 0 MB | 0 MB | channel6: | 16384 MB | 0 MB | 0 MB | 0 MB | ... ----------+-----------------------------------------------+ Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Link: https://lkml.kernel.org/r/20210810183855.129076-1-nchatrad@amd.com/
Signed-off-by: Justin Vreeland <vreeland.justin@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Steven Johnson <strntydog@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
syslog is only used when the daemon runs in backround mode this service is configured to run in foreground mode Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
The data type of sprintf called in the function uuid_le() is mismatch. Arm64 compiler force it to unsigned char by default, and can work normally. But if someone compile it with the option -fsigned-char, the function can't work correctly. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
It will record event even the option -r is not provided for hip08. It is not right, and fix it. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
It is not right to use '%d' to print uint8_t and uint16_t, although there is no function issue. Change to use '%hhu' and '%hu' separately. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Add some modules supported by hisi common error section. Besides, HHA is the module for some old platform, and it takes the same place of MATA, so remove it. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cleanup files that are generated at build time from the *.in input files. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Sort of sync this file from Fedora's upstream, addressing some bugs sysfsdir install bugs. After such change, the main difference would be that, in Fedora, it uses different config settings, depending at the architecture: -%configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%ifarch %{arm} aarch64 +%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --enable-non-standard --enable-arm --enable-hisi-ns-decode --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%else +%configure --enable-sqlite3 --enable-aer --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --with-sysconfdefdir=%{_sysconfdir}/sysconfig +%endif Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Since Linux 5.18-rc1 a new block tracepoint called block_rq_error is available for tracing disk error events dedicatedly. Currently rasdaemon is using block_rq_complete which also traces successful cases. It incurs excessive tracing logs and somehow overhead since the event is triggered quite often. Use the new tracepoint for disk error reporting, and the new trace point has the same format as block_rq_complete. Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
The version used is glibc specific therefore make it so and provide a fallback for non-glibc systems Signed-off-by: Khem Raj <raj.khem@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Found with covscan. Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Found with covscan. Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
…rectly We could just have an empty string but keeping the format could prevent issues if someone is actually parsing this. Found with covscan. v2: fixed the timestamp as pointed by Robert Elliott Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
When the corrected errors exceed the set limit in cycle, try to offline the related cpu core. Signed-off-by: Shengwei Luo <luoshengwei@huawei.com> Signed-off-by: Junchong Pan <panjunchong@hisilicon.com> Signed-off-by: Lei Feng <fenglei47@h-partners.com> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
When the recoverable errors in cpu core occurred, try to offline the related cpu core. Signed-off-by: Shengwei Luo <luoshengwei@huawei.com> Signed-off-by: Junchong Pan <panjunchong@hisilicon.com> Signed-off-by: Lei Feng <fenglei47@h-partners.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
autoreconf is producing a compile file. Ignore it on git status. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Should be auto-filling the release information and upload a source distro package tarball. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Use my own upload release asset logic, as it is known to work already on ZBar. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Rasdaemon used for a long time an early version of this library, with the code embedded directly into its code. The rationale is that the library was not officially released on that time, but this has long changed. So, instead, just use the library directly. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
With the function rename due to the usage of libtraceevent library, adjust some indentations. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Now that rasdaemon is using the libtraceevent library, we can get rid of our own fork. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
This is needed to build newest version of rasdaemon. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Use autoupdate 2.71, in order to get rid of obsoleted macros. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Ensure that all modules are enabled on "make distcheck". Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Nowadays, we're only using github in practice for development. Let it clearer at the documentation. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
The error events are not received in the rasdaemon since kernel 6.1-rc6. This issue is firstly detected and reported, when testing the CXL error events in the rasdaemon. Debugging showed, poll() on trace_pipe_raw in the ras-events.c do not return and this issue is seen after the commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have polling block on watermark"). This issue is also verified using a test application for poll() and select() on per_cpu trace_pipe_raw. There is also a bug reported on this issue, https://lore.kernel.org/all/31eb3b12-3350-90a4-a0d9-d1494db7cf74@oracle.com/ This issue occurs for the per_cpu case, which calls the ring_buffer_poll_wait(), in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the percentage of pages are available. The default value set for the buffer_percent is 50 in the kernel/trace/trace.c. However poll() does not return even met the percentage of pages condition. As a fix, rasdaemon set buffer_percent as 0 through the /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent, then the task will wake up as soon as data is added to any of the specific cpu buffer and poll() on per_cpu/cpuX/trace_pipe_raw does not block indefinitely. Dependency on the kernel fix commit 3e46d910d8acf94e5360126593b68bf4fee4c4a1("tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw") Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Mock now makes mandatory to add the install dir, otherwise it refuses to build. So, add it. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
As we're not not bunding libtraceevent inside RASdaemon, packaging it now requires it. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
That allows git??b to better parse it. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
It is missing an entry about new labels. Also, version is at the wrong place. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
make dist-bzip2 requires configure to work, which, in turn, depends on having some tools installed. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
configure scripts need to be runnable with a POSIX-compliant /bin/sh. On many (but not all!) systems, /bin/sh is provided by Bash, so errors like this aren't spotted. Notably Debian defaults to /bin/sh provided by dash which doesn't tolerate such bashisms as '=='. This retains compatibility with bash. Fixes configure warnings/errors like: ``` checking for libtraceevent... yes ./configure: 13430: test: x: unexpected operator ./configure: 13439: test: x: unexpected operator ``` Signed-off-by: Sam James <sam@gentoo.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Fix for regression in ras_mc_create_table() if some cpus are offline at the system start Issue: Regression in the ras_mc_create_table() if some of the cpus are offline at the system start when run the rasdaemon. This issue is reproducible in ras_mc_create_table() with decode and record non-standard events and reproducible sometimes with ras_mc_create_table() for the standard events. Also in the multi thread way, there is memory leak in ras_mc_event_opendb() as struct sqlite3_priv *priv and sqlite3 *db allocated/initialized per thread, but stored in the common struct ras_events ras in pthread data, which is shared across the threads. Reason: when the system starts with some of the cpus offline and then run the rasdaemon, read_ras_event_all_cpus() exit with error and switch to the multi thread way. However read() in read_ras_event() return error in threads for each of the offline CPUs and does clean up including calling ras_mc_event_closedb(). Since the 'struct ras_events ras' passed in the pthread_data to each of the threads is common, struct sqlite3_priv *priv and sqlite3 *db allocated/ initialized per thread and stored in the common 'struct ras_events ras', are getting overwritten in each ras_mc_event_opendb()(which called from pthread per cpu), result memory leak. Also when ras_mc_event_closedb() is called in the above error case from the threads corresponding to the offline cpus, close the sqlite3 *db and free sqlite3_priv *priv stored in the common 'struct ras_events ras', result regression when accessing priv->db in the ras_mc_create_table() from another context later. Solution: In ras_mc_event_opendb(), allocate struct sqlite3_priv *priv, init sqlite3 *db and create tables common for the threads with shared 'struct ras_events ras' based on a reference count and free them in the same way. Also protect critical code ras_mc_event_opendb() and ras_mc_event_closedb() using mutex in the multi thread case from any regression caused by the thread pre-emption. Reported-by: Lei Feng <fenglei47@h-partners.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Move definition for BIT() and BIT_ULL() to the common file ras-record.h Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Add support to log and record the CXL poison events. The corresponding Kernel patches here: https://lore.kernel.org/linux-cxl/64457d30bae07_2028294ac@dwillia2-xfh.jf.intel.com.notmuch/ Presently for logging only, could be extended for the policy based recovery action for the frequent poison events depending on the above kernel patches. Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Add support to log and record the CXL AER uncorrectable errors. The corresponding Kernel patches are here: https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t https://lore.kernel.org/lkml/63eeb2a8c9e3f_32d612941f@dwillia2-xfh.jf.intel.com.notmuch/T/ It was found that the header log data to be converted to the big-endian format to correctly store in the SQLite DB likely because the SQLite database seems uses the big-endian storage. Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com># Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Add support to log and record the CXL AER correctable errors. The corresponding Kernel patches are here: https://lore.kernel.org/linux-cxl/166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com/T/#t https://lore.kernel.org/linux-cxl/63e5ed38d77d9_138fbc2947a@iweiny-mobl.notmuch/T/#t Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
When monitoring AER errors, verify if we are on an Altra platform to use ipmitool when dealing with this type of errors. Needs both --enable-aer and --enable-amp-ns-decode Fixes: 738bafa ("Add error handling for Ampere-specific errors.") Signed-off-by: William Roche <william.roche@oracle.com>
Sorry that it took me so long to come back with an updated version of this fix. Please let me know how you would prefer to proceed with this request ? |
When monitoring AER errors, verify if we are on an Altra platform
to use ipmitool when dealing with this type of errors.
Needs both --enable-aer and --enable-amp-ns-decode
Fixes: 738bafa ("Add error handling for Ampere-specific errors.")
Signed-off-by: William Roche william.roche@oracle.com