View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0016993 | CentOS-7 | kernel | public | 2020-01-31 21:55 | 2020-12-12 09:27 |
Reporter | Alexander Krupp | Assigned To | |||
Priority | normal | Severity | minor | Reproducibility | sometimes |
Status | resolved | Resolution | fixed | ||
Platform | x86_64 | OS | Centos | OS Version | 7 |
Product Version | 7.7-1908 | ||||
Fixed in Version | 7.9.2009 | ||||
Summary | 0016993: mce caused by an intel cpu erratum is printed (wall) on all terminals | ||||
Description | The intel erratum in my case is HSD131 or HSW131. It is very annoying to see this printed a number of times daily on all shell terminals and I do not want to miss a _real_ message by ignoring this. --- pr_emerg --- Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: Machine check events logged Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 90000040000f0005 Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: TSC 0 Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1580015578 SOCKET 0 APIC 6 microcode 27 --- pr_emerg --- --- Hardware --- processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz stepping : 3 microcode : 0x27 --- Hardware --- | ||||
Steps To Reproduce | 1) Install CentOS-7 with QEMU/KVM on above HW 2) Run one or more 32-bit virtual machines 3) Wait for the mce to show up | ||||
Additional Information | FreeBSD has a fix for this class of errata: https://lists.freebsd.org/pipermail/svn-src-head/2015-April/070884.html https://svnweb.freebsd.org/base?view=revision&revision=281751 /* + * Skip spurious corrected parity errors generated by Intel Haswell- + * and Broadwell-based CPUs (see HSD131, HSM142, HSW131 and BDM48 + * erratum respectively), unless reporting is enabled. + * Note that these errors also have been observed with the D0-stepping + * of Haswell, while at least initially the CPU specification updates + * suggested only the C0-stepping to be affected. Similarly, Celeron + * 2955U with a CPU ID of 0x45 apparently are also concerned with the + * same problem, with HSM142 only referring to 0x3c and 0x46. */ | ||||
Tags | 3.10.0-1062.9.1.el7.x86_64 | ||||
abrt_hash | |||||
URL | https://svnweb.freebsd.org/base?view=revision&revision=281751 | ||||
Please find attached a patch which filters the panic message from the __print_mce function in mce.c. The patch is for Centos kernel 3.10.0-1062.9.1.el7. I would appreciate if something like this could somehow make its way into the mainline kernel - BSD seems to have it for years, why not Linux? I did not encounter any discussions regarding HSD131 or HSW131 in Linux developer areas when searching for this. Intel_HSD131_etc.patch (2,216 bytes)
--- a/arch/x86/kernel/cpu/mcheck/mce.c.orig 2020-01-31 14:10:36.421882426 +0100 +++ b/arch/x86/kernel/cpu/mcheck/mce.c 2020-01-31 14:13:16.584315314 +0100 @@ -116,6 +116,9 @@ static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs); +/* filter false positives from panic pr_emerg */ +static int (*quirk_noprint)(struct mce *m); + /* * CPU/chipset specific EDAC code can register a notifier call here to print * MCE errors in a human-readable form. @@ -281,6 +284,9 @@ static void __print_mce(struct mce *m) { + if (quirk_noprint && quirk_noprint(m)) + return; + pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n", m->extcpu, (m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""), @@ -321,8 +327,8 @@ static void print_mce(struct mce *m) { - __print_mce(m); - pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n"); + __print_mce(m); + pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n"); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -1570,6 +1576,22 @@ m->cs = regs->cs; } +/* detect if mcd should be filtered from printk */ +static int quirk_haswell_hsd131_noprint(struct mce *m) +{ + if ( // Intel Haswell notice HSD131 + m->cpuvendor == X86_VENDOR_INTEL + && m->cpuid == 0x306c3 + && m->bank == 0 + && (m->status & 0xa0000000ffffffff) == 0x80000000000f0005 + ) + { + return 1; + } else { + return 0; + } +} + /* Add per CPU specific workarounds here */ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c) { @@ -1673,6 +1696,16 @@ if (c->x86 == 6 && c->x86_model == 45) quirk_no_way_out = quirk_sandybridge_ifu; + + /* similar to https://svnweb.freebsd.org/changeset/base/281751 */ + if (c->x86 == 6 && ( c->x86_model == 0x3c /* HSD131, HSM142, HSW131 */ + || c->x86_model == 0x3d /* BDM48 */ + || c->x86_model == 0x45 + || c->x86_model == 0x46 ) /* HSM142 */ + { + pr_info("Detected Haswell CPU. MCE quirk HSD131, HSM142, HSW131, BDM48, or HSM142 enabled.\n"); + quirk_noprint = quirk_haswell_hsd131_noprint; + } } if (cfg->monarch_timeout < 0) cfg->monarch_timeout = 0; |
|
Additional information about the erratum is here: https://trick77.com/qemu-on-haswell-causes-spurious-mce-events/ QEMU on Haswell causes spurious MCE events Posted on November 4, 2014 by Jan https://www.linuxquestions.org/questions/linux-hardware-18/hardware-error-this-is-%2Anot%2A-a-software-problem-4175535727/page2.html |
|
You need to report this on bugzilla.redhat.com. CenTOS is a rebuild of RHEL and aims for bug for bug compatibility with it. To get this fixed in CentOS, first you have to get it fixed in RHEL. | |
Reported as https://bugzilla.redhat.com/show_bug.cgi?id=1797205 | |
Fix seems to be scheduled for 7.9 | |
Fix is in 7.9 The following was logged last in the RHEL bug tracker: --- errata-xmlrpc 2020-09-29 21:07:02 UTC Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4060 |
|
Date Modified | Username | Field | Change |
---|---|---|---|
2020-01-31 21:55 | Alexander Krupp | New Issue | |
2020-01-31 21:55 | Alexander Krupp | Tag Attached: 3.10.0-1062.9.1.el7.x86_64 | |
2020-01-31 22:04 | Alexander Krupp | File Added: Intel_HSD131_etc.patch | |
2020-01-31 22:04 | Alexander Krupp | Note Added: 0036170 | |
2020-01-31 22:16 | Alexander Krupp | Note Added: 0036171 | |
2020-01-31 23:29 | TrevorH | Note Added: 0036172 | |
2020-02-01 12:10 | Alexander Krupp | Note Added: 0036174 | |
2020-09-10 08:47 | Alexander Krupp | Note Added: 0037680 | |
2020-12-11 01:34 | Alexander Krupp | Note Added: 0038071 | |
2020-12-12 09:27 | toracat | Status | new => resolved |
2020-12-12 09:27 | toracat | Resolution | open => fixed |
2020-12-12 09:27 | toracat | Fixed in Version | => 7.9.2009 |