View Issue Details

IDProjectCategoryView StatusLast Update
0016993CentOS-7kernelpublic2020-09-10 08:47
ReporterAlexander Krupp 
PrioritynormalSeverityminorReproducibilitysometimes
Status newResolutionopen 
Platformx86_64OSCentosOS Version7
Product Version7.7-1908 
Target VersionFixed in Version 
Summary0016993: mce caused by an intel cpu erratum is printed (wall) on all terminals
DescriptionThe intel erratum in my case is HSD131 or HSW131. It is very annoying to see this printed a number of times daily on all shell terminals and I do not want to miss a _real_ message by ignoring this.

--- pr_emerg ---
Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: Machine check events logged
Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 90000040000f0005
Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: TSC 0
Jan 26 06:12:58 ruebenhost kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1580015578 SOCKET 0 APIC 6 microcode 27
--- pr_emerg ---


--- Hardware ---
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz
stepping : 3
microcode : 0x27
--- Hardware ---

Steps To Reproduce1) Install CentOS-7 with QEMU/KVM on above HW
2) Run one or more 32-bit virtual machines
3) Wait for the mce to show up
Additional InformationFreeBSD has a fix for this class of errata:

https://lists.freebsd.org/pipermail/svn-src-head/2015-April/070884.html
https://svnweb.freebsd.org/base?view=revision&revision=281751

/*
+ * Skip spurious corrected parity errors generated by Intel Haswell-
+ * and Broadwell-based CPUs (see HSD131, HSM142, HSW131 and BDM48
+ * erratum respectively), unless reporting is enabled.
+ * Note that these errors also have been observed with the D0-stepping
+ * of Haswell, while at least initially the CPU specification updates
+ * suggested only the C0-stepping to be affected. Similarly, Celeron
+ * 2955U with a CPU ID of 0x45 apparently are also concerned with the
+ * same problem, with HSM142 only referring to 0x3c and 0x46.
      */
Tags3.10.0-1062.9.1.el7.x86_64
abrt_hash
URLhttps://svnweb.freebsd.org/base?view=revision&revision=281751

Activities

Alexander Krupp

Alexander Krupp

2020-01-31 22:04

reporter   ~0036170

Please find attached a patch which filters the panic message from the __print_mce function in mce.c.
The patch is for Centos kernel 3.10.0-1062.9.1.el7.

I would appreciate if something like this could somehow make its way into the mainline kernel - BSD seems to have it for years, why not Linux? I did not encounter any discussions regarding HSD131 or HSW131 in Linux developer areas when searching for this.

Intel_HSD131_etc.patch (2,216 bytes)
--- a/arch/x86/kernel/cpu/mcheck/mce.c.orig	2020-01-31 14:10:36.421882426 +0100
+++ b/arch/x86/kernel/cpu/mcheck/mce.c	2020-01-31 14:13:16.584315314 +0100
@@ -116,6 +116,9 @@
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
+/* filter false positives from panic pr_emerg */
+static int (*quirk_noprint)(struct mce *m);
+
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -281,6 +284,9 @@
 
 static void __print_mce(struct mce *m)
 {
+        if (quirk_noprint && quirk_noprint(m)) 
+	         return;
+	  
 	pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
 		 m->extcpu,
 		 (m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""),
@@ -321,8 +327,8 @@
 
 static void print_mce(struct mce *m)
 {
-	__print_mce(m);
-	pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
+        __print_mce(m);
+       	pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1570,6 +1576,22 @@
 	m->cs = regs->cs;
 }
 
+/* detect if mcd should be filtered from printk */
+static int quirk_haswell_hsd131_noprint(struct mce *m)
+{
+	if ( // Intel Haswell notice HSD131
+ 	   m->cpuvendor == X86_VENDOR_INTEL
+	   && m->cpuid == 0x306c3 
+	   && m->bank == 0
+	   && (m->status & 0xa0000000ffffffff) == 0x80000000000f0005
+	 ) 
+	{ 
+	  return 1;
+	} else {
+	  return 0;
+	}
+}
+
 /* Add per CPU specific workarounds here */
 static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 {
@@ -1673,6 +1696,16 @@
 
 		if (c->x86 == 6 && c->x86_model == 45)
 			quirk_no_way_out = quirk_sandybridge_ifu;
+
+		/* similar to https://svnweb.freebsd.org/changeset/base/281751 */
+		if (c->x86 == 6 && ( c->x86_model == 0x3c /* HSD131, HSM142, HSW131 */
+		   	      	   || c->x86_model == 0x3d /* BDM48 */
+				   || c->x86_model == 0x45 
+				   || c->x86_model == 0x46 ) /* HSM142 */ 
+		{
+		        pr_info("Detected Haswell CPU. MCE quirk HSD131, HSM142, HSW131, BDM48, or HSM142 enabled.\n");
+		        quirk_noprint = quirk_haswell_hsd131_noprint;
+		}
 	}
 	if (cfg->monarch_timeout < 0)
 		cfg->monarch_timeout = 0;
Intel_HSD131_etc.patch (2,216 bytes)
Alexander Krupp

Alexander Krupp

2020-01-31 22:16

reporter   ~0036171

Additional information about the erratum is here:

https://trick77.com/qemu-on-haswell-causes-spurious-mce-events/
QEMU on Haswell causes spurious MCE events
Posted on November 4, 2014 by Jan

https://www.linuxquestions.org/questions/linux-hardware-18/hardware-error-this-is-%2Anot%2A-a-software-problem-4175535727/page2.html
TrevorH

TrevorH

2020-01-31 23:29

manager   ~0036172

You need to report this on bugzilla.redhat.com. CenTOS is a rebuild of RHEL and aims for bug for bug compatibility with it. To get this fixed in CentOS, first you have to get it fixed in RHEL.
Alexander Krupp

Alexander Krupp

2020-02-01 12:10

reporter   ~0036174

Reported as https://bugzilla.redhat.com/show_bug.cgi?id=1797205
Alexander Krupp

Alexander Krupp

2020-09-10 08:47

reporter   ~0037680

Fix seems to be scheduled for 7.9

Issue History

Date Modified Username Field Change
2020-01-31 21:55 Alexander Krupp New Issue
2020-01-31 21:55 Alexander Krupp Tag Attached: 3.10.0-1062.9.1.el7.x86_64
2020-01-31 22:04 Alexander Krupp File Added: Intel_HSD131_etc.patch
2020-01-31 22:04 Alexander Krupp Note Added: 0036170
2020-01-31 22:16 Alexander Krupp Note Added: 0036171
2020-01-31 23:29 TrevorH Note Added: 0036172
2020-02-01 12:10 Alexander Krupp Note Added: 0036174
2020-09-10 08:47 Alexander Krupp Note Added: 0037680