View Issue Details

IDProjectCategoryView StatusLast Update
0007474CentOS-7kernelpublic2015-08-19 07:09
Reporterthomasf 
PriorityimmediateSeveritycrashReproducibilityrandom
Status resolvedResolutionfixed 
Platformx86_64OSCentOS 7OS Version7.0
Product Version7.0-1406 
Target VersionFixed in Version 
Summary0007474: kernel BUG at mm/memory.c:3765!, System crash about every 7 hours
DescriptionSystem reboots about every 7 hour.
Primary Error Message:kernel BUG at mm/memory.c:3765!

Hardware:
Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+, BIOS 3.0a 12/04/2013
2 x CPU E5-2695v2 XEON
24 x MEMORY 16GB KVR16R11D4/16 ECC CL11
------------------------------------------
24 Core and 384 GB Memory

The vmcore file can be downloaded from http://www.swissbyte.com/vmcore_dump.gz because it 4.9 GB of size
Steps To ReproduceRandom
Just make some high cpu load 40% and a lot of io
Additional Information[32595.802244] kernel BUG at mm/memory.c:3765!
[32595.802272] invalid opcode: 0000 [#1] SMP
[32595.802298] Modules linked in: xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables sg iTCO_wdt iTCO_vendor_support dm_mod coretemp kvm_intel kvm ipmi_devintf crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure sb_edac edac_core i2c_i801 ixgbe lpc_ich mfd_core ptp pps_core mdio ipmi_si ipmi_msghandler wmi mei_me mperf ioatdma mei shpchp dca xfs libcrc32c sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm ahci drm aacraid libahci libata i2c_core megaraid_sas
[32595.802703] CPU: 26 PID: 5246 Comm: jane1slave002 Not tainted 3.10.0-123.4.4.el7.x86_64 #1
[32595.802744] Hardware name: Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+, BIOS 3.0a 12/04/2013
[32595.802788] task: ffff883f20f34440 ti: ffff883f20df4000 task.ti: ffff883f20df4000
[32595.802826] RIP: 0010:[<ffffffff8116cbb8>] [<ffffffff8116cbb8>] handle_mm_fault+0xc78/0xd90
[32595.802877] RSP: 0000:ffff883f20df5da0 EFLAGS: 00010246
[32595.802904] RAX: 0000000000000100 RBX: 00000007b3270970 RCX: ffff883f20df5fd8
[32595.802939] RDX: ffff883f20f34440 RSI: 0000000000000000 RDI: 80000032194001e6
[32595.802975] RBP: ffff883f20df5e20 R08: 0000000000000000 R09: 0000000000000d82
[32595.803010] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88180b3adcc8
[32595.803047] R13: ffff885f647e96c8 R14: 0000000000000029 R15: ffff885f6c6f5dc0
[32595.803083] FS: 00007f9e6b097700(0000) GS:ffff882fbfbc0000(0000) knlGS:0000000000000000
[32595.803123] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32595.803153] CR2: 00000007b3270970 CR3: 0000005f71c27000 CR4: 00000000001407e0
[32595.803188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32595.803229] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32595.803268] Stack:
[32595.803281] fff00000ffffffff 0000000000000016 0000000000000000 0000000000017988
[32595.803332] 0000000000000000 ffff885f6c6f5e38 0000000000000000 8000002c6ff33067
[32595.803384] ffffea00bdcf3db0 ffff885f6c6f5e38 00000000fd29f2ed 0000000000000006
[32595.803433] Call Trace:
[32595.803453] [<ffffffff815ed646>] __do_page_fault+0x156/0x540
[32595.803485] [<ffffffff815eda4a>] do_page_fault+0x1a/0x70
[32595.803514] [<ffffffff815e9c88>] page_fault+0x28/0x30
[32595.803540] Code: e8 ae c1 ff ff 85 c0 0f 85 06 f7 ff ff 49 8b 3c 24 e9 4e f5 ff ff 4c 89 f7 e8 15 b8 03 00 4c 89 f7 e8 2d 20 fe ff e9 07 fb ff ff <0f> 0b 4c 89 e7 4c 89 5d b8 e8 9a ad ff ff 48 89 de 49 89 c7 4c
[32595.803760] RIP [<ffffffff8116cbb8>] handle_mm_fault+0xc78/0xd90
[32595.803794] RSP <ffff883f20df5da0>

3.10.0-123.4.4.el7.x86_64 #1 SMP Fri Jul 25 05:07:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The vmcore file can be downloaded from http://www.swissbyte.com/vmcore_dump.gz because it 4.9 GB of size
TagsNo tags attached.
abrt_hash
URL

Activities

wolfy

wolfy

2014-08-07 18:19

developer   ~0020614

Given that you use ECC memory I presume that it's not a hardware issue. Therefore please file a bug at bugzilla.redhat.com ( RHEL 7, kernel ) and crosslink ( via the External Bug Tracker field ) to Centos bugid #7414. Once RH fixes this problem and releases a fixed kernel it will propagate to CentOS as well.
thomasf

thomasf

2014-08-07 23:56

reporter   ~0020618

You are right, we are using ECC and we have already replaced the mainboard and memory and the error stil exists.
I try now to open a bug report on redhat
thomasf

thomasf

2014-08-08 00:09

reporter   ~0020619

Bug 1127947 on bugzilla.redhat.com
wolfy

wolfy

2014-08-08 06:09

developer   ~0020620

bugzilla.r.c marks all kernel bugs as private by default so we'll rely on you to keep us posted, should any news appear.

Meanwhile I suggest you give a spin to one of the kernel packages ( either kernel-lt or kernel-ml) provided by elrepo. Maybe you get lucky...
thomasf

thomasf

2014-08-08 07:54

reporter   ~0020626

Here are the last lines of a vmstat output, before the kernel hit the bug
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
r b swpd free inact active si so bi bo in cs us sy id wa st CEST
5 1 0 14387076 276179424 98342560 0 0 1 92 1391 1535 1 0 98 0 0 2014-08-07 20:01:52
2 1 0 14131832 276184384 98583440 0 0 4 729 1837 4225 1 0 98 0 0 2014-08-07 20:02:12
1 1 0 14251772 276189504 98467104 0 0 28 2911 2281 7996 2 0 98 0 0 2014-08-07 20:02:32
0 0 0 14665908 276178944 98067344 0 0 11 1531 2721 6019 2 0 97 0 0 2014-08-07 20:02:52
0 0 0 14466244 276147040 98299968 0 0 0 1010 1219 1773 0 0 99 0 0 2014-08-07 20:03:12
1 0 0 14570524 276134336 98208576 0 0 0 2750 1414 1880 0 0 99 0 0 2014-08-07 20:03:32
4 0 0 12302956 276266944 100366312 0 0 69 8262 3124 1924 3 1 96 0 0 2014-08-07 20:03:52
1 1 0 14413524 276246528 98281104 0 0 1 23140 34235 2993 2 1 97 0 0 2014-08-07 20:04:12
0 0 0 13817992 276171136 98924048 0 0 0 5172 4178 4326 0 0 99 0 0 2014-08-07 20:04:32
2 0 0 12048324 276213408 100584656 0 0 23 18357 9639 9909 3 1 96 0 0 2014-08-07 20:04:52
10 0 0 10548260 276394112 101867312 0 0 7 33960 29356 19471 6 3 89 1 0 2014-08-07 20:05:12

On this system when free memory are below 10GB, the kernel vm is starting some cleanup processes and one of the hit the bug.

Just before the kernel crash, we see a large increase of system cpu usage (>60%)
wolfy

wolfy

2014-08-08 08:18

developer   ~0020627

If you have not done so already, please report this info to the redhat bugzilla, as well
thomasf

thomasf

2014-08-10 21:53

reporter   ~0020653

Primitive Workaround:

When the free memory goes near 3%, run the command:

"sync; echo 3 > /proc/sys/vm/drop_caches"

I run this command automatically every hour and the system don't reboot anymore.
thomasf

thomasf

2014-08-18 09:11

reporter   ~0020696

There is no update from the redhat site.
The ticket has still the status new
thomasf

thomasf

2014-08-29 15:11

reporter   ~0020784

Update from redhat site:
*** This bug has been marked as a duplicate of bug 1119439 ***
The bug 1119439 has the Status: POST since 2014-07-14
thomasf

thomasf

2014-09-05 20:40

reporter   ~0020837

Fixed In Version kernel-3.10.0-152.el7
slayerduck

slayerduck

2014-10-01 10:41

reporter   ~0021030

I'm experiencing the exact problem, ECC high memory servers and reboots happening across all my Centos 7 production servers with 8 to 48 hours of interval in between. How can i implement the real fix? I'll be running the workaround in the meantime.

[203314.752738] ------------[ cut here ]------------
[203314.752759] kernel BUG at mm/memory.c:3765!
[203314.752772] invalid opcode: 0000 [#1] SMP
[203314.752800] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_connlimit nf_nat_ftp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_REDIRECT nf_nat xt_conntrack iptable_mangle nf_conntrack_ftp nf_conntrack ipt_REJECT xt_LOG xt_limit iptable_filter ip_tables xt_multiport sg binfmt_misc iTCO_wdt iTCO_vendor_support raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core igb ioatdma ptp mei_me pps_core mei shpchp dca wmi ipmi_si ipmi_msghandler mperf xfs libcrc32c raid1 sd_mod sr_mod cdrom crc_t10dif crct10dif_common usb_storage syscopyarea
[203314.753319] sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core
[203314.753413] CPU: 13 PID: 250204 Comm: php Not tainted 3.10.0-123.8.1.el7.x86_64 #1
[203314.753434] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[203314.753456] task: ffff88084ef8f1c0 ti: ffff880dd991e000 task.ti: ffff880dd991e000
[203314.753477] RIP: 0010:[<ffffffff8116ca48>] [<ffffffff8116ca48>] handle_mm_fault+0xc78/0xd90
[203314.754403] RSP: 0018:ffff880dd991fda0 EFLAGS: 00010246
[203314.754867] RAX: 0000000000000100 RBX: 00007fb9601d2946 RCX: 0000000000000000
[203314.755781] RDX: ffff88084ef8f1c0 RSI: 0000000000000000 RDI: 8000000ae36001e6
[203314.756693] RBP: ffff880dd991fe20 R08: ffff880dd991e000 R09: 0000000000000000
[203314.757608] R10: 0000000000000000 R11: 0000000000000246 R12: ffff880d0a2f5800
[203314.758524] R13: ffff8800370a85e8 R14: 0000000000000029 R15: ffff881051ec4b00
[203314.759438] FS: 00007fb9f87f8700(0000) GS:ffff88085fce0000(0000) knlGS:0000000000000000
[203314.760351] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203314.760814] CR2: 00000000020da430 CR3: 0000000a31e46000 CR4: 00000000001407e0
[203314.761726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[203314.762639] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[203314.763552] Stack:
[203314.764005] 0000000000000008 ffff88062a28cfa0 ffffffff810bfd38 ffff881051ec4b00
[203314.765024] ffffc900168c0388 0000000000000000 00000000c11d74ba 00007fba1bb25d80
[203314.765952] 00000000ffffffff 0000000000000001 00000000c11d74ba 0000000000000006
[203314.766883] Call Trace:
[203314.767345] [<ffffffff810bfd38>] ? get_futex_key+0x1c8/0x2a0
[203314.767812] [<ffffffff815edac6>] __do_page_fault+0x156/0x540
[203314.768277] [<ffffffff810c2b62>] ? do_futex+0x122/0x5b0
[203314.768742] [<ffffffff8109fd0c>] ? update_curr+0xcc/0x150
[203314.769208] [<ffffffff8109b806>] ? __dequeue_entity+0x26/0x40
[203314.769678] [<ffffffff81011619>] ? __switch_to+0x179/0x490
[203314.770149] [<ffffffff815edeca>] do_page_fault+0x1a/0x70
[203314.770613] [<ffffffff815ea108>] page_fault+0x28/0x30
[203314.771077] Code: e8 ae c1 ff ff 85 c0 0f 85 06 f7 ff ff 49 8b 3c 24 e9 4e f5 ff ff 4c 89 f7 e8 15 b8 03 00 4c 89 f7 e8 fd 1f fe ff e9 07 fb ff ff <0f> 0b 4c 89 e7 4c 89 5d b8 e8 9a ad ff ff 48 89 de 49 89 c7 4c
[203314.772765] RIP [<ffffffff8116ca48>] handle_mm_fault+0xc78/0xd90
[203314.773236] RSP <ffff880dd991fda0>
slayerduck

slayerduck

2014-10-07 15:56

reporter   ~0021075

Workaround "sync; echo 3 > /proc/sys/vm/drop_caches" is no longer effective, i had a panic happen after 4 days of uptime. Tried to install the elrepo ml kernel but that fails to boot the OS. Out of options here, what do?
michele franceschini

michele franceschini

2014-10-19 18:07

reporter   ~0021188

Hi, I'm having a kernel panic on an hp 6735s with amd turion processor: I installed successfully the iso image from ftp then upgraded to kernel 3.10.0-123.8.1.el7.x86_64 and afterwards, I'm getting kernel panic 9 times up to 10.
The message is now showing as last 2 lines as follows:
41.219124 Kernel panic - not syncing: Fatal exception
41.220041 drm_kms_helper: panic occurred, switching back to text console
I can install Centos 6.5 without any problem but el7 is absolutely unstable and unreliable.
Please, help me.
sde

sde

2014-10-27 18:11

reporter   ~0021413

I have this same issue. Where can I get kernel-3.10.0-152.el7 ? Please help.
toracat

toracat

2014-10-27 18:17

manager   ~0021414

Last edited: 2014-10-27 18:38

View 2 revisions

I suspect kernel-3.10.0-152.el7 will be in el7.7.

@thomasf, do you see any patch(es) in BZ#1119439?

[EDIT] err, I meant el7.1 of course.

thomasf

thomasf

2014-10-27 18:31

reporter   ~0021415

No, just the information that the Bug is fixed in kernel-3.10.0-152.el7
Redhat Status is ON_QA

Old Question:
Workaround "sync; echo 3 > /proc/sys/vm/drop_caches" is no longer effective, i had a panic happen after 4 days of uptime.

Answer:
If you have a heavy IO system, you need to run the command more often then just every hour
thomasf

thomasf

2014-10-27 18:36

reporter   ~0021417

Because i run into a other kernel bug and i can't endanger my production system, i have switched to Ubuntu 14.04 LTS where the bug not exists.
sde

sde

2014-10-27 18:55

reporter   ~0021419

Thank you thomasf. I have added the workaround you have mentioned.
JohnnyHughes

JohnnyHughes

2014-10-27 19:13

administrator   ~0021420

You might try this upstream testing kernel and see if it fixes that problem:

http://people.centos.org/hughesjr/kernel/7/

I found this upstream at people.redhat.com for a different problem, but it contains the 3.10.0-152.el7 tree, so it should contain the fix for this problem.

Please note that this is a testing kernel and not necessarily production grade.
JohnnyHughes

JohnnyHughes

2014-10-27 19:15

administrator   ~0021421

Please note that we try our best to help fix bugs in CentOS, but if you want real time SLA support, then a RHEL subscription is the way to go for that type of support.
toracat

toracat

2014-10-27 19:54

manager   ~0021422

A quick look at the changelog indicates that the bug was fixed by this patch:

https://lkml.org/lkml/2014/4/25/506

It is in kernel >= 3.14. So, wolfy's suggestion to try ELRepo's kernel-ml (comment 20620) would have worked. kernel-ml is now at 3.17.
sde

sde

2014-10-27 19:55

reporter   ~0021423

Thanks a lot Johnny! I have installed the kernel and restarted the memory intensive job. Thanks again.
toracat

toracat

2014-10-28 21:26

manager   ~0021442

kernel-3.10.0-123.9.2.el7 is out. CentOS' kernel-plus now has the patch referenced in comment 21422.
slayerduck

slayerduck

2014-10-29 02:22

reporter   ~0021444

So either installing Johnny's kernel or centos plus kernel fixes it? I think centos plus is a better bet then for stability?
JohnnyHughes

JohnnyHughes

2014-10-29 02:28

administrator   ~0021445

I would think that the plus kernel would be better if it works.

That kernel should be on mirror.centos.org within 30 minutes .. kernel-plus-3.10.0-123.9.2.el7
sde

sde

2014-10-31 20:21

reporter   ~0021494

So far I don't have any more crashes after installing Johnny's kernel-3.10.0-167.el7.bz1043379.17.x86_64 and running memory intensive job for about 4 days. Also I don't see any other problems using this test kernel.

Has anyone tested kernel-3.10.0-123.9.2.el7 to see if it has corrected the "kernel BUG at mm/memory.c:3765!"?
michele franceschini

michele franceschini

2014-11-01 13:02

reporter   ~0021499

Hi, It's still panic now the message I get is as follows:
CPU 0PID:7 COMM:migragtion/0 TAINTED:PFDO 3.10.0-123.9.2.el7.x86_64#1
HARDWARE NAME: Hewlett Packard Hp Compaq 6735/30E4 BIOS 68GPP Ver. F.0E 9/14/2009
Please, if You need any log file let me know where I can get it and I'll send it to You as well.
Sincerely.
Michele.
toracat

toracat

2014-11-01 13:56

manager   ~0021500

Please note that kernel-3.10.0-123.9.2.el7 does not have the fix. You need to try either the one Johnny made available or kernel-plus from the centosplus repository.
slayerduck

slayerduck

2014-11-07 02:30

reporter   ~0021579

Tried to migrate to centos plus kernel but i can't get it booting up, after installing centos plus kernel it hangs after basic systems and trows and error saying it can't reach /dev/root by the UUID given. Going back to the original kernel works fine and it has the same UUID as root. Should i create a new ticket for that?
sde

sde

2014-11-07 16:03

reporter   ~0021596

First thanks to toracat for the clarification. Since 11/3/2014 I am using the kernel-plus-3.10.0-123.9.2.el7 without any issue. Only thing that I noticed is "kernel.perf_event max sample rate" came down to 7000 from 50000 when compared to Johnny's kernel! Also now memory remains in "free" category rather than in "cached" category till it is required.
sde

sde

2014-11-07 16:09

reporter   ~0021597

So my question is: does kernel-3.10.123.9.3.el7 contain the fix?
toracat

toracat

2014-11-07 17:46

manager   ~0021602

@sde

No. There was only one bug fix when going from 3.103.10.123.9.2.el7 to .123.9.3.el7. This is not the one.
sde

sde

2014-11-07 18:03

reporter   ~0021603

Thanks toracat. Then I guess at some point of time I will need to update to kernel-plus-3.10.123.9.3.el7.
slayerduck

slayerduck

2015-01-01 16:56

reporter   ~0022064

Any ETA for this update to be in the normal kernel? its been almost 4 months now
Evolution

Evolution

2015-01-02 00:00

administrator   ~0022065

It will be patched as soon as upstream (rh) decides to fix it. We have no control over that. As Johnny said above, if you want an ETA or SLA, that's what RHEL is for.

In the mean time, kernel-plus or kernel-ml will have to do.
toracat

toracat

2015-01-29 01:40

manager   ~0022253

The patch is now in the distro kernel 3.10.0-123.20.1.el7, therefore no longer needs to be added to the plus kernel.

Issue History

Date Modified Username Field Change
2014-08-07 18:00 thomasf New Issue
2014-08-07 18:19 wolfy Note Added: 0020614
2014-08-07 23:56 thomasf Note Added: 0020618
2014-08-08 00:09 thomasf Note Added: 0020619
2014-08-08 06:09 wolfy Note Added: 0020620
2014-08-08 07:54 thomasf Note Added: 0020626
2014-08-08 08:18 wolfy Note Added: 0020627
2014-08-10 21:53 thomasf Note Added: 0020653
2014-08-18 09:11 thomasf Note Added: 0020696
2014-08-29 15:11 thomasf Note Added: 0020784
2014-09-05 20:40 thomasf Note Added: 0020837
2014-10-01 10:41 slayerduck Note Added: 0021030
2014-10-07 15:56 slayerduck Note Added: 0021075
2014-10-19 18:07 michele franceschini Note Added: 0021188
2014-10-27 18:11 sde Note Added: 0021413
2014-10-27 18:17 toracat Note Added: 0021414
2014-10-27 18:31 thomasf Note Added: 0021415
2014-10-27 18:36 thomasf Note Added: 0021417
2014-10-27 18:38 toracat Note Edited: 0021414 View Revisions
2014-10-27 18:55 sde Note Added: 0021419
2014-10-27 19:13 JohnnyHughes Note Added: 0021420
2014-10-27 19:15 JohnnyHughes Note Added: 0021421
2014-10-27 19:54 toracat Note Added: 0021422
2014-10-27 19:55 sde Note Added: 0021423
2014-10-28 21:26 toracat Note Added: 0021442
2014-10-29 02:22 slayerduck Note Added: 0021444
2014-10-29 02:28 JohnnyHughes Note Added: 0021445
2014-10-31 20:21 sde Note Added: 0021494
2014-11-01 13:02 michele franceschini Note Added: 0021499
2014-11-01 13:56 toracat Note Added: 0021500
2014-11-07 02:30 slayerduck Note Added: 0021579
2014-11-07 16:03 sde Note Added: 0021596
2014-11-07 16:09 sde Note Added: 0021597
2014-11-07 17:46 toracat Note Added: 0021602
2014-11-07 18:03 sde Note Added: 0021603
2015-01-01 16:56 slayerduck Note Added: 0022064
2015-01-02 00:00 Evolution Note Added: 0022065
2015-01-29 01:40 toracat Note Added: 0022253
2015-01-29 01:43 toracat Status new => resolved
2015-01-29 01:43 toracat Resolution open => fixed