View Issue Details

IDProjectCategoryView StatusLast Update
0018571CentOS-7centos-releasepublic2023-03-08 02:51
Reporterdjhwyh Assigned To 
PrioritynormalSeveritycrashReproducibilitysometimes
Status newResolutionopen 
Platformx86OSCentOS Linux releaseOS Version7.9.2009
Product Version7.9.2009 
Summary0018571: The operating system always crashes after a few days
Description1、127.0.0.1-2023-03-03-01\:29\:08/vmcore-dmesg.txt
[6209143.271660] double fault: 0000 [#1] SMP
[6209143.271685] Modules linked in: iptable_mangle sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel ipmi_si lrw gf128mul glue_helper ipmi_devintf ablk_helper cryptd ipmi_msghandler wdat_wdt pcspkr joydev mei_me lpc_ich mei i2c_i801 ses enclosure scsi_transport_sas sg acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ahci ixgbe libahci crct10dif_pclmul crct10dif_common crc32c_intel libata megaraid_sas mdio igb ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[6209143.272411] CPU: 34 PID: 28403 Comm: bash Kdump: loaded Not tainted 3.10.0-1160.76.1.el7.x86_64 #1
[6209143.272791] Hardware name: Huawei RH2288 V3/BC11HGSB0, BIOS 5.03 07/25/2018
[6209143.272984] task: ffff9d6b7ceb1080 ti: ffff9d67286cc000 task.ti: ffff9d67286cc000
[6209143.273354] RIP: 0010:[<00000000b5199e33>] [<00000000b5199e33>] 0xb5199e33
[6209143.273556] RSP: 0018:00007ffd48b3bd78 EFLAGS: 00010046
[6209143.273749] RAX: 0000000000000004 RBX: 0000000000bc6ba8 RCX: 00007faf4c13a465
[6209143.274119] RDX: 00007ffd48b3bd80 RSI: 00007ffd48b3bd80 RDI: 0000000000bc6ba8
[6209143.274490] RBP: 0000000000bc6ba8 R08: 0000000000000002 R09: 0000000000000002
[6209143.274861] R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000bbd988
[6209143.275230] R13: 0000000000bc1248 R14: 0000000000000024 R15: 0000000000000000
[6209143.275602] FS: 00007faf4ca59740(0000) GS:ffff9d6b7f380000(0000) knlGS:0000000000000000
[6209143.275972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6209143.276169] CR2: 00000000b5199e33 CR3: 0000001472676000 CR4: 00000000003607e0
[6209143.276537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[6209143.276907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[6209143.277275] Call Trace:
[6209143.277463] Code: Bad RIP value.
[6209143.277660] RIP [<00000000b5199e33>] 0xb5199e33
[6209143.277855] RSP <00007ffd48b3bd78>
2、127.0.0.1-2022-12-21-04\:26\:01/vmcore-dmesg.txt
[683524.397035] double fault: 0000 [#1] SMP
[683524.397061] Modules linked in: sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel ipmi_si lrw gf128mul glue_helper ipmi_devintf ablk_helper cryptd ipmi_msghandler wdat_wdt pcspkr joydev ses enclosure scsi_transport_sas lpc_ich i2c_i801 sg mei_me mei acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ahci libahci ixgbe crct10dif_pclmul libata crct10dif_common crc32c_intel megaraid_sas mdio igb ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[683524.397398] CPU: 34 PID: 7501 Comm: grep Kdump: loaded Not tainted 3.10.0-1160.76.1.el7.x86_64 #1
[683524.397418] Hardware name: Huawei RH2288 V3/BC11HGSB0, BIOS 5.03 07/25/2018
[683524.397437] task: ffff9cfe28e1e300 ti: ffff9cffdc58c000 task.ti: ffff9cffdc58c000
[683524.397813] RIP: 0010:[<0000000095999e33>] [<0000000095999e33>] 0x95999e33
[683524.398019] RSP: 0018:00007ffddaae9ed8 EFLAGS: 00010016
[683524.398212] RAX: 0000000000000003 RBX: 00007f8dccdb31c0 RCX: 00007f8dccadb1f0
[683524.398588] RDX: 00007f8dccdae838 RSI: 0000000000000001 RDI: 0000000000000002
[683524.398957] RBP: 00007f8dccdaf380 R08: 00007f8dccdb49f0 R09: 00007f8dcd22c740
[683524.399324] R10: 00007ffddaae99a0 R11: 0000000000000216 R12: 0000000000000000
[683524.399695] R13: 00007f8dccdb3e80 R14: 0000000000000000 R15: 0000000000000002
[683524.400066] FS: 00007f8dcd22c740(0000) GS:ffff9d1abf380000(0000) knlGS:0000000000000000
[683524.400439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[683524.400628] CR2: 0000000095999e33 CR3: 000000200fc02000 CR4: 00000000003607e0
[683524.400996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[683524.401370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[683524.401739] Call Trace:
[683524.401924] Code: Bad RIP value.
[683524.402123] RIP [<0000000095999e33>] 0x95999e33
[683524.402320] RSP <00007ffddaae9ed8>
3、127.0.0.1-2022-12-08-07\:00\:54/vmcore-dmesg.txt
[4344241.382981] double fault: 0000 [#1] SMP
[4344241.383007] Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag iptable_mangle sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel ipmi_si lrw gf128mul glue_helper ipmi_devintf ablk_helper cryptd ipmi_msghandler wdat_wdt pcspkr joydev ses enclosure scsi_transport_sas lpc_ich i2c_i801 sg mei_me mei acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ahci libahci ixgbe crct10dif_pclmul crct10dif_common crc32c_intel libata megaraid_sas mdio igb ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[4344241.383377] CPU: 34 PID: 3993 Comm: tunasync Kdump: loaded Not tainted 3.10.0-1160.76.1.el7.x86_64 #1
[4344241.383746] Hardware name: Huawei RH2288 V3/BC11HGSB0, BIOS 5.03 07/25/2018
[4344241.383940] task: ffff943fb886c200 ti: ffff943fa4ce8000 task.ti: ffff943fa4ce8000
[4344241.384309] RIP: 0010:[<0000000096f99e30>] [<0000000096f99e30>] 0x96f99e30
[4344241.384506] RSP: 0018:00007f3c777fdd28 EFLAGS: 00010002
[4344241.384698] RAX: 0000000000000018 RBX: 0000000000000004 RCX: 000000000046cc07
[4344241.385074] RDX: 00000000019ddf70 RSI: 0000000000000004 RDI: 0000000000000011
[4344241.385446] RBP: 00007f3c777fdd68 R08: 0000000000000001 R09: 0000000000424a81
[4344241.385804] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000054
[4344241.386159] R13: 00000000019a7440 R14: 0000000000000000 R15: 0000000000000000
[4344241.386522] FS: 00007f3c777fe700(0000) GS:ffff944fbf380000(0000) knlGS:0000000000000000
[4344241.386889] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4344241.387082] CR2: 0000000096f99e30 CR3: 000000203d6a2000 CR4: 00000000003607e0
[4344241.387442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4344241.387798] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[4344241.388160] Call Trace:
[4344241.388341] Code: Bad RIP value.
[4344241.388521] RIP [<0000000096f99e30>] 0x96f99e30
[4344241.388686] RSP <00007f3c777fdd28>
Steps To ReproduceNo recurrence, randomly generated
TagsNo tags attached.
abrt_hash
URL

Activities

ManuelWolfshant

ManuelWolfshant

2023-03-06 21:17

manager   ~0039073

Last edited: 2023-03-06 21:21

Given that there are literally millions of computers that run CentOS 7 without issues, the chances that you have found a bug that triggers a system crash and that it affects only you are extremely low.
Please run a comprehensive test of your hardware because it is much more probable that you have a hardware issue such as overheating, a defective storage drive ( which in turns leads to bad code or data being loaded by the OS ) or a bad memory chip.
Incidentally you are also using a version of the kernel which is slightly out of date. I doubt that this is your culprit but a yum update should not hurt. You should also pay a visit to https://support.huawei.com/enterprise/en/server/rh2288-v3-pid-9901877/software/ because you do not seem to run the latest BIOS and maybe there are newer versions of interest for you.
djhwyh

djhwyh

2023-03-08 01:42

reporter   ~0039074

We reported this server to the manufacturer for repair several times in 2022, and also replaced a memory bar. Later, it still crashed and restarted frequently, but the manufacturer later checked that there was no hardware problem. I have also upgraded the system kernel several times. I upgraded the kernel yesterday to the latest version 3.10.0-1160.83.1.el7.x86_64, we will continue to observe.
The previous crash date is referenced below:
 ls /var/crash/
127.0.0.1-2022-09-23-19:08:30 127.0.0.1-2022-09-24-12:58:48 127.0.0.1-2022-10-19-00:09:39 127.0.0.1-2022-12-21-04:26:01
127.0.0.1-2022-09-24-07:10:10 127.0.0.1-2022-10-13-15:13:39 127.0.0.1-2022-12-08-07:00:54 127.0.0.1-2023-03-03-01:29:08
ManuelWolfshant

ManuelWolfshant

2023-03-08 02:01

manager   ~0039075

Last edited: 2023-03-08 02:02

Many years ago I had a period when I was writing very manyFedora DVDs monthly, shipping them to people who could not afford to download them. Unfortunately roughly half of the disks I wrote turned out to be defective. Nothing else seemed to be wrong with that computer so I suspected the disks and obviously the DVD writer. After replacing disk brand, DVD unit, cables and IDE port I did a full and thorough hardware check of the whole system. Turned out that a memory module was defective but memtest+ would report it as perfectly fine as long as the computer was configured to run in dual channel mode. The moment after I switched to single channel, memtest+ failed. Replaced the faulty module and lived happily ever after.
As another story, the very system I am typing from used to blow up NVME SSDs ( they just became unreachable as if nothing was connected to the NVME port of the motherboard ) after a random period of usage ( between a week and 3 months ). Kingston were nice and replaced them under warranty but after the 4th replacement they suggested to attempt a BIOS update. Guess what ? Did that and the system works perfectly fine since Jan 2021.
What I mean with the above stories is that I still bet on a hardware issue even if your supplier was not able to identify any issue. There are tons of things that can go wrong, from power supply, bad contact between cooler and processor, memory.. even a defective processor that acts up in very specific conditions ( yep, I've seen that too, I have an AMD Duron that seems perfectly fine unless you run a very very very specific test ). The chances that in the whole world you are the only one affected by a constant crash of CentOS 7 are close to none.
djhwyh

djhwyh

2023-03-08 02:51

reporter   ~0039076

Ok, now that I have your specific confirmation reply, I will continue to try to check the hardware. No matter what the result is, I will consider replacing all the memory first, etc. If the fault is more frequent, I will even consider replacing the whole server.

Issue History

Date Modified Username Field Change
2023-03-06 08:20 djhwyh New Issue
2023-03-06 21:17 ManuelWolfshant Note Added: 0039073
2023-03-06 21:21 ManuelWolfshant Note Edited: 0039073
2023-03-08 01:42 djhwyh Note Added: 0039074
2023-03-08 02:01 ManuelWolfshant Note Added: 0039075
2023-03-08 02:02 ManuelWolfshant Note Edited: 0039075
2023-03-08 02:51 djhwyh Note Added: 0039076