View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0005229 | CentOS-6 | lvm2 | public | 2011-11-05 19:08 | 2011-11-08 14:48 |
Reporter | gpr_support_user | ||||
Priority | normal | Severity | block | Reproducibility | always |
Status | new | Resolution | open | ||
Platform | OS | CentOS | OS Version | 6.0 | |
Product Version | 6.0 | ||||
Target Version | Fixed in Version | ||||
Summary | 0005229: clvmd hang | ||||
Description | The system is configured as a two node cluster with clvmd and cmirrord. Each node is setup with two disks. One disk is for the boot and root partitions. The other disk, the one of concern, is setup as shared storage via aoe. Each node makes the two partions on this disk available to the other node via aoe. The intent is to get clvm to mirror these partions over aoe. The disks each have a 60GB data and 1GB mirror log partition. Setting up the arrangement goes smoothly. The mirror runs. But, when failure is simulated by removing the ethernet cable, the problem: running any clvm/lvm2 commands causes a hang on the command line. It is worth mentioning that it is only the clvm subsystem that hangs not the entire cluster or os. | ||||
Steps To Reproduce | [root@test1 aoe]# cat /etc/init.d/vbld #!/bin/bash modprobe aoe echo eth1 > /dev/etherd/interfaces vblade 1 1 eth1 /dev/sda1 & vblade 1 2 eth1 /dev/sda2 & [root@test2 ~]# cat /etc/init.d/vbld #!/bin/bash modprobe aoe echo eth1 > /dev/etherd/interfaces vblade 2 1 eth1 /dev/sdb1 & vblade 2 2 eth1 /dev/sdb2 & [root@test1 ~]# aoe-stat e2.1 64.436GB eth1 1024 up e2.2 1.085GB eth1 1024 up [root@test2 ~]# aoe-stat e1.1 64.436GB eth1 1024 up e1.2 1.085GB eth1 1024 up [root@test1 ~]# pvscan PV /dev/md2 VG main lvm2 [74.41 GiB / 0 free] PV /dev/etherd/e2.1 VG vg_shared lvm2 [60.01 GiB / 8.00 MiB free] PV /dev/sda1 VG vg_shared lvm2 [60.01 GiB / 60.01 GiB free] PV /dev/etherd/e2.2 VG vg_shared lvm2 [1.01 GiB / 1.01 GiB free] PV /dev/sda2 lvm2 [1.01 GiB] Total: 5 [196.44 GiB] / in use: 4 [195.43 GiB] / in no VG: 1 [1.01 GiB] [root@test2 ~]# pvscan PV /dev/md2 VG main lvm2 [74.41 GiB / 32.00 MiB free] PV /dev/sdb1 VG vg_shared lvm2 [60.01 GiB / 8.00 MiB free] PV /dev/etherd/e1.1 VG vg_shared lvm2 [60.01 GiB / 60.01 GiB free] PV /dev/sdb2 VG vg_shared lvm2 [1.01 GiB / 1.01 GiB free] PV /dev/etherd/e1.2 lvm2 [1.01 GiB] Total: 5 [196.44 GiB] / in use: 4 [195.43 GiB] / in no VG: 1 [1.01 GiB] [root@test1 ~]# lvdisplay <snip> --- Logical volume --- LV Name /dev/vg_shared/mirrorlv VG Name vg_shared LV UUID BzkHu3-WdPK-wZiF-cDSk-7Bq0-AhA3-Z0DXjW LV Write Access read/write LV Status available # open 0 LV Size 60.00 GiB Current LE 15360 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 253:2 [root@test2 ~]# lvdisplay <snip> --- Logical volume --- LV Name /dev/vg_shared/mirrorlv VG Name vg_shared LV UUID BzkHu3-WdPK-wZiF-cDSk-7Bq0-AhA3-Z0DXjW LV Write Access read/write LV Status available # open 0 LV Size 60.00 GiB Current LE 15360 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 To show that this lv is cluster enabled: [root@test1 ~]# vgs VG #PV #LV #SN Attr VSize VFree main 1 2 0 wz--n- 74.41g 0 vg_shared 3 1 0 wz--nc 121.02g 61.02g [root@test2 ~]# vgs VG #PV #LV #SN Attr VSize VFree main 1 2 0 wz--n- 74.41g 32.00m vg_shared 3 1 0 wz--nc 121.02g 61.02g both nodes are in the cluster [root@test1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 3548 2011-11-05 18:23:16 test1 2 M 3552 2011-11-05 18:23:16 test2 full process lists attached/uploaded to this issue as a separate txt file. Now when I remove the ethernet cable: dmesg show: [root@test1 ~]# dmesg <snip> eth1: link down dlm: closing connection to node 2 [root@test1 ~]# dmesg <snip> eth1: link down dlm: closing connection to node 1 And as expected: [root@test1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 3548 2011-11-05 18:23:16 test1 2 X 3552 test2 Now running pvscan and hang: [root@test1 ~]# pvscan -vv -dd Setting global/locking_type to 3 Setting global/wait_for_locks to 1 Cluster locking selected. Locking VG P_#global PW (VG) (0x4) Also: [root@test1 ~]# dmesg <snip> INFO: task clvmd:4441 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. clvmd D f6235e48 0 4441 1 0x00000080 f5869030 00000082 00000002 f6235e48 c2084b34 00000000 f5869030 f592c030 c0408237 f6266880 0000012d c0ac0060 b7a41b51 0000012d c0ac0060 f58692d8 c0ac0060 c0abbb34 c0ac0060 f58692d8 c2084b34 c043fd3b 000f3602 f5869030 Call Trace: [<c0408237>] ? __switch_to+0xd7/0x1a0 [<c043fd3b>] ? find_busiest_group+0x1cb/0x970 [<c080d1f5>] ? rwsem_down_failed_common+0x75/0x1a0 [<c080d36a>] ? rwsem_down_read_failed+0x1a/0x24 [<c080d3a7>] ? call_rwsem_down_read_failed+0x7/0xc [<c080ca94>] ? down_read+0x14/0x20 [<f88b6168>] ? dlm_user_request+0x28/0x1e0 [dlm] [<c0510f44>] ? cache_alloc_refill+0x184/0x540 [<c040a129>] ? reschedule_interrupt+0x31/0x38 [<c05100d8>] ? ksm_scan_thread+0x4f8/0xa30 [<f88c239b>] ? device_write+0x43b/0x610 [dlm] [<c080b38f>] ? schedule+0x42f/0xb00 [<c0592cec>] ? security_file_permission+0xc/0x10 [<f88c1f60>] ? device_write+0x0/0x610 [dlm] [<c051de50>] ? vfs_write+0xa0/0x190 [<c04a993c>] ? audit_syscall_entry+0x21c/0x240 [<c051e8d1>] ? sys_write+0x41/0x70 [<c04099fb>] ? sysenter_do_call+0x12/0x28 From here it is impossible to recover lvm/clvm aside from rebooting the system. | ||||
Additional Information | [root@test1 ~]# cat /etc/redhat-release CentOS Linux release 6.0 (Final) [root@test1 ~]# uname -a Linux test1.localdomain 2.6.32-71.el6.i686 #1 SMP Fri Nov 12 04:17:17 GMT 2010 i686 i686 i386 GNU/Linux [root@test1 ~]# cat /etc/redhat-release CentOS Linux release 6.0 (Final) [root@test1 ~]# uname -a Linux test1.localdomain 2.6.32-71.el6.i686 #1 SMP Fri Nov 12 04:17:17 GMT 2010 i686 i686 i386 GNU/Linux [root@test1 ~]# rpm -q cman gfs2-utils lvm2-cluster cmirror rgmanager cman-3.0.12-23.el6.i686 gfs2-utils-3.0.12-23.el6.i686 lvm2-cluster-2.02.72-8.el6.i686 cmirror-2.02.72-8.el6.i686 rgmanager-3.0.12-10.el6.i686 [root@test2 ~]# rpm -q cman gfs2-utils lvm2-cluster cmirror rgmanager cman-3.0.12-23.el6.i686 gfs2-utils-3.0.12-23.el6.i686 lvm2-cluster-2.02.72-8.el6.i686 cmirror-2.02.72-8.el6.i686 rgmanager-3.0.12-10.el6.i686 [root@test1 ~]# aoe-version aoetools: 32 installed aoe driver: 77 running aoe driver: 77 [root@test2 ~]# aoe-version aoetools: 32 installed aoe driver: 77 running aoe driver: 77 vblade version: vblade-20 | ||||
Tags | aoe, clvmd, hang, lvdisplay, lvm2, pvscan, vgdisplay | ||||
process-list.txt (18,850 bytes)
[root@test1 ~]# ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.1 2828 1344 ? Ss 18:22 0:01 /sbin/init root 2 0.0 0.0 0 0 ? S 18:22 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? S 18:22 0:00 [migration/0] root 4 0.0 0.0 0 0 ? S 18:22 0:00 [ksoftirqd/0] root 5 0.0 0.0 0 0 ? S 18:22 0:00 [watchdog/0] root 6 0.0 0.0 0 0 ? S 18:22 0:00 [migration/1] root 7 0.0 0.0 0 0 ? S 18:22 0:00 [ksoftirqd/1] root 8 0.0 0.0 0 0 ? S 18:22 0:00 [watchdog/1] root 9 0.0 0.0 0 0 ? S 18:22 0:00 [events/0] root 10 0.0 0.0 0 0 ? S 18:22 0:00 [events/1] root 11 0.0 0.0 0 0 ? S 18:22 0:00 [cpuset] root 12 0.0 0.0 0 0 ? S 18:22 0:00 [khelper] root 13 0.0 0.0 0 0 ? S 18:22 0:00 [netns] root 14 0.0 0.0 0 0 ? S 18:22 0:00 [async/mgr] root 15 0.0 0.0 0 0 ? S 18:22 0:00 [pm] root 16 0.0 0.0 0 0 ? S 18:22 0:00 [sync_supers] root 17 0.0 0.0 0 0 ? S 18:22 0:00 [bdi-default] root 18 0.0 0.0 0 0 ? S 18:22 0:00 [kintegrityd/0] root 19 0.0 0.0 0 0 ? S 18:22 0:00 [kintegrityd/1] root 20 0.0 0.0 0 0 ? S 18:22 0:00 [kblockd/0] root 21 0.0 0.0 0 0 ? S 18:22 0:00 [kblockd/1] root 22 0.0 0.0 0 0 ? S 18:22 0:00 [kacpid] root 23 0.0 0.0 0 0 ? S 18:22 0:00 [kacpi_notify] root 24 0.0 0.0 0 0 ? S 18:22 0:00 [kacpi_hotplug] root 25 0.0 0.0 0 0 ? S 18:22 0:00 [ata/0] root 26 0.0 0.0 0 0 ? S 18:22 0:00 [ata/1] root 27 0.0 0.0 0 0 ? S 18:22 0:00 [ata_aux] root 28 0.0 0.0 0 0 ? S 18:22 0:00 [ksuspend_usbd] root 29 0.0 0.0 0 0 ? S 18:22 0:00 [khubd] root 30 0.0 0.0 0 0 ? S 18:22 0:00 [kseriod] root 33 0.0 0.0 0 0 ? S 18:22 0:00 [khungtaskd] root 34 0.0 0.0 0 0 ? S 18:22 0:00 [kswapd0] root 35 0.0 0.0 0 0 ? SN 18:22 0:00 [ksmd] root 36 0.0 0.0 0 0 ? S 18:22 0:00 [aio/0] root 37 0.0 0.0 0 0 ? S 18:22 0:00 [aio/1] root 38 0.0 0.0 0 0 ? S 18:22 0:00 [crypto/0] root 39 0.0 0.0 0 0 ? S 18:22 0:00 [crypto/1] root 44 0.0 0.0 0 0 ? S 18:22 0:00 [pciehpd] root 46 0.0 0.0 0 0 ? S 18:22 0:00 [kpsmoused] root 47 0.0 0.0 0 0 ? S 18:22 0:00 [usbhid_resumer] root 77 0.0 0.0 0 0 ? S 18:22 0:00 [kstriped] root 222 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_0] root 224 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_1] root 230 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_2] root 231 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_3] root 403 0.0 0.0 0 0 ? S 18:22 0:00 [md1_raid1] root 410 0.0 0.0 0 0 ? S 18:22 0:00 [md2_raid1] root 445 0.0 0.0 0 0 ? S 18:22 0:00 [kdmflush] root 452 0.0 0.0 0 0 ? S 18:22 0:00 [kdmflush] root 468 0.0 0.0 0 0 ? S 18:22 0:00 [jbd2/dm-0-8] root 469 0.0 0.0 0 0 ? S 18:22 0:00 [ext4-dio-unwrit] root 470 0.0 0.0 0 0 ? S 18:22 0:00 [ext4-dio-unwrit] root 547 0.0 0.0 2680 864 ? S<s 18:22 0:00 /sbin/udevd -d root 731 0.0 0.0 0 0 ? S 18:22 0:00 [flush-253:0] root 919 0.0 0.0 0 0 ? S 18:23 0:00 [jbd2/md1-8] root 920 0.0 0.0 0 0 ? S 18:23 0:00 [ext4-dio-unwrit] root 921 0.0 0.0 0 0 ? S 18:23 0:00 [ext4-dio-unwrit] root 1014 0.0 0.0 0 0 ? S 18:23 0:00 [kauditd] root 1356 0.0 0.0 12872 800 ? S<sl 18:23 0:00 auditd root 1372 0.0 0.1 36528 1416 ? Sl 18:23 0:00 /sbin/rsyslogd -c 4 root 1397 0.0 0.0 0 0 ? S 18:23 0:00 [kondemand/0] root 1398 0.0 0.0 0 0 ? S 18:23 0:00 [kondemand/1] rpc 1423 0.0 0.0 2512 776 ? Ss 18:23 0:00 rpcbind rpcuser 1441 0.0 0.1 2580 1052 ? Ss 18:23 0:00 rpc.statd root 1453 0.0 0.0 2576 444 ? Ss 18:23 0:00 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid root 1480 0.0 0.0 0 0 ? S 18:23 0:00 [rpciod/0] root 1481 0.0 0.0 0 0 ? S 18:23 0:00 [rpciod/1] root 1488 0.0 0.0 2772 500 ? Ss 18:23 0:00 rpc.idmapd root 1518 0.0 0.0 2676 856 ? S< 18:23 0:00 /sbin/udevd -d root 1519 0.0 0.0 0 0 ? S< 18:23 0:00 [aoe_tx] root 1520 0.0 0.0 0 0 ? S< 18:23 0:00 [aoe_ktio] root 1521 0.0 0.0 1960 484 ? S 18:23 0:00 vblade 1 1 eth1 /dev/sda1 root 1522 0.0 0.0 1960 480 ? S 18:23 0:00 vblade 1 2 eth1 /dev/sda2 root 1563 1.0 7.0 71412 68356 ? SLsl 18:23 0:08 corosync -f root 1620 0.0 0.1 36612 1692 ? Ssl 18:23 0:00 fenced root 1635 0.0 0.1 52876 1712 ? Ssl 18:23 0:00 dlm_controld root 1663 0.0 0.1 33408 1664 ? Ssl 18:23 0:00 gfs_controld root 1767 0.0 0.0 7164 752 ? Ss 18:23 0:00 cmirrord dbus 1777 0.0 0.0 2932 860 ? Ss 18:23 0:00 dbus-daemon --system root 1789 0.0 3.7 38496 36640 ? SLsl 18:23 0:00 clvmd -T30 root 1791 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_astd] root 1792 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_scand] root 1793 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recv/0] root 1794 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recv/1] root 1795 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_send] root 1796 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recoverd] root 1809 0.0 0.0 0 0 ? S 18:23 0:00 [kdmflush] root 1836 0.0 0.0 1972 596 ? Ss 18:23 0:00 /usr/sbin/acpid 68 1845 0.0 0.3 6168 3660 ? Ss 18:23 0:00 hald root 1846 0.0 0.1 3736 1164 ? S 18:23 0:00 hald-runner root 1874 0.0 0.1 3812 996 ? S 18:23 0:00 hald-addon-input: Listening on /dev/input/event4 /dev/input/event1 /dev/input/event0 /dev/input/event2 68 1894 0.0 0.1 3456 980 ? S 18:23 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket root 1908 0.0 0.1 29924 1552 ? Ssl 18:23 0:00 automount --pid-file /var/run/autofs.pid root 1926 0.0 0.1 8236 1124 ? Ss 18:23 0:00 /usr/sbin/sshd root 2002 0.0 0.2 10164 2096 ? Ss 18:23 0:00 /usr/libexec/postfix/master postfix 2010 0.0 0.2 10240 2124 ? S 18:23 0:00 pickup -l -t fifo -u postfix 2012 0.0 0.2 10308 2232 ? S 18:23 0:00 qmgr -l -t fifo -u root 2017 0.0 0.4 21096 4568 ? Ss 18:23 0:00 /usr/sbin/abrtd root 2025 0.0 0.1 5840 1180 ? Ss 18:23 0:00 crond root 2036 0.0 0.0 2824 328 ? Ss 18:23 0:00 /usr/sbin/atd root 2058 0.0 0.1 11160 1696 ? S<sl 18:23 0:00 modclusterd root 2070 0.0 0.0 1960 484 tty1 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty1 root 2072 0.0 0.0 1960 452 tty2 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty2 root 2075 0.0 0.0 1960 456 tty3 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty3 root 2078 0.0 0.0 2676 840 ? S< 18:23 0:00 /sbin/udevd -d root 2080 0.0 0.0 1960 452 tty4 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty4 root 2082 0.0 0.0 1960 456 tty5 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty5 root 2087 0.0 0.0 1960 452 tty6 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty6 root 2355 0.0 0.3 12012 3556 ? Ss 18:26 0:00 sshd: root@pts/0 root 2366 0.0 0.1 5192 1648 pts/0 Ss 18:26 0:00 -bash root 3329 1.0 0.1 4764 976 pts/0 R+ 18:35 0:00 ps auxww [root@test2 ~]# ps axuww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.1 2828 1344 ? Ss 18:22 0:01 /sbin/init root 2 0.0 0.0 0 0 ? S 18:22 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? S 18:22 0:00 [migration/0] root 4 0.0 0.0 0 0 ? S 18:22 0:00 [ksoftirqd/0] root 5 0.0 0.0 0 0 ? S 18:22 0:00 [watchdog/0] root 6 0.0 0.0 0 0 ? S 18:22 0:00 [migration/1] root 7 0.0 0.0 0 0 ? S 18:22 0:00 [ksoftirqd/1] root 8 0.0 0.0 0 0 ? S 18:22 0:00 [watchdog/1] root 9 0.0 0.0 0 0 ? S 18:22 0:00 [events/0] root 10 0.0 0.0 0 0 ? S 18:22 0:00 [events/1] root 11 0.0 0.0 0 0 ? S 18:22 0:00 [cpuset] root 12 0.0 0.0 0 0 ? S 18:22 0:00 [khelper] root 13 0.0 0.0 0 0 ? S 18:22 0:00 [netns] root 14 0.0 0.0 0 0 ? S 18:22 0:00 [async/mgr] root 15 0.0 0.0 0 0 ? S 18:22 0:00 [pm] root 16 0.0 0.0 0 0 ? S 18:22 0:00 [sync_supers] root 17 0.0 0.0 0 0 ? S 18:22 0:00 [bdi-default] root 18 0.0 0.0 0 0 ? S 18:22 0:00 [kintegrityd/0] root 19 0.0 0.0 0 0 ? S 18:22 0:00 [kintegrityd/1] root 20 0.0 0.0 0 0 ? S 18:22 0:00 [kblockd/0] root 21 0.0 0.0 0 0 ? S 18:22 0:00 [kblockd/1] root 22 0.0 0.0 0 0 ? S 18:22 0:00 [kacpid] root 23 0.0 0.0 0 0 ? S 18:22 0:00 [kacpi_notify] root 24 0.0 0.0 0 0 ? S 18:22 0:00 [kacpi_hotplug] root 25 0.0 0.0 0 0 ? S 18:22 0:00 [ata/0] root 26 0.0 0.0 0 0 ? S 18:22 0:00 [ata/1] root 27 0.0 0.0 0 0 ? S 18:22 0:00 [ata_aux] root 28 0.0 0.0 0 0 ? S 18:22 0:00 [ksuspend_usbd] root 29 0.0 0.0 0 0 ? S 18:22 0:00 [khubd] root 30 0.0 0.0 0 0 ? S 18:22 0:00 [kseriod] root 33 0.0 0.0 0 0 ? S 18:22 0:00 [khungtaskd] root 34 0.0 0.0 0 0 ? S 18:22 0:00 [kswapd0] root 35 0.0 0.0 0 0 ? SN 18:22 0:00 [ksmd] root 36 0.0 0.0 0 0 ? S 18:22 0:00 [aio/0] root 37 0.0 0.0 0 0 ? S 18:22 0:00 [aio/1] root 38 0.0 0.0 0 0 ? S 18:22 0:00 [crypto/0] root 39 0.0 0.0 0 0 ? S 18:22 0:00 [crypto/1] root 45 0.0 0.0 0 0 ? S 18:22 0:00 [kpsmoused] root 46 0.0 0.0 0 0 ? S 18:22 0:00 [usbhid_resumer] root 76 0.0 0.0 0 0 ? S 18:22 0:00 [kstriped] root 133 0.0 0.0 0 0 ? S 18:22 0:00 [i915] root 134 0.0 0.0 0 0 ? S< 18:22 0:00 [kslowd000] root 135 0.0 0.0 0 0 ? S< 18:22 0:00 [kslowd001] root 239 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_0] root 240 0.0 0.0 0 0 ? S 18:22 0:00 [scsi_eh_1] root 367 0.0 0.0 0 0 ? S 18:22 0:00 [md1_raid1] root 374 0.0 0.0 0 0 ? S 18:22 0:00 [md2_raid1] root 409 0.0 0.0 0 0 ? S 18:22 0:00 [kdmflush] root 416 0.0 0.0 0 0 ? S 18:22 0:00 [kdmflush] root 430 0.0 0.0 0 0 ? S 18:22 0:00 [kjournald] root 441 0.0 0.0 0 0 ? S 18:22 0:00 [flush-253:0] root 510 0.0 0.0 2680 868 ? S<s 18:22 0:00 /sbin/udevd -d root 678 0.0 0.0 0 0 ? S 18:22 0:00 [hd-audio0] root 865 0.0 0.0 0 0 ? S 18:22 0:00 [kjournald] root 960 0.0 0.0 0 0 ? S 18:22 0:00 [kauditd] root 1290 0.0 0.0 13764 804 ? S<sl 18:22 0:00 auditd root 1306 0.0 0.1 36528 1412 ? Sl 18:22 0:00 /sbin/rsyslogd -c 4 root 1331 0.0 0.0 0 0 ? S 18:22 0:00 [kondemand/0] root 1332 0.0 0.0 0 0 ? S 18:22 0:00 [kondemand/1] rpc 1357 0.0 0.0 2512 776 ? Ss 18:22 0:00 rpcbind rpcuser 1375 0.0 0.1 2580 1060 ? Ss 18:22 0:00 rpc.statd root 1387 0.0 0.0 2576 444 ? Ss 18:22 0:00 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid root 1413 0.0 0.0 0 0 ? S 18:22 0:00 [rpciod/0] root 1414 0.0 0.0 0 0 ? S 18:22 0:00 [rpciod/1] root 1421 0.0 0.0 2744 484 ? Ss 18:22 0:00 rpc.idmapd root 1450 0.0 0.0 2676 892 ? S< 18:22 0:00 /sbin/udevd -d root 1451 0.0 0.0 0 0 ? S< 18:22 0:00 [aoe_tx] root 1452 0.0 0.0 0 0 ? S< 18:22 0:00 [aoe_ktio] root 1453 0.0 0.0 1960 480 ? S 18:22 0:00 vblade 2 1 eth1 /dev/sdb1 root 1454 0.0 0.0 1960 484 ? S 18:22 0:00 vblade 2 2 eth1 /dev/sdb2 root 1493 1.0 7.1 75716 72660 ? SLsl 18:22 0:08 corosync -f root 1548 0.0 0.1 36612 1596 ? Ssl 18:22 0:00 fenced root 1562 0.0 0.1 56980 1736 ? Ssl 18:22 0:00 dlm_controld root 1586 0.0 0.1 33408 1572 ? Ssl 18:22 0:00 gfs_controld root 1676 0.0 0.0 2676 860 ? S< 18:23 0:00 /sbin/udevd -d root 1702 0.0 0.0 7164 752 ? Ss 18:23 0:00 cmirrord dbus 1712 0.0 0.0 2932 992 ? Ss 18:23 0:00 dbus-daemon --system root 1724 0.0 3.5 38496 36636 ? SLsl 18:23 0:00 clvmd -T30 root 1726 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_astd] root 1727 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_scand] root 1728 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recv/0] root 1729 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recv/1] root 1730 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_send] root 1731 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recoverd] root 1743 0.0 0.0 0 0 ? S 18:23 0:00 [kdmflush] 68 1770 0.0 0.3 6052 3536 ? Ss 18:23 0:00 hald root 1771 0.0 0.1 3736 1052 ? S 18:23 0:00 hald-runner root 1799 0.0 0.0 3812 988 ? S 18:23 0:00 hald-addon-input: Listening on /dev/input/event0 /dev/input/event1 68 1814 0.0 0.0 3452 992 ? S 18:23 0:00 /usr/libexec/hald-addon-acpi root 1833 0.0 0.1 29924 1552 ? Ssl 18:23 0:00 automount --pid-file /var/run/autofs.pid root 1851 0.0 0.1 8236 1120 ? Ss 18:23 0:00 /usr/sbin/sshd root 1927 0.0 0.2 10164 2096 ? Ss 18:23 0:00 /usr/libexec/postfix/master postfix 1936 0.0 0.2 10240 2124 ? S 18:23 0:00 pickup -l -t fifo -u postfix 1937 0.0 0.2 10308 2232 ? S 18:23 0:00 qmgr -l -t fifo -u root 1941 0.0 0.4 21100 4572 ? Ss 18:23 0:00 /usr/sbin/abrtd root 1950 0.0 0.1 5832 1176 ? Ss 18:23 0:00 crond root 1961 0.0 0.0 2824 440 ? Ss 18:23 0:00 /usr/sbin/atd root 1983 0.0 0.2 11164 2060 ? S<sl 18:23 0:00 modclusterd root 2005 0.0 0.5 5128 5124 ? S<Ls 18:23 0:00 rgmanager root 2007 0.0 0.2 39932 2252 ? S<l 18:23 0:00 rgmanager root 2009 0.0 0.0 0 0 ? S 18:23 0:00 [dlm_recoverd] root 2041 0.0 0.0 4148 532 ? Ss 18:23 0:00 /usr/sbin/oddjobd -p /var/run/oddjobd.pid -t 300 root 2058 0.0 0.0 6744 728 ? Ss 18:23 0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam root 2059 0.0 0.0 6744 384 ? S 18:23 0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam root 2060 0.0 0.0 6744 384 ? S 18:23 0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam root 2061 0.0 0.0 6744 384 ? S 18:23 0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam root 2062 0.0 0.0 6744 384 ? S 18:23 0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam ricci 2069 0.0 0.1 9424 1096 ? S<s 18:23 0:00 ricci -u ricci root 2078 0.0 0.0 1960 448 tty1 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty1 root 2080 0.0 0.0 1960 452 tty2 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty2 root 2082 0.0 0.0 1960 448 tty3 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty3 root 2084 0.0 0.0 1960 456 tty4 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty4 root 2086 0.0 0.0 1960 452 tty5 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty5 root 2088 0.0 0.0 1960 452 tty6 Ss+ 18:23 0:00 /sbin/mingetty /dev/tty6 root 2863 0.0 0.3 11120 3360 ? Ss 18:26 0:00 sshd: root@pts/0 root 2877 0.0 0.1 5208 1664 pts/0 Ss 18:26 0:00 -bash root 4062 1.0 0.0 4764 972 pts/0 R+ 18:36 0:00 ps axuww |
|
locking type set correctly in both clusters: [root@test2 ~]# grep locking_type /etc/lvm/lvm.conf locking_type = 3 [root@test1 ~]# grep locking_type /etc/lvm/lvm.conf locking_type = 3 |
|
checked against : https://bugzilla.redhat.com/show_bug.cgi?id=531637 According to this the mirror should still be able to run on both nodes as long as the mirror log is on the node with lowest node id. The mirror log is on node with lowest node id. So this is not applicable. [root@test1 ~]# lvcreate -L 60G -m1 vg_shared -n mirrorlv /dev/etherd/e2.1 /dev/sda1 /dev/sda2 [root@test1 ~]# grep node /etc/cluster/cluster.conf <cman two_node="1" expected_votes="1"/> <clusternodes> <clusternode name="test1" nodeid="1"> </clusternode> <clusternode name="test2" nodeid="2"> </clusternode> </clusternodes> |
|
Able to reproduce this problem with no "High Availability" or "Resilient Storage" software running. This suggest the problem has nothing to do with cluster software. 1) stop rgmanager, gfs2, clvmd, cman and cmorrord to stop cluster subsystem. 2) unplug eth1(the link over which aoe and the cluster talks) to simulate cluster link failure. 3) run pvscan and hang [root@test2 ~]# pvscan -vv -dd Setting global/locking_type to 3 Setting global/wait_for_locks to 1 Cluster locking selected. connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. Setting global/fallback_to_local_locking to 1 locking/fallback_to_local_locking not found in config: defaulting to 1 WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Setting global/locking_dir to /var/lock/lvm Locking /var/lock/lvm/P_global WB Wiping cache of LVM-capable devices Wiping internal VG cache Walking through all physical volumes /dev/ram0: size is 32768 sectors /dev/ram0: size is 32768 sectors /dev/ram0: No label detected /dev/loop0: size is 0 sectors /dev/sda: size is 156299375 sectors /dev/etherd/e1.2: size is 2120580 sectors <blocking> while pvscan is blocking dmesg gives the following: [root@test2 ~]# dmesg <snip> eth1: link down end_request: I/O error, dev etherd/e1.2, sector 0 Buffer I/O error on device etherd/e1.2, logical block 0 end_request: I/O error, dev etherd/e1.2, sector 4 Buffer I/O error on device etherd/e1.2, logical block 1 end_request: I/O error, dev etherd/e1.2, sector 8 Buffer I/O error on device etherd/e1.2, logical block 2 end_request: I/O error, dev etherd/e1.2, sector 12 Buffer I/O error on device etherd/e1.2, logical block 3 end_request: I/O error, dev etherd/e1.2, sector 16 Buffer I/O error on device etherd/e1.2, logical block 4 end_request: I/O error, dev etherd/e1.2, sector 20 Buffer I/O error on device etherd/e1.2, logical block 5 end_request: I/O error, dev etherd/e1.2, sector 24 Buffer I/O error on device etherd/e1.2, logical block 6 end_request: I/O error, dev etherd/e1.2, sector 28 Buffer I/O error on device etherd/e1.2, logical block 7 |
|
tail /var/log/messages: Nov 8 14:30:21 test2 lvm[3832]: connect() failed on local socket: No such file or directory Nov 8 14:30:21 test2 lvm[3832]: Internal cluster locking initialisation failed. |
|
more of the same as above in the logs after running pvscan again twice while eth1 link down. It suggests that blocking is occurring due to the device initially appearing as being available to the kernel because the device isn't immediately telling the kernel that it is unavailable. The kernel and lvm appear eventually to timeout. [root@test2 ~]# tail /var/log/messages Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 3 Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 16 Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 4 Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 20 Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 5 Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 24 Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 6 Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 28 Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 7 Nov 8 14:33:33 test2 kernel: aoe: device 1.2 is not up [root@test2 ~]# tail /var/log/messages Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 2 Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 3 Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 4 Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 5 Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 6 Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 7 Nov 8 14:38:29 test2 lvm[3907]: /dev/etherd/e1.1: read failed after 0 of 512 at 0: Input/output error Nov 8 14:38:29 test2 lvm[3907]: /dev/etherd/e1.1: open failed: No such device Nov 8 14:38:29 test2 lvm[3907]: Couldn't find device with uuid vCR1eu-F6tq-AyjB-2HUG-ZgOd-8vul-20IQHY. Nov 8 14:38:29 test2 lvm[3907]: Couldn't find device with uuid U2MW58-qe0Q-65YK-UfJK-HXYE-jh5G-qdci02. |
|
Date Modified | Username | Field | Change |
---|---|---|---|
2011-11-05 19:08 | gpr_support_user | New Issue | |
2011-11-05 19:08 | gpr_support_user | File Added: process-list.txt | |
2011-11-05 19:12 | gpr_support_user | Tag Attached: aoe | |
2011-11-05 19:12 | gpr_support_user | Tag Attached: clvmd | |
2011-11-05 19:12 | gpr_support_user | Tag Attached: hang | |
2011-11-05 19:12 | gpr_support_user | Tag Attached: pvscan | |
2011-11-05 19:45 | gpr_support_user | Note Added: 0013724 | |
2011-11-05 20:28 | gpr_support_user | Note Added: 0013725 | |
2011-11-05 20:29 | gpr_support_user | Tag Attached: lvm2 | |
2011-11-05 20:29 | gpr_support_user | Tag Attached: lvdisplay | |
2011-11-05 20:29 | gpr_support_user | Tag Attached: vgdisplay | |
2011-11-08 14:31 | gpr_support_user | Note Added: 0013737 | |
2011-11-08 14:33 | gpr_support_user | Note Added: 0013738 | |
2011-11-08 14:48 | gpr_support_user | Note Added: 0013739 |