View Issue Details

IDProjectCategoryView StatusLast Update
0005229CentOS-6lvm2public2011-11-08 14:48
Reportergpr_support_user 
PrioritynormalSeverityblockReproducibilityalways
Status newResolutionopen 
PlatformOSCentOSOS Version6.0
Product Version6.0 
Target VersionFixed in Version 
Summary0005229: clvmd hang
DescriptionThe system is configured as a two node cluster with clvmd and cmirrord. Each node is setup with two disks. One disk is for the boot and root partitions. The other disk, the one of concern, is setup as shared storage via aoe. Each node makes the two partions on this disk available to the other node via aoe. The intent is to get clvm to mirror these partions over aoe. The disks each have a 60GB data and 1GB mirror log partition.

Setting up the arrangement goes smoothly. The mirror runs. But, when failure is simulated by removing the ethernet cable, the problem:

running any clvm/lvm2 commands causes a hang on the command line.

It is worth mentioning that it is only the clvm subsystem that hangs not the entire cluster or os.
Steps To Reproduce[root@test1 aoe]# cat /etc/init.d/vbld
#!/bin/bash

modprobe aoe
echo eth1 > /dev/etherd/interfaces
vblade 1 1 eth1 /dev/sda1 &
vblade 1 2 eth1 /dev/sda2 &

[root@test2 ~]# cat /etc/init.d/vbld
#!/bin/bash

modprobe aoe
echo eth1 > /dev/etherd/interfaces
vblade 2 1 eth1 /dev/sdb1 &
vblade 2 2 eth1 /dev/sdb2 &


[root@test1 ~]# aoe-stat
      e2.1 64.436GB eth1 1024 up
      e2.2 1.085GB eth1 1024 up

[root@test2 ~]# aoe-stat
      e1.1 64.436GB eth1 1024 up
      e1.2 1.085GB eth1 1024 up


[root@test1 ~]# pvscan
  PV /dev/md2 VG main lvm2 [74.41 GiB / 0 free]
  PV /dev/etherd/e2.1 VG vg_shared lvm2 [60.01 GiB / 8.00 MiB free]
  PV /dev/sda1 VG vg_shared lvm2 [60.01 GiB / 60.01 GiB free]
  PV /dev/etherd/e2.2 VG vg_shared lvm2 [1.01 GiB / 1.01 GiB free]
  PV /dev/sda2 lvm2 [1.01 GiB]
  Total: 5 [196.44 GiB] / in use: 4 [195.43 GiB] / in no VG: 1 [1.01 GiB]

[root@test2 ~]# pvscan
  PV /dev/md2 VG main lvm2 [74.41 GiB / 32.00 MiB free]
  PV /dev/sdb1 VG vg_shared lvm2 [60.01 GiB / 8.00 MiB free]
  PV /dev/etherd/e1.1 VG vg_shared lvm2 [60.01 GiB / 60.01 GiB free]
  PV /dev/sdb2 VG vg_shared lvm2 [1.01 GiB / 1.01 GiB free]
  PV /dev/etherd/e1.2 lvm2 [1.01 GiB]
  Total: 5 [196.44 GiB] / in use: 4 [195.43 GiB] / in no VG: 1 [1.01 GiB]


[root@test1 ~]# lvdisplay
<snip>

  --- Logical volume ---
  LV Name /dev/vg_shared/mirrorlv
  VG Name vg_shared
  LV UUID BzkHu3-WdPK-wZiF-cDSk-7Bq0-AhA3-Z0DXjW
  LV Write Access read/write
  LV Status available
  # open 0
  LV Size 60.00 GiB
  Current LE 15360
  Segments 1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 4096
  Block device 253:2

[root@test2 ~]# lvdisplay
<snip>
  --- Logical volume ---
  LV Name /dev/vg_shared/mirrorlv
  VG Name vg_shared
  LV UUID BzkHu3-WdPK-wZiF-cDSk-7Bq0-AhA3-Z0DXjW
  LV Write Access read/write
  LV Status available
  # open 0
  LV Size 60.00 GiB
  Current LE 15360
  Segments 1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device 253:2


To show that this lv is cluster enabled:

[root@test1 ~]# vgs
  VG #PV #LV #SN Attr VSize VFree
  main 1 2 0 wz--n- 74.41g 0
  vg_shared 3 1 0 wz--nc 121.02g 61.02g

[root@test2 ~]# vgs
  VG #PV #LV #SN Attr VSize VFree
  main 1 2 0 wz--n- 74.41g 32.00m
  vg_shared 3 1 0 wz--nc 121.02g 61.02g

both nodes are in the cluster

[root@test1 ~]# cman_tool nodes
Node Sts Inc Joined Name
   1 M 3548 2011-11-05 18:23:16 test1
   2 M 3552 2011-11-05 18:23:16 test2

full process lists attached/uploaded to this issue as a separate txt file.

Now when I remove the ethernet cable:

dmesg show:

[root@test1 ~]# dmesg
<snip>
eth1: link down
dlm: closing connection to node 2

[root@test1 ~]# dmesg
<snip>
eth1: link down
dlm: closing connection to node 1

And as expected:

[root@test1 ~]# cman_tool nodes
Node Sts Inc Joined Name
   1 M 3548 2011-11-05 18:23:16 test1
   2 X 3552 test2



Now running pvscan and hang:

[root@test1 ~]# pvscan -vv -dd
      Setting global/locking_type to 3
      Setting global/wait_for_locks to 1
      Cluster locking selected.
      Locking VG P_#global PW (VG) (0x4)

Also:
[root@test1 ~]# dmesg
<snip>
INFO: task clvmd:4441 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd D f6235e48 0 4441 1 0x00000080
 f5869030 00000082 00000002 f6235e48 c2084b34 00000000 f5869030 f592c030
 c0408237 f6266880 0000012d c0ac0060 b7a41b51 0000012d c0ac0060 f58692d8
 c0ac0060 c0abbb34 c0ac0060 f58692d8 c2084b34 c043fd3b 000f3602 f5869030
Call Trace:
 [<c0408237>] ? __switch_to+0xd7/0x1a0
 [<c043fd3b>] ? find_busiest_group+0x1cb/0x970
 [<c080d1f5>] ? rwsem_down_failed_common+0x75/0x1a0
 [<c080d36a>] ? rwsem_down_read_failed+0x1a/0x24
 [<c080d3a7>] ? call_rwsem_down_read_failed+0x7/0xc
 [<c080ca94>] ? down_read+0x14/0x20
 [<f88b6168>] ? dlm_user_request+0x28/0x1e0 [dlm]
 [<c0510f44>] ? cache_alloc_refill+0x184/0x540
 [<c040a129>] ? reschedule_interrupt+0x31/0x38
 [<c05100d8>] ? ksm_scan_thread+0x4f8/0xa30
 [<f88c239b>] ? device_write+0x43b/0x610 [dlm]
 [<c080b38f>] ? schedule+0x42f/0xb00
 [<c0592cec>] ? security_file_permission+0xc/0x10
 [<f88c1f60>] ? device_write+0x0/0x610 [dlm]
 [<c051de50>] ? vfs_write+0xa0/0x190
 [<c04a993c>] ? audit_syscall_entry+0x21c/0x240
 [<c051e8d1>] ? sys_write+0x41/0x70
 [<c04099fb>] ? sysenter_do_call+0x12/0x28

From here it is impossible to recover lvm/clvm aside from rebooting the system.
Additional Information[root@test1 ~]# cat /etc/redhat-release
CentOS Linux release 6.0 (Final)
[root@test1 ~]# uname -a
Linux test1.localdomain 2.6.32-71.el6.i686 #1 SMP Fri Nov 12 04:17:17 GMT 2010 i686 i686 i386 GNU/Linux

[root@test1 ~]# cat /etc/redhat-release
CentOS Linux release 6.0 (Final)
[root@test1 ~]# uname -a
Linux test1.localdomain 2.6.32-71.el6.i686 #1 SMP Fri Nov 12 04:17:17 GMT 2010 i686 i686 i386 GNU/Linux

[root@test1 ~]# rpm -q cman gfs2-utils lvm2-cluster cmirror rgmanager
cman-3.0.12-23.el6.i686
gfs2-utils-3.0.12-23.el6.i686
lvm2-cluster-2.02.72-8.el6.i686
cmirror-2.02.72-8.el6.i686
rgmanager-3.0.12-10.el6.i686


[root@test2 ~]# rpm -q cman gfs2-utils lvm2-cluster cmirror rgmanager
cman-3.0.12-23.el6.i686
gfs2-utils-3.0.12-23.el6.i686
lvm2-cluster-2.02.72-8.el6.i686
cmirror-2.02.72-8.el6.i686
rgmanager-3.0.12-10.el6.i686

[root@test1 ~]# aoe-version
              aoetools: 32
  installed aoe driver: 77
    running aoe driver: 77


[root@test2 ~]# aoe-version
              aoetools: 32
  installed aoe driver: 77
    running aoe driver: 77

vblade version:

vblade-20
Tagsaoe, clvmd, hang, lvdisplay, lvm2, pvscan, vgdisplay

Activities

gpr_support_user

gpr_support_user

2011-11-05 19:08

reporter  

process-list.txt (18,850 bytes)


[root@test1 ~]# ps auxww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.1   2828  1344 ?        Ss   18:22   0:01 /sbin/init
root         2  0.0  0.0      0     0 ?        S    18:22   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    18:22   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    18:22   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    18:22   0:00 [watchdog/0]
root         6  0.0  0.0      0     0 ?        S    18:22   0:00 [migration/1]
root         7  0.0  0.0      0     0 ?        S    18:22   0:00 [ksoftirqd/1]
root         8  0.0  0.0      0     0 ?        S    18:22   0:00 [watchdog/1]
root         9  0.0  0.0      0     0 ?        S    18:22   0:00 [events/0]
root        10  0.0  0.0      0     0 ?        S    18:22   0:00 [events/1]
root        11  0.0  0.0      0     0 ?        S    18:22   0:00 [cpuset]
root        12  0.0  0.0      0     0 ?        S    18:22   0:00 [khelper]
root        13  0.0  0.0      0     0 ?        S    18:22   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    18:22   0:00 [async/mgr]
root        15  0.0  0.0      0     0 ?        S    18:22   0:00 [pm]
root        16  0.0  0.0      0     0 ?        S    18:22   0:00 [sync_supers]
root        17  0.0  0.0      0     0 ?        S    18:22   0:00 [bdi-default]
root        18  0.0  0.0      0     0 ?        S    18:22   0:00 [kintegrityd/0]
root        19  0.0  0.0      0     0 ?        S    18:22   0:00 [kintegrityd/1]
root        20  0.0  0.0      0     0 ?        S    18:22   0:00 [kblockd/0]
root        21  0.0  0.0      0     0 ?        S    18:22   0:00 [kblockd/1]
root        22  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpid]
root        23  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpi_notify]
root        24  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpi_hotplug]
root        25  0.0  0.0      0     0 ?        S    18:22   0:00 [ata/0]
root        26  0.0  0.0      0     0 ?        S    18:22   0:00 [ata/1]
root        27  0.0  0.0      0     0 ?        S    18:22   0:00 [ata_aux]
root        28  0.0  0.0      0     0 ?        S    18:22   0:00 [ksuspend_usbd]
root        29  0.0  0.0      0     0 ?        S    18:22   0:00 [khubd]
root        30  0.0  0.0      0     0 ?        S    18:22   0:00 [kseriod]
root        33  0.0  0.0      0     0 ?        S    18:22   0:00 [khungtaskd]
root        34  0.0  0.0      0     0 ?        S    18:22   0:00 [kswapd0]
root        35  0.0  0.0      0     0 ?        SN   18:22   0:00 [ksmd]
root        36  0.0  0.0      0     0 ?        S    18:22   0:00 [aio/0]
root        37  0.0  0.0      0     0 ?        S    18:22   0:00 [aio/1]
root        38  0.0  0.0      0     0 ?        S    18:22   0:00 [crypto/0]
root        39  0.0  0.0      0     0 ?        S    18:22   0:00 [crypto/1]
root        44  0.0  0.0      0     0 ?        S    18:22   0:00 [pciehpd]
root        46  0.0  0.0      0     0 ?        S    18:22   0:00 [kpsmoused]
root        47  0.0  0.0      0     0 ?        S    18:22   0:00 [usbhid_resumer]
root        77  0.0  0.0      0     0 ?        S    18:22   0:00 [kstriped]
root       222  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_0]
root       224  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_1]
root       230  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_2]
root       231  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_3]
root       403  0.0  0.0      0     0 ?        S    18:22   0:00 [md1_raid1]
root       410  0.0  0.0      0     0 ?        S    18:22   0:00 [md2_raid1]
root       445  0.0  0.0      0     0 ?        S    18:22   0:00 [kdmflush]
root       452  0.0  0.0      0     0 ?        S    18:22   0:00 [kdmflush]
root       468  0.0  0.0      0     0 ?        S    18:22   0:00 [jbd2/dm-0-8]
root       469  0.0  0.0      0     0 ?        S    18:22   0:00 [ext4-dio-unwrit]
root       470  0.0  0.0      0     0 ?        S    18:22   0:00 [ext4-dio-unwrit]
root       547  0.0  0.0   2680   864 ?        S<s  18:22   0:00 /sbin/udevd -d
root       731  0.0  0.0      0     0 ?        S    18:22   0:00 [flush-253:0]
root       919  0.0  0.0      0     0 ?        S    18:23   0:00 [jbd2/md1-8]
root       920  0.0  0.0      0     0 ?        S    18:23   0:00 [ext4-dio-unwrit]
root       921  0.0  0.0      0     0 ?        S    18:23   0:00 [ext4-dio-unwrit]
root      1014  0.0  0.0      0     0 ?        S    18:23   0:00 [kauditd]
root      1356  0.0  0.0  12872   800 ?        S<sl 18:23   0:00 auditd
root      1372  0.0  0.1  36528  1416 ?        Sl   18:23   0:00 /sbin/rsyslogd -c 4
root      1397  0.0  0.0      0     0 ?        S    18:23   0:00 [kondemand/0]
root      1398  0.0  0.0      0     0 ?        S    18:23   0:00 [kondemand/1]
rpc       1423  0.0  0.0   2512   776 ?        Ss   18:23   0:00 rpcbind
rpcuser   1441  0.0  0.1   2580  1052 ?        Ss   18:23   0:00 rpc.statd
root      1453  0.0  0.0   2576   444 ?        Ss   18:23   0:00 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid
root      1480  0.0  0.0      0     0 ?        S    18:23   0:00 [rpciod/0]
root      1481  0.0  0.0      0     0 ?        S    18:23   0:00 [rpciod/1]
root      1488  0.0  0.0   2772   500 ?        Ss   18:23   0:00 rpc.idmapd
root      1518  0.0  0.0   2676   856 ?        S<   18:23   0:00 /sbin/udevd -d
root      1519  0.0  0.0      0     0 ?        S<   18:23   0:00 [aoe_tx]
root      1520  0.0  0.0      0     0 ?        S<   18:23   0:00 [aoe_ktio]
root      1521  0.0  0.0   1960   484 ?        S    18:23   0:00 vblade 1 1 eth1 /dev/sda1
root      1522  0.0  0.0   1960   480 ?        S    18:23   0:00 vblade 1 2 eth1 /dev/sda2
root      1563  1.0  7.0  71412 68356 ?        SLsl 18:23   0:08 corosync -f
root      1620  0.0  0.1  36612  1692 ?        Ssl  18:23   0:00 fenced
root      1635  0.0  0.1  52876  1712 ?        Ssl  18:23   0:00 dlm_controld
root      1663  0.0  0.1  33408  1664 ?        Ssl  18:23   0:00 gfs_controld
root      1767  0.0  0.0   7164   752 ?        Ss   18:23   0:00 cmirrord
dbus      1777  0.0  0.0   2932   860 ?        Ss   18:23   0:00 dbus-daemon --system
root      1789  0.0  3.7  38496 36640 ?        SLsl 18:23   0:00 clvmd -T30
root      1791  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_astd]
root      1792  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_scand]
root      1793  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recv/0]
root      1794  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recv/1]
root      1795  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_send]
root      1796  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recoverd]
root      1809  0.0  0.0      0     0 ?        S    18:23   0:00 [kdmflush]
root      1836  0.0  0.0   1972   596 ?        Ss   18:23   0:00 /usr/sbin/acpid
68        1845  0.0  0.3   6168  3660 ?        Ss   18:23   0:00 hald
root      1846  0.0  0.1   3736  1164 ?        S    18:23   0:00 hald-runner
root      1874  0.0  0.1   3812   996 ?        S    18:23   0:00 hald-addon-input: Listening on /dev/input/event4 /dev/input/event1 /dev/input/event0 /dev/input/event2
68        1894  0.0  0.1   3456   980 ?        S    18:23   0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root      1908  0.0  0.1  29924  1552 ?        Ssl  18:23   0:00 automount --pid-file /var/run/autofs.pid
root      1926  0.0  0.1   8236  1124 ?        Ss   18:23   0:00 /usr/sbin/sshd
root      2002  0.0  0.2  10164  2096 ?        Ss   18:23   0:00 /usr/libexec/postfix/master
postfix   2010  0.0  0.2  10240  2124 ?        S    18:23   0:00 pickup -l -t fifo -u
postfix   2012  0.0  0.2  10308  2232 ?        S    18:23   0:00 qmgr -l -t fifo -u
root      2017  0.0  0.4  21096  4568 ?        Ss   18:23   0:00 /usr/sbin/abrtd
root      2025  0.0  0.1   5840  1180 ?        Ss   18:23   0:00 crond
root      2036  0.0  0.0   2824   328 ?        Ss   18:23   0:00 /usr/sbin/atd
root      2058  0.0  0.1  11160  1696 ?        S<sl 18:23   0:00 modclusterd
root      2070  0.0  0.0   1960   484 tty1     Ss+  18:23   0:00 /sbin/mingetty /dev/tty1
root      2072  0.0  0.0   1960   452 tty2     Ss+  18:23   0:00 /sbin/mingetty /dev/tty2
root      2075  0.0  0.0   1960   456 tty3     Ss+  18:23   0:00 /sbin/mingetty /dev/tty3
root      2078  0.0  0.0   2676   840 ?        S<   18:23   0:00 /sbin/udevd -d
root      2080  0.0  0.0   1960   452 tty4     Ss+  18:23   0:00 /sbin/mingetty /dev/tty4
root      2082  0.0  0.0   1960   456 tty5     Ss+  18:23   0:00 /sbin/mingetty /dev/tty5
root      2087  0.0  0.0   1960   452 tty6     Ss+  18:23   0:00 /sbin/mingetty /dev/tty6
root      2355  0.0  0.3  12012  3556 ?        Ss   18:26   0:00 sshd: root@pts/0
root      2366  0.0  0.1   5192  1648 pts/0    Ss   18:26   0:00 -bash
root      3329  1.0  0.1   4764   976 pts/0    R+   18:35   0:00 ps auxww

[root@test2 ~]# ps axuww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.1   2828  1344 ?        Ss   18:22   0:01 /sbin/init
root         2  0.0  0.0      0     0 ?        S    18:22   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    18:22   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    18:22   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    18:22   0:00 [watchdog/0]
root         6  0.0  0.0      0     0 ?        S    18:22   0:00 [migration/1]
root         7  0.0  0.0      0     0 ?        S    18:22   0:00 [ksoftirqd/1]
root         8  0.0  0.0      0     0 ?        S    18:22   0:00 [watchdog/1]
root         9  0.0  0.0      0     0 ?        S    18:22   0:00 [events/0]
root        10  0.0  0.0      0     0 ?        S    18:22   0:00 [events/1]
root        11  0.0  0.0      0     0 ?        S    18:22   0:00 [cpuset]
root        12  0.0  0.0      0     0 ?        S    18:22   0:00 [khelper]
root        13  0.0  0.0      0     0 ?        S    18:22   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    18:22   0:00 [async/mgr]
root        15  0.0  0.0      0     0 ?        S    18:22   0:00 [pm]
root        16  0.0  0.0      0     0 ?        S    18:22   0:00 [sync_supers]
root        17  0.0  0.0      0     0 ?        S    18:22   0:00 [bdi-default]
root        18  0.0  0.0      0     0 ?        S    18:22   0:00 [kintegrityd/0]
root        19  0.0  0.0      0     0 ?        S    18:22   0:00 [kintegrityd/1]
root        20  0.0  0.0      0     0 ?        S    18:22   0:00 [kblockd/0]
root        21  0.0  0.0      0     0 ?        S    18:22   0:00 [kblockd/1]
root        22  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpid]
root        23  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpi_notify]
root        24  0.0  0.0      0     0 ?        S    18:22   0:00 [kacpi_hotplug]
root        25  0.0  0.0      0     0 ?        S    18:22   0:00 [ata/0]
root        26  0.0  0.0      0     0 ?        S    18:22   0:00 [ata/1]
root        27  0.0  0.0      0     0 ?        S    18:22   0:00 [ata_aux]
root        28  0.0  0.0      0     0 ?        S    18:22   0:00 [ksuspend_usbd]
root        29  0.0  0.0      0     0 ?        S    18:22   0:00 [khubd]
root        30  0.0  0.0      0     0 ?        S    18:22   0:00 [kseriod]
root        33  0.0  0.0      0     0 ?        S    18:22   0:00 [khungtaskd]
root        34  0.0  0.0      0     0 ?        S    18:22   0:00 [kswapd0]
root        35  0.0  0.0      0     0 ?        SN   18:22   0:00 [ksmd]
root        36  0.0  0.0      0     0 ?        S    18:22   0:00 [aio/0]
root        37  0.0  0.0      0     0 ?        S    18:22   0:00 [aio/1]
root        38  0.0  0.0      0     0 ?        S    18:22   0:00 [crypto/0]
root        39  0.0  0.0      0     0 ?        S    18:22   0:00 [crypto/1]
root        45  0.0  0.0      0     0 ?        S    18:22   0:00 [kpsmoused]
root        46  0.0  0.0      0     0 ?        S    18:22   0:00 [usbhid_resumer]
root        76  0.0  0.0      0     0 ?        S    18:22   0:00 [kstriped]
root       133  0.0  0.0      0     0 ?        S    18:22   0:00 [i915]
root       134  0.0  0.0      0     0 ?        S<   18:22   0:00 [kslowd000]
root       135  0.0  0.0      0     0 ?        S<   18:22   0:00 [kslowd001]
root       239  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_0]
root       240  0.0  0.0      0     0 ?        S    18:22   0:00 [scsi_eh_1]
root       367  0.0  0.0      0     0 ?        S    18:22   0:00 [md1_raid1]
root       374  0.0  0.0      0     0 ?        S    18:22   0:00 [md2_raid1]
root       409  0.0  0.0      0     0 ?        S    18:22   0:00 [kdmflush]
root       416  0.0  0.0      0     0 ?        S    18:22   0:00 [kdmflush]
root       430  0.0  0.0      0     0 ?        S    18:22   0:00 [kjournald]
root       441  0.0  0.0      0     0 ?        S    18:22   0:00 [flush-253:0]
root       510  0.0  0.0   2680   868 ?        S<s  18:22   0:00 /sbin/udevd -d
root       678  0.0  0.0      0     0 ?        S    18:22   0:00 [hd-audio0]
root       865  0.0  0.0      0     0 ?        S    18:22   0:00 [kjournald]
root       960  0.0  0.0      0     0 ?        S    18:22   0:00 [kauditd]
root      1290  0.0  0.0  13764   804 ?        S<sl 18:22   0:00 auditd
root      1306  0.0  0.1  36528  1412 ?        Sl   18:22   0:00 /sbin/rsyslogd -c 4
root      1331  0.0  0.0      0     0 ?        S    18:22   0:00 [kondemand/0]
root      1332  0.0  0.0      0     0 ?        S    18:22   0:00 [kondemand/1]
rpc       1357  0.0  0.0   2512   776 ?        Ss   18:22   0:00 rpcbind
rpcuser   1375  0.0  0.1   2580  1060 ?        Ss   18:22   0:00 rpc.statd
root      1387  0.0  0.0   2576   444 ?        Ss   18:22   0:00 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid
root      1413  0.0  0.0      0     0 ?        S    18:22   0:00 [rpciod/0]
root      1414  0.0  0.0      0     0 ?        S    18:22   0:00 [rpciod/1]
root      1421  0.0  0.0   2744   484 ?        Ss   18:22   0:00 rpc.idmapd
root      1450  0.0  0.0   2676   892 ?        S<   18:22   0:00 /sbin/udevd -d
root      1451  0.0  0.0      0     0 ?        S<   18:22   0:00 [aoe_tx]
root      1452  0.0  0.0      0     0 ?        S<   18:22   0:00 [aoe_ktio]
root      1453  0.0  0.0   1960   480 ?        S    18:22   0:00 vblade 2 1 eth1 /dev/sdb1
root      1454  0.0  0.0   1960   484 ?        S    18:22   0:00 vblade 2 2 eth1 /dev/sdb2
root      1493  1.0  7.1  75716 72660 ?        SLsl 18:22   0:08 corosync -f
root      1548  0.0  0.1  36612  1596 ?        Ssl  18:22   0:00 fenced
root      1562  0.0  0.1  56980  1736 ?        Ssl  18:22   0:00 dlm_controld
root      1586  0.0  0.1  33408  1572 ?        Ssl  18:22   0:00 gfs_controld
root      1676  0.0  0.0   2676   860 ?        S<   18:23   0:00 /sbin/udevd -d
root      1702  0.0  0.0   7164   752 ?        Ss   18:23   0:00 cmirrord
dbus      1712  0.0  0.0   2932   992 ?        Ss   18:23   0:00 dbus-daemon --system
root      1724  0.0  3.5  38496 36636 ?        SLsl 18:23   0:00 clvmd -T30
root      1726  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_astd]
root      1727  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_scand]
root      1728  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recv/0]
root      1729  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recv/1]
root      1730  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_send]
root      1731  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recoverd]
root      1743  0.0  0.0      0     0 ?        S    18:23   0:00 [kdmflush]
68        1770  0.0  0.3   6052  3536 ?        Ss   18:23   0:00 hald
root      1771  0.0  0.1   3736  1052 ?        S    18:23   0:00 hald-runner
root      1799  0.0  0.0   3812   988 ?        S    18:23   0:00 hald-addon-input: Listening on /dev/input/event0 /dev/input/event1
68        1814  0.0  0.0   3452   992 ?        S    18:23   0:00 /usr/libexec/hald-addon-acpi
root      1833  0.0  0.1  29924  1552 ?        Ssl  18:23   0:00 automount --pid-file /var/run/autofs.pid
root      1851  0.0  0.1   8236  1120 ?        Ss   18:23   0:00 /usr/sbin/sshd
root      1927  0.0  0.2  10164  2096 ?        Ss   18:23   0:00 /usr/libexec/postfix/master
postfix   1936  0.0  0.2  10240  2124 ?        S    18:23   0:00 pickup -l -t fifo -u
postfix   1937  0.0  0.2  10308  2232 ?        S    18:23   0:00 qmgr -l -t fifo -u
root      1941  0.0  0.4  21100  4572 ?        Ss   18:23   0:00 /usr/sbin/abrtd
root      1950  0.0  0.1   5832  1176 ?        Ss   18:23   0:00 crond
root      1961  0.0  0.0   2824   440 ?        Ss   18:23   0:00 /usr/sbin/atd
root      1983  0.0  0.2  11164  2060 ?        S<sl 18:23   0:00 modclusterd
root      2005  0.0  0.5   5128  5124 ?        S<Ls 18:23   0:00 rgmanager
root      2007  0.0  0.2  39932  2252 ?        S<l  18:23   0:00 rgmanager
root      2009  0.0  0.0      0     0 ?        S    18:23   0:00 [dlm_recoverd]
root      2041  0.0  0.0   4148   532 ?        Ss   18:23   0:00 /usr/sbin/oddjobd -p /var/run/oddjobd.pid -t 300
root      2058  0.0  0.0   6744   728 ?        Ss   18:23   0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam
root      2059  0.0  0.0   6744   384 ?        S    18:23   0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam
root      2060  0.0  0.0   6744   384 ?        S    18:23   0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam
root      2061  0.0  0.0   6744   384 ?        S    18:23   0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam
root      2062  0.0  0.0   6744   384 ?        S    18:23   0:00 /usr/sbin/saslauthd -m /var/run/saslauthd -a pam
ricci     2069  0.0  0.1   9424  1096 ?        S<s  18:23   0:00 ricci -u ricci
root      2078  0.0  0.0   1960   448 tty1     Ss+  18:23   0:00 /sbin/mingetty /dev/tty1
root      2080  0.0  0.0   1960   452 tty2     Ss+  18:23   0:00 /sbin/mingetty /dev/tty2
root      2082  0.0  0.0   1960   448 tty3     Ss+  18:23   0:00 /sbin/mingetty /dev/tty3
root      2084  0.0  0.0   1960   456 tty4     Ss+  18:23   0:00 /sbin/mingetty /dev/tty4
root      2086  0.0  0.0   1960   452 tty5     Ss+  18:23   0:00 /sbin/mingetty /dev/tty5
root      2088  0.0  0.0   1960   452 tty6     Ss+  18:23   0:00 /sbin/mingetty /dev/tty6
root      2863  0.0  0.3  11120  3360 ?        Ss   18:26   0:00 sshd: root@pts/0
root      2877  0.0  0.1   5208  1664 pts/0    Ss   18:26   0:00 -bash
root      4062  1.0  0.0   4764   972 pts/0    R+   18:36   0:00 ps axuww

process-list.txt (18,850 bytes)
gpr_support_user

gpr_support_user

2011-11-05 19:45

reporter   ~0013724

locking type set correctly in both clusters:

[root@test2 ~]# grep locking_type /etc/lvm/lvm.conf
    locking_type = 3

[root@test1 ~]# grep locking_type /etc/lvm/lvm.conf
    locking_type = 3
gpr_support_user

gpr_support_user

2011-11-05 20:28

reporter   ~0013725

checked against :
https://bugzilla.redhat.com/show_bug.cgi?id=531637

According to this the mirror should still be able to run on both nodes as long as the mirror log is on the node with lowest node id. The mirror log is on node with lowest node id. So this is not applicable.

[root@test1 ~]# lvcreate -L 60G -m1 vg_shared -n mirrorlv /dev/etherd/e2.1 /dev/sda1 /dev/sda2

[root@test1 ~]# grep node /etc/cluster/cluster.conf
   <cman two_node="1" expected_votes="1"/>
   <clusternodes>
     <clusternode name="test1" nodeid="1">
     </clusternode>
     <clusternode name="test2" nodeid="2">
     </clusternode>
   </clusternodes>
gpr_support_user

gpr_support_user

2011-11-08 14:31

reporter   ~0013737

Able to reproduce this problem with no "High Availability" or "Resilient Storage" software running. This suggest the problem has nothing to do with cluster software.

1) stop rgmanager, gfs2, clvmd, cman and cmorrord to stop cluster subsystem.
2) unplug eth1(the link over which aoe and the cluster talks) to simulate cluster link failure.
3) run pvscan and hang

[root@test2 ~]# pvscan -vv -dd
      Setting global/locking_type to 3
      Setting global/wait_for_locks to 1
      Cluster locking selected.
  connect() failed on local socket: No such file or directory
  Internal cluster locking initialisation failed.
      Setting global/fallback_to_local_locking to 1
      locking/fallback_to_local_locking not found in config: defaulting to 1
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
      Setting global/locking_dir to /var/lock/lvm
      Locking /var/lock/lvm/P_global WB
    Wiping cache of LVM-capable devices
    Wiping internal VG cache
    Walking through all physical volumes
      /dev/ram0: size is 32768 sectors
      /dev/ram0: size is 32768 sectors
      /dev/ram0: No label detected
      /dev/loop0: size is 0 sectors
      /dev/sda: size is 156299375 sectors
      /dev/etherd/e1.2: size is 2120580 sectors
<blocking>

while pvscan is blocking dmesg gives the following:

[root@test2 ~]# dmesg
<snip>
eth1: link down
end_request: I/O error, dev etherd/e1.2, sector 0
Buffer I/O error on device etherd/e1.2, logical block 0
end_request: I/O error, dev etherd/e1.2, sector 4
Buffer I/O error on device etherd/e1.2, logical block 1
end_request: I/O error, dev etherd/e1.2, sector 8
Buffer I/O error on device etherd/e1.2, logical block 2
end_request: I/O error, dev etherd/e1.2, sector 12
Buffer I/O error on device etherd/e1.2, logical block 3
end_request: I/O error, dev etherd/e1.2, sector 16
Buffer I/O error on device etherd/e1.2, logical block 4
end_request: I/O error, dev etherd/e1.2, sector 20
Buffer I/O error on device etherd/e1.2, logical block 5
end_request: I/O error, dev etherd/e1.2, sector 24
Buffer I/O error on device etherd/e1.2, logical block 6
end_request: I/O error, dev etherd/e1.2, sector 28
Buffer I/O error on device etherd/e1.2, logical block 7
gpr_support_user

gpr_support_user

2011-11-08 14:33

reporter   ~0013738

tail /var/log/messages:

Nov 8 14:30:21 test2 lvm[3832]: connect() failed on local socket: No such file or directory
Nov 8 14:30:21 test2 lvm[3832]: Internal cluster locking initialisation failed.
gpr_support_user

gpr_support_user

2011-11-08 14:48

reporter   ~0013739

more of the same as above in the logs after running pvscan again twice while eth1 link down. It suggests that blocking is occurring due to the device initially appearing as being available to the kernel because the device isn't immediately telling the kernel that it is unavailable. The kernel and lvm appear eventually to timeout.

[root@test2 ~]# tail /var/log/messages
Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 3
Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 16
Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 4
Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 20
Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 5
Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 24
Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 6
Nov 8 14:33:33 test2 kernel: end_request: I/O error, dev etherd/e1.2, sector 28
Nov 8 14:33:33 test2 kernel: Buffer I/O error on device etherd/e1.2, logical block 7
Nov 8 14:33:33 test2 kernel: aoe: device 1.2 is not up


[root@test2 ~]# tail /var/log/messages
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 2
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 3
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 4
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 5
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 6
Nov 8 14:38:29 test2 kernel: end_request: I/O error, dev etherd/e1.1, sector 7
Nov 8 14:38:29 test2 lvm[3907]: /dev/etherd/e1.1: read failed after 0 of 512 at 0: Input/output error
Nov 8 14:38:29 test2 lvm[3907]: /dev/etherd/e1.1: open failed: No such device
Nov 8 14:38:29 test2 lvm[3907]: Couldn't find device with uuid vCR1eu-F6tq-AyjB-2HUG-ZgOd-8vul-20IQHY.
Nov 8 14:38:29 test2 lvm[3907]: Couldn't find device with uuid U2MW58-qe0Q-65YK-UfJK-HXYE-jh5G-qdci02.

Issue History

Date Modified Username Field Change
2011-11-05 19:08 gpr_support_user New Issue
2011-11-05 19:08 gpr_support_user File Added: process-list.txt
2011-11-05 19:12 gpr_support_user Tag Attached: aoe
2011-11-05 19:12 gpr_support_user Tag Attached: clvmd
2011-11-05 19:12 gpr_support_user Tag Attached: hang
2011-11-05 19:12 gpr_support_user Tag Attached: pvscan
2011-11-05 19:45 gpr_support_user Note Added: 0013724
2011-11-05 20:28 gpr_support_user Note Added: 0013725
2011-11-05 20:29 gpr_support_user Tag Attached: lvm2
2011-11-05 20:29 gpr_support_user Tag Attached: lvdisplay
2011-11-05 20:29 gpr_support_user Tag Attached: vgdisplay
2011-11-08 14:31 gpr_support_user Note Added: 0013737
2011-11-08 14:33 gpr_support_user Note Added: 0013738
2011-11-08 14:48 gpr_support_user Note Added: 0013739