Ross Vandegrift
2008-08-25 23:29:41 UTC
Hi everyone,
Have run into a strange problem on our RH cluster installation. We
have a cluster that uses iscsi shared storage for GFS2. It's been
running for months with no problems.
Today, the app on one node died. I logged in, assumed things were
fenced, and tried to go about my business of restarting it. After
some fiddling, I got the box back in the cluster fine.
It just happened again, and I've dug in a bit more. I was wrong - the
failed node has not been fenced. The last thing in dmesg on the
failing node is:
GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done
Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks. Connectivity to the storage is good: no interfaces show
dropped packets or errors. cman_tool reports the node as healthy:
[***@sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 239.192.54.34
The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.
[***@sensor01 ~]# cman_tool nodes -f
Node Sts Inc Joined Name
1 M 360 2008-08-25 16:24:29 sensor01.dc3
Last fenced: 2008-08-25 16:04:25 by leaf8b-2.dc3
2 M 364 2008-08-25 16:24:29 sensor02.dc3
3 X 364 sensor03.dc3
Node has not been fenced since it went down
The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.
Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem. But as soon as I try a filesystem operation, boom,
I hang.
Any hints on where I can start looking?
Have run into a strange problem on our RH cluster installation. We
have a cluster that uses iscsi shared storage for GFS2. It's been
running for months with no problems.
Today, the app on one node died. I logged in, assumed things were
fenced, and tried to go about my business of restarting it. After
some fiddling, I got the box back in the cluster fine.
It just happened again, and I've dug in a bit more. I was wrong - the
failed node has not been fenced. The last thing in dmesg on the
failing node is:
GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done
Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks. Connectivity to the storage is good: no interfaces show
dropped packets or errors. cman_tool reports the node as healthy:
[***@sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 239.192.54.34
The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.
[***@sensor01 ~]# cman_tool nodes -f
Node Sts Inc Joined Name
1 M 360 2008-08-25 16:24:29 sensor01.dc3
Last fenced: 2008-08-25 16:04:25 by leaf8b-2.dc3
2 M 364 2008-08-25 16:24:29 sensor02.dc3
3 X 364 sensor03.dc3
Node has not been fenced since it went down
The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.
Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem. But as soon as I try a filesystem operation, boom,
I hang.
Any hints on where I can start looking?
--
Ross Vandegrift
***@kallisti.us
"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37
Ross Vandegrift
***@kallisti.us
"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37