GFS2 becomes non-responsive, no fencing

Discussion:

Ross Vandegrift

2008-08-25 23:29:41 UTC

Hi everyone,

Have run into a strange problem on our RH cluster installation. We
have a cluster that uses iscsi shared storage for GFS2. It's been
running for months with no problems.

Today, the app on one node died. I logged in, assumed things were
fenced, and tried to go about my business of restarting it. After
some fiddling, I got the box back in the cluster fine.

It just happened again, and I've dug in a bit more. I was wrong - the
failed node has not been fenced. The last thing in dmesg on the
failing node is:

GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done

Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks. Connectivity to the storage is good: no interfaces show
dropped packets or errors. cman_tool reports the node as healthy:

[***@sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 239.192.54.34

The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.

[***@sensor01 ~]# cman_tool nodes -f
Node Sts Inc Joined Name
1 M 360 2008-08-25 16:24:29 sensor01.dc3
Last fenced: 2008-08-25 16:04:25 by leaf8b-2.dc3
2 M 364 2008-08-25 16:24:29 sensor02.dc3
3 X 364 sensor03.dc3
Node has not been fenced since it went down

The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.

Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem. But as soon as I try a filesystem operation, boom,
I hang.

Any hints on where I can start looking?

--
Ross Vandegrift
***@kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

Ross Vandegrift

2008-08-26 17:50:04 UTC

Permalink

Post by Ross Vandegrift
Today, the app on one node died. I logged in, assumed things were
fenced, and tried to go about my business of restarting it. After
some fiddling, I got the box back in the cluster fine.
It just happened again, and I've dug in a bit more. I was wrong - the
failed node has not been fenced. The last thing in dmesg on the

Some more information gleaned today. I left the node running last
night without fixing the GFS2 access. Today, we noticed that
filesystem access has been restored for new processes - it's slow
(sometimes taking minutes to return an ls for 10 items), but
it eventually responds. The application threads that are sleeping in D
still haven't received their data from reads issued yesterday
afternoon.

A cursory examination of DLM-related keys in /sys reveal that the
working and broken nodes are configured the same. No major disparity
in terms of memory use, except the obvious fact that the broken node
shows very litte disk IO.

I'm pretty much at a loss - any ideas would be very welcome.

Steven Whitehouse

2008-08-27 08:30:52 UTC

Permalink

Hi,

Post by Ross Vandegrift

Some more information gleaned today. I left the node running last
night without fixing the GFS2 access. Today, we noticed that
filesystem access has been restored for new processes - it's slow
(sometimes taking minutes to return an ls for 10 items), but
it eventually responds. The application threads that are sleeping in D
still haven't received their data from reads issued yesterday
afternoon.
A cursory examination of DLM-related keys in /sys reveal that the
working and broken nodes are configured the same. No major disparity
in terms of memory use, except the obvious fact that the broken node
shows very litte disk IO.
I'm pretty much at a loss - any ideas would be very welcome.

There are a few things to check. Firstly compare /proc/slabinfo on a
slow node with that on a node running at normal speed. That will tell
you if there is a problem with memory leaking or not being reclaimed
properly.

If the node seems stuck, then try and echo t >/proc/sysrq-trigger and
look at the backtraces of any process which has called into gfs2 to see
where they are waiting. Also a dump of the glocks (you'll need to have
debugfs mounted) on all nodes should then allow you to work out whether
something on the stuck nodes is waiting for something on one of the
other nodes. Sometimes its useful to look at the DLM locks as well.

If you feel that you'd rather not go through all the details yourself,
then please put the info into a bugzilla entry and we'll take a look,

Steve.

Ross Vandegrift

2008-08-27 15:13:45 UTC

Permalink

Post by Steven Whitehouse
There are a few things to check. Firstly compare /proc/slabinfo on a
slow node with that on a node running at normal speed. That will tell
you if there is a problem with memory leaking or not being reclaimed
properly.

Nothing interesting at a quick glance over slabinfo. The active node
has much more consumed memory than the inactive one, which would seem
normal since the busted node hasn't done any work in hours.

Anything large is expected - icache, dcache, buffer_head. The active
node has more objects then the busted one.

Post by Steven Whitehouse
If the node seems stuck, then try and echo t >/proc/sysrq-trigger and
look at the backtraces of any process which has called into gfs2 to see
where they are waiting. Also a dump of the glocks (you'll need to have
debugfs mounted) on all nodes should then allow you to work out whether
something on the stuck nodes is waiting for something on one of the
other nodes. Sometimes its useful to look at the DLM locks as well.

Okay - processes that are hung due to I/O on GFS2 filesystem all have
a similar call stack:

=======================
ls D C981824A 2928 11112 11000 (NOTLB)
e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550
c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044
f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000
Call Trace:
[<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
[<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm]
[<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm]
[<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
[<c0604d68>] __wait_on_bit+0x33/0x58
[<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
[<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
[<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
[<c0436076>] wake_bit_function+0x0/0x3c
[<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
[<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
[<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
[<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2]
[<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2]
[<c047f754>] filldir64+0x0/0xc5
[<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2]
[<c047f935>] vfs_readdir+0x63/0x8d
[<c047f754>] filldir64+0x0/0xc5
[<c047f9c2>] sys_getdents64+0x63/0xa5
[<c0404eff>] syscall_call+0x7/0xb
=======================
python D 2A0CCE0D 1676 10551 10175 (NOTLB)
f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550
c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4
f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000
Call Trace:
[<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
[<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
[<c0604d68>] __wait_on_bit+0x33/0x58
[<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
[<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
[<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
[<c0436076>] wake_bit_function+0x0/0x3c
[<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
[<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
[<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
[<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2]
[<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2]
[<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2]
[<c047b557>] permission+0x78/0xb5
[<c047c9c0>] __link_path_walk+0x141/0xd33
[<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
[<c048d67a>] __mark_inode_dirty+0x13d/0x14f
[<c047d5fb>] link_path_walk+0x49/0xbd
[<c044ae04>] audit_syscall_entry+0x11c/0x14e
[<c047d9c8>] do_path_lookup+0x20e/0x25e
[<c047ded5>] sys_mkdirat+0x36/0xb6
[<c044ae04>] audit_syscall_entry+0x11c/0x14e
[<c047df64>] sys_mkdir+0xf/0x13
[<c0404eff>] syscall_call+0x7/0xb
=======================

I've got dumps of the glocks from debugfs, but I'm not really familiar
enough to GFS to understand what I'm reading. I tried to file a bug
in RH Bugzilla, but am getting 503 errors. I've posted the glock
dumps here:

http://kallisti.us/~ross/working-glocks
http://kallisti.us/~ross/broken-glocks

Can you point me in the direction of a document that explains what
the various things in the output mean?