Discussion:
dlm and IO speed problem <er, might wanna get a coffee first ; )>
(too old to reply)
christopher barry
2008-04-08 02:36:46 UTC
Permalink
Hi everyone,

I have a couple of questions about the tuning the dlm and gfs that
hopefully someone can help me with.

my setup:
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.

The cluster is a developer build cluster, where developers login, and
are balanced across nodes and edit and compile code. They can access via
vnc, XDMCP, ssh and telnet, and nodes external to the cluster can mount
the gfs home via nfs, balanced through the director. Their homes are on
the gfs, and accessible on all nodes.

I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.

* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it? I've set LVS persistence for ssh and
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?

* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.

Am I expecting too much from gfs? Did I oversell it when I literally
fought to use it rather than nfs off the NetApp filer, insisting that
the performance of gfs smoked nfs? Or, more likely, do I just not
understand how to optimize it fully for my application?


Regards and Thanks,
-C
Bevan Broun
2008-04-08 04:31:31 UTC
Permalink
Hi All

I have a strange set of requirements:

A two node cluster:
services running on cluster nodes are not shared (ie not clustered).
cluster is only there for two GFS file systems on a SAN.
The same storage system hosts non GFS luns for individual use by the
cluster members.
The nodes run two applications, the critical app does NOT use the GFS. The
non critical ap uses the GFS.
The critical application uses storage from the SAN for ext3 file systems.

The requirement is that a failure of the cluster should not interupt the
critical application.
This means the failed node cannot be power cycled. Also the failed node must
continue to have access to it's non GFS luns on the storage.

The Storage are two HP EVAs. Each EVA has two controllers. There are two
brocade FC switches.

Fencing is required for GFS.

The only solution I can think of is:
GFS LUNs presented down one HBA only, while ext3 luns are presented down
both.
Use SAN fencing to block access by the fenced host to GFS luns by blocking
access to the controller that is handling this LUN.

repairing the cluster will be a manual operation that may involve a reboot.

does this look workable?

Thanks
Wendy Cheng
2008-04-08 09:13:52 UTC
Permalink
On Mon, Apr 7, 2008 at 9:36 PM, christopher barry <
Post by christopher barry
Hi everyone,
I have a couple of questions about the tuning the dlm and gfs that
hopefully someone can help me with.
There are lots to say about this configuration.. It is not a simple tuning
issue.
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
Putting a load balancer in front of cluster filesystem is tricky to get it
right (to say the least). This is particularly true between GFS and LVS,
mostly because LVS is a general purpose load balancer that is difficult to
tune to work with the existing GFS locking overhead.


The cluster is a developer build cluster, where developers login, and
Post by christopher barry
are balanced across nodes and edit and compile code. They can access via
vnc, XDMCP, ssh and telnet, and nodes external to the cluster can mount
the gfs home via nfs, balanced through the director. Their homes are on
the gfs, and accessible on all nodes.
Direct login into GFS nodes (via vnc, ssh, telnet, etc) is ok but nfs client
access in this setup will have locking issues. It is *not* only a
performance issue. It is *also* a function issue - that is, before 2.6.19
Linux kernel, NLM locking (used by NFS client) doesn't get propagated into
clustered NFS servers. You'll have file corruption if different NFS clients
do file lockings and expect the lockings can be honored across different
clustered NFS servers. In general, people needs to think *very* carefully to
put a load balancer before a group of linux NFS servers using any
before-2.6.19 kernel. It is not going to work if there are multiple clients
that invoke either posix locks and/or flocks on files that are expected to
get accessed across different linux NFS servers on top *any* cluster
filesystem (not only GFS). .
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it?
Cluster locking is expensive. As the result, GFS caches its glocks and there
is an one-to-one correspondence between GFS glock and DLM locks. Even an
user is no longer "on" that node, the lock stays on that node unless:

1. some other node requests an exclusive access of this lock (file write);
or
2. the node has memory pressure that kicks off linux virtual memory manager
to reclaim idle filesystem structures (inode, dentries, etc); or
3. abnormal events such as crash, umount, etc.

Check out: ,
http://open-sharedroot.org/Members/marc/blog/blog-on-gfs/glock-trimming-patch/?searchterm=gfs
for details.


I've set LVS persistence for ssh and
Post by christopher barry
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?
You did the right thing here (by making the connection persistence). There
is a gfs glock dump command that can print out all the lock info (name,
owner, etc) but I really don't want to recommend it - since automating this
process is not trivial and there is no way to do this by hand, i.e.
manually.
Post by christopher barry
* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.
The more memory you have, the more gfs locks (and their associated gfs file
structures) will be cached in the node. It, in turns, will make both dlm and
gfs lock queries take longer. The glock_purge (on RHEL 4.6, not on RHEL 4.5)
should be able to help but its effects will be limited if you ping-pong the
locks quickly between different GFS nodes. Try to play around with this
tunable (start with 20%) to see how it goes (but please reset gfs_scand and
gfs_inoded back to their defaults while you are experimenting glock_purge).

So assume this is a build-compile cluster, implying large amount of small
files come and go, The tricks I can think of:

1. glock_purge ~ 20%
2. glock_inode shorter than default (not longer)
3. persistent LVS session if all possible
Post by christopher barry
Am I expecting too much from gfs? Did I oversell it when I literally
fought to use it rather than nfs off the NetApp filer, insisting that
the performance of gfs smoked nfs? Or, more likely, do I just not
understand how to optimize it fully for my application?
GFS1 is very good on large sequential IO (such as vedio-on-demand) but works
poorly in the environment you try to setup. However, I'm in an awkward
position to do further comments I'll stop here.

-- Wendy
Kadlecsik Jozsef
2008-04-08 21:09:26 UTC
Permalink
Post by Wendy Cheng
The more memory you have, the more gfs locks (and their associated gfs file
structures) will be cached in the node. It, in turns, will make both dlm and
gfs lock queries take longer. The glock_purge (on RHEL 4.6, not on RHEL 4.5)
should be able to help but its effects will be limited if you ping-pong the
locks quickly between different GFS nodes. Try to play around with this
tunable (start with 20%) to see how it goes (but please reset gfs_scand and
gfs_inoded back to their defaults while you are experimenting glock_purge).
So assume this is a build-compile cluster, implying large amount of small
1. glock_purge ~ 20%
2. glock_inode shorter than default (not longer)
3. persistent LVS session if all possible
What is glock_inode? Does it exist or something equivalent in
cluster-2.01.00?

Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too
small it results not only long linked lists but clashing at the same
bucket will block otherwise parallel operations. Wouldn't it help
increasing it from 8k to 65k?

Best regards,
Jozsef
--
E-mail : ***@mail.kfki.hu, ***@blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
Wendy Cheng
2008-04-09 15:06:08 UTC
Permalink
Post by Kadlecsik Jozsef
What is glock_inode? Does it exist or something equivalent in
cluster-2.01.00?
Sorry, typo. What I mean is "inoded_secs" (gfs inode daemon wake-up
time). This is the daemon that reclaims deleted inodes. Don't set it too
small though.
Post by Kadlecsik Jozsef
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too
small it results not only long linked lists but clashing at the same
bucket will block otherwise parallel operations. Wouldn't it help
increasing it from 8k to 65k?
Worth a try.

However, the issues involved here are more than lock searching time. It
also has to do with cache flushing. GFS currently accumulates too much
dirty caches. When it starts to flush, it will pause the system for too
long. Glock trimming helps - since cache flush is part of glock
releasing operation.

-- Wendy
Wendy Cheng
2008-04-09 17:54:27 UTC
Permalink
Post by Wendy Cheng
Post by Kadlecsik Jozsef
What is glock_inode? Does it exist or something equivalent in
cluster-2.01.00?
Sorry, typo. What I mean is "inoded_secs" (gfs inode daemon wake-up
time). This is the daemon that reclaims deleted inodes. Don't set it
too small though.
Have been responding to this email from top of the head, based on folks'
descriptions. Please be aware that they are just rough thoughts and the
responses may not fit in general cases. The above is mostly for the
original problem description where:

1. The system is designated for build-compile - my take is that there
are many temporary and deleted files.
2. The gfs_inode tunable was changed (to 30, instead of default, 15).
Post by Wendy Cheng
Post by Kadlecsik Jozsef
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being
too small it results not only long linked lists but clashing at the
same bucket will block otherwise parallel operations. Wouldn't it
help increasing it from 8k to 65k?
Worth a try.
Now I remember .... we did experiment with different hash sizes when
this latency issue was first reported two years ago. It didn't make much
difference. The cache flushing, on the other hand, was more significant.

-- Wendy
Post by Wendy Cheng
However, the issues involved here are more than lock searching time.
It also has to do with cache flushing. GFS currently accumulates too
much dirty caches. When it starts to flush, it will pause the system
for too long. Glock trimming helps - since cache flush is part of
glock releasing operation.
Kadlecsik Jozsef
2008-04-09 19:42:33 UTC
Permalink
Post by Wendy Cheng
Have been responding to this email from top of the head, based on folks'
descriptions. Please be aware that they are just rough thoughts and the
responses may not fit in general cases. The above is mostly for the original
1. The system is designated for build-compile - my take is that there are many
temporary and deleted files.
2. The gfs_inode tunable was changed (to 30, instead of default, 15).
I'll take it into account when experimenting with the different settings.
Post by Wendy Cheng
Post by Wendy Cheng
Post by Kadlecsik Jozsef
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too
small it results not only long linked lists but clashing at the same
bucket will block otherwise parallel operations. Wouldn't it help
increasing it from 8k to 65k?
Worth a try.
Now I remember .... we did experiment with different hash sizes when this
latency issue was first reported two years ago. It didn't make much
difference. The cache flushing, on the other hand, was more significant.
What led me to suspect clashing in the hash (or some other lock-creating
issue) was the simple test I made on our five node cluster: on one node I
ran

find /gfs -type f -exec cat {} > /dev/null \;

and on another one just started an editor, naming a non-existent file.
It took multiple seconds while the editor "opened" the file. What else
than creating the lock could delay the process so long?
Post by Wendy Cheng
Post by Wendy Cheng
However, the issues involved here are more than lock searching time. It also
has to do with cache flushing. GFS currently accumulates too much dirty
caches. When it starts to flush, it will pause the system for too long.
Glock trimming helps - since cache flush is part of glock releasing
operation.
But 'flushing when releasing glock' looks as a side effect. I mean, isn't
there a more direct way to control the flushing?

I can easily be totally wrong, but on the one hand, it's good to keep as
many locks cached as possible, because lock creation is expensive. But on
the other hand, trimming locks triggers flushing, which helps to keep the
systems running more smoothly. So a tunable to control flushing directly
would be better than just trimming the locks, isn't it. But not knowing
the deep internals of GFS, my reasoning can of course be bogus.

Best regards,
Jozsef
--
E-mail : ***@mail.kfki.hu, ***@blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
Wendy Cheng
2008-04-09 20:41:37 UTC
Permalink
Post by Kadlecsik Jozsef
Post by Wendy Cheng
Have been responding to this email from top of the head, based on folks'
descriptions. Please be aware that they are just rough thoughts and the
responses may not fit in general cases. The above is mostly for the original
1. The system is designated for build-compile - my take is that there are many
temporary and deleted files.
2. The gfs_inode tunable was changed (to 30, instead of default, 15).
I'll take it into account when experimenting with the different settings.
Post by Wendy Cheng
Post by Wendy Cheng
Post by Kadlecsik Jozsef
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too
small it results not only long linked lists but clashing at the same
bucket will block otherwise parallel operations. Wouldn't it help
increasing it from 8k to 65k?
Worth a try.
Now I remember .... we did experiment with different hash sizes when this
latency issue was first reported two years ago. It didn't make much
difference. The cache flushing, on the other hand, was more significant.
What led me to suspect clashing in the hash (or some other lock-creating
issue) was the simple test I made on our five node cluster: on one node I
ran
find /gfs -type f -exec cat {} > /dev/null \;
and on another one just started an editor, naming a non-existent file.
It took multiple seconds while the editor "opened" the file. What else
than creating the lock could delay the process so long?
Not knowing how "find" is implemented, I would guess this is caused by
directory locks. Creating a file needs a directory lock. Your exclusive
write lock (file create) can't be granted until the "find" releases the
directory lock. It doesn't look like a lock query performance issue to me.
Post by Kadlecsik Jozsef
Post by Wendy Cheng
Post by Wendy Cheng
However, the issues involved here are more than lock searching time. It also
has to do with cache flushing. GFS currently accumulates too much dirty
caches. When it starts to flush, it will pause the system for too long.
Glock trimming helps - since cache flush is part of glock releasing
operation.
But 'flushing when releasing glock' looks as a side effect. I mean, isn't
there a more direct way to control the flushing?
I can easily be totally wrong, but on the one hand, it's good to keep as
many locks cached as possible, because lock creation is expensive. But on
the other hand, trimming locks triggers flushing, which helps to keep the
systems running more smoothly. So a tunable to control flushing directly
would be better than just trimming the locks, isn't it.
To make long story short, I did submit a direct cache flush patch first,
instead of this final version of lock trimming patch. Unfortunately, it
was *rejected*.

-- Wendy
Kadlecsik Jozsef
2008-04-10 13:00:40 UTC
Permalink
Post by Wendy Cheng
Post by Kadlecsik Jozsef
What led me to suspect clashing in the hash (or some other lock-creating
issue) was the simple test I made on our five node cluster: on one node I
ran
find /gfs -type f -exec cat {} > /dev/null \;
and on another one just started an editor, naming a non-existent file.
It took multiple seconds while the editor "opened" the file. What else than
creating the lock could delay the process so long?
Not knowing how "find" is implemented, I would guess this is caused by
directory locks. Creating a file needs a directory lock. Your exclusive write
lock (file create) can't be granted until the "find" releases the directory
lock. It doesn't look like a lock query performance issue to me.
As /gfs is a large directory structure with hundreds of user home
directories, somehow I don't think I could pick the same directory which
was just processed by "find".

But this is a good clue to what might bite us most! Our GFS cluster is an
almost mail-only cluster for users with Maildir. When the users experience
temporary hangups for several seconds (even when writing a new mail), it
might be due to the concurrent scanning for a new mail on one node by the
MUA and the delivery to the Maildir in another node by the MTA.

What is really strange (and distrurbing) that such "hangups" can take
10-20 seconds which is just too much for the users.

In order to look at the possible tuning options and the side effects, I
list what I have learned so far:

- Increasing glock_purge (percent, default 0) helps to trim back the
unused glocks by gfs_scand itself. Otherwise glocks can accumulate and
gfs_scand eats more and more time at scanning the larger and
larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,
looking for work to do. By increasing scand_secs one can lessen the load
produced by gfs_scand, but it'll hurt because flushing data can be
delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing
often helps to avoid burstiness *and* to prolong another nodes'
lock access. Question is, what are the side effects of small
demote_secs values? (Probably there is no much point to choose
smaller demote_secs value than scand_secs.)

Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
Post by Wendy Cheng
Post by Kadlecsik Jozsef
But 'flushing when releasing glock' looks as a side effect. I mean, isn't
there a more direct way to control the flushing?
To make long story short, I did submit a direct cache flush patch first,
instead of this final version of lock trimming patch. Unfortunately, it was
*rejected*.
I see. Another question, just out of curiosity: why don't you use kernel
timers for every glock instead of gfs_scand? The hash bucket id of the
glock should be added to struct gfs_glock, but the timer function could be
almost identical with scan_glock. As far as I see the only drawback were
that it'd be equivalent with 'glock_purge = 100' and it'd be tricky to
emulate glock_purge != 100 settings.

Best regards,
Jozsef
--
E-mail : ***@mail.kfki.hu, ***@blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
Kadlecsik Jozsef
2008-04-11 11:05:08 UTC
Permalink
Post by Kadlecsik Jozsef
But this is a good clue to what might bite us most! Our GFS cluster is an
almost mail-only cluster for users with Maildir. When the users experience
temporary hangups for several seconds (even when writing a new mail), it
might be due to the concurrent scanning for a new mail on one node by the
MUA and the delivery to the Maildir in another node by the MTA.
What is really strange (and distrurbing) that such "hangups" can take
10-20 seconds which is just too much for the users.
Yesterday we started to monitor the number of locks/held locks on two of
the machines. The results from the first day can be found at
http://www.kfki.hu/~kadlec/gfs/.

It looks as Maildir is definitely a wrong choice for GFS and we should
consider to convert to mailbox format: at least I cannot explain the
spikes in another way.
Post by Kadlecsik Jozsef
In order to look at the possible tuning options and the side effects, I
- Increasing glock_purge (percent, default 0) helps to trim back the
unused glocks by gfs_scand itself. Otherwise glocks can accumulate and
gfs_scand eats more and more time at scanning the larger and
larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,
looking for work to do. By increasing scand_secs one can lessen the load
produced by gfs_scand, but it'll hurt because flushing data can be
delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing
often helps to avoid burstiness *and* to prolong another nodes'
lock access. Question is, what are the side effects of small
demote_secs values? (Probably there is no much point to choose
smaller demote_secs value than scand_secs.)
Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
Best regards,
Jozsef
--
E-mail : ***@mail.kfki.hu, ***@blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
Wendy Cheng
2008-04-12 04:16:52 UTC
Permalink
Post by Kadlecsik Jozsef
Post by Kadlecsik Jozsef
But this is a good clue to what might bite us most! Our GFS cluster is an
almost mail-only cluster for users with Maildir. When the users experience
temporary hangups for several seconds (even when writing a new mail), it
might be due to the concurrent scanning for a new mail on one node by the
MUA and the delivery to the Maildir in another node by the MTA.
I personally don't know much about mail server. But if anyone can
explain more about what these two processes (?) do, say, how does that
"MTA" deliver its mail (by "rename" system call ?) and/or how mails are
moved from which node to where, we may have a better chance to figure
this puzzle out.

Note that "rename" system call is normally very expensive. Minimum 4
exclusive locks are required (two directory locks, one file lock for
unlink, one file lock for link), plus resource group lock if block
allocation is required. There are numerous chances for deadlocks if not
handled carefully. The issue is further worsen by the way GFS1 does its
lock ordering - it obtains multiple locks based on lock name order. Most
of the locknames are taken from inode number so their sequence always
quite random. As soon as lock contention occurs, lock requests will be
serialized to avoid deadlocks. So this may be a cause for these spikes
where "rename"(s) are struggling to get lock order straight. But I don't
know for sure unless someone explains how email server does its things.
BTW, GFS2 has relaxed this lock order issue so it should work better.

I'm having a trip (away from internet) but I'm interested to know this
story... Maybe by the time I get back on my laptop, someone has figured
this out. But please do share the story :) ...

-- Wendy
Post by Kadlecsik Jozsef
Post by Kadlecsik Jozsef
What is really strange (and distrurbing) that such "hangups" can take
10-20 seconds which is just too much for the users.
Yesterday we started to monitor the number of locks/held locks on two of
the machines. The results from the first day can be found at
http://www.kfki.hu/~kadlec/gfs/.
It looks as Maildir is definitely a wrong choice for GFS and we should
consider to convert to mailbox format: at least I cannot explain the
spikes in another way.
Post by Kadlecsik Jozsef
In order to look at the possible tuning options and the side effects, I
- Increasing glock_purge (percent, default 0) helps to trim back the
unused glocks by gfs_scand itself. Otherwise glocks can accumulate and
gfs_scand eats more and more time at scanning the larger and
larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,
looking for work to do. By increasing scand_secs one can lessen the load
produced by gfs_scand, but it'll hurt because flushing data can be
delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing
often helps to avoid burstiness *and* to prolong another nodes'
lock access. Question is, what are the side effects of small
demote_secs values? (Probably there is no much point to choose
smaller demote_secs value than scand_secs.)
Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
Best regards,
Jozsef
--
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
g***@bobich.net
2008-04-08 10:05:25 UTC
Permalink
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
[...]
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it? I've set LVS persistence for ssh and
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?
* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.
It sounds like bumping up lock trimming might help, but I don't think
the feature accessibility through /sys has been back-ported to RHEL4, so
if you're stuck with RHEL4, you may have to rebuild the latest versions of
the tools and kernel modules from RHEL5, or you're out of luck.
Post by christopher barry
Am I expecting too much from gfs? Did I oversell it when I literally
fought to use it rather than nfs off the NetApp filer, insisting that
the performance of gfs smoked nfs? Or, more likely, do I just not
understand how to optimize it fully for my application?
Probably a combination of all of the above. The main advantage of GFS
isn't speed, it's the fact that it is a proper POSIX file system, unlike
NFS or CIFS (e.g. file locking actually works on GFS). It also tends to
stay consistent if a node fails, due to journalling.

Having said that, I've not seen speed differences as big as what you're
describing, but I'm using RHEL5. I also have bandwidth charts for my
DRBD/cluster interface, and the bandwidth usage on a lightly loaded system
is not really signifficant unless lots of writes start happening. With
mostly reads (which can all be served from the local DRBD mirror), the
background "noise" traffic of combined DRBD and RHCS is > 200Kb/s
(25KB/s). Since the ping times are < 0.1ms, in theory, this should make
locks take < 1ms to resolve/migrate. Of course, if your find goes over
50,000 files, the a 50 second delay to migrate all the locks may well be
in a reasonable ball-park. You may find that things have moved on quite a
bit since RHEL4...

Gordan
Wendy Cheng
2008-04-08 14:37:58 UTC
Permalink
Post by g***@bobich.net
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
[...]
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it? I've set LVS persistence for ssh and
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?
* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.
It sounds like bumping up lock trimming might help, but I don't think
the feature accessibility through /sys has been back-ported to RHEL4,
so if you're stuck with RHEL4, you may have to rebuild the latest
versions of the tools and kernel modules from RHEL5, or you're out of
luck.
Glock trimming patch was mostly written and tuned on top of RHEL 4. It
doesn't use /sys interface. The original patch was field tested on
several customer production sites. Upon CVS RHEL 4.5 check-in, it was
revised to use a less aggressive approach and turned out to be not as
effective as the original approach. So the original patch was re-checked
into RHEL 4.6.

I wrote the patch.

-- Wendy
christopher barry
2008-04-10 20:18:55 UTC
Permalink
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
[...]
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
Anyway, thought I would re-connect to you all and let you know how this
worked out. We ended up scrapping gfs. Not because it's not a great fs,
but because I was using it in a way that was playing to it's weak
points. I had a lot of time and energy invested in it, and it was hard
to let it go. Turns out that connecting to the NetApp filer via nfs is
faster for this workload. I couldn't believe it either, as my bonnie and
dd type tests showed gfs to be faster. But for the use case of large
sets of very small files, and lots of stats going on, gfs simply cannot
compete with NetApp's nfs implementation. GFS is an excellent fs, and it
has it's place in the landscape - but for a development build system,
the NetApp is simply phenomenal.


Thanks all for your assistance in the many months I have sought and
received advice and help here.

Regards,
Christopher Barry
Wendy Cheng
2008-04-11 15:28:37 UTC
Permalink
Post by christopher barry
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
[...]
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
Anyway, thought I would re-connect to you all and let you know how this
worked out. We ended up scrapping gfs. Not because it's not a great fs,
but because I was using it in a way that was playing to it's weak
points. I had a lot of time and energy invested in it, and it was hard
to let it go. Turns out that connecting to the NetApp filer via nfs is
faster for this workload. I couldn't believe it either, as my bonnie and
dd type tests showed gfs to be faster. But for the use case of large
sets of very small files, and lots of stats going on, gfs simply cannot
compete with NetApp's nfs implementation. GFS is an excellent fs, and it
has it's place in the landscape - but for a development build system,
the NetApp is simply phenomenal.
Assuming you run both configurations (nfs-wafl vs. gfs-san) on the very
same netapp box (?) ...

Both configurations have their pros and cons. The wafl-nfs runs on
native mode that certainly has its advantages - you've made a good
choice but the latter (gfs-on-netapp san) can work well in other
situations. The biggest problem with your original configuration is the
load-balancer. The round-robin (and its variants) scheduling will not
work well if you have a write intensive workload that needs to fight for
locks between multiple GFS nodes. IIRC, there are gfs customers running
on build-compile development environment. They normally assign groups of
users on different GFS nodes, say user id starting with a-e on node 1,
f-j on node2, etc.

One encouraging news from this email is gfs-netapp-san runs well on
bonnie. GFS1 has been struggling with bonnie (large amount of smaller
files within one single node) for a very long time. One of the reasons
is its block allocation tends to get spread across the disk whenever
there are resource group contentions. It is very difficult for linux IO
scheduler to merge these blocks within one single server. When the
workload becomes IO-bound, the locks are subsequently stalled and
everything start to snow-ball after that. Netapp SAN has one more layer
of block allocation indirection within its firmware and its write speed
is "phenomenal" (I'm borrowing your words ;) ), mostly to do with the
NVRAM where it can aggressively cache write data - this helps GFS to
relieve its small file issue quite well.

-- Wendy
christopher barry
2008-04-11 15:47:16 UTC
Permalink
Post by Wendy Cheng
Post by christopher barry
Post by christopher barry
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.
[...]
Post by christopher barry
I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.
Anyway, thought I would re-connect to you all and let you know how this
worked out. We ended up scrapping gfs. Not because it's not a great fs,
but because I was using it in a way that was playing to it's weak
points. I had a lot of time and energy invested in it, and it was hard
to let it go. Turns out that connecting to the NetApp filer via nfs is
faster for this workload. I couldn't believe it either, as my bonnie and
dd type tests showed gfs to be faster. But for the use case of large
sets of very small files, and lots of stats going on, gfs simply cannot
compete with NetApp's nfs implementation. GFS is an excellent fs, and it
has it's place in the landscape - but for a development build system,
the NetApp is simply phenomenal.
Assuming you run both configurations (nfs-wafl vs. gfs-san) on the very
same netapp box (?) ...
yes.
Post by Wendy Cheng
Both configurations have their pros and cons. The wafl-nfs runs on
native mode that certainly has its advantages - you've made a good
choice but the latter (gfs-on-netapp san) can work well in other
situations. The biggest problem with your original configuration is the
load-balancer. The round-robin (and its variants) scheduling will not
work well if you have a write intensive workload that needs to fight for
locks between multiple GFS nodes. IIRC, there are gfs customers running
on build-compile development environment. They normally assign groups of
users on different GFS nodes, say user id starting with a-e on node 1,
f-j on node2, etc.
exactly. I was about to implement the sh (source hash) scheduler in LVS,
which I believe would have accomplished the same thing, only
automatically, and in a statistically balanced way. Actually still
might. I've had some developers test out the nfs solution and for some
gfs is still better. I know that if users are pinned to a node - but can
still failover in the event of node failure - this would yield the best
possible performance.

The main reason the IT group wants to use nfs, is for all of the other
benefits, such as file-level snapshots, better backup performance, etc.
Now that they see a chink in the gfs performance armor, mainly because I
implemented the wrong load balancing algorithm, they're circling for the
kill. I'm interested how well the nfs will scale with users vs. the
gfs-san approach.
Post by Wendy Cheng
One encouraging news from this email is gfs-netapp-san runs well on
bonnie. GFS1 has been struggling with bonnie (large amount of smaller
files within one single node) for a very long time. One of the reasons
is its block allocation tends to get spread across the disk whenever
there are resource group contentions. It is very difficult for linux IO
scheduler to merge these blocks within one single server. When the
workload becomes IO-bound, the locks are subsequently stalled and
everything start to snow-ball after that. Netapp SAN has one more layer
of block allocation indirection within its firmware and its write speed
is "phenomenal" (I'm borrowing your words ;) ), mostly to do with the
NVRAM where it can aggressively cache write data - this helps GFS to
relieve its small file issue quite well.
Thanks for all of your input Wendy.

-C
Post by Wendy Cheng
-- Wendy
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
Continue reading on narkive:
Loading...