Discussion:
[Linux-cluster] GFS2 DLM problem on NVMes
=?ks_c_5601-1987?B?vLq56cDn?=
2017-11-20 04:23:35 UTC
Permalink
Hello, List.

We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ¡°dlm: gfs2: send_repeat_remove¡± kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ¡°send_repeat_remove¡± function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)

Information about the test environment.
Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.

Can you offer suggestions or directions to solve these problems?
Thank you in advance :)

Best regards,
/Jay Sung

Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
***@sk.com | mobile: +82-10-2087-5637
Steven Whitehouse
2017-11-20 10:40:54 UTC
Permalink
Hi,
Post by =?ks_c_5601-1987?B?vLq56cDn?=
Hello, List.
We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host
frequently reports “dlm: gfs2: send_repeat_remove” kernel messages,
and I/O throughput becomes unstable and low.
I found a GFS2 commit message about “send_repeat_remove” function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)
Information about the test environment.
Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of
the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.
Can you offer suggestions or directions to solve these problems?
Thank you in advance :)
Best regards,
/Jay Sung
I'm copying in our DLM experts. It would be good to open a bug at Red
Hat's bugzilla to track this issue (and a customer case too, if you are
a customer). It looks like something that will need some investigation
to get to the bottom of what is going on. I suspect that a tcpdump of
the DLM traffic when the issue occurs would be the first thing to try,
so that we can try and match the message to the protocol dump. That may
not be easy since I suspect that there is a large quantity of DLM
traffic in your set up, and that will make finding the specific messages
more tricky.

Just out of interest, what kind of network is this running over? How
much bandwidth is DLM taking up?

Steve.
David Teigland
2017-11-20 19:09:32 UTC
Permalink
Post by =?ks_c_5601-1987?B?vLq56cDn?=
We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
Does MD RAID10 work correctly under GFS2? Does the RAID10 make use of the
recent md-cluster enhancements (which also use the dlm)?
Post by =?ks_c_5601-1987?B?vLq56cDn?=
However, a GFS2 DLM problem occurred. The problem is that each host
frequently reports dlm: gfs2: send_repeat_remove kernel messages,
and I/O throughput becomes unstable and low.
send_repeat_remove is a mysterious corner case, related to the resource
directory becoming out of sync with the actual resource master. There's
an inherent race in this area of the dlm which is hard to solve because
the same record (mapping of resource name to master nodeid) needs to be
changed consistently on two nodes. Perhaps in the future the dlm could be
enhanced with some algorithm to do that better. For now, it just repeats
the change (logging the message you see). If the repeated operation is
working, then things won't be permanently stuck.

The most likely cause, it seems to me, is that the speed of storage
relative to the speed of the network is triggering pathological timing
issues in the dlm. Try adjusting the "toss_secs" tunable, which controls
how long a node will hold on to an unused resource before giving up
mastery of it (the master change is what leads to the inconsistency
mentioned above.)

echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs

The default is 10, I'd try 100/1000/10000. A number too large could have
negative consequences of not freeing enough dlm resources that will never
be used again, e.g. if you are deleting a lot of files. Set this number
before mounting gfs for it to take effect.

In the past, I think that send_repeat_remove has tended to appear when
there's a huge volume of dlm messages, triggered by excessive caching done
by gfs when there's a large amount of system memory. The huge volume of
dlm messages results in the messages appearing in unusual sequences,
reversing the usual cause-effect.

Dave
--
Linux-cluster mailing list
Linux-***@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Eric H. Chang
2017-11-22 04:32:13 UTC
Permalink
Hi Dave and Steven,
Thank you for the assistance.

We made some progress here and would like to share with you.

#1.
We¡¯ve set ¡®vm.vfs_cache_pressure¡¯ to zero and ran tests. As a result, we couldn¡¯t see the same problem happening and observed that the slab grew slowly and saturated to 25GB during the overnight test. We will keep running test with this, but it¡¯d be appeciated if you can advise any risks when we stick with this config.

#2.
We¡¯ve tested with different ¡®toss_secs¡¯ as advised. When we configured it as 1000, we saw the ¡®send_repeat_remove¡¯ log after 1000sec. We can test with other values on ¡®toss_secs¡¯, but we think it would have the same problem potentially when freeing up the slab after the configured sec.

Do our results make sense to you?

Best Regards,
Eric Chang(Hong-seok), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
***@sk.com<mailto:***@sk.com> | mobile: +82-10-4996-3690 | skype: ehschang


Re: [Linux-cluster] GFS2 DLM problem on NVMes

*From: David Teigland <teigland redhat com>
*To: bj sung sk com
*Cc: linux-cluster redhat com
*Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 13:09:32 -0600
Post by =?ks_c_5601-1987?B?vLq56cDn?=
We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
Does MD RAID10 work correctly under GFS2? Does the RAID10 make use of the
recent md-cluster enhancements (which also use the dlm)?
Post by =?ks_c_5601-1987?B?vLq56cDn?=
However, a GFS2 DLM problem occurred. The problem is that each host
frequently reports dlm: gfs2: send_repeat_remove kernel messages,
and I/O throughput becomes unstable and low.
send_repeat_remove is a mysterious corner case, related to the resource
directory becoming out of sync with the actual resource master. There's
an inherent race in this area of the dlm which is hard to solve because
the same record (mapping of resource name to master nodeid) needs to be
changed consistently on two nodes. Perhaps in the future the dlm could be
enhanced with some algorithm to do that better. For now, it just repeats
the change (logging the message you see). If the repeated operation is
working, then things won't be permanently stuck.

The most likely cause, it seems to me, is that the speed of storage
relative to the speed of the network is triggering pathological timing
issues in the dlm. Try adjusting the "toss_secs" tunable, which controls
how long a node will hold on to an unused resource before giving up
mastery of it (the master change is what leads to the inconsistency
mentioned above.)

echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs

The default is 10, I'd try 100/1000/10000. A number too large could have
negative consequences of not freeing enough dlm resources that will never
be used again, e.g. if you are deleting a lot of files. Set this number
before mounting gfs for it to take effect.

In the past, I think that send_repeat_remove has tended to appear when
there's a huge volume of dlm messages, triggered by excessive caching done
by gfs when there's a large amount of system memory. The huge volume of
dlm messages results in the messages appearing in unusual sequences,
reversing the usual cause-effect.

Dave


Re: [Linux-cluster] GFS2 DLM problem on NVMes

*From: Steven Whitehouse <swhiteho redhat com>
*To: linux-cluster redhat com, Mark Ferrell <mferrell redhat com>, David Teigland <teigland redhat com>
*Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 10:40:54 +0000


Hi,

On 20/11/17 04:23, Œº¹éÀç wrote:


Hello, List.

We are developing storage systems using 10 NVMes (current test set).

Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ¡°dlm: gfs2: send_repeat_remove¡± kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ¡°send_repeat_remove¡± function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)


Information about the test environment.

Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.


Can you offer suggestions or directions to solve these problems?

Thank you in advance :)


Best regards,
/Jay Sung

I'm copying in our DLM experts. It would be good to open a bug at Red Hat's bugzilla to track this issue (and a customer case too, if you are a customer). It looks like something that will need some investigation to get to the bottom of what is going on. I suspect that a tcpdump of the DLM traffic when the issue occurs would be the first thing to try, so that we can try and match the message to the protocol dump. That may not be easy since I suspect that there is a large quantity of DLM traffic in your set up, and that will make finding the specific messages more tricky.

Just out of interest, what kind of network is this running over? How much bandwidth is DLM taking up?
Steve.

[Linux-cluster] GFS2 DLM problem on NVMes

*From: Œº¹éÀç <bj sung sk com>
*To: "linux-cluster redhat com" <linux-cluster redhat com>
*Subject: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 04:23:35 +0000

Hello, List.


We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ¡°dlm: gfs2: send_repeat_remove¡± kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ¡°send_repeat_remove¡± function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)

Information about the test environment.

Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).

Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.
Can you offer suggestions or directions to solve these problems?

Thank you in advance :)

Best regards,
/Jay Sung
Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
bj sung sk com | mobile: +82-10-2087-5637

Loading...