Discussion:
Removing a node from a running cluster
Pena, Francisco Javier
2007-01-04 14:42:05 UTC
Permalink
Hello,

I am finding a strange cman behavior when removing a node from a running cluster. The starting point is:

- 3 nodes running RHEL 4 U4, GFS 6.1 (1 vote per node)
- Quorum disk (4 votes)

I stop all cluster services on node 3, then modify the cluster.conf file to remove the node (and adjust the quorum disk votes to 3), and then "ccs_tool update" and "cman_tool version -r <new_version>". The cluster services keep running, however it looks like cman is not completely in sync with ccsd:

# ccs_tool lsnode

Cluster name: TestCluster, config_version: 9

Nodename Votes Nodeid Iface Fencetype
gfsnode1 1 1 iLO_NODE1
gfsnode2 1 2 iLO_NODE2


# cman_tool nodes

Node Votes Exp Sts Name
0 4 0 M /dev/emcpowera1
1 1 3 M gfsnode1
2 1 3 M gfsnode2
3 1 3 X gfsnode3

# cman_tool status

Protocol version: 5.0.1
Config version: 9
Cluster name: TestCluster
Cluster ID: 62260
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 6
Quorum: 4
Active subsystems: 9
Node name: gfsnode1
Node ID: 1
Node addresses: A.B.C.D

CMAN still thinks the third node is part of the cluster, but has just stopped working. In addition to that, it is not updating the number of votes for the quorum disk. If I completely restart the cluster services on all nodes, I get the right information:

- Correct votes for the quorum disk
- Third node dissappears
- The Expected_votes value is now 2

I know from a previous post that two node clusters are a special case, even with quorum disk, but I am pretty sure the same problem will happen with higher node counts (I just do not have enough hardware to test it).

So, is this considered as a bug or is it expected that the information from removed nodes is still there until the whole cluster is restarted?

Thanks in advance,

Javier Peña
Patrick Caulfield
2007-01-04 15:00:43 UTC
Permalink
Post by Pena, Francisco Javier
Hello,
- 3 nodes running RHEL 4 U4, GFS 6.1 (1 vote per node)
- Quorum disk (4 votes)
# ccs_tool lsnode
Cluster name: TestCluster, config_version: 9
Nodename Votes Nodeid Iface Fencetype
gfsnode1 1 1 iLO_NODE1
gfsnode2 1 2 iLO_NODE2
# cman_tool nodes
Node Votes Exp Sts Name
0 4 0 M /dev/emcpowera1
1 1 3 M gfsnode1
2 1 3 M gfsnode2
3 1 3 X gfsnode3
# cman_tool status
Protocol version: 5.0.1
Config version: 9
Cluster name: TestCluster
Cluster ID: 62260
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 6
Quorum: 4
Active subsystems: 9
Node name: gfsnode1
Node ID: 1
Node addresses: A.B.C.D
- Correct votes for the quorum disk
- Third node dissappears
- The Expected_votes value is now 2
I can't comment on the behaviour of the quorum disk, but cman is behaving as expected. A node is NEVER removed from the internal
lists of cman while any node of the cluster is till active. It is completely harmless in that state, the node simply remains
permanently dead and expected votes is adjusted accordingly.
--
patrick
Jim Parsons
2007-01-04 15:55:14 UTC
Permalink
Post by Patrick Caulfield
Post by Pena, Francisco Javier
Hello,
- 3 nodes running RHEL 4 U4, GFS 6.1 (1 vote per node)
- Quorum disk (4 votes)
# ccs_tool lsnode
Cluster name: TestCluster, config_version: 9
Nodename Votes Nodeid Iface Fencetype
gfsnode1 1 1 iLO_NODE1
gfsnode2 1 2 iLO_NODE2
# cman_tool nodes
Node Votes Exp Sts Name
0 4 0 M /dev/emcpowera1
1 1 3 M gfsnode1
2 1 3 M gfsnode2
3 1 3 X gfsnode3
# cman_tool status
Protocol version: 5.0.1
Config version: 9
Cluster name: TestCluster
Cluster ID: 62260
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 6
Quorum: 4
Active subsystems: 9
Node name: gfsnode1
Node ID: 1
Node addresses: A.B.C.D
- Correct votes for the quorum disk
- Third node dissappears
- The Expected_votes value is now 2
I can't comment on the behaviour of the quorum disk, but cman is behaving as expected. A node is NEVER removed from the internal
lists of cman while any node of the cluster is till active. It is completely harmless in that state, the node simply remains
permanently dead and expected votes is adjusted accordingly.
Patrick - isn't it also necessary to set a cman attribute for
two-node='1' in the conf file? In order for cman to see this attribute,
the entire cluster would need to be restarted.

Regards,

-Jim
Patrick Caulfield
2007-01-04 15:34:16 UTC
Permalink
Post by Jim Parsons
Post by Patrick Caulfield
Post by Pena, Francisco Javier
Hello,
I am finding a strange cman behavior when removing a node from a
- 3 nodes running RHEL 4 U4, GFS 6.1 (1 vote per node)
- Quorum disk (4 votes)
I stop all cluster services on node 3, then modify the cluster.conf
file to remove the node (and adjust the quorum disk votes to 3), and
then "ccs_tool update" and "cman_tool version -r <new_version>". The
cluster services keep running, however it looks like cman is not
# ccs_tool lsnode
Cluster name: TestCluster, config_version: 9
Nodename Votes Nodeid Iface Fencetype
gfsnode1 1 1 iLO_NODE1
gfsnode2 1 2 iLO_NODE2
# cman_tool nodes
Node Votes Exp Sts Name
0 4 0 M /dev/emcpowera1
1 1 3 M gfsnode1
2 1 3 M gfsnode2
3 1 3 X gfsnode3
# cman_tool status
Protocol version: 5.0.1
Config version: 9
Cluster name: TestCluster
Cluster ID: 62260
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 6
Quorum: 4
Active subsystems: 9
Node name: gfsnode1
Node ID: 1
Node addresses: A.B.C.D
CMAN still thinks the third node is part of the cluster, but has just
stopped working. In addition to that, it is not updating the number
of votes for the quorum disk. If I completely restart the cluster
- Correct votes for the quorum disk
- Third node dissappears
- The Expected_votes value is now 2
I can't comment on the behaviour of the quorum disk, but cman is
behaving as expected. A node is NEVER removed from the internal
lists of cman while any node of the cluster is till active. It is
completely harmless in that state, the node simply remains
permanently dead and expected votes is adjusted accordingly.
Patrick - isn't it also necessary to set a cman attribute for
two-node='1' in the conf file? In order for cman to see this attribute,
the entire cluster would need to be restarted.
No, not if they're using a quorum disk.

That flag is only needed for a two-node cluster where the quorum is set to one and the surviving node is determined by a fencing race.
--
patrick
Jim Parsons
2007-01-04 16:23:13 UTC
Permalink
Post by Patrick Caulfield
Post by Jim Parsons
Patrick - isn't it also necessary to set a cman attribute for
two-node='1' in the conf file? In order for cman to see this attribute,
the entire cluster would need to be restarted.
No, not if they're using a quorum disk.
That flag is only needed for a two-node cluster where the quorum is set to one and the surviving node is determined by a fencing race.
Oh my - what are the implications of having that attr set when using a
quorum disk? Nothing, I hope...

-J
Patrick Caulfield
2007-01-04 17:05:04 UTC
Permalink
Post by Jim Parsons
Post by Patrick Caulfield
Post by Jim Parsons
Patrick - isn't it also necessary to set a cman attribute for
two-node='1' in the conf file? In order for cman to see this attribute,
the entire cluster would need to be restarted.
No, not if they're using a quorum disk.
That flag is only needed for a two-node cluster where the quorum is
set to one and the surviving node is determined by a fencing race.
Oh my - what are the implications of having that attr set when using a
quorum disk? Nothing, I hope...
Well, basically that flag allows a cluster to continue with a single vote. So it could be quite dangerous I suppose if the cluster
splits and one node has the quorum disk and one doesn't.

I'd need to check specific configurations but I wouldn't really recommend it...
--
patrick
Graeme Crawford
2007-01-08 18:20:36 UTC
Permalink
Next time, run "cman_tool leave" it has a few pre-req's so check the man page.
Then a "cman_tool expected vote_num" should sort out your quorum/votes.

graeme.
Post by Pena, Francisco Javier
Hello,
- 3 nodes running RHEL 4 U4, GFS 6.1 (1 vote per node)
- Quorum disk (4 votes)
# ccs_tool lsnode
Cluster name: TestCluster, config_version: 9
Nodename Votes Nodeid Iface Fencetype
gfsnode1 1 1 iLO_NODE1
gfsnode2 1 2 iLO_NODE2
# cman_tool nodes
Node Votes Exp Sts Name
0 4 0 M /dev/emcpowera1
1 1 3 M gfsnode1
2 1 3 M gfsnode2
3 1 3 X gfsnode3
# cman_tool status
Protocol version: 5.0.1
Config version: 9
Cluster name: TestCluster
Cluster ID: 62260
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 6
Quorum: 4
Active subsystems: 9
Node name: gfsnode1
Node ID: 1
Node addresses: A.B.C.D
- Correct votes for the quorum disk
- Third node dissappears
- The Expected_votes value is now 2
I know from a previous post that two node clusters are a special case, even with quorum disk, but I am pretty sure the same problem will happen with higher node counts (I just do not have enough hardware to test it).
So, is this considered as a bug or is it expected that the information from removed nodes is still there until the whole cluster is restarted?
Thanks in advance,
Javier Peña
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
i***@logicore.net
2007-01-08 18:31:01 UTC
Permalink
Post by Graeme Crawford
Next time, run "cman_tool leave" it has a few pre-req's so check the man page.
I have these problems also, trying to shut down the cluster, I get;

cman_tool leave;

cman_tool: Can't leave cluster while there are 6 active subsystems


Mike
i***@logicore.net
2007-01-08 18:52:59 UTC
Permalink
Fixed my shut down problems so anyone else having issues, here's how it works.

Man for cman_tool says;

//
leave;

Tells CMAN to leave the cluster. You cannot do this if there are subsystems
(eg DLM, GFS) active.
You should dismount all GFS filesystems, shutdown CLVM, fenced and anything
else using the cluster manager before using cman_tool leave. Look at
cman_tool status|services to see how many (and which) services are running.
\\

Answers all :).

Mike
James Parsons
2007-01-08 19:04:45 UTC
Permalink
Post by i***@logicore.net
Fixed my shut down problems so anyone else having issues, here's how it works.
Man for cman_tool says;
//
leave;
Tells CMAN to leave the cluster. You cannot do this if there are subsystems
(eg DLM, GFS) active.
You should dismount all GFS filesystems, shutdown CLVM, fenced and anything
else using the cluster manager before using cman_tool leave. Look at
cman_tool status|services to see how many (and which) services are running.
\\
Answers all :).
WARNING: Shameless Promotion --

Conga does all of these things for you in a browser window...there is a
dropdown menu on the node page that offers the user the option to have a
node leave or join a cluster, completely delete a node, reboot a node,
or use the fence subsystem to fence a node. With one mouse click and a
confirmation dailog, all neccesary services are checked and shutdown for
you and the node is removed/deleted/etc.

When you add a new node, you enter the ipaddr/hostname for the new node,
and then all necessary packages are yummed and installed, all necessary
services started, and a new configuration file reflecting the new node
is propagated.

What if you add a node two a two-node cluster that does not use quorum
disk, you ask? Conga removes the two_node=1 attr from the <cman> tag and
reminds you that the cluster needs to be restarted...and provides a link
to the appropriate cluster page where one mouse click and a confirmation
dialog will restart the whole cluster.

-J
i***@logicore.net
2007-01-08 19:35:53 UTC
Permalink
Ok, confusion again... why does this work on one node but not another. They
are identical nodes in every way.

# more stop_gfs

/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd

On some nodes, I'm still getting;

cman_tool: Can't leave cluster while there are 1 active subsystems

Mike
Post by i***@logicore.net
Fixed my shut down problems so anyone else having issues, here's how
it works.
Man for cman_tool says;
//
leave;
Tells CMAN to leave the cluster. You cannot do this if there are subsystems
(eg DLM, GFS) active.
You should dismount all GFS filesystems, shutdown CLVM, fenced and anything
else using the cluster manager before using cman_tool leave. Look at
cman_tool status|services to see how many (and which) services are
running.
\\
Answers all :).
Mike
Jaap Dijkshoorn
2007-01-09 08:42:40 UTC
Permalink
MIke,
Post by i***@logicore.net
Ok, confusion again... why does this work on one node but not
another. They
are identical nodes in every way.
# more stop_gfs
/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd
On some nodes, I'm still getting;
cman_tool: Can't leave cluster while there are 1 active subsystems
Mike
You should check with cman_tool services(said below), which services are
still running/updating/joining etc. It sometimes happen that a service
cant be shutdown nicely. You can try to kill daemons by hand with a
soft/hard kill.


Met vriendelijke groet, Kind Regards,

Jaap P. Dijkshoorn
Group Leader Cluster Computing
Systems Programmer
mailto:***@sara.nl http://home.sara.nl/~jaapd

SARA Computing & Networking Services
Kruislaan 415 1098 SJ Amsterdam
Tel: +31-(0)20-5923000
Fax: +31-(0)20-6683167
http://www.sara.nl
Post by i***@logicore.net
Post by i***@logicore.net
Fixed my shut down problems so anyone else having issues, here's how
it works.
Man for cman_tool says;
//
leave;
Tells CMAN to leave the cluster. You cannot do this if there
are subsystems
Post by i***@logicore.net
(eg DLM, GFS) active.
You should dismount all GFS filesystems, shutdown CLVM,
fenced and anything
Post by i***@logicore.net
else using the cluster manager before using cman_tool leave.
Look at
Post by i***@logicore.net
cman_tool status|services to see how many (and which)
services are
Post by i***@logicore.net
running.
\\
Answers all :).
Mike
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
i***@logicore.net
2007-01-09 15:45:43 UTC
Permalink
Hi there,

This is my current shutdown script which works on some servers, not on others.

/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd

I run it...

Deactivating VG VolGroup01: [ OK ]
Deactivating VG VolGroup02: [ OK ]
Deactivating VG VolGroup03: [ OK ]
Deactivating VG VolGroup04: [ OK ]
Stopping clvm: [ OK ]
Stopping fence domain: [ OK ]
cman_tool: Can't leave cluster while there are 4 active subsystems

# cman_tool services
Service Name GID LID State Code
User: "usrm::manager" 13 6 run -
[2 3 4 6 5 7 8]

What's usrm::manager? I can't seem to find anything on the redhat site and
online searches lead to endless 'stuff'. I'm guessing what ever this is, it's
the problem?

Mike
Post by Jaap Dijkshoorn
You should check with cman_tool services(said below), which services are
still running/updating/joining etc. It sometimes happen that a service
cant be shutdown nicely. You can try to kill daemons by hand with a
soft/hard kill.
Met vriendelijke groet, Kind Regards,
Jaap P. Dijkshoorn
Group Leader Cluster Computing
Systems Programmer
SARA Computing & Networking Services
Kruislaan 415 1098 SJ Amsterdam
Tel: +31-(0)20-5923000
Fax: +31-(0)20-6683167
http://www.sara.nl
Christopher Hawkins
2007-01-09 15:55:51 UTC
Permalink
Are you unmounting the GFS filesystem first? That should be the first thing
in your script...
-----Original Message-----
Sent: Tuesday, January 09, 2007 10:46 AM
To: linux clustering
Subject: RE: [Linux-cluster] Can't leave cluster
Hi there,
This is my current shutdown script which works on some
servers, not on others.
/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd
I run it...
Deactivating VG VolGroup01: [ OK ]
Deactivating VG VolGroup02: [ OK ]
Deactivating VG VolGroup03: [ OK ]
Deactivating VG VolGroup04: [ OK ]
Stopping clvm: [ OK ]
Stopping fence domain: [ OK ]
cman_tool: Can't leave cluster while there are 4 active subsystems
# cman_tool services
Service Name GID LID
State Code
User: "usrm::manager" 13 6 run -
[2 3 4 6 5 7 8]
What's usrm::manager? I can't seem to find anything on the
redhat site and online searches lead to endless 'stuff'. I'm
guessing what ever this is, it's the problem?
Mike
Post by Jaap Dijkshoorn
You should check with cman_tool services(said below), which
services
Post by Jaap Dijkshoorn
are still running/updating/joining etc. It sometimes happen that a
service cant be shutdown nicely. You can try to kill
daemons by hand
Post by Jaap Dijkshoorn
with a soft/hard kill.
Met vriendelijke groet, Kind Regards,
Jaap P. Dijkshoorn
Group Leader Cluster Computing
Systems Programmer
SARA Computing & Networking Services
Kruislaan 415 1098 SJ Amsterdam
Tel: +31-(0)20-5923000
Fax: +31-(0)20-6683167
http://www.sara.nl
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
i***@logicore.net
2007-01-09 16:03:33 UTC
Permalink
Yup, it's the second item in my script.
Post by Christopher Hawkins
Are you unmounting the GFS filesystem first? That should be the first thing
in your script...
-----Original Message-----
Sent: Tuesday, January 09, 2007 10:46 AM
To: linux clustering
Subject: RE: [Linux-cluster] Can't leave cluster
Hi there,
This is my current shutdown script which works on some
servers, not on others.
/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd
I run it...
Deactivating VG VolGroup01: [ OK ]
Deactivating VG VolGroup02: [ OK ]
Deactivating VG VolGroup03: [ OK ]
Deactivating VG VolGroup04: [ OK ]
Stopping clvm: [ OK ]
Stopping fence domain: [ OK ]
cman_tool: Can't leave cluster while there are 4 active subsystems
# cman_tool services
Service Name GID LID
State Code
User: "usrm::manager" 13 6 run -
[2 3 4 6 5 7 8]
What's usrm::manager? I can't seem to find anything on the
redhat site and online searches lead to endless 'stuff'. I'm
guessing what ever this is, it's the problem?
Mike
Post by Jaap Dijkshoorn
You should check with cman_tool services(said below), which
services
Post by Jaap Dijkshoorn
are still running/updating/joining etc. It sometimes happen that a
service cant be shutdown nicely. You can try to kill
daemons by hand
Post by Jaap Dijkshoorn
with a soft/hard kill.
Met vriendelijke groet, Kind Regards,
Jaap P. Dijkshoorn
Group Leader Cluster Computing
Systems Programmer
SARA Computing & Networking Services
Kruislaan 415 1098 SJ Amsterdam
Tel: +31-(0)20-5923000
Fax: +31-(0)20-6683167
http://www.sara.nl
--
Linux-cluster mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster
Christopher Hawkins
2007-01-09 16:13:44 UTC
Permalink
-----Original Message-----
Sent: Tuesday, January 09, 2007 11:04 AM
To: linux-cluster
Subject: RE: [Linux-cluster] Can't leave cluster
Yup, it's the second item in my script.
Wow, a serious blonde moment. I have had the same issue from time to time
(with starting as well as stopping) if the scripts go too fast. I don't
recall which component was being sensitive, but you might try adding a sleep
5 here and there or running the commands manually, but with a good pause
between them, and see if that changes anything.
Robert Peterson
2007-01-09 16:07:54 UTC
Permalink
Post by i***@logicore.net
# cman_tool services
Service Name GID LID State Code
User: "usrm::manager" 13 6 run -
[2 3 4 6 5 7 8]
What's usrm::manager? I can't seem to find anything on the redhat site and
online searches lead to endless 'stuff'. I'm guessing what ever this is, it's
the problem?
Mike
Hi Mike,

That's for rgmanager I think. Perhaps your script should also do:
service rgmanager stop

Regards,

Bob Peterson
Red Hat Cluster Suite
i***@logicore.net
2007-01-09 21:12:25 UTC
Permalink
Thanks Bob,

Can't recall if I replied to this but have one other question.
Post by i***@logicore.net
What's usrm::manager? I can't seem to find anything on the redhat site and
online searches lead to endless 'stuff'. I'm guessing what ever this is,
it's the problem?
service rgmanager stop
That was indeed what it was. Here is my final shutdown script;

service httpd stop
umount /var/www
vgchange -aln
service clvmd stop
fence_tool leave
service fenced stop
service rgmanager stop
cman_tool leave
killall ccsd

Two questions;

1: I probably don't need the last line in there correct?

2: Can I create a new service so that I can run this script to shut things
down cleanly when I want to reboot the node? If so, what is the process?

Mike
Robert Peterson
2007-01-12 17:27:25 UTC
Permalink
Post by i***@logicore.net
That was indeed what it was. Here is my final shutdown script;
service httpd stop
umount /var/www
vgchange -aln
service clvmd stop
fence_tool leave
service fenced stop
service rgmanager stop
cman_tool leave
killall ccsd
Two questions;
1: I probably don't need the last line in there correct?
2: Can I create a new service so that I can run this script to shut things
down cleanly when I want to reboot the node? If so, what is the process?
Mike
Hi Mike,

1. I recommend "service ccsd stop" rather than killall ccsd.
2. In theory, this script should not be necessary on a RHEL, Fedora Core
or centos box
if you have your service scripts set up and chkconfig'ed on. When
you do
/sbin/reboot, the service scripts are supposed to run in the correct
order
and take care of all this for you. Shudown should take you to
runlevel 6,
which should run the shutdown scripts in /etc/rc.d/rc6.d in the Kxx
order.
The httpd script should stop that service, then the
the gfs script should take care of unmounting the gfs file systems
at "stop".
The clvmd script should take care of deactivating the vgs. And the
Kxx numbers
should be set properly at install time to ensure the proper order.
If there's a problem shutting down with the normal scripts, perhaps
we need to
file a bug and get the scripts changed.

Regards,

Bob Peterson
Red Hat Cluster Suite
i***@logicore.net
2007-01-12 21:19:15 UTC
Permalink
Post by Robert Peterson
1. I recommend "service ccsd stop" rather than killall ccsd.
This is actually my latest. While it does not *seem* to work at times, it does
take the node out of the cluster cleanly. I say seem because it tells me that
the node is still in the cluster yet it's not.

# more stop_gfs
service httpd stop
umount /var/www
vgchange -aln
service clvmd stop
fence_tool leave
service fenced stop
cman_tool leave
service rgmanager stop
sleep 5
service cman stop
Post by Robert Peterson
2. In theory, this script should not be necessary on a RHEL, Fedora Core
or centos box if you have your service scripts set up and chkconfig'ed on.
When you do /sbin/reboot, the service scripts are supposed to run in the
correct order and take care of all this for you.
Never had, don't know why. Always figured it was because of the way I have to
start my nodes. I wanted to add my shutdown script into the shutdown run
levels so that it's automatic but am not sure how to add that in.
Post by Robert Peterson
Shudown should take you to runlevel 6, which should run the shutdown scripts
in /etc/rc.d/rc6.d in the Kxx order.
Do you mean I should just copy my shutdown script into that directory?
Post by Robert Peterson
If there's a problem shutting down with the normal scripts, perhaps
we need to file a bug and get the scripts changed.
Well, here is my startup script for each node, maybe the answer lies in how I
start them?

depmod -a
modprobe dm-mod
modprobe gfs
modprobe lock_dlm

service rgmanager start
ccsd
cman_tool join -w
fence_tool join -w
clvmd
vgchange -aly
mount -t gfs /dev/VolGroup04/web /var/www/

cp -f /var/www/system/httpd.conf /etc/httpd/conf/.
cp -f /var/www/system/php.ini /etc/.
/etc/init.d/httpd start

Mike
Robert Peterson
2007-01-12 21:53:58 UTC
Permalink
Hi Mike,
Post by i***@logicore.net
Post by Robert Peterson
1. I recommend "service ccsd stop" rather than killall ccsd.
This is actually my latest. While it does not *seem* to work at times, it does
take the node out of the cluster cleanly. I say seem because it tells me that
the node is still in the cluster yet it's not.
# more stop_gfs
service httpd stop
umount /var/www
vgchange -aln
service clvmd stop
fence_tool leave
service fenced stop
cman_tool leave
service rgmanager stop
sleep 5
service cman stop
Shouldn't there be a "service ccsd stop" at the end?
Post by i***@logicore.net
Never had, don't know why. Always figured it was because of the way I have to
start my nodes. I wanted to add my shutdown script into the shutdown run
levels so that it's automatic but am not sure how to add that in.
Well, normally the scripts are all in /etc/init.d/ and are the same for
startup and shutdown.
In the runlevel directories, /etc/rc.d/rc3.d (runlevel 3),
/etc/rc.d/rc5.d (runlevel 5) and
/etc/rc.d/rc6.d (shutdown) there are symlinks to the scripts. If they
start with Sxx
they run at startup, and if they start with Kxx they're run at shutdown
at that runlevel.

These symlinks for the runlevels are created by the chkconfig tool.
So if I do the command "chkconfig ccsd on" it creates the symlinks for me
at the appropriate runlevels.

Ordinarily, at the top of the scripts have comments at the top that the
"chkconfig" tool uses to figure out how to name these symlinks.
So if you look at the top of /etc/init.d/ccsd, you'll see something like
this:

# chkconfig: 345 20 80

The 345 means it starts up at runlevels 3, 4 and 5. The "20" means it
symlinks
S20ccsd in /etc/rc.d/rc3.d/ from /etc/init.d/ccsd. The 80 means K80 at
runlevel 6.
The scripts are run in numerical order, so number "S20" will be run before
any of the S21 scripts, etc. And K80 will be run after K79.
Post by i***@logicore.net
Do you mean I should just copy my shutdown script into that directory?
Not exactly. What I meant is that your script should not be necessary
because
the shutdown init scripts should run automatically and take of everything
for you. If you really want to use your script, you can add the appropriate
comments to your script, copy it to /etc/init.d, and do chkconfig
<script name> on"
to create the symlinks.
Post by i***@logicore.net
Post by Robert Peterson
If there's a problem shutting down with the normal scripts, perhaps
we need to file a bug and get the scripts changed.
Well, here is my startup script for each node, maybe the answer lies in how I
start them?
depmod -a
modprobe dm-mod
modprobe gfs
modprobe lock_dlm
service rgmanager start
ccsd
cman_tool join -w
fence_tool join -w
clvmd
vgchange -aly
mount -t gfs /dev/VolGroup04/web /var/www/
cp -f /var/www/system/httpd.conf /etc/httpd/conf/.
cp -f /var/www/system/php.ini /etc/.
/etc/init.d/httpd start
Mike
Okay, so maybe you just need to do:
chkconfig cman on;chkconfig ccsd on; chkconfig gfs on; chkconfig clvmd on;
chkconfig rgmanager on; chkconfig fenced on; chkconfig httpd on;and so
forth,
so they're started up at boot time, and taken down in the correct order
at shutdown time.

Regards,

Bob Peterson
Red Hat Cluster Suite
i***@logicore.net
2007-01-13 23:16:25 UTC
Permalink
Ok, I've been trying it the way you've suggested. I auto start the services on
the node, then run a script to join the node and start a service in this case;

cman_tool -t 120 join -w
fence_tool -t 120 join -w
vgchange -aly
mount -t gfs /dev/VolGroup04/web /var/www/
cp -f /var/www/system/httpd.conf /etc/httpd/conf/.
cp -f /var/www/system/php.ini /etc/.
/etc/init.d/httpd start

This works just fine. Now, when I try to remove the node from the cluster, I
still get;

cman_tool: Can't leave cluster while there are 3 active subsystems

cman_tool services shows that rgmanager is still running. I stop that, same
problem, node is still in the cluster. What next?

My stop script is;

/etc/init.d/httpd stop
umount /var/www
vgchange -aln
fence_tool leave
cman_tool leave remove -w

Mike
Robert Peterson
2007-01-15 01:51:00 UTC
Permalink
Post by i***@logicore.net
Ok, I've been trying it the way you've suggested. I auto start the services on
the node, then run a script to join the node and start a service in this case;
cman_tool -t 120 join -w
fence_tool -t 120 join -w
vgchange -aly
mount -t gfs /dev/VolGroup04/web /var/www/
cp -f /var/www/system/httpd.conf /etc/httpd/conf/.
cp -f /var/www/system/php.ini /etc/.
/etc/init.d/httpd start
This works just fine. Now, when I try to remove the node from the cluster, I
still get;
cman_tool: Can't leave cluster while there are 3 active subsystems
cman_tool services shows that rgmanager is still running. I stop that, same
problem, node is still in the cluster. What next?
My stop script is;
/etc/init.d/httpd stop
umount /var/www
vgchange -aln
fence_tool leave
cman_tool leave remove -w
Mike
Hi Mike,

In theory, the script to join the node should not be necessary because the
cman init script should do the cman_tool join, the fenced init script should
do the fence_tool join, the clvmd script should do the vgchange -aly,
and the httpd init script should take care of httpd.

When the node is shut down, the rgmanager script should stop it.
And if all the start and stop scripts are set to run in their appropriate
run levels, there shouldn't be any resources left in cman_tool services
to keep the shutdown from occurring normally. Perhaps you can do:

chkconfig --list | grep "cman\|rgmanager\|fenced\|ccsd\|clvmd\|gfs\|httpd"

and make sure the cluster services are all listed as "on" for 3, 4, and 5.
I believe these things shouldn't require any extra scripts to start or stop,
and if they are required, maybe we (or I) need to change the init scripts.
If average users are having problems with the scripts, let's get them fixed.

Regards,

Bob Peterson
Red Hat Cluster Suite
i***@logicore.net
2007-01-15 06:15:41 UTC
Permalink
Post by Robert Peterson
In theory, the script to join the node should not be necessary because the
cman init script should do the cman_tool join, the fenced init script should
do the fence_tool join, the clvmd script should do the vgchange -aly,
and the httpd init script should take care of httpd.
Got it and I've changed everything as it should be now. Those were left over
from the ongoing learning about GFS Clustering.
Post by Robert Peterson
I believe these things shouldn't require any extra scripts to start or stop,
and if they are required, maybe we (or I) need to change the init scripts.
If average users are having problems with the scripts, let's get them fixed.
Seems to be working fine now. The only error I got was when I first restarted
each node after changing the run levels, I found;

Jan 14 23:59:04 compdev fenced[10735]: fencing node "cweb92.domain.com"
Jan 14 23:59:04 compdev fenced[10735]: agent "fence_brocade" reports: parse
error: unknown option "nodeid=92"
Jan 14 23:59:04 compdev fenced[10735]: fence "cweb92.domain.com" failed

Just wondering... should that node ID be the same as the node name?

Mike
Robert Peterson
2007-01-15 15:29:03 UTC
Permalink
Post by i***@logicore.net
Seems to be working fine now. The only error I got was when I first restarted
each node after changing the run levels, I found;
Jan 14 23:59:04 compdev fenced[10735]: fencing node "cweb92.domain.com"
Jan 14 23:59:04 compdev fenced[10735]: agent "fence_brocade" reports: parse
error: unknown option "nodeid=92"
Jan 14 23:59:04 compdev fenced[10735]: fence "cweb92.domain.com" failed
Just wondering... should that node ID be the same as the node name?
Mike
Hi Mike,

Sounds like a small problem with your cluster.conf file.
If you post it here or email it to me, I can probably tell you what's wrong.

Regards,

Bob Peterson
Red Hat Cluster Suite
Mike Papper
2007-01-16 01:24:45 UTC
Permalink
Hi, I am considering using GFS + Linux Cluster so that multiple clients
can "share" the same filesystem. An alternative I am considering would
enable me to not have to use GFS - which I believe will reduce the
complexity of our system greatly.

I am hoping some users of GFS have encountered these issues before and
was hoping to get feedback - any is appreciated.

The alternative I am considering is to have a single filesystem
available to many clients using a SAN (iSCSI in this case). However only
one client would mount the filesystem (Reiser, XFS etc.) as read/write
while the others would mount it read-only. For my application, all files
are written once then only ever read or deleted.

Is it the case that when a new file is added (by the writer machine)
that the clients that are mounted read only would see and be able to
read this new file?

Does this apply to symbolic-link files as well?

Does anyone have experience with such a configuration?

Mike
Nate Carlson
2007-01-16 05:26:29 UTC
Permalink
Post by Mike Papper
The alternative I am considering is to have a single filesystem
available to many clients using a SAN (iSCSI in this case). However only
one client would mount the filesystem (Reiser, XFS etc.) as read/write
while the others would mount it read-only. For my application, all files
are written once then only ever read or deleted.
From everything I've read, this will *not* work. Read the list archives;
there has been lots of discussion of this before..

------------------------------------------------------------------------
| nate carlson | ***@natecarlson.com | http://www.natecarlson.com |
| depriving some poor village of its idiot since 1981 |
------------------------------------------------------------------------
Mike Papper
2007-01-16 01:24:48 UTC
Permalink
Hi,

I would like to use GFS to enable multiple clients to access one large
filesystem supported via an iSCSI SAN. The files are written once and
then only read or deleted. In some ways GFS may be overkill for this
application (because I do not need to support appending/writing to a
file once its created) but it enables multiple clients access to a
single filesystem.

I know that GFS and the Linux Cluser are available on red Hat Enterprise
as well as CentOS and Fedora. I believe the cost of RH is very large
($1000 per client for RHEL plus another $2200 per client for the cluster
software) and I am seeking an alternative...

I would appreciate feedback concerning these items:

1) is the CentOS or Fedora Core 6 version of Cluster "production ready"
2) Does anyone have an experience that they can share using these other
OS to install and configure GFS?
3) If I use CentOS and add the Linux Cluster (I am talking about the
link on their site to download GFS et al.) what is involved (assuming
that I can start with the latest Cent OS) in terms of installation to
make it work?
4) Similar to above but with Fedora Core 6 - what extra work do I need
to do to install Linux Cluster + GFS (I', referring to things like
recompiling the kernel, putting in a kernel patch, installing RPMs etc.).
5) Is it advisable to put millions of files in a single directory? I
know that GFS has published limits of how many files per directory etc.
(although I can't recall the exact numbers right now) but is it OK to go
up to these limits without a performance penalty?
5a) Has anyone had experience with a large number of files or
directories per directory that was still under the limits published for
GFS where they ran into performance issues?

Any ideas on a good, clean way to get Linux Cluster + GFS running on our
system is appreciated.

Mike
Nate Carlson
2007-01-16 04:24:47 UTC
Permalink
[Answering questions that I know the answer for.]
Post by Mike Papper
1) is the CentOS or Fedora Core 6 version of Cluster "production ready"
The CentOS implementation is built from the same sources as the RHEL4
code. The CentOS team just rebuilds it. I have no experience with the FC6
version.
Post by Mike Papper
2) Does anyone have an experience that they can share using these other
OS to install and configure GFS?
CentOS works just fine.
Post by Mike Papper
3) If I use CentOS and add the Linux Cluster (I am talking about the
link on their site to download GFS et al.) what is involved (assuming
that I can start with the latest Cent OS) in terms of installation to
make it work?
The same as RHEL4.
Post by Mike Papper
Any ideas on a good, clean way to get Linux Cluster + GFS running on our
system is appreciated.
CentOS is functionally identical to RHEL4 - it's just rebuilt from the
source RPM's that Redhat provides, with some additional minor patches and
tweaks. However, if you run into problems, you don't have official RedHat
support behind you to help get it fixed. Since the GFS subsystem is rather
complex, it can be very nice to have support when it breaks. :)

------------------------------------------------------------------------
| nate carlson | ***@natecarlson.com | http://www.natecarlson.com |
| depriving some poor village of its idiot since 1981 |
------------------------------------------------------------------------
Lon Hohberger
2007-01-16 21:27:14 UTC
Permalink
Post by Mike Papper
I know that GFS and the Linux Cluser are available on red Hat Enterprise
as well as CentOS and Fedora. I believe the cost of RH is very large
($1000 per client for RHEL plus another $2200 per client for the cluster
software) and I am seeking an alternative...
1) is the CentOS or Fedora Core 6 version of Cluster "production ready"
CentOS 4 probably is fine; FC6 packages are maybe not quite as stable
(yet) as the release on CentOS 4 or RHEL 4. Of course, use the latest
packages in any case, and definitely report bugs you find.

(Obligatory note: if you need someone to call if it breaks, you still
might want to consider RHEL + RHGFS.)
Post by Mike Papper
2) Does anyone have an experience that they can share using these other
OS to install and configure GFS?
I think it's part of the FC6 install.
Post by Mike Papper
3) If I use CentOS and add the Linux Cluster (I am talking about the
link on their site to download GFS et al.) what is involved (assuming
that I can start with the latest Cent OS) in terms of installation to
make it work?
There should be no tricks; installation should "just work" in all of the
cases you mentioned.
Post by Mike Papper
4) Similar to above but with Fedora Core 6 - what extra work do I need
to do to install Linux Cluster + GFS (I', referring to things like
recompiling the kernel, putting in a kernel patch, installing RPMs etc.).
WRT FC6... Configuration should be similar to either of the previous
versions. The FAQ should have lots of relevant information, as well.

-- Lon

i***@logicore.net
2007-01-09 21:39:50 UTC
Permalink
Same problem again on another node.

stop-cluster-script

service httpd stop
umount /var/www
vgchange -aln
service clvmd stop
fence_tool leave
service fenced stop
service rgmanager stop
cman_tool leave
killall ccsd

I run it and;

]# ./stop_gfs
Stopping httpd: [ OK ]
Found duplicate PV y6nVM03KVVWs0v68yQVmiGruP5hOSv1z: using /dev/sdd not
/dev/sda
Found duplicate PV wv0qVlspVX11RBlVI5IKyXLAVoH0eiZ3: using /dev/sde not
/dev/sdb
Found duplicate PV t9Fwnx7n6vrPpCZ8d3XKyO6V6cIvqeWR: using /dev/sdf not
/dev/sdc
0 logical volume(s) in volume group "VolGroup01" now active
0 logical volume(s) in volume group "VolGroup04" now active
0 logical volume(s) in volume group "VolGroup03" now active
0 logical volume(s) in volume group "VolGroup02" now active
Found duplicate PV y6nVM03KVVWs0v68yQVmiGruP5hOSv1z: using /dev/sdd not
/dev/sda
Found duplicate PV wv0qVlspVX11RBlVI5IKyXLAVoH0eiZ3: using /dev/sde not
/dev/sdb
Found duplicate PV t9Fwnx7n6vrPpCZ8d3XKyO6V6cIvqeWR: using /dev/sdf not
/dev/sdc
Found duplicate PV y6nVM03KVVWs0v68yQVmiGruP5hOSv1z: using /dev/sdd not
/dev/sda
Found duplicate PV wv0qVlspVX11RBlVI5IKyXLAVoH0eiZ3: using /dev/sde not
/dev/sdb
Found duplicate PV t9Fwnx7n6vrPpCZ8d3XKyO6V6cIvqeWR: using /dev/sdf not
/dev/sdc
Deactivating VG VolGroup01: [ OK ]
Deactivating VG VolGroup02: [ OK ]
Deactivating VG VolGroup03: [ OK ]
Deactivating VG VolGroup04: [ OK ]
Stopping clvm: [ OK ]
Stopping fence domain: [ OK ]
Cluster Service Manager is stopped.
cman_tool: Can't leave cluster while there are 2 active subsystems

# cman_tool services
Service Name GID LID State Code
Ramon van Alteren
2007-01-09 12:32:07 UTC
Permalink
Post by i***@logicore.net
Ok, confusion again... why does this work on one node but not another. They
are identical nodes in every way.
# more stop_gfs
/etc/init.d/httpd stop
umount /var/www
vgchange -aln
/etc/init.d/clvmd stop
fence_tool leave
/etc/init.d/fenced stop
cman_tool leave
killall ccsd
On some nodes, I'm still getting;
cman_tool: Can't leave cluster while there are 1 active subsystems
Check with cman_tool services

One that bit me before is *thinking* I unmounted a gfs system but it
failed because I was still running nfs which exported one of the gfs
filesystems.

Ramon
Bowie Bailey
2007-01-09 17:17:04 UTC
Permalink
Post by Christopher Hawkins
Wow, a serious blonde moment. I have had the same issue from time to
time (with starting as well as stopping) if the scripts go too fast.
I don't recall which component was being sensitive, but you might try
adding a sleep 5 here and there or running the commands manually, but
with a good pause between them, and see if that changes anything.
I had the same issue with CMAN failing to stop. I found that adding a
"sleep 5" before the call to cman_tool in the init script fixed it.
--
Bowie
Loading...