Josh Moore
http://home.attmil.ne.jp/a/jm/
Building a Beowulf Cluster
Introduction-
I had just finished the Linux from Scratch Project about a week ago and
thought I should find something else to try to do. I looked around the
internet and thought I would try to build a Beowulf cluster. During
construction of it, I noticed a lack of guides and decided to write one
for others interested in building their own cluster. This is not the
only way to build a cluster, but it is the way I built mine and will be
a guide on how to build a cluster similar to the one I built.
Before we get started into any technical details, you must first know
what a Beowulf cluster is. A Beowulf cluster is nothing more then a
group of two or more computers that are networked together and allow
programs to be split into parts and run between the computers to speed
up tasks. The first Beowulf cluster was built by two NASA engineers in
1994. It was constructed out of 16 Intel DX4 processors running at 100
MHz each. All the computers in a cluster are called nodes. A lot of
groups who need huge processing power (universities, labs, engineering
groups, NASA) often use clusters instead of super computers. Why spend
tons of money on an eight processor computer when it would be cheaper
to build a cluster of eight computers that would have the same or
greater performance results? As with SMP (multiple processors), not
just any program can be ran to take advantage of a cluster. It must
first be programmed and compiled to use message passing (clustering
software). This is just a watered down introduction of what a cluster
is, but it should be enough to get a basic idea for the rest of the
document.
For my first cluster, I built a few computers from parts laying around
the office. They consist of three computers ranging from Pentium 133
MHz to Pentium 200 MHz. All computers had between 64 and 128 megs of
ram. All are keyboard and monitor less but allow SSH log ins for
administration. The cluster also sits behind my firewall on the same
network as my other computers. I used Red Hat 9.0 as the operating
system on all nodes. Any setup will work but this is what seemed to
work best for me since it required no money and was purely a project to
use for future reference while building other clusters.
Nodes-
Most clusters have a master node which controls everything and a group
of slave nodes. The master node is supposed to be the fastest computer
and the computer with the most disk space. Once the cluster is
completed, users will only need to log into the master node to run jobs
and perform other tasks.
When and Where-
Once the nodes are setup and working the way you want, you can begin to
start the actual setup of clustering. The first step is to edit the
host file on each node and make such all the nodes can talk to each
other by name. It would be a good idea to have the nodes assigned a
static IP or have a DHCP server issue them IP addresses based on MAC
address. I used master, slave1, slave2, etc... to keep things simple.
For those in LA county, you might need to go with a different naming
scheme. It is also important for all nodes to have the same time set.
You should already have all clocks synchronized since it is a network
environment and helps with intrusion detection and other tasks. It is
best to have an internal computer running a network time protocol
server and synchronize all the computers on the network with it. If
not, synchronizing all the clocks from an external NTP server would
work as long as it is the same server.
Who-
All nodes also must have a similar user with identical home directories
and settings. I just added a user named cluster to keep things
simple. I also exported the /home directory on the master node and
mounted it to all the slaves using NFS. This ensured that all nodes
would have the same user configuration files and work space.
SSH versus RSH-
To be able to run jobs on each other, all nodes most be able to log
into and run commands on the other nodes. This can be done one of two
ways. You can setup SSH to require no password by using a key file.
This will add some security as it is encrypted, but will also slow the
cluster down a little bit since it has to be encrypted and decrypted.
Since all I am doing is running programs I don't really care if the
packets are encrypted, so I went with the second option. The second
option is using RSH. It can be a little trickier to setup but works
fine once it is up and working. If you want to use RSH but are really
paranoid, you could always add a second NIC to the master node and hook
it up to your main network and firewall that connection. It would also
improve performance a little, as the cluster is on its own network and
doesn't have to worry about other traffic.
Since most people probably don't know how to use RSH I will explain how
to set it up. The following must be done on ever cluster. First edit
the /etc/hosts.equiv file and add the host name of all the computers in
the cluster. Edit /etc/xinetd.d/rsh and /etc/xinetd.d/rlogin files and
change disable=yes to disable=no. This line can be at the top or
bottom of the file and I have seen it at both places on RH9. Also edit
/etc/pam.d/rlogin and move 'auth sufficient pam_rhosts_auth.so' from
the middle to the top of the file. Restart xinetd. Test it out by
issuing simple commands such as 'rsh host_name uptime'.
Message Passing-
Now it is time to install message passing software. There are several
choices, some better then others, but I went with MPICH. The
Mathematics and Computer Science Division at Argonne National
Laboratory at www-unix.mcs.anl.gov describes MPICH as "a freely
available, portable implementation of MPI, the Standard for message
passing libraries." Once MPICH was downloaded, I extracted it into the
cluster home directory and built at according to the install guide. It
is always a good idea to follow the install guide. Installation was
not too difficult. ./configure, make, su -c make install.
After MPICH is installed to master node, I exported it's directory via
NFS and mounted it to all the slave nodes. This will help make future
upgrades easier and causes no measurable decrease in performance.
There is on last step in configuring MPICH. The machines.LINUX file
under the share directory of MPICH must be edited. By default
localhost.localdomain is listed five times. Delete all occurrences of
it and list the host names of all the nodes in the cluster. Also edit
your .bash_profile and add the mpich/bin directory to your path. Now
it is time to test everything out.
Testing-
It is important to only perform clustering as the cluster user since
running as root is always bad and anything you do wrong could mess up
several computers because of NFS and RSH. There is a test program
called tstmachines located in the sbin directory of MPICH. Running it
should produce no output if everything goes ok. You can run it with
the -v (verbose) option and should receive something similar to the
following.
Trying true on master ...
Trying true on slave1 ...
Trying true on slave2 ...
Trying ls on master ...
Trying ls on slave1 ...
Trying ls on slave2 ...
Trying user program on master ...
Trying user program on slave1 ...
Trying user program on slave2 ...
Once all the work and troubleshooting is done, you are ready to
actually run programs. This can be done with the 'mpirun -np #
-nolocal program name' Replace # with the number of processors you
want to be used. The -nolocal option tells MPICH to run the program
across the nodes instead of being ran on the same node # times. MPICH
comes with a sample program called cpi which will calculate PI and give
you a rough benchmark of the system. It is best to run it on each node
then on all nodes and compare the results. If all goes well, you can
get into some serious clustering.
My own programs-
You might be thinking, great now I can take a few computers to a LAN
party and run Quake faster then anyone and be the frag master. It
would be nice if you could, but life is not that simple. I mentioned
earlier that clustering is a lot like multiple CPUs except cheaper.
Not just any program be be ran across a beowulf cluster. It must be
written in parallel and compiled with a parallel compiler. There are
several companies that offer parallel compilers but MPICH comes with a
parallel compiler that works well. The program split up parts of the
program that can be ran on multiple nodes. Parallel programming is
beyond the scope of this document but there are several good books on
the subject and if you have fairly good knowledge of C or FORTRAN, it
shouldn't be too hard to pick up. Below is code from a simple
hello_world.c program that is written in parallel.
#include <stdio.h>
#include "mpi.h"
int main(int argc, char **argv)
{
int myrank, nprocs, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Get_processor_name(processor_name, &namelen);
printf("Hello World! I'm rank %d of %d on %s\n", myrank,
nprocs, processor_name);
MPI_Finalize();
return 0;
}
The above prints hello world and tells which node it is from and the
nodes rank when compared to the others. The best way to learn is
probably to look at the source code of other programs and try to learn
from others. There are several programs available that are already
coded in parallel. They include: image and video renders, password
crackers, astronomy and weather simulators, and programs that look for
oil.
What's next?
This was just a basic guide on how to get a cluster working. A cluster
in the real world would benefit from: administration tools so you only
have to type a command on one node and have the command issued on all
nodes which makes upgrades and maintenance easier, monitoring software
to track CPU and memory usage in order to make sure everything is
running smoothly, and job scheduling. Job scheduling makes sure that
all jobs and users share CPU power and makes the cluster more
efficient. This can be very useful when used with monitoring software
to quickly diagnose and fix any decrease in performance. Perhaps
sometime later I will write a part two that explains what to do with
the cluster after it is actually built and give more details on
optional tools that are available and how to use them.