Realization of the Beowulf Cluster:
We started off with loading the master with Red Hat Linux 6.2, kernel 2.2.14-5.0.
Both myself and Bala are active Linux proponents so doing this with Linux
was specially enjoyable! Bala kept telling me that it won't be the final
installation and we will have to do more installations, but the lazy-bones
that I am, I was determined to make as few installations as possible, preferably
one! So I started off with a custom class installation of the "official"
Red Hat Linux 6.2 standard edition and partitioned the SCSI hard-disk
with the following scheme:
| / | 2.0 GB. |
| /usr | 2.2 GB. |
| /opt | 1.0 GB. |
| /home | 3.0 GB. |
| swap | 0.4 GB. (400MB!) |
Actually, I had no idea how big the codes to be run were supposed to be, so I hadn't kept home partition as large as apparent now, but later managed to swap a couple of partitions when bala informed me about the goof-up! I got away with it, a close shave, I must say, and I present here the final partitioning scheme as it appears now.
By the way, I was quite pleasantly surprised to find that RedHat Linux standard package contains a lot of tools just for the Beowulf Cluster community: the PVM daemon and the LAM-MPI(parallel computing library), for instance! After the installation, which involved downloading the necessary Intel i810 graphics drivers from http://support.intel.com ,our next job was to recompile the kernel for the master so that it was PIII optimised and also supported RARP. After testing the new kernel and the X configuration thoroughly on the master, our next job was to compile the kernel for the nodes.
We had decided on adopting the most elegant way to boot the diskless workstations: over the network. This involved considerable involvement (read: back-breaking work) with network-procotols like BOOTP, RARP, DHCP, NFS, etc because of one sad fact: the Linux community has become so multi-threaded that everyone seems to be evolving his own standard which is not compatible with anyone else's! Anyway, we set about configuring the master server as a BOOTPD server. This worked out fine with the inbuilt bootp server that is started from /etc/inetd.conf but we wanted full control over the booting process and also monitor the bootpd exchange messages; so we were running it from the shell prompt in full verbose and standalone mode ("bootpd -d 4"). This also gives the added advantage of shutting off the bootp server after all the clients have booted and conserving the master's system resources. This step was smooth and test client correctly got its assigned IP number etc, once bala set up the /etc/bootptab file correctly. (I always mess up the simplest things!). The reason for choosing bootpd over DHCP was the fact that we had no use for the dynamic allocation of the IP no.s but wanted the clients to obtain static IP no.s only, during the bootup.
Then came the major challenge: making the client request for the kernel images via TFTP (Trivial File Transfer Protocol). No matter what we did could make the PXE Boot-ROM ask for the kernel image and boot it. All the available TFTP daemons seemed to be failing in making the proper transfer. There were two error messages:"TFTP: timed out" or "TFTP: Too many packages".There seemed to be some mismatch in the PXE protocol (which is somewhat "non-standard" according to some knowledgeable folks) and the standard TFTP transfer. In a state of sheer desperation, it occured to me that for the nodes to boot, the OS image must probably fit within the first 640Kb of the RAM, the legendary limitation of DOS! And these 2.2.x kernels were just too big to fit in. So I rigged up a new lean, mean kernel of approx size 420Kb and tried again and Voila: the transfer and booting was successful... Moral: With newer versions of the Linux kernel and corresponding increase in kernel size, one has to be careful in keeping the kernel size within the 640 Kb limitation (so as to fit it in the "low memory area") if one does not use an initial ramdisk or any other trick. The TFTP servers also probably haven't been (quite logically, perhaps!) designed to handle kernel images of that magnitude (~640 Kb). I dont know the exact cut-off size, but the largest kernel I have successfully transferred is a little less than 600Kb and has been transferred with the TFTP-hpa server that supports the "tsize" option.(More of this later).
The Intel EtherExpress Pro 100 card fortunately has a BootROM with PXE (Preboot Execution Environment) burnt in so we did not have to go into the trouble of burning Etherboot/Netboot onto it. However, PXE being a new technology, there did not seem to be much support for it in Linux, and we had to search for a long time over the net for drivers. Actually, the BOOTP part was okay, thanks to bala, but the next and major hurdle in our way was to find Linux drivers that supported the PXE protocol and also the consequent TFTP protocol. After a lot (and I mean "a lot") of searching on the net and trying out all kinds of things, we finally hit jackpot with the PXE Linux drivers that comes with the the Syslinux package. However, the PXE Linux stuff has its own requirements and it requires the precious "tftp-hpa" server which supports the "tsize" option. (Beware, the tftp-server has very little documentation. So I can't tell you much about the "tsize" except that its something that the tftp-hpa server supports and PXE linux can't live with!) We had also compiled the clients' kernel with in-built support for root-file-system over NFS, RARP and the Intel EtherExpress Pro (100 Mbps) NIC.
Troubles did not seem to have an end as even after smooth transfer of the
kernel by TFTP, the kernel refused to boot. The client either frantically
"seek"-ed ( I know that the past-tense of seek is sought, in case you are
wondering) the floppy disk or the speaker beeped madly! I had surely compiled
in support for root-file-system over NFS and RARP into the
nodes' kernel and had used rdev to make the root-image bytes point
to NFS, but to no avail, as the kernel simply refused to boot. At this
point I received an inspiration in the form of bala's statement: "Look
boss, none of the kernels you are compiling are booting, okay!" This made
me consider whether the configuration/compilation process itself was faulty?!
Till then I was configuring/compiling with "make xconfig" and with this
"inspiration" from bala, I decided to use "make menuconfig" instead. And
hey, I couldn't believe it: the kernel thus compiled booted fine and attempted
to mount its root-file-system over NFS. Both of us heaved a sigh of relief
and from then on, it was smooth sailing for us. Moral: the "make xconfig"
process somehow damages the kernel when you are compiling monolithic (non-modular)
support for "root-file-system over NFS" and probably also "RARP". "make
menuconfig" does it fine.
After this process of kernel compilation and successful transmission over
the network, some care had to be taken over giving required NFS
permission. By the way, NFS export permission needs to be given
to the /tftpboot directory of the master, so that the clients can successfully
boot over the network. (this seemed funny to me!) The root-file-system
for all the nodes were created in the directory "nfsroot" in the "/" of
the master as "/nfsroot/node2, /nfsroot/node3", etc. These directories
were identical for all practical purposes except for the mandatory changes
in the "/etc" of each node, such as "/etc/HOSTNAME". So we exported the
following file-systems to the clients from the master: /usr, /home,
and /opt. All the rest of the root directories were created manually (for
node2) and then copied for the other nodes. All the directories (such
as /bin, /sbin, etc) were simply copied from the master using "cp"
(by root, of course) with two very important exceptions: the /dev and /tmp
directories of the nodes. The /dev being a special directory defining devices,
should be cloned using "tar" and the /tmp directory on all the nodes has
to be manually created by the root on all nodes and most importantly, "777"
permission has to be given to /tmp on all the nodes. This was important
for the parallel programming interface (LAM-MPI
) on the nodes to function properly.