kamadhenu@Jawaharlal
Nehru Centre for Advanced Scientific Research
Additional
Documentation:
In
this additional documentation page, I will log all the small tid-bits of
configuration, tweaking, software installation etc. I have done; which
may not have been important to the working of the cluster as a whole but
which perhaps makes life a lot easier when you have to live with the cluster.
-
The
cluster comprises a private network (subnet) with IP no.s ranging from
10.0.0.1 (for the master) to 10.0.0.8. The master, with two NICs, has two
IP no.s: one private to the cluster and the other connected to the LAN.
A routing problem of "bootp" packets had to be taken care of by adding
the following line at the end of /etc/rc.d/rc.local file of the master.
route
add -host 255.255.255.255 eth0
(eth0
is the network interface of the master to the cluster subnet while eth1
connects to the LAN).
-
Installed
and configured rsh services, for seamless (read password-less) logging
into all nodes from any other node (including master).
-
Created
file-system on the secondary IDE Hard-disk with "mke2fs -b 4096 /dev/hda1".
This 12 Gb disk is meant for secondary storage of codes and results. Since
the codes are all going to be large files, the block size of 4kb was used
to obtain faster raw reads, less fragmentation etc. Could not enable multiword
DMA and 32-bit disk access modes because the IBM disk did not seem to support
the parameters passed by hdparm. (kamadhenu hung when I tried, the only
time!)
-
Installed
prsh, a parallel shell specially meant for Beowulf Clusters. It enables
sending the same command parallely to all the nodes of the cluster, without
logging into the each of the nodes individually. After installation, it
requires that every user of the cluster has the "PRSH_HOSTS" environment
variable set through his shell initialization file, so that he can type
the command
bash$prsh -- <command>
to send the <command> parallely to all nodes. However, it is advisable
that the user sets his "PRSH_HOSTS" variable only to the slave nodes so
that he won't bring disaster upon himself by mistakenly typing
bash$prsh
-- reboot
Alternatively,
one can specify all the node names (to which he wants to send the command
to) in the command-line itself as:
bash$prsh
node3 node6 node8 -- tkill
-
Installed
bWatch, a Tcl/Tk cluster monitoring tool, that gives CPU load, free memory
etc. of the nodes in a neat GUI, without being resource hungry.
-
Installed
LAM-MPI 6.3.2 downloaded from University of Notre-dam and also ran the
LAM-TEST suite also available from the same site.
-
Installed
"Smile Queueing Management System" a cluster management software from Thailand
that enables queueing of jobs in a cluster when there are multiple users
involved.
-
Installed
"Smile Cluster Management Suite", a cluster management suite again from
Thailand, that is meant for system administration of a cluster. However,
this software was found to be extremely resource hungry on the master,
and was not really used.
-
Applied the TCP patch to the 2.2.14-5.0
kernel that adjusts the TCP timout values and acknowledgement packet requests
for a private cluster, as is the case here.
Considerable improvement, as regards
the bandwidth and latency values were obtained as evident in the performance
results page.
-
Installed Code-Crusader, the Integrated
Development Environment that is fantastically configurable and hence allows
one to do a professional job at code development in scalar as well as parallel
environment. I would suggest people doing scalar code development also
give it a try because its really good.(You would want to throw away the
Turbo C compiler and stuff!). It also supposedly has tight integration
with Code-Medic, a graphical Debugging Environment. By the way, the people
who developed Code-Crusader are not paying me to write all this !(they
believe in Open Source anyway, so they can't be so cheap, right?)
-
MPICH 1.2.1 was installed.
-
Compiled and installed the Linux 2.2.16-3
kernel with Josip Loncaric's TCP patch.
Goto
Starting Page...
Goto
Configuration page...
Goto
Benchmarking page...
Goto
performance results page...
Page last updated: 27th
Jan, 2001.