Testing
Installation | Top | The ZeptoOS V2.0 Kernel
Early testing and troubleshooting
Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:
/bin/sleep
If you are using Cobalt, submit using either of the commands below:
cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600 qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
If you are using mpirun directly, submit as follows:
mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ -cwd $PWD -exe /bin/sleep 3600
This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as /bin/sleep instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.
If everything works out fine, messages such as the following will be found in the error stream (jobid.error file if using Cobalt):
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface FE_MPI (Info) : Invoking mpirun backend FE_MPI (Info) : connectToServer() - Handshake successful BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP FE_MPI (Info) : Preparing partition BE_MPI (Info) : Examining specified partition BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ... BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F') BE_MPI (Info) : Checking partition owner... BE_MPI (Info) : Setting new owner BE_MPI (Info) : Initiating boot of the partition BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot... BE_MPI (Info) : Partition is ready BE_MPI (Info) : Done preparing partition FE_MPI (Info) : Adding job BE_MPI (Info) : Adding job to database... FE_MPI (Info) : Job added with the following id: 98461 FE_MPI (Info) : Starting job 98461 FE_MPI (Info) : Waiting for job to terminate BE_MPI (Info) : IO - Threads initialized BE_MPI (Info) : I/O input runner thread terminated
(we stripped the timestamp prefixes to make the lines shorter)
If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:
BE_MPI (Info) : I/O output runner thread terminated BE_MPI (Info) : Job 98463 switched to state ERROR ('E') BE_MPI (ERROR): Job execution failed [...] BE_MPI (ERROR): The error message in the job record is as follows: BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
This error indicates that the job was submitted to a default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). You need to go back to the Kernel Profile section to fix the problem.
Log files
Booting problems on I/O nodes and compute nodes can be diagnosed using the system log files, located in /bgsys/logs/BGP/.
Every I/O node has its own log file, with a name such as R*-M*-N*-J*.log. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on ANL-R00-M1-N12-64; a corresponding I/O node log file on Argonne machines will be R00-M1-N12-J00.log. This is how a log file from a successful ZeptoOS boot looks like:
Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009 Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000 init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary Starting RPC portmap daemon..done eth0: Link status [RX+,TX+] mount server reported tcp not available, falling back to udp mount: RPC: Remote system error - No route to host Zepto ION startup-00 eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57 inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0 UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:880 errors:0 dropped:0 overruns:0 frame:0 TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb) Interrupt:32 Zepto ION startup-00 done done Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart. done Starting network time protocol daemon (NTPD) using 172.17.3.1May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec May 1 12:57:11 ion-15 ntpd[653]: ntpd [email protected] Sat Oct 4 00:01:53 UTC 2008 (1) May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123 May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123 May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123 May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040 done Enabling ssh Mounting site filesystems done Loading PVFS2 kernel module done Sleeping 0 seconds before starting PVFS done Starting PVFS2 client done Sleeping 10 seconds before mounting PVFS done Mounting PVFS2 filesystems done Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22. done Zepto ION startup-12 Zepto ION startup-12 done Starting GPFS May 1 12:57:26 ion-15 syslogd 1.4.1: restart. /etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00 ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists Starting ZOID... done Zepto ION startup-99 Zepto ION startup-99 done May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh' BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash) Enter 'help' for a list of built-in commands. /bin/sh: can't access tty; job control turned off ~ #
(again, we stripped the prefixes to make the lines shorter)
Messages such as Zepto ION startup or Starting ZOID clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one instead mistakenly booted with the default ramdisk, this could be recognized by messages such as:
Starting CIO services [ciod:initialized] done
(ciod is never started when using Zepto Compute Node Linux)
In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of Wed Oct 29 18:51:19 UTC 2008; as can be seen above, the ZeptoOS kernel was built more recently.
Interactive login
Using Testcode
Once the files installations have been done and the profiles defined by creating symbolic links to the images in the top-level directory, it is time to submit a test job. Use the test program provided with the distribution. From the top level directory:
$ cd comm/testcodes
Compiling
The program can be compiled on the login node using:
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c $ /path/to/install/bin/zmpixlc_r -openmp -o omp-test-linux omp-test.c
Submitting Job
(For ANL Users) The job needs to be submitted using the cobalt resource manager. For the mpi-test program use the following command:
cqsub -n <number-of-processes> -t <time> -k <profile-name> mpi-test-linux cqsub -n <number-of-processes> -t <time> -k <profile-name> -e OMP_NUM_THREADS=<num> omp-test-linux