Testing

From ZeptoOS
Revision as of 13:19, 1 May 2009 by Iskra (talk | contribs)
Jump to navigationJump to search

Installation | Top | The ZeptoOS V2.0 Kernel


Early testing and troubleshooting

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

/bin/sleep

If you are using Cobalt, submit using either of the commands below:

cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600

If you are using mpirun directly, submit as follows:

mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as /bin/sleep instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.

If everything works out fine, messages such as the following will be found in the error stream (jobid.error file if using Cobalt):

FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated

(we stripped the timestamps to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR):   "Load failed on 172.16.3.11: Program segment is not 1MB aligned"

This error indicates that the job was submitted to a default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used).

Using Testcode

Once the files installations have been done and the profiles defined by creating symbolic links to the images in the top-level directory, it is time to submit a test job. Use the test program provided with the distribution. From the top level directory:

$ cd comm/testcodes

Compiling

The program can be compiled on the login node using:

$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -openmp -o omp-test-linux omp-test.c

Submitting Job

(For ANL Users) The job needs to be submitted using the cobalt resource manager. For the mpi-test program use the following command:

cqsub -n <number-of-processes> -t <time> -k <profile-name> mpi-test-linux
cqsub -n <number-of-processes> -t <time> -k <profile-name> -e OMP_NUM_THREADS=<num> omp-test-linux

Installation | Top | The ZeptoOS V2.0 Kernel