Difference between revisions of "Testing"
(16 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | [[Installation]] | [[ZeptoOS_Documentation|Top]] | [[ | + | [[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]] |
---- | ---- | ||
− | |||
− | |||
Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working: | Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working: | ||
− | == | + | ==The /bin/sleep job== |
− | If | + | If using Cobalt, submit using either of the commands below: |
<pre> | <pre> | ||
− | cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600 | + | $ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600 |
− | qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600 | + | $ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600 |
</pre> | </pre> | ||
− | If | + | If using <tt>mpirun</tt> directly, submit as follows: |
<pre> | <pre> | ||
− | mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ | + | $ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ |
-cwd $PWD -exe /bin/sleep 3600 | -cwd $PWD -exe /bin/sleep 3600 | ||
</pre> | </pre> | ||
− | This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network | + | This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network file system to reduce the number of dependencies. |
If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt): | If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt): | ||
Line 63: | Line 61: | ||
</pre> | </pre> | ||
− | This error indicates that the job was submitted to | + | This error indicates that the job was submitted to the default software environment with the light-weight kernel, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). Go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files (see below) can be useful to diagnose the problem. |
− | + | ==Log files== | |
− | + | ===I/O node=== | |
− | Every I/O node has its own log file, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt>; a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like: | + | Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like: |
<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009 | <pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009 | ||
Line 91: | Line 89: | ||
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart. | Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart. | ||
done | done | ||
− | Starting network time protocol daemon (NTPD) using 172.17.3. | + | Starting network time protocol daemon (NTPD) using 172.17.3.1 |
+ | May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec | ||
May 1 12:57:11 ion-15 ntpd[653]: ntpd [email protected] Sat Oct 4 00:01:53 UTC 2008 (1) | May 1 12:57:11 ion-15 ntpd[653]: ntpd [email protected] Sat Oct 4 00:01:53 UTC 2008 (1) | ||
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec | May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec | ||
Line 102: | Line 101: | ||
Mounting site filesystems | Mounting site filesystems | ||
done | done | ||
− | Loading PVFS2 kernel module | + | Loading PVFS2 kernel module done |
− | Sleeping 0 seconds before starting PVFS | + | Sleeping 0 seconds before starting PVFS done |
− | Starting PVFS2 client | + | Starting PVFS2 client done |
Sleeping 10 seconds before mounting PVFS | Sleeping 10 seconds before mounting PVFS | ||
done | done | ||
− | Mounting PVFS2 filesystems | + | Mounting PVFS2 filesystems done |
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22. | Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22. | ||
done | done | ||
Line 132: | Line 131: | ||
(again, we stripped the prefixes to make the lines shorter) | (again, we stripped the prefixes to make the lines shorter) | ||
− | Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one | + | Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If instead one mistakenly boots with the default ramdisk, this could be recognized by messages such as: |
<pre> | <pre> | ||
Starting CIO services | Starting CIO services | ||
− | [ciod:initialized] | + | [ciod:initialized] done |
</pre> | </pre> | ||
− | (<tt>ciod</tt> is ''never'' started when using | + | (<tt>ciod</tt> is ''never'' started when using the ZeptoOS compute node Linux) |
In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently. | In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently. | ||
− | ===Interactive login=== | + | ===Compute node=== |
+ | |||
+ | All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt><service_node>-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both. | ||
+ | |||
+ | A correct boot log when booting ZeptoOS will look something like this: | ||
+ | |||
+ | <pre> | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges: | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take... | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found. | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default) | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <[email protected]> | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15 | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) | ||
+ | </pre> | ||
+ | |||
+ | This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''. | ||
+ | |||
+ | The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there: | ||
+ | |||
+ | <pre> | ||
+ | DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1 | ||
+ | DBBlockCmd setusername iskra | ||
+ | iskra db_allocate ANL-R00-M1-N12-64 | ||
+ | iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64 | ||
+ | iskra block state C | ||
+ | iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64) | ||
+ | iskra:ANL-R00-M1-N12-64 BlockController::connect() | ||
+ | iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206 | ||
+ | Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101 | ||
+ | iskra:ANL-R00-M1-N12-64 connected to mcServer | ||
+ | iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created | ||
+ | iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened | ||
+ | iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log | ||
+ | iskra:ANL-R00-M1-N12-64 MailboxListener starting | ||
+ | iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64 | ||
+ | iskra:ANL-R00-M1-N12-64 BlockController::boot_block \ | ||
+ | uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \ | ||
+ | cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \ | ||
+ | ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk | ||
+ | iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1 | ||
+ | </pre> | ||
+ | |||
+ | Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition. | ||
+ | |||
+ | After the kernel boot log, the log file will also contain information about subsequent phases of starting a job: | ||
+ | |||
+ | <pre> | ||
+ | iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00 | ||
+ | iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful | ||
+ | iskra DatabaseBlockCommandThread stopped | ||
+ | DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1 | ||
+ | DBJobCmd setusername iskra | ||
+ | iskra Starting Job 98461 | ||
+ | New thread 4398305505840, for jobid 98461 | ||
+ | selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra | ||
+ | ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra | ||
+ | ANL-R00-M1-N12-64 persist: 1 | ||
+ | ANL-R00-M1-N12-64 connecting to mpirun... | ||
+ | ANL-R00-M1-N12-64 setting mpirun stream, fd=386 | ||
+ | ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000 | ||
+ | ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000 | ||
+ | ANL-R00-M1-N12-64 Job::load() /bin/sleep | ||
+ | ANL-R00-M1-N12-64 Job loaded: 98461 | ||
+ | ANL-R00-M1-N12-64 About to start /bin/sleep | ||
+ | ANL-R00-M1-N12-64 Job 98461 set to RUNNING | ||
+ | iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064) | ||
+ | </pre> | ||
+ | |||
+ | ==Interactive login== | ||
+ | |||
+ | We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above. | ||
+ | |||
+ | The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device: | ||
+ | |||
+ | <pre> | ||
+ | eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57 | ||
+ | inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0 | ||
+ | UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 | ||
+ | RX packets:880 errors:0 dropped:0 overruns:0 frame:0 | ||
+ | TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0 | ||
+ | collisions:0 txqueuelen:1000 | ||
+ | RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb) | ||
+ | Interrupt:32 | ||
+ | </pre> | ||
+ | |||
+ | In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value). | ||
+ | |||
+ | The IP address is also available from the MMCS log file: | ||
+ | |||
+ | <pre> | ||
+ | ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000 | ||
+ | </pre> | ||
+ | |||
+ | With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses. | ||
+ | |||
+ | Once the IP address is known, one can simply use the SSH: | ||
+ | |||
+ | <pre> | ||
+ | [email protected]:~> ssh 172.16.3.15 | ||
+ | |||
+ | |||
+ | BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash) | ||
+ | Enter 'help' for a list of built-in commands. | ||
+ | |||
+ | /gpfs/home/iskra $ hostname | ||
+ | ion-15 | ||
+ | /gpfs/home/iskra $ | ||
+ | </pre> | ||
+ | |||
+ | If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customizations to work properly. To enable access for the partition owner, an administrator might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH key pairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided when [[Configuration#Building|building]] the ZeptoOS environment. | ||
+ | |||
+ | Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again. | ||
+ | |||
+ | Here is part of the <tt>ps</tt> output from an I/O node: | ||
+ | |||
+ | <pre> | ||
+ | /gpfs/home/iskra $ ps -ef | ||
+ | UID PID PPID C STIME TTY TIME CMD | ||
+ | [...] | ||
+ | 65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap | ||
+ | root 108 19 0 16:09 ? 00:00:00 [rpciod/0] | ||
+ | root 109 19 0 16:09 ? 00:00:00 [rpciod/1] | ||
+ | root 110 19 0 16:09 ? 00:00:00 [rpciod/2] | ||
+ | root 111 19 0 16:09 ? 00:00:00 [rpciod/3] | ||
+ | root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd | ||
+ | root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x | ||
+ | ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd. | ||
+ | root 688 1 0 16:09 ? 00:00:00 [lockd] | ||
+ | root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c | ||
+ | root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 - | ||
+ | root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r | ||
+ | root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm | ||
+ | root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc] | ||
+ | root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc] | ||
+ | root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd | ||
+ | root 1153 1 0 16:10 ? 00:00:00 [mmkproc] | ||
+ | root 1152 1 0 16:10 ? 00:00:00 [mmkproc] | ||
+ | root 1154 1 0 16:10 ? 00:00:00 [mmkproc] | ||
+ | iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s | ||
+ | root 2823 1 0 16:10 ? 00:00:00 /bin/sh | ||
+ | root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv] | ||
+ | iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0 | ||
+ | iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh | ||
+ | iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef | ||
+ | /gpfs/home/iskra $ | ||
+ | </pre> | ||
+ | |||
+ | The I/O nodes run a small Linux setup with the root file system in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote file system clients (GPFS, PVFS). Please verify at this stage that the remote file systems have been mounted correctly. | ||
+ | One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O nodes and the compute nodes, implemented using the virtual network tunneling device available in Linux: | ||
− | == | + | <pre> |
− | + | /gpfs/home/iskra $ ifconfig tun0 | |
+ | tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 | ||
+ | inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255 | ||
+ | UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 | ||
+ | RX packets:0 errors:0 dropped:0 overruns:0 frame:0 | ||
+ | TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
+ | collisions:0 txqueuelen:500 | ||
+ | RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) | ||
+ | /gpfs/home/iskra $ | ||
+ | </pre> | ||
+ | |||
+ | At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address <tt>192.168.1.16</tt>. If you are beginning to see a pattern here, then be advised that with a 64:1 ratio, the IP address of the second compute node is... <tt>192.168.1.59</tt>. Do not blame us for this chaos – blame IBM :-). | ||
+ | |||
+ | The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them: | ||
+ | |||
+ | <pre> | ||
+ | /gpfs/home/iskra $ telnet 192.168.1.64 | ||
+ | |||
+ | Entering character mode | ||
+ | Escape character is '^]'. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash) | ||
+ | Enter 'help' for a list of built-in commands. | ||
+ | |||
+ | ~ # | ||
+ | </pre> | ||
+ | |||
+ | The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger partitions with more than one I/O node, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate. | ||
+ | |||
+ | Here is part of the <tt>ps</tt> output from a compute node: | ||
+ | |||
+ | <pre> | ||
+ | ~ # ps -ef | ||
+ | PID USER VSZ STAT COMMAND | ||
+ | [...] | ||
+ | 34 root 5440 S /bin/sh /etc/init.d/rc.sysinit | ||
+ | 44 root 5504 S /sbin/telnetd -l /bin/sh | ||
+ | 47 root 6528 S /sbin/inetd | ||
+ | 48 root 46400 R N /sbin/control | ||
+ | 62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse | ||
+ | 116 root 5248 S /bin/sleep 3600 | ||
+ | 118 root 5504 S /bin/sh | ||
+ | </pre> | ||
+ | |||
+ | Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control. | ||
+ | |||
+ | There are two custom processes running on each compute node: | ||
+ | |||
+ | '''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible. | ||
+ | |||
+ | '''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the front end and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin/</tt>. | ||
+ | |||
+ | ==Shell script job== | ||
+ | |||
+ | Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory. | ||
+ | |||
+ | Here is a sample shell script to try: | ||
+ | |||
+ | <pre> | ||
+ | #!/bin/sh | ||
+ | |||
+ | . /proc/personality.sh | ||
+ | |||
+ | while true; do | ||
+ | echo "Node $BG_RANK_IN_PSET running (stdout)" | ||
+ | echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2 | ||
+ | sleep 10 | ||
+ | done | ||
+ | </pre> | ||
+ | |||
+ | (please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>) | ||
+ | |||
+ | Create the script file on a network filesystem that is available on the I/O nodes, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The script prints a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered on the service node. If that is the case, kill the job and verify that the standard output data did appear. | ||
+ | |||
+ | ==MPI and OpenMP jobs== | ||
+ | |||
+ | The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory: | ||
<pre> | <pre> | ||
$ cd comm/testcodes | $ cd comm/testcodes | ||
</pre> | </pre> | ||
− | === Compiling === | + | ===Compiling=== |
+ | |||
+ | The programs can be compiled on a login node using: | ||
− | |||
<pre> | <pre> | ||
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c | $ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c | ||
− | $ /path/to/install/bin/zmpixlc_r - | + | $ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c |
+ | </pre> | ||
+ | |||
+ | ===Submitting=== | ||
+ | |||
+ | Submit the MPI test like any other job; use one of the below commands: | ||
+ | |||
+ | <pre> | ||
+ | $ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux | ||
+ | $ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux | ||
+ | $ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \ | ||
+ | -cwd $PWD -exe $PWD/omp-test-linux | ||
</pre> | </pre> | ||
− | + | For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> environment variable: | |
− | + | ||
− | |||
<pre> | <pre> | ||
− | cqsub - | + | $ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux |
− | + | $ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux | |
+ | $ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ | ||
+ | -cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux | ||
</pre> | </pre> | ||
+ | |||
+ | The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world". | ||
+ | |||
+ | '''Note:''' see the [[FAQ#Why large MPI processes do not work|FAQ]] if submitting larger MPI processes does not work properly. | ||
---- | ---- | ||
− | [[Installation]] | [[ZeptoOS_Documentation|Top]] | [[ | + | [[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]] |
Latest revision as of 10:56, 8 May 2009
Installation | Top | MPICH, DCMF, and SPI
Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:
The /bin/sleep job
If using Cobalt, submit using either of the commands below:
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600 $ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
If using mpirun directly, submit as follows:
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ -cwd $PWD -exe /bin/sleep 3600
This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as /bin/sleep instead of something from a network file system to reduce the number of dependencies.
If everything works out fine, messages such as the following will be found in the error stream (jobid.error file if using Cobalt):
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface FE_MPI (Info) : Invoking mpirun backend FE_MPI (Info) : connectToServer() - Handshake successful BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP FE_MPI (Info) : Preparing partition BE_MPI (Info) : Examining specified partition BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ... BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F') BE_MPI (Info) : Checking partition owner... BE_MPI (Info) : Setting new owner BE_MPI (Info) : Initiating boot of the partition BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot... BE_MPI (Info) : Partition is ready BE_MPI (Info) : Done preparing partition FE_MPI (Info) : Adding job BE_MPI (Info) : Adding job to database... FE_MPI (Info) : Job added with the following id: 98461 FE_MPI (Info) : Starting job 98461 FE_MPI (Info) : Waiting for job to terminate BE_MPI (Info) : IO - Threads initialized BE_MPI (Info) : I/O input runner thread terminated
(we stripped the timestamp prefixes to make the lines shorter)
If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:
BE_MPI (Info) : I/O output runner thread terminated BE_MPI (Info) : Job 98463 switched to state ERROR ('E') BE_MPI (ERROR): Job execution failed [...] BE_MPI (ERROR): The error message in the job record is as follows: BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
This error indicates that the job was submitted to the default software environment with the light-weight kernel, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). Go back to the Installation section to fix the problem. Information from the system log files (see below) can be useful to diagnose the problem.
Log files
I/O node
Every I/O node has its own log file located in /bgsys/logs/BGP/, with a name such as R*-M*-N*-J*.log. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on ANL-R00-M1-N12-64 (we could see that in the error stream; Cobalt users can also use [c]qstat); a corresponding I/O node log file on Argonne machines will be R00-M1-N12-J00.log. This is how a log file from a successful ZeptoOS boot looks like:
Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009 Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000 init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary Starting RPC portmap daemon..done eth0: Link status [RX+,TX+] mount server reported tcp not available, falling back to udp mount: RPC: Remote system error - No route to host Zepto ION startup-00 eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57 inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0 UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:880 errors:0 dropped:0 overruns:0 frame:0 TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb) Interrupt:32 Zepto ION startup-00 done done Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart. done Starting network time protocol daemon (NTPD) using 172.17.3.1 May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec May 1 12:57:11 ion-15 ntpd[653]: ntpd [email protected] Sat Oct 4 00:01:53 UTC 2008 (1) May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123 May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123 May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123 May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040 done Enabling ssh Mounting site filesystems done Loading PVFS2 kernel module done Sleeping 0 seconds before starting PVFS done Starting PVFS2 client done Sleeping 10 seconds before mounting PVFS done Mounting PVFS2 filesystems done Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22. done Zepto ION startup-12 Zepto ION startup-12 done Starting GPFS May 1 12:57:26 ion-15 syslogd 1.4.1: restart. /etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00 ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists Starting ZOID... done Zepto ION startup-99 Zepto ION startup-99 done May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh' BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash) Enter 'help' for a list of built-in commands. /bin/sh: can't access tty; job control turned off ~ #
(again, we stripped the prefixes to make the lines shorter)
Messages such as Zepto ION startup or Starting ZOID clearly indicate that a ZeptoOS I/O node ramdisk is being used. If instead one mistakenly boots with the default ramdisk, this could be recognized by messages such as:
Starting CIO services [ciod:initialized] done
(ciod is never started when using the ZeptoOS compute node Linux)
In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of Wed Oct 29 18:51:19 UTC 2008; as can be seen above, the ZeptoOS kernel was built more recently.
Compute node
All the compute nodes on the machine share the same MMCS log file, located in /bgsys/logs/BGP/. The name of the log file is not fixed (it contains a timestamp), but <service_node>-bgdb0-mmcs_db_server-current.log always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.
A correct boot log when booting ZeptoOS will look something like this:
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0) iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM. iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code. iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved. iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144 iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009 iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges: iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672 iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672 iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672 iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658 iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000 iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem) iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192 iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take... iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found. iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take... iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found. iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup... iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take... iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found. iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0 iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16 iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2 iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes) iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768) iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7) iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default) iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6 iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <[email protected]> iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1 iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17 iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15 iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines only.
The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1 DBBlockCmd setusername iskra iskra db_allocate ANL-R00-M1-N12-64 iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64 iskra block state C iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64) iskra:ANL-R00-M1-N12-64 BlockController::connect() iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206 Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101 iskra:ANL-R00-M1-N12-64 connected to mcServer iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log iskra:ANL-R00-M1-N12-64 MailboxListener starting iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64 iskra:ANL-R00-M1-N12-64 BlockController::boot_block \ uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \ cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \ ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.
After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00 iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful iskra DatabaseBlockCommandThread stopped DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1 DBJobCmd setusername iskra iskra Starting Job 98461 New thread 4398305505840, for jobid 98461 selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra ANL-R00-M1-N12-64 persist: 1 ANL-R00-M1-N12-64 connecting to mpirun... ANL-R00-M1-N12-64 setting mpirun stream, fd=386 ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000 ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000 ANL-R00-M1-N12-64 Job::load() /bin/sleep ANL-R00-M1-N12-64 Job loaded: 98461 ANL-R00-M1-N12-64 About to start /bin/sleep ANL-R00-M1-N12-64 Job 98461 set to RUNNING iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
Interactive login
We are assuming at this point that launching /bin/sleep has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.
The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57 inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0 UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:880 errors:0 dropped:0 overruns:0 frame:0 TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb) Interrupt:32
In this case, the address is 172.16.3.15 (the inet addr value).
The IP address is also available from the MMCS log file:
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.
Once the IP address is known, one can simply use the SSH:
[email protected]:~> ssh 172.16.3.15 BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash) Enter 'help' for a list of built-in commands. /gpfs/home/iskra $ hostname ion-15 /gpfs/home/iskra $
If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customizations to work properly. To enable access for the partition owner, an administrator might need to make adjustments to update_passwd_file.sh. To enable password-less login for the partition owners without requiring them to set up personal SSH key pairs, we recommend to add the names of the front end nodes to the shosts.equiv file, found in ramdisk/ION/ramdisk-add/etc/ssh.zepto/ (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the -data suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (ssh -l root), passing the password provided when building the ZeptoOS environment.
Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.
Here is part of the ps output from an I/O node:
/gpfs/home/iskra $ ps -ef UID PID PPID C STIME TTY TIME CMD [...] 65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap root 108 19 0 16:09 ? 00:00:00 [rpciod/0] root 109 19 0 16:09 ? 00:00:00 [rpciod/1] root 110 19 0 16:09 ? 00:00:00 [rpciod/2] root 111 19 0 16:09 ? 00:00:00 [rpciod/3] root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd. root 688 1 0 16:09 ? 00:00:00 [lockd] root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 - root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc] root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc] root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd root 1153 1 0 16:10 ? 00:00:00 [mmkproc] root 1152 1 0 16:10 ? 00:00:00 [mmkproc] root 1154 1 0 16:10 ? 00:00:00 [mmkproc] iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s root 2823 1 0 16:10 ? 00:00:00 /bin/sh root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv] iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0 iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef /gpfs/home/iskra $
The I/O nodes run a small Linux setup with the root file system in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote file system clients (GPFS, PVFS). Please verify at this stage that the remote file systems have been mounted correctly.
One custom process running on the node is ZOID, the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O nodes and the compute nodes, implemented using the virtual network tunneling device available in Linux:
/gpfs/home/iskra $ ifconfig tun0 tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) /gpfs/home/iskra $
At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses 192.168.1.1 to 192.168.1.64 (the last octet of the address is the pset rank). Somewhat confusingly, the first compute node (compute node 0) has IP address 192.168.1.64, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address 192.168.1.16. If you are beginning to see a pattern here, then be advised that with a 64:1 ratio, the IP address of the second compute node is... 192.168.1.59. Do not blame us for this chaos – blame IBM :-).
The compute nodes are running a telnet daemon, and no password is required to log on them:
/gpfs/home/iskra $ telnet 192.168.1.64 Entering character mode Escape character is '^]'. BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash) Enter 'help' for a list of built-in commands. ~ #
The IP address of the I/O node on this virtual network is 192.168.1.254. The network is local to each I/O node, so for larger partitions with more than one I/O node, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.
Here is part of the ps output from a compute node:
~ # ps -ef PID USER VSZ STAT COMMAND [...] 34 root 5440 S /bin/sh /etc/init.d/rc.sysinit 44 root 5504 S /sbin/telnetd -l /bin/sh 47 root 6528 S /sbin/inetd 48 root 46400 R N /sbin/control 62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse 116 root 5248 S /bin/sleep 3600 118 root 5504 S /bin/sh
Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes do enforce user-level access control.
There are two custom processes running on each compute node:
control is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.
zoid-fuse is a FUSE (Filesystem in Userspace) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under /fuse/, and symbolic links such as /home -> /fuse/home are set up to keep the front end and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile fuse-start and fuse-stop scripts can be found under ramdisk/CN/tree/bin/.
Shell script job
Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.
Here is a sample shell script to try:
#!/bin/sh . /proc/personality.sh while true; do echo "Node $BG_RANK_IN_PSET running (stdout)" echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2 sleep 10 done
(please see the FAQ for the explanation of /proc/personality.sh and BG_RANK_IN_PSET)
Create the script file on a network filesystem that is available on the I/O nodes, set the executable bit (chmod 755) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The script prints a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered on the service node. If that is the case, kill the job and verify that the standard output data did appear.
MPI and OpenMP jobs
The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
$ cd comm/testcodes
Compiling
The programs can be compiled on a login node using:
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c $ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
Submitting
Submit the MPI test like any other job; use one of the below commands:
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux $ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux $ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \ -cwd $PWD -exe $PWD/omp-test-linux
For the OpenMP test, we pass the number of OpenMP threads to use in the OMP_NUM_THREADS environment variable:
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux $ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux $ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \ -cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".
Note: see the FAQ if submitting larger MPI processes does not work properly.