Difference between revisions of "ZOID"
Line 34: | Line 34: | ||
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt> | ; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt> | ||
: will also be set... | : will also be set... | ||
+ | |||
+ | '''Note:''' the user script is invoked synchronously by the daemon, i.e., the job will not start running until the script terminates. If you need processes to run on I/O nodes while the job is running, start them in the background (&). | ||
===File broadcast=== | ===File broadcast=== | ||
Line 50: | Line 52: | ||
'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this: | '''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this: | ||
− | User script (<tt>zoid-user-script.sh</tt>): | + | User script (<tt>$HOME/zoid-user-script.sh</tt>): |
<pre> | <pre> | ||
#!/bin/sh | #!/bin/sh | ||
Line 125: | Line 127: | ||
: <tt>unix</tt> plugin (POSIX file I/O) | : <tt>unix</tt> plugin (POSIX file I/O) | ||
− | The counters are 64-bit integers, so they will take a while to | + | The counters are 64-bit integers, so they will take a while to overflow :-). |
+ | |||
+ | Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file: | ||
+ | <pre> | ||
+ | #!/bin/sh | ||
+ | |||
+ | if [ "$1" -eq "1" ]; then | ||
+ | /bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` & | ||
+ | fi | ||
+ | exit 0 | ||
+ | </pre> | ||
==Administrator interface== | ==Administrator interface== | ||
− | + | The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file: | |
− | - | + | ; ZOID_BUFFER_SIZE (-b) |
+ | : 4096:4195328 | ||
+ | ; ZOID_ACK_THRESHOLD (-a) | ||
+ | : 8 | ||
+ | ; ZOID_MODULES (-m) | ||
+ | : "unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so" | ||
+ | ; ZOID_ENABLE_NAT (-n) | ||
+ | : off | ||
+ | ;<span id="user_script">ZOID_USER_SCRIPT (-u)</span> | ||
+ | : "/bin.rd/zoid-user-script.sh" | ||
==Programmer interface== | ==Programmer interface== |
Revision as of 17:03, 22 April 2009
Other Packages and Utilities | Top | KTAU
Introduction
ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and I/O nodes (job management, file I/O, sockets) is facilitated by ZOID.
ZOID infrastructure consists of:
- Multithreaded zoid daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
- control daemon on the compute nodes which is responsible for job management tasks such as launching application processes, for the forwarding of stdin/out/err data, and for forwarding of IP packets,
- zoid-fuse daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.
User interface
User script
Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a user script on I/O nodes. By default, the daemon invokes $HOME/zoid-user-script.sh (this pathname can be changed by an administrator). A single parameter is passed to the script: 1 at the job startup, and 0 at the termination.
Information about the job will be passed to the script in the following environment variables:
- ZOID_JOB_EXEC
- name of the job executable,
- ZOID_JOB_ARGS
- job arguments, separated by a :
- ZOID_JOB_ENV
- job environment variables, separated by a :
- ZOID_JOB_ID
- BG/P control system job id (Note: this is generally different from the Cobalt job ID; see FAQ for the latter),
- ZOID_JOB_GLOBAL_SIZE
- number of processes in the job (size of MPI_COMM_WORLD),
- ZOID_JOB_LOCAL_SIZE
- number of job processes handled by this I/O node,
- ZOID_JOB_MODE
- 0 for SMP, 1 for VN, and 2 for DUAL,
- SHELL, PATH, USER, and HOME
- will also be set...
Note: the user script is invoked synchronously by the daemon, i.e., the job will not start running until the script terminates. If you need processes to run on I/O nodes while the job is running, start them in the background (&).
File broadcast
A /bin.rd/f2cn command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.
The command takes two arguments:
- absolute pathname to the input file on the I/O node,
- absolute pathname to the output file on the compute nodes.
The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.
The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.
Note: all the compute nodes in the pset must be up and running. Do not use this command on incomplete partitions (e.g., a one-process job on a 64-node partition); you will likely hang the ZOID daemon if you try.
Note2: this feature can safely be used from within a user script, so one can, e.g., pre-stage large binaries, like this:
User script ($HOME/zoid-user-script.sh):
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/f2cn $HOME/large_binary /tmp/large_binary fi exit 0
Job script (submitted using Cobalt or mpirun):
#!/bin/sh chmod 755 /tmp/large_binary /tmp/large_binary
Performance counters
A /bin.rd/statquery command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.
The command takes a single optional argument:
- the interval between successive queries, in seconds.
If the argument is not provided, the command will terminate after the first query.
Here is the sample output generated:
Timestamp: 1240439085.688831 Total messages sent: 5767 Total bytes sent: 7619170 Total messages received: 5717 Total bytes received: 72575 IP fwd messages sent: 196 IP fwd bytes sent: 5889 IP fwd messages received: 84 IP fwd bytes received: 6453 Stream messages sent: 65 Stream bytes sent: 520 Stream messages received: 65 Stream bytes received: 1416 Broadcast messages sent: 1 Broadcast bytes sent: 2437906 Internal messages sent: 193 Internal bytes sent: 39524 Internal messages received: 256 Internal bytes received: 1792 Plugin 5 messages sent: 0 Plugin 5 bytes sent: 0 Plugin 5 messages received: 0 Plugin 5 bytes received: 0 Plugin 2 messages sent: 5312 Plugin 2 bytes sent: 5135331 Plugin 2 messages received: 5312 Plugin 2 bytes received: 62914
The meaning of the fields is as follows:
- Timestamp
- number of seconds and microseconds from the epoch, as returned by gettimeofday(2),
- IP fwd
- IP packet forwarding between compute nodes and I/O nodes
- Stream
- stdin/out/err streams,
- Broadcast
- file broadcasts
- Internal
- job control messages, etc.
- Plugin 5
- internal mapping plugin, used by MPI
- Plugin 2
- unix plugin (POSIX file I/O)
The counters are 64-bit integers, so they will take a while to overflow :-).
Example user script ($HOME/zoid-user-script.sh) that samples the statistics every 60 seconds and writes them to a unique file:
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` & fi exit 0
Administrator interface
The zoid I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the ramdisk/ION/ramdisk-add/etc/sysconfig/zoid file:
- ZOID_BUFFER_SIZE (-b)
- 4096:4195328
- ZOID_ACK_THRESHOLD (-a)
- 8
- ZOID_MODULES (-m)
- "unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"
- ZOID_ENABLE_NAT (-n)
- off
- ZOID_USER_SCRIPT (-u)
- "/bin.rd/zoid-user-script.sh"
Programmer interface
Building a new plugin
Replacement libc?