Difference between revisions of "ZOID"
Line 196: | Line 196: | ||
====Argument hints==== | ====Argument hints==== | ||
+ | |||
+ | Hints are generally needed to correctly encode and decode function arguments. They are placed after each argument, before a separating comma (or a closing bracket), and are embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (:). The following hints are currently defined: | ||
+ | |||
+ | ; in, out, inout | ||
+ | : Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default. | ||
+ | ; obj, str, ptr, arr, arr2d | ||
+ | : Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to an object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default. | ||
+ | ; size | ||
+ | : Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc).<br/> For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself).<br/> Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if you want it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)). | ||
+ | ; nullok | ||
+ | : An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. If a <tt>NULL</tt> pointer is passed to an argument without the <tt>nullok</tt> flag, the client ''will'' crash. | ||
+ | ; zerocopy | ||
+ | : An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations of using this option. | ||
+ | ; userbuf | ||
+ | : An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with custom output buffer|Zerocopy with custom output buffer]] and [[#Zerocopy with custom input buffer|Zerocopy with custom input buffer]]. | ||
====Limitations==== | ====Limitations==== | ||
Line 216: | Line 231: | ||
In order to use the new interface, a compute node application will need to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). | In order to use the new interface, a compute node application will need to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). | ||
+ | |||
+ | Zerocopy alignment | ||
====Initialization==== | ====Initialization==== | ||
Line 222: | Line 239: | ||
===Additional considerations=== | ===Additional considerations=== | ||
− | |||
− | |||
====Forwarding <tt>errno</tt>==== | ====Forwarding <tt>errno</tt>==== | ||
Line 231: | Line 246: | ||
====Hitting the maximum message size limit==== | ====Hitting the maximum message size limit==== | ||
− | ====Zerocopy==== | + | ====Zerocopy performance==== |
+ | |||
+ | Because of the additial protocol overheads it introduces, it should be used only for potentially large memory buffers. | ||
− | + | ====Zerocopy with custom output buffer==== | |
− | + | ====Zerocopy with custom input buffer==== | |
---- | ---- | ||
[[Other Packages and Utilities]] | [[ZeptoOS_Documentation|Top]] | [[KTAU]] | [[Other Packages and Utilities]] | [[ZeptoOS_Documentation|Top]] | [[KTAU]] |
Revision as of 14:54, 23 April 2009
Other Packages and Utilities | Top | KTAU
Introduction
ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and I/O nodes (job management, file I/O, sockets) is facilitated by ZOID.
ZOID infrastructure consists of:
- Multithreaded zoid daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
- control daemon on the compute nodes which is responsible for job management tasks such as launching application processes, for the forwarding of stdin/out/err data, and for forwarding of IP packets,
- zoid-fuse daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.
User interface
User script
Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a user script on I/O nodes. By default, the daemon invokes $HOME/zoid-user-script.sh (this pathname can be changed by an administrator). A single parameter is passed to the script: 1 at the job startup, and 0 at the termination.
Information about the job will be passed to the script in the following environment variables:
- ZOID_JOB_EXEC
- name of the job executable,
- ZOID_JOB_ARGS
- job arguments, separated by a :
- ZOID_JOB_ENV
- job environment variables, separated by a :
- ZOID_JOB_ID
- BG/P control system job id (Note: this is generally different from the Cobalt job ID; see FAQ for the latter),
- ZOID_JOB_GLOBAL_SIZE
- number of processes in the job (size of MPI_COMM_WORLD),
- ZOID_JOB_LOCAL_SIZE
- number of job processes handled by this I/O node,
- ZOID_JOB_MODE
- 0 for SMP, 1 for VN, and 2 for DUAL,
- SHELL, PATH, USER, and HOME
- will also be set...
Note: the user script is invoked synchronously by the daemon, i.e., the job will not start running until the script terminates. If you need processes to run on I/O nodes while the job is running, start them in the background (&).
File broadcast
A /bin.rd/f2cn command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.
The command takes two arguments:
- absolute pathname to the input file on the I/O node,
- absolute pathname to the output file on the compute nodes.
The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.
The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.
Note: all the compute nodes in the pset must be up and running. Do not use this command on incomplete partitions (e.g., a one-process job on a 64-node partition); you will likely hang the ZOID daemon if you try.
Note2: this feature can safely be used from within a user script, so one can, e.g., pre-stage large binaries, like this:
User script ($HOME/zoid-user-script.sh):
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/f2cn $HOME/large_binary /tmp/large_binary fi exit 0
Job script (submitted using Cobalt or mpirun):
#!/bin/sh chmod 755 /tmp/large_binary /tmp/large_binary
Performance counters
A /bin.rd/statquery command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.
The command takes a single optional argument:
- the interval between successive queries, in seconds.
If the argument is not provided, the command will terminate after the first query.
Here is the sample output generated:
Timestamp: 1240439085.688831 Total messages sent: 5767 Total bytes sent: 7619170 Total messages received: 5717 Total bytes received: 72575 IP fwd messages sent: 196 IP fwd bytes sent: 5889 IP fwd messages received: 84 IP fwd bytes received: 6453 Stream messages sent: 65 Stream bytes sent: 520 Stream messages received: 65 Stream bytes received: 1416 Broadcast messages sent: 1 Broadcast bytes sent: 2437906 Internal messages sent: 193 Internal bytes sent: 39524 Internal messages received: 256 Internal bytes received: 1792 Plugin 5 messages sent: 0 Plugin 5 bytes sent: 0 Plugin 5 messages received: 0 Plugin 5 bytes received: 0 Plugin 2 messages sent: 5312 Plugin 2 bytes sent: 5135331 Plugin 2 messages received: 5312 Plugin 2 bytes received: 62914
The meaning of the fields is as follows:
- Timestamp
- number of seconds and microseconds from the epoch, as returned by gettimeofday(2),
- IP fwd
- IP packet forwarding between compute nodes and I/O nodes
- Stream
- stdin/out/err streams,
- Broadcast
- file broadcasts
- Internal
- job control messages, etc.
- Plugin 5
- internal mapping plugin, used by MPI
- Plugin 2
- unix plugin (POSIX file I/O)
The counters are 64-bit integers, so they will take a while to overflow :-).
Example user script ($HOME/zoid-user-script.sh) that samples the statistics every 60 seconds and writes them to a unique file:
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` & fi exit 0
Administrator interface
The zoid I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the ramdisk/ION/ramdisk-add/etc/sysconfig/zoid file and rebuilding the I/O node ramdisk:
- ZOID_BUFFER_SIZE (-b)
- Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use colon (:) to separate the two sizes when customizing this value. If desired, support for two separate buffer sizes can be disabled by providing only one value to this option.
- ZOID_ACK_THRESHOLD (-a)
- Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use acknowledgements, but the config file defaults to a value of 8, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
- ZOID_MODULES (-m)
- Specifies a :-separated list of ZOID plugins to load. This defaults to "unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so" in the config file; do not remove any of these or basic system services will stop working. The unix plugin provide POSIX file I/O support, while mapping is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plugins can be created and added here; see Programmer interface for details.
- ZOID_ENABLE_NAT (-n)
- Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because we have found that it has a detrimental effect to the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on per-job basis by setting the ZOID_ENABLE_NAT environment variable when submitting a job (see the FAQ).
- ZOID_USER_SCRIPT (-u)
- Specifies the pathname to the user script; it defaults to "/bin.rd/zoid-user-script.sh". This script can be found in ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh; it sets a few environment variables and then invokes user's custom $HOME/zoid-user-script.sh. Hence, if you want to adjust the behavior of this option, you can either change this option or the script in the ramdisk.
Programmer interface
ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such a custom plug-in.
Overview
All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.
Follow existing plug-ins, found in packages/zoid/src/, as examples. The unix plug-in is generally the most up to date, but other plug-ins such as mapping, zoidfs, and barrier should also be fine.
A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the unix plug-in (the wrappers used by the FUSE client are available in packages/zoid/src/unix/stubs/; another version is in the GNU libc sources, in packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/).
The scanner.pl script, found in packages/zoid/src/, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as unix or mapping. The Makefile in those plug-ins is written in a generic fashion and should only require a change to the PREFIX line to be usable with another plug-in. Use that Makefile to invoke the scanner.pl script and to compile the generated source files.
Input header file
The input header file must be a valid C header file with additional hints in the comments. The file is read by the scanner.pl script.
The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.
Ordinary comments are best placed on separate lines.
Note: the parse is case sensitive.
Start line
Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
- ID=<n>
- Each plug-in needs a unique, 16-bit identifier, passed in <n>. The following identifiers are already in use: 0 (internal), 1 (zoidfs plug-in), 2 (unix), 3 (lofar), 4 (test), 5 (mapping), and 10 (ftb).
- INIT=<s1>
- <s1> provides a name of an initialization function from the server-side implementation code which will be invoked before a job starts running. The arguments passed are described later in Server-side API. If a plug-in does not need this feature, please specify INIT=NULL.
- FINI=<s2>
- <s2> provides a name of a termination function from the server-side implementation code which will be invoked after all job processes have exited. No arguments are passed to this function. If a plug-in does not need this feature, please specify FINI=NULL.
- PROC=<s3>
- <s3> provides a name of a callback function from the server-side implementation code which will be invoked on a startup and termination of every application process. The arguments passed are described later in Server-side API. If a plug-in does not need this feature, please specify PROC=NULL.
Argument hints
Hints are generally needed to correctly encode and decode function arguments. They are placed after each argument, before a separating comma (or a closing bracket), and are embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (:). The following hints are currently defined:
- in, out, inout
- Specifies whether the argument is an input argument, an output argument, or both. in is the default.
- obj, str, ptr, arr, arr2d
- Specifies the type of the argument, respectively a plain object (say, an int, or a structure passed by value), a '\0'-terminated character string, a pointer to an object, an array of objects, or a two-dimensional array (type**, not type[][]). obj is the default.
- size
- Required for array arguments (arr and arr2d). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (1 to number of arguments) or relative ones (+1 for the next argument, -1 for the previous argument, etc).
For arr arguments, the size argument must be of a numerical type, or a pointer to such a type. For arr2d arguments, the size argument must itself be an array (an arr argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the arr array itself).
Please note that the unit of size for the numerical types is the size of the base array type (thus, sizeof(int) for an array of ints), not byte (if you want it to be byte, just make the array argument have type char* or void* (a GCC extension)). - nullok
- An option for arguments passed by pointer (basically, all but obj). If provided, it indicates that the argument is allowed to be NULL. This is not the default because supporting NULL pointers results in an additional computational and protocol overhead. If a NULL pointer is passed to an argument without the nullok flag, the client will crash.
- zerocopy
- An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one in argument and no more than one out argument. Zerocopy performance discusses performance considerations of using this option.
- userbuf
- An option for zerocopy; only supported for arr arguments. Enables a special form of zero-copy support, discussed in Zerocopy with custom output buffer and Zerocopy with custom input buffer.
Limitations
Generated files
For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.
Server-side API
The hand-written server-side implementation code needs to define all the functions listed in the header file. It should be compiled as a shared library; use the implementation/ subdirectory of the unix plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so you must ensure that your implementation is multi-thread-safe.
Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described earlier.
Start-line functions
Implementation functions
Client-side API
In order to use the new interface, a compute node application will need to be linked with the client-side stubs and with a common support library libzoid_cn.a (a prebuilt version of the latter is in packages/zoid/prebuilt; sources are in packages/zoid/src/cnl/client).
Zerocopy alignment
Initialization
Error conditions
Additional considerations
Forwarding errno
Returning variable amounts of data in arrays
Hitting the maximum message size limit
Zerocopy performance
Because of the additial protocol overheads it introduces, it should be used only for potentially large memory buffers.