Difference between revisions of "ZOID"
(46 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | [[ | + | [[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]] |
---- | ---- | ||
==Introduction== | ==Introduction== | ||
− | ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and I/O nodes (job management, file I/O, sockets) is | + | ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID. |
ZOID infrastructure consists of: | ZOID infrastructure consists of: | ||
− | * | + | * A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management, |
− | * <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as launching application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for forwarding of IP packets, | + | * <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for the forwarding of IP packets, |
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications. | * <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications. | ||
==User interface== | ==User interface== | ||
+ | |||
+ | ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it. | ||
===User script=== | ===User script=== | ||
Line 21: | Line 23: | ||
: name of the job executable, | : name of the job executable, | ||
; <tt>ZOID_JOB_ARGS</tt> | ; <tt>ZOID_JOB_ARGS</tt> | ||
− | : job arguments, separated by | + | : job arguments, separated by colons (<tt>:</tt>) |
; <tt>ZOID_JOB_ENV</tt> | ; <tt>ZOID_JOB_ENV</tt> | ||
− | : job environment variables, separated by | + | : job environment variables, separated by colons (<tt>:</tt>) |
; <tt>ZOID_JOB_ID</tt> | ; <tt>ZOID_JOB_ID</tt> | ||
− | : BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ# | + | : BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter), |
; <tt>ZOID_JOB_GLOBAL_SIZE</tt> | ; <tt>ZOID_JOB_GLOBAL_SIZE</tt> | ||
− | : number of processes in the job (size of <tt>MPI_COMM_WORLD</tt>), | + | : the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), |
; <tt>ZOID_JOB_LOCAL_SIZE</tt> | ; <tt>ZOID_JOB_LOCAL_SIZE</tt> | ||
− | : number of job processes handled by this I/O node, | + | : the number of job processes handled by this I/O node, |
; <tt>ZOID_JOB_MODE</tt> | ; <tt>ZOID_JOB_MODE</tt> | ||
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL, | : <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL, | ||
Line 35: | Line 37: | ||
: will also be set... | : will also be set... | ||
− | ''' | + | '''Notes:''' |
+ | * The user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&). | ||
+ | * For this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly. | ||
===File broadcast=== | ===File broadcast=== | ||
Line 48: | Line 52: | ||
The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk. | The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk. | ||
− | '''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); | + | '''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon. |
'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this: | '''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this: | ||
Line 79: | Line 83: | ||
If the argument is not provided, the command will terminate after the first query. | If the argument is not provided, the command will terminate after the first query. | ||
− | Here is | + | Here is a sample output generated: |
<pre> | <pre> | ||
Line 115: | Line 119: | ||
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>, | : number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>, | ||
; IP fwd | ; IP fwd | ||
− | : IP packet forwarding between compute nodes and I/O nodes | + | : IP packet forwarding between compute nodes and I/O nodes, |
; Stream | ; Stream | ||
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams, | : <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams, | ||
; Broadcast | ; Broadcast | ||
− | : [[#File broadcast|file broadcasts]] | + | : [[#File broadcast|file broadcasts]], |
; Internal | ; Internal | ||
: job control messages, etc. | : job control messages, etc. | ||
; Plugin 5 | ; Plugin 5 | ||
− | : internal <tt>mapping</tt> | + | : internal <tt>mapping</tt> plug-in, used by MPI, |
; Plugin 2 | ; Plugin 2 | ||
− | : <tt>unix</tt> plugin (POSIX file I/O) | + | : <tt>unix</tt> plugin (POSIX file I/O). |
The counters are 64-bit integers, so they will take a while to overflow :-). | The counters are 64-bit integers, so they will take a while to overflow :-). | ||
Line 140: | Line 144: | ||
==Administrator interface== | ==Administrator interface== | ||
+ | |||
+ | ===Configuration file=== | ||
The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk: | The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk: | ||
− | ; ZOID_BUFFER_SIZE (-b) | + | ; <span id="opt_buffer_size">ZOID_BUFFER_SIZE (-b)</span> |
− | : Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use colon (:) to separate the two sizes when customizing this value. If desired, support for | + | : Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option. |
; ZOID_ACK_THRESHOLD (-a) | ; ZOID_ACK_THRESHOLD (-a) | ||
− | : Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements. | + | : Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements. |
− | ; ZOID_MODULES (-m) | + | ; <span id="opt_modules">ZOID_MODULES (-m)</span> |
− | : Specifies a <tt>:</tt>-separated list of ZOID | + | : Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provides POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details. |
− | ; ZOID_ENABLE_NAT (-n) | + | ; <span id="opt_enable_nat">ZOID_ENABLE_NAT (-n)</span> |
− | : Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because | + | : Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]). |
; <span id="opt_user_script">ZOID_USER_SCRIPT (-u)</span> | ; <span id="opt_user_script">ZOID_USER_SCRIPT (-u)</span> | ||
− | : Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, | + | : Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk.<br/>'''Note:''' to be able to invoke a script from user's home directory, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly. |
+ | |||
+ | ===The /bin.rd/update_passwd_file.sh file=== | ||
+ | |||
+ | Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>. | ||
+ | |||
+ | The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this file contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user. | ||
+ | |||
+ | If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it. | ||
+ | |||
+ | ===The /bin.rd/nat file=== | ||
+ | |||
+ | If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications. | ||
==Programmer interface== | ==Programmer interface== | ||
− | + | ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins. | |
+ | |||
+ | ===Overview=== | ||
+ | |||
+ | All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it. | ||
+ | |||
+ | Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine. | ||
+ | |||
+ | A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>). | ||
+ | |||
+ | The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files. | ||
+ | |||
+ | ===Input header file=== | ||
+ | |||
+ | The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script. | ||
+ | |||
+ | The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions. | ||
+ | |||
+ | Ordinary comments are best placed on separate lines. | ||
+ | |||
+ | '''Note:''' the parser is case ''sensitive''. | ||
+ | |||
+ | ====Start line==== | ||
+ | |||
+ | Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line: | ||
+ | |||
+ | <pre> | ||
+ | /* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */ | ||
+ | </pre> | ||
+ | |||
+ | ; ID=<n> | ||
+ | : Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>). | ||
+ | ; INIT=<s1> | ||
+ | : <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>. | ||
+ | ; FINI=<s2> | ||
+ | : <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>. | ||
+ | ; PROC=<s3> | ||
+ | : <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>. | ||
+ | |||
+ | ====Argument hints==== | ||
+ | |||
+ | Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined: | ||
+ | |||
+ | ; in, out, inout | ||
+ | : Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default. | ||
+ | ; obj, str, ptr, arr, arr2d | ||
+ | : Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default. | ||
+ | ; size | ||
+ | : Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc).<br/> For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself).<br/> Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)). | ||
+ | ; nullok | ||
+ | : An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash. | ||
+ | ; zerocopy | ||
+ | : An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option. | ||
+ | ; userbuf | ||
+ | : An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]]. | ||
+ | |||
+ | Here is an example function prototype with the hints: | ||
+ | |||
+ | <pre> | ||
+ | int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */, | ||
+ | char * buffer /* out:arr:size=+1 */, | ||
+ | size_t buffer_length /* in:obj */); | ||
+ | </pre> | ||
+ | |||
+ | ====Limitations==== | ||
+ | |||
+ | As indicated earlier, the scanner is limited, so keep the prototypes simple. | ||
− | + | Return type of a forwarded function must be scalar or <tt>void</tt>. | |
+ | |||
+ | Structures with pointer fields inside of them cannot be forwarded. | ||
+ | |||
+ | ====Generated files==== | ||
+ | |||
+ | For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results. | ||
+ | |||
+ | Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>. | ||
+ | |||
+ | None of the generated files should be modified. | ||
+ | |||
+ | ===Server-side API=== | ||
+ | |||
+ | Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]]. | ||
+ | |||
+ | The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file. | ||
+ | |||
+ | All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe. | ||
+ | |||
+ | ====Start-line functions==== | ||
+ | |||
+ | The following [[#Start line|start-line]] functions can be defined: | ||
+ | |||
+ | <pre> | ||
+ | void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv); | ||
+ | </pre> | ||
+ | |||
+ | The INIT function is invoked during initialization, right before a job starts running. Arguments: | ||
+ | |||
+ | ; pset_mpi_proc_count | ||
+ | : The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number. | ||
+ | ; argc | ||
+ | : The number of command-line arguments plus one. | ||
+ | ; envc | ||
+ | : The number of environment variables. | ||
+ | ; argenv | ||
+ | : An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables. | ||
+ | |||
+ | <pre> | ||
+ | void FINI(void); | ||
+ | </pre> | ||
+ | |||
+ | The FINI function is invoked after the last process of the job has terminated. | ||
+ | |||
+ | <pre> | ||
+ | void PROC(int added, int pset_pid); | ||
+ | </pre> | ||
+ | |||
+ | The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments: | ||
+ | |||
+ | ; added | ||
+ | : <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated. | ||
+ | ; pset_pid | ||
+ | : A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]). | ||
+ | |||
+ | ====Implementation functions==== | ||
+ | |||
+ | The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file: | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_calling_process_id(void); | ||
+ | </pre> | ||
+ | |||
+ | This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated. | ||
+ | |||
+ | <pre> | ||
+ | void __zoid_register_userbuf(void* userbuf, | ||
+ | void (*callback)(void* userbuf, void* priv), | ||
+ | void* priv); | ||
+ | </pre> | ||
+ | |||
+ | This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]]. | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_send_output(int pid, int fd, const char* buffer, int len); | ||
+ | </pre> | ||
+ | |||
+ | This function writes an arbitrary string to the job's standard output or error. Arguments: | ||
+ | |||
+ | ; pid | ||
+ | : Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process. | ||
+ | ; fd | ||
+ | : <tt>1</tt> for standard output, <tt>2</tt> for standard error. | ||
+ | ; buffer, len | ||
+ | : The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated. | ||
+ | |||
+ | The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank). | ||
+ | |||
+ | ===Client-side API=== | ||
+ | |||
+ | A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file: | ||
+ | |||
+ | ====Initialization==== | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_init(void); | ||
+ | </pre> | ||
+ | |||
+ | This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function. | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_job_size(void); | ||
+ | int __zoid_my_rank(void); | ||
+ | </pre> | ||
+ | |||
+ | These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the | ||
+ | current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell). | ||
+ | |||
+ | ====Error conditions==== | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_error(void); | ||
+ | </pre> | ||
+ | |||
+ | This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned: | ||
+ | |||
+ | ; ENOSYS | ||
+ | : Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded. | ||
+ | ; ENOMEM | ||
+ | : Out of memory condition. | ||
+ | ; E2BIG | ||
+ | : Message exceeded the internal size limit. | ||
+ | |||
+ | <pre> | ||
+ | int __zoid_excessive_size(void); | ||
+ | </pre> | ||
+ | |||
+ | If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large. | ||
+ | |||
+ | <span id="size_restrictions">ZOID</span> [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers. | ||
+ | |||
+ | If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit. | ||
+ | |||
+ | '''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results. | ||
+ | |||
+ | Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented: | ||
+ | |||
+ | <pre> | ||
+ | ssize_t read(int fd, void *buf, size_t nbytes) | ||
+ | { | ||
+ | static ssize_t max_read_nbytes = -1; | ||
+ | ssize_t bytes_read; | ||
+ | |||
+ | bytes_read = 0; | ||
+ | do | ||
+ | { | ||
+ | ssize_t toread, justread; | ||
+ | int error; | ||
+ | |||
+ | toread = nbytes - bytes_read; | ||
+ | |||
+ | if (max_read_nbytes != -1 && toread > max_read_nbytes) | ||
+ | toread = max_read_nbytes; | ||
+ | |||
+ | /* unix_read is the forwarded function call. */ | ||
+ | justread = unix_read(fd, buf + bytes_read, toread); | ||
+ | |||
+ | if ((error = __zoid_error())) | ||
+ | { | ||
+ | if (error != E2BIG) | ||
+ | { | ||
+ | /* For a generic ZOID error, just bail out. */ | ||
+ | errno = error; | ||
+ | return -1; | ||
+ | } | ||
+ | |||
+ | /* We tried to send a too large read request. Adjust. */ | ||
+ | max_read_nbytes = toread - __zoid_excessive_size(); | ||
+ | } | ||
+ | else | ||
+ | { | ||
+ | if (justread < 0) | ||
+ | { | ||
+ | /* For a generic read() error, just bail out. | ||
+ | In case of an I/O error, unix_read returns -errno. */ | ||
+ | errno = -justread; | ||
+ | return -1; | ||
+ | } | ||
+ | |||
+ | bytes_read += justread; | ||
+ | |||
+ | if (justread != toread) | ||
+ | /* unix_read as such succeeded, but it read fewer bytes than | ||
+ | expected. We terminate prematurely then. */ | ||
+ | break; | ||
+ | } | ||
+ | } while (bytes_read < nbytes); | ||
+ | |||
+ | return bytes_read; | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | ===Additional considerations=== | ||
+ | |||
+ | ====Forwarding <tt>errno</tt>==== | ||
+ | |||
+ | If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this: | ||
+ | |||
+ | <pre> | ||
+ | if (close(server_fd) == -1) | ||
+ | return -errno; | ||
+ | else | ||
+ | return 0; | ||
+ | </pre> | ||
+ | |||
+ | Then, on the client side, we have a convenience wrapper: | ||
+ | |||
+ | <pre> | ||
+ | int close(int fd) | ||
+ | { | ||
+ | return unix_decode_result(unix_close(fd)); | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | <tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible: | ||
+ | |||
+ | <pre> | ||
+ | #define unix_decode_result(result) \ | ||
+ | ({ \ | ||
+ | typeof (result) _result = (result); \ | ||
+ | int _n; \ | ||
+ | if ((_n = __zoid_error()) != 0) \ | ||
+ | { \ | ||
+ | errno = _n; \ | ||
+ | _result = -1; \ | ||
+ | } \ | ||
+ | else if (_result < 0) \ | ||
+ | { \ | ||
+ | errno = -_result; \ | ||
+ | _result = -1; \ | ||
+ | } \ | ||
+ | _result; \ | ||
+ | }) | ||
+ | </pre> | ||
+ | |||
+ | ====Returning variable amounts of data in arrays==== | ||
+ | |||
+ | Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this: | ||
+ | |||
+ | <pre> | ||
+ | ssize_t unix_read(int fd /* in:obj */, | ||
+ | void *buf /* out:arr:size=+1 */, | ||
+ | size_t *count /* inout:ptr */) | ||
+ | { | ||
+ | ssize_t ret; | ||
+ | |||
+ | ... | ||
+ | |||
+ | if ((ret = read(fd, buf, *count)) == -1) | ||
+ | { | ||
+ | *count = 0; | ||
+ | return -errno; | ||
+ | } | ||
+ | else | ||
+ | { | ||
+ | *count = ret; | ||
+ | return ret; | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | Obviously, the client side needs to be modified as well, to pass the size argument by address. | ||
+ | |||
+ | '''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable. | ||
+ | |||
+ | ====Zerocopy performance==== | ||
+ | |||
+ | Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies. | ||
+ | |||
+ | Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls. | ||
+ | |||
+ | '''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one. | ||
+ | |||
+ | ====Zerocopy with a custom output buffer==== | ||
+ | |||
+ | Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy. | ||
+ | |||
+ | This extra copy can be avoided for <tt>zerocopy</tt> output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling: | ||
+ | |||
+ | <pre> | ||
+ | void __zoid_register_userbuf(void* userbuf, | ||
+ | void (*callback)(void* userbuf, void* priv), | ||
+ | void* priv); | ||
+ | </pre> | ||
+ | |||
+ | Arguments: | ||
+ | |||
+ | ; userbuf | ||
+ | : The address of the buffer. | ||
+ | ; callback | ||
+ | : A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>, if desired. | ||
+ | ; priv | ||
+ | : A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way. | ||
+ | |||
+ | The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]]. | ||
+ | |||
+ | As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. | ||
+ | |||
+ | '''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node. | ||
+ | |||
+ | We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most likely to be useful with buffers allocated outside of the implementation functions: | ||
+ | |||
+ | <pre> | ||
+ | static void buffer_cb(void* userbuf, void* priv) | ||
+ | { | ||
+ | free(userbuf); | ||
+ | } | ||
+ | |||
+ | ssize_t unix_read(int fd /* in:obj */, | ||
+ | void *buf /* out:arr:size=+1:zerocopy:userbuf */, | ||
+ | size_t *count /* inout:ptr */) | ||
+ | { | ||
+ | ssize_t ret; | ||
+ | |||
+ | ... | ||
+ | |||
+ | if (posix_memalign(&buf, 16, *count)) | ||
+ | { | ||
+ | *count = 0; | ||
+ | return -ENOMEM; | ||
+ | } | ||
+ | |||
+ | __zoid_register_userbuf(buf, &buffer_cb, NULL); | ||
+ | |||
+ | if ((ret = read(fd, buf, *count)) == -1) | ||
+ | { | ||
+ | *count = 0; | ||
+ | return -errno; | ||
+ | } | ||
+ | else | ||
+ | { | ||
+ | *count = ret; | ||
+ | return ret; | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | ====Zerocopy with a custom input buffer==== | ||
+ | |||
+ | The <tt>userbuf</tt> flag discussed above can also be used for ''input'' <tt>zerocopy</tt> <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned. | ||
+ | |||
+ | If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows: | ||
+ | |||
+ | <pre> | ||
+ | void* <name>_allocate_cb(int len); | ||
+ | </pre> | ||
+ | |||
+ | The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>). | ||
+ | |||
+ | There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc. | ||
+ | |||
+ | The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture. | ||
+ | |||
+ | Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed. | ||
+ | |||
+ | As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node. | ||
+ | |||
+ | Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function. | ||
+ | |||
+ | A simple example: | ||
+ | |||
+ | <pre> | ||
+ | void* unix_write_allocate_cb(int len) | ||
+ | { | ||
+ | void* ptr; | ||
+ | |||
+ | if (posix_memalign(&ptr, 16, len)) | ||
+ | return NULL; | ||
+ | |||
+ | return ptr; | ||
+ | } | ||
+ | |||
+ | ssize_t unix_write(int fd /* in:obj */, | ||
+ | const void *buf /* in:arr:size=+1:zerocopy:userbuf */, | ||
+ | size_t count /* in:obj */) | ||
+ | { | ||
+ | ssize_t ret; | ||
+ | |||
+ | ... | ||
+ | |||
+ | if ((ret = write(fd, buf, count)) == -1) | ||
+ | ret = -errno; | ||
+ | |||
+ | free((void*)buf); | ||
+ | |||
+ | return ret; | ||
+ | } | ||
+ | </pre> | ||
---- | ---- | ||
− | [[ | + | [[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]] |
Latest revision as of 20:04, 10 May 2009
Introduction
ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.
ZOID infrastructure consists of:
- A multithreaded zoid daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
- control daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the forwarding of IP packets,
- zoid-fuse daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.
User interface
ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.
User script
Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a user script on I/O nodes. By default, the daemon invokes $HOME/zoid-user-script.sh (this pathname can be changed by an administrator). A single parameter is passed to the script: 1 at the job startup, and 0 at the termination.
Information about the job will be passed to the script in the following environment variables:
- ZOID_JOB_EXEC
- name of the job executable,
- ZOID_JOB_ARGS
- job arguments, separated by colons (:)
- ZOID_JOB_ENV
- job environment variables, separated by colons (:)
- ZOID_JOB_ID
- BG/P control system job id (Note: this is generally different from the Cobalt job ID; see FAQ for the latter),
- ZOID_JOB_GLOBAL_SIZE
- the number of processes in the job (the size of MPI_COMM_WORLD),
- ZOID_JOB_LOCAL_SIZE
- the number of job processes handled by this I/O node,
- ZOID_JOB_MODE
- 0 for SMP, 1 for VN, and 2 for DUAL,
- SHELL, PATH, USER, and HOME
- will also be set...
Notes:
- The user script is invoked synchronously by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).
- For this feature to work, update_passwd_file.sh must be working correctly.
File broadcast
A /bin.rd/f2cn command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.
The command takes two arguments:
- absolute pathname to the input file on the I/O node,
- absolute pathname to the output file on the compute nodes.
The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.
The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.
Note: all the compute nodes in the pset must be up and running. Do not use this command on incomplete partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.
Note2: this feature can safely be used from within a user script, so one can, e.g., pre-stage large binaries, like this:
User script ($HOME/zoid-user-script.sh):
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/f2cn $HOME/large_binary /tmp/large_binary fi exit 0
Job script (submitted using Cobalt or mpirun):
#!/bin/sh chmod 755 /tmp/large_binary /tmp/large_binary
Performance counters
A /bin.rd/statquery command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.
The command takes a single optional argument:
- the interval between successive queries, in seconds.
If the argument is not provided, the command will terminate after the first query.
Here is a sample output generated:
Timestamp: 1240439085.688831 Total messages sent: 5767 Total bytes sent: 7619170 Total messages received: 5717 Total bytes received: 72575 IP fwd messages sent: 196 IP fwd bytes sent: 5889 IP fwd messages received: 84 IP fwd bytes received: 6453 Stream messages sent: 65 Stream bytes sent: 520 Stream messages received: 65 Stream bytes received: 1416 Broadcast messages sent: 1 Broadcast bytes sent: 2437906 Internal messages sent: 193 Internal bytes sent: 39524 Internal messages received: 256 Internal bytes received: 1792 Plugin 5 messages sent: 0 Plugin 5 bytes sent: 0 Plugin 5 messages received: 0 Plugin 5 bytes received: 0 Plugin 2 messages sent: 5312 Plugin 2 bytes sent: 5135331 Plugin 2 messages received: 5312 Plugin 2 bytes received: 62914
The meaning of the fields is as follows:
- Timestamp
- number of seconds and microseconds from the epoch, as returned by gettimeofday(2),
- IP fwd
- IP packet forwarding between compute nodes and I/O nodes,
- Stream
- stdin/out/err streams,
- Broadcast
- file broadcasts,
- Internal
- job control messages, etc.
- Plugin 5
- internal mapping plug-in, used by MPI,
- Plugin 2
- unix plugin (POSIX file I/O).
The counters are 64-bit integers, so they will take a while to overflow :-).
Example user script ($HOME/zoid-user-script.sh) that samples the statistics every 60 seconds and writes them to a unique file:
#!/bin/sh if [ "$1" -eq "1" ]; then /bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` & fi exit 0
Administrator interface
Configuration file
The zoid I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the ramdisk/ION/ramdisk-add/etc/sysconfig/zoid file and rebuilding the I/O node ramdisk:
- ZOID_BUFFER_SIZE (-b)
- Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (:) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
- ZOID_ACK_THRESHOLD (-a)
- Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of 8, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
- ZOID_MODULES (-m)
- Specifies a :-separated list of ZOID plug-ins to load. This defaults to "unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so" in the config file; do not remove any of these or basic system services will stop working. The unix plug-in provides POSIX file I/O support, while mapping is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see Programmer interface for details.
- ZOID_ENABLE_NAT (-n)
- Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the ZOID_ENABLE_NAT environment variable when submitting a job (see the FAQ).
- ZOID_USER_SCRIPT (-u)
- Specifies the pathname to the user script; it defaults to "/bin.rd/zoid-user-script.sh". This script can be found in ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh; it sets a few environment variables and then invokes user's custom $HOME/zoid-user-script.sh. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk.
Note: to be able to invoke a script from user's home directory, update_passwd_file.sh must be working correctly.
The /bin.rd/update_passwd_file.sh file
Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the /bin.rd/update_passwd_file.sh script, which is invoked by the daemon while the partition is being initialized. The script can be found in ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh.
The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the /bgsys/iofs/etc/passwd for an entry with the same user ID (on Argonne machines, this file contains all valid account names). If a matching entry is found, it is appended to the /etc/passwd file in the I/O node ramdisk, thus enabling login access to the node for that user.
If allowing ordinary users access to the I/O nodes is undesirable, one can simply put exit 0 at the top of the script to disable it.
The /bin.rd/nat file
If NAT has been requested, the daemon invokes the /bin.rd/nat script to enabled it. This script can be found in ramdisk/ION/ramdisk-add/bin/nat. Generally, it should not require any modifications.
Programmer interface
ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.
Overview
All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.
Follow existing plug-ins, found in packages/zoid/src/, as examples. The unix plug-in is generally the most up to date, but other plug-ins such as mapping, zoidfs, barrier, and test should also be fine.
A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the unix plug-in (the wrappers used by the FUSE client are available in packages/zoid/src/unix/stubs/; another version is in the GNU libc sources, in packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/).
The scanner.pl script, found in packages/zoid/src/, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as unix or mapping. The Makefile in those plug-ins is written in a generic fashion and should only require a change to the PREFIX line to be usable with another plug-in. Use that Makefile to invoke the scanner.pl script and to compile the generated source files.
Input header file
The input header file must be a valid C header file with additional hints in the comments. The file is read by the scanner.pl script.
The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.
Ordinary comments are best placed on separate lines.
Note: the parser is case sensitive.
Start line
Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
- ID=<n>
- Each plug-in needs a unique, 16-bit identifier, passed in <n>. The following identifiers are already in use: 0 (internal), 1 (zoidfs plug-in), 2 (unix), 3 (lofar), 4 (test), 5 (mapping), and 10 (ftb).
- INIT=<s1>
- <s1> provides the name of an initialization function which will be invoked before a job starts running; see Start-line functions for more information. If a plug-in does not need this feature, please specify INIT=NULL.
- FINI=<s2>
- <s2> provides the name of a termination function which will be invoked after all job processes have exited; see Start-line functions for more information. If a plug-in does not need this feature, please specify FINI=NULL.
- PROC=<s3>
- <s3> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see Start-line functions for more information. If a plug-in does not need this feature, please specify PROC=NULL.
Argument hints
Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (:). The following hints are currently defined:
- in, out, inout
- Specifies whether the argument is an input argument, an output argument, or both. in is the default.
- obj, str, ptr, arr, arr2d
- Specifies the type of the argument, respectively a plain object (say, an int, or a structure passed by value), a '\0'-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (type**, not type[][]). obj is the default.
- size
- Required for array arguments (arr and arr2d). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (1 to number of arguments) or relative ones (+1 for the next argument, -1 for the previous argument, etc).
For arr arguments, the size argument must be of a numerical type, or a pointer to such a type. For arr2d arguments, the size argument must itself be an array (an arr argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the arr array itself).
Please note that the unit of size for the numerical types is the size of the base array type (thus, sizeof(int) for an array of ints), not byte (if one would like it to be byte, just make the array argument have type char* or void* (a GCC extension)). - nullok
- An option for arguments passed by pointer (basically, all but obj). If provided, it indicates that the argument is allowed to be NULL. This is not the default because supporting NULL pointers results in an additional computational and protocol overhead. Note: if a NULL pointer is passed to an argument that lacks the nullok flag, the client will crash.
- zerocopy
- An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one in argument and no more than one out argument. Zerocopy performance discusses performance considerations when using this option.
- userbuf
- An option for zerocopy; only supported for arr arguments. Enables a special form of zero-copy support, discussed in Zerocopy with a custom output buffer and Zerocopy with a custom input buffer.
Here is an example function prototype with the hints:
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */, char * buffer /* out:arr:size=+1 */, size_t buffer_length /* in:obj */);
Limitations
As indicated earlier, the scanner is limited, so keep the prototypes simple.
Return type of a forwarded function must be scalar or void.
Structures with pointer fields inside of them cannot be forwarded.
Generated files
For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.
Two more files per plug-in are generated: header_defs.h and header_dispatch.c.
None of the generated files should be modified.
Server-side API
Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described earlier.
The hand-written server-side implementation code should include the zoid_api.h header file (available from packages/zoid/prebuilt/) and the plug-in input header file.
All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the implementation/ subdirectory of the unix plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.
Start-line functions
The following start-line functions can be defined:
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
The INIT function is invoked during initialization, right before a job starts running. Arguments:
- pset_mpi_proc_count
- The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
- argc
- The number of command-line arguments plus one.
- envc
- The number of environment variables.
- argenv
- An array of '\0'-terminated strings, one after another. The first string is the name of the job executable, followed by argc-1 command-line arguments, followed by envc environment variables.
void FINI(void);
The FINI function is invoked after the last process of the job has terminated.
void PROC(int added, int pset_pid);
The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:
- added
- 1 if the process was started, 0 if it was terminated.
- pset_pid
- A process identifier (as returned by __zoid_calling_process_id).
Implementation functions
The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the zoid_api.h header file:
int __zoid_calling_process_id(void);
This function returns a unique identifier of the compute node process that invoked the function. The identifier is not an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.
void __zoid_register_userbuf(void* userbuf, void (*callback)(void* userbuf, void* priv), void* priv);
This function will be discussed in Zerocopy with a custom output buffer.
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
This function writes an arbitrary string to the job's standard output or error. Arguments:
- pid
- Process identifier as returned by __zoid_calling_process_id. The process in question must have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
- fd
- 1 for standard output, 2 for standard error.
- buffer, len
- The string and its length. '\0' should not be included in len and buffer does not need to be '\0'-terminated.
The function returns 0 if successful, and -1 if not (such as when the process identified by pid does not have an MPI rank).
Client-side API
A compute node application needs to be linked with the client-side stubs and with a common support library libzoid_cn.a (a prebuilt version of the latter is in packages/zoid/prebuilt; sources are in packages/zoid/src/cnl/client). Several functions are available to applications by including the zoid_api.h header file:
Initialization
int __zoid_init(void);
This function must be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns 0 if successful, 1 otherwise. There is no corresponding termination function.
int __zoid_job_size(void); int __zoid_my_rank(void);
These functions return, respectively, the number of processes in the job (the size of MPI_COMM_WORLD), and the MPI rank of the current process. Either will return -1 if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).
Error conditions
int __zoid_error(void);
This function should be invoked on the client side after every forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of 0 indicates a success; otherwise, one of the following error values will be returned:
- ENOSYS
- Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side modules have not been loaded.
- ENOMEM
- Out of memory condition.
- E2BIG
- Message exceeded the internal size limit.
int __zoid_excessive_size(void);
If __zoid_error returned E2BIG, calling this function will provide an indication of by how many bytes the input or output was too large.
ZOID has a limit on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom input or output buffers.
If the limit is hit, the operation needs to be split into smaller ones. Information returned by __zoid_excessive_size makes it easy to adjust the buffer and resubmit.
Note: While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.
Here is an example of how the client-side convenience wrapper for a call such as POSIX read could be implemented:
ssize_t read(int fd, void *buf, size_t nbytes) { static ssize_t max_read_nbytes = -1; ssize_t bytes_read; bytes_read = 0; do { ssize_t toread, justread; int error; toread = nbytes - bytes_read; if (max_read_nbytes != -1 && toread > max_read_nbytes) toread = max_read_nbytes; /* unix_read is the forwarded function call. */ justread = unix_read(fd, buf + bytes_read, toread); if ((error = __zoid_error())) { if (error != E2BIG) { /* For a generic ZOID error, just bail out. */ errno = error; return -1; } /* We tried to send a too large read request. Adjust. */ max_read_nbytes = toread - __zoid_excessive_size(); } else { if (justread < 0) { /* For a generic read() error, just bail out. In case of an I/O error, unix_read returns -errno. */ errno = -justread; return -1; } bytes_read += justread; if (justread != toread) /* unix_read as such succeeded, but it read fewer bytes than expected. We terminate prematurely then. */ break; } } while (bytes_read < nbytes); return bytes_read; }
Additional considerations
Forwarding errno
If one needs to pass a variable such as errno from the I/O node to the client, the most straightforward way is to add an extra integer out pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The unix plug-in does it that way, so, e.g., the implementation of close on the I/O node looks something like this:
if (close(server_fd) == -1) return -errno; else return 0;
Then, on the client side, we have a convenience wrapper:
int close(int fd) { return unix_decode_result(unix_close(fd)); }
unix_decode_result is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:
#define unix_decode_result(result) \ ({ \ typeof (result) _result = (result); \ int _n; \ if ((_n = __zoid_error()) != 0) \ { \ errno = _n; \ _result = -1; \ } \ else if (_result < 0) \ { \ errno = -_result; \ _result = -1; \ } \ _result; \ })
Returning variable amounts of data in arrays
Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an inout pointer to a numerical type. A server-side implementation of a function such as read then looks like this:
ssize_t unix_read(int fd /* in:obj */, void *buf /* out:arr:size=+1 */, size_t *count /* inout:ptr */) { ssize_t ret; ... if ((ret = read(fd, buf, *count)) == -1) { *count = 0; return -errno; } else { *count = ret; return ret; } }
Obviously, the client side needs to be modified as well, to pass the size argument by address.
Note: this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output arr2d, which internally behaves a lot like multiple separate arrs). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.
Zerocopy performance
Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the out arrays is sent to the compute nodes without any extra memory copies.
Client side is only zero-copy for arrays that use the zerocopy flag in the header file. Because of the additial protocol overheads that zerocopy introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O read or write calls.
Note: for maximum performance, the arrays passed as zerocopy arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as malloc have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.
Zerocopy with a custom output buffer
Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.
This extra copy can be avoided for zerocopy output arr types if the userbuf flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a NULL pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:
void __zoid_register_userbuf(void* userbuf, void (*callback)(void* userbuf, void* priv), void* priv);
Arguments:
- userbuf
- The address of the buffer.
- callback
- A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. userbuf is passed as the first argument to the callback. It is safe for the callback to invoke __zoid_calling_process_id, if desired.
- priv
- A private data passed as the second argument to the callback. It is not interpreted by ZOID in any way.
The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the size argument. The server-side implementation part may choose to return less than the maximum amount, as explained earlier.
As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as malloc have been modified to align memory to that boundary, but we recommend explicitly calling posix_memalign.
Note: because the buffer provided is not allocated by ZOID, message size restrictions discussed earlier do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.
We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most likely to be useful with buffers allocated outside of the implementation functions:
static void buffer_cb(void* userbuf, void* priv) { free(userbuf); } ssize_t unix_read(int fd /* in:obj */, void *buf /* out:arr:size=+1:zerocopy:userbuf */, size_t *count /* inout:ptr */) { ssize_t ret; ... if (posix_memalign(&buf, 16, *count)) { *count = 0; return -ENOMEM; } __zoid_register_userbuf(buf, &buffer_cb, NULL); if ((ret = read(fd, buf, *count)) == -1) { *count = 0; return -errno; } else { *count = ret; return ret; } }
Zerocopy with a custom input buffer
The userbuf flag discussed above can also be used for input zerocopy arr arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.
If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input userbuf argument, with _allocate_cb suffix attached to it. Its prototype needs to be as follows:
void* <name>_allocate_cb(int len);
The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or NULL if that is not possible (in which case, the function will fail and __zoid_error on the client side will return ENOMEM).
There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type char*, unsigned char*, void* (a GCC extension), etc.
The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.
Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input userbuf array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.
As with other user-level callbacks, the allocation routine may call __zoid_calling_process_id to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as malloc have been modified to align memory to that boundary, but we recommend explicitly calling posix_memalign. Finally, as with output userbuf, message size restrictions discussed earlier do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.
Under rare circumstances, input userbuf could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the FINI function.
A simple example:
void* unix_write_allocate_cb(int len) { void* ptr; if (posix_memalign(&ptr, 16, len)) return NULL; return ptr; } ssize_t unix_write(int fd /* in:obj */, const void *buf /* in:arr:size=+1:zerocopy:userbuf */, size_t count /* in:obj */) { ssize_t ret; ... if ((ret = write(fd, buf, count)) == -1) ret = -errno; free((void*)buf); return ret; }