ZeptoOS - User contributions [en]

Configuration

2022-02-25T19:38:47Z

Iskra: /* Downloading */

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the front end nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [https://www.mcs.anl.gov/research/projects/zeptoos/downloads downloads page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-<version>.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>ZeptoOS-<version></tt> directory:

<pre>
$ cd ZeptoOS-<version>
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories:

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a textual user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once the password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: ZeptoOS compute node Linux with embedded compute node ramdisk.
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-cn-linux
: Rebuilds the compute node ramdisk and embeds it into a compute node kernel image.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and embeds the ramdisk into the kernel.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
; bgp-libs-build
: Rebuilds SPI, DCMF and MPICH from sources
(the following <tt>make</tt> targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]

Configuration

2022-02-25T19:37:19Z

Iskra: /* Downloading */

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the front end nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [https://www.mcs.anl.gov/research/projects/zeptoos/download download page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-<version>.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>ZeptoOS-<version></tt> directory:

<pre>
$ cd ZeptoOS-<version>
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories:

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a textual user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once the password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: ZeptoOS compute node Linux with embedded compute node ramdisk.
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-cn-linux
: Rebuilds the compute node ramdisk and embeds it into a compute node kernel image.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and embeds the ramdisk into the kernel.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
; bgp-libs-build
: Rebuilds SPI, DCMF and MPICH from sources
(the following <tt>make</tt> targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]

MediaWiki:Sidebar

2018-02-21T19:19:32Z

Iskra:

*
** mainpage|Top
** http://www.mcs.anl.gov/research/projects/zeptoos/|ZeptoOS Home
* SEARCH
* TOOLBOX
* LANGUAGES

ZeptoOS:About

2018-02-20T21:26:44Z

Iskra:

[http://www.mcs.anl.gov/research/projects/zeptoos/ ZeptoOS] is a research project studying operating systems for petascale architectures with 10,000 to 1 million CPUs. Operating system and run-time software is strained by ultra-scale machines, and a variety of fascinating research topics are revealed at such amazing scale. Archtectures such as IBM's BlueGene and Cray's XT3 are on the path toward petaflops and beyond, and make perfect testbeds for computer science explorations.

The ZeptoOS project is a collaboration between Argonne National Laboratory and the University of Oregon.

FAQ

2010-05-06T22:49:12Z

Iskra: /* How to obtain a Cobalt job ID */

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

An MPI rank should not be confused with the torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$.*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

==Why SSH keeps asking for a password==

As we envisioned it, partition owners should be able to log on the I/O nodes belonging to their jobs without being asked for a password. The following considerations apply:
# The account information on the partition owner must be added to the <tt>/etc/passwd</tt> file on the I/O nodes when launching a job; this is discussed [[ZOID#The /bin.rd/update_passwd_file.sh file|here]].
# For password-less logins, <tt>shosts.equiv</tt> must be configured before (re)building the I/O node ramdisk, as discussed [[Testing#Interactive login|here]]. Alternatively, users could set up SSH key pairs in their home directories (password-less, or using <tt>ssh-agent</tt> to cache the password).
# SSH might temporarily prevent a partition owner from logging in if an attempt is made before the job starts running, as discussed [[Testing#Interactive login|here]]. Root can always log in, by providing the password set when building the I/O node ramdisk for the first time.
# Finally, keep in mind that a particular site might have disabled this feature on purpose.

----
[[ZeptoOS_Documentation|Top]]

FAQ

2009-10-19T22:24:14Z

Iskra: /* How to open a socket from a CN to the outside world */

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

An MPI rank should not be confused with the torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

==Why SSH keeps asking for a password==

As we envisioned it, partition owners should be able to log on the I/O nodes belonging to their jobs without being asked for a password. The following considerations apply:
# The account information on the partition owner must be added to the <tt>/etc/passwd</tt> file on the I/O nodes when launching a job; this is discussed [[ZOID#The /bin.rd/update_passwd_file.sh file|here]].
# For password-less logins, <tt>shosts.equiv</tt> must be configured before (re)building the I/O node ramdisk, as discussed [[Testing#Interactive login|here]]. Alternatively, users could set up SSH key pairs in their home directories (password-less, or using <tt>ssh-agent</tt> to cache the password).
# SSH might temporarily prevent a partition owner from logging in if an attempt is made before the job starts running, as discussed [[Testing#Interactive login|here]]. Root can always log in, by providing the password set when building the I/O node ramdisk for the first time.
# Finally, keep in mind that a particular site might have disabled this feature on purpose.

----
[[ZeptoOS_Documentation|Top]]

Limitations

2009-05-11T22:37:07Z

Iskra: /* Known Bugs / Current Limitations */

[[ZeptoOS_Documentation|Top]]
----

==Known Bugs / Current Limitations==

===No VN/DUAL mode in MPI===

Blue Gene/P supports three job modes:

* SMP (one application process per node)
* DUAL (two application processes per node)
* VN (four application processes per node)

In Cobalt, the job mode can be specified using <tt>cqsub -m</tt> or <tt>qsub --mode</tt>.

ZeptoOS will launch the appropriate number of application processes per node as determined by the mode; however, MPI jobs currently only work in the SMP mode. We plan to fix this problem in the near future.

===No Universal Performance Counter (UPC)===

UPC is not available in this release. Thus, PAPI will not work since it depends on UPC.
We are currently trying to enable the UPC support in our Linux environment.

===MPI-IO support===

Due to the limitations of FUSE (the compute-node infrastructure we use for I/O forwarding of POSIX calls), if using the standard glibc, pathnames passed to MPI-IO routines need to be prefixed with <tt>bglockless:</tt> or <tt>bgl:</tt> (the latter will not work with PVFS; the former should work with all filesystems).

This should not be necessary when using the version of glibc [[Other Packages#ZOID glibc|modified for ZOID]]. That version should also give a better performance, so please give it a try if the performance with the standard glibc is unsatisfactory.

Also, within the DOE FastOS [http://www.iofsl.org/ I/O forwarding project] we are working on a new, high performance I/O forwarding infrastructure for parallel applications and as this work matures, we will integrate it into ZeptoOS.

===Some MPI jobs hung when they are killed===

We have been seeing this a lot with <tt>cn-ipfwd</tt>, the [[Other Packages#IP over torus|IP-over-torus]] program. This program runs "forever", so it eventually needs to be killed. When that happens, it will frequently hung one or more compute nodes, preventing the partition from shutting down cleanly.

However, the service node will force a shutdown after a timeout of five minutes, so in practice this is not a significant problem. Also, we have not seen this problem with ordinary MPI applications (unlike most MPI applications, <tt>cn-ipfwd</tt> is multithreaded and communicates a lot with the kernel).

===mpirun -nofree does not work===

<tt>mpirun -nofree</tt> (submitting multiple jobs without rebooting the nodes) does not work in the current release. Currently, partitions must be rebooted. We intend to fix it in the next version.

==Features Coming Soon==

===Multiple MPI jobs one after another===

Since ZeptoOS supports submitting a shell script as a compute node "application", it is possible to run multiple "real" applications from within one job:

<pre>
#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10; do
/path/to/real/application
done
</pre>

This does work for sequential applications, but not for those that are linked with MPI; with MPI, an application can only be run once. However, we have an experimental code that lifts this limitation and we plan to include it in the next release.

----
[[ZeptoOS_Documentation|Top]]

Changes

2009-05-11T17:53:29Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

'''ZeptoOS-BGP 2.0''' released May 11, 2009. 
Most important changes:

* first public release with support for Blue Gene/P
* ZeptoOS Compute Node Linux with:
** Big Memory
** Native communication support
** IP forwarding
* ZOID enabled by default

'''ZeptoOS-BG 1.5''' released June 28, 2007. 
Most important changes:

* tested on V1R3M2 driver, should work with V1R3M0 and V1R3M1 as well
* support for ZOID
* ION kernel with experimental support for compute class processes and static TLB
* PVFS2 updated to version 2.6.3

'''ZeptoOS-BG 1.4''' released January 31, 2006. 
Most important changes:

* tested on V1R2M1 driver and may work with V1R2M0
* 2.6 ION kernel (based on 2.6.5)
* pvfs2 binary and rc script update
* boot msg clean up
* ZeptoInstall.sh fixes
* misc. such as fixing /tmp perms

'''ZeptoOS-BG 1.2''' released November 11, 2005. 
Most important changes:

* Integrated support for KTAU, a kernel profiling/tracing tool.
* Support for custom /bgl/dist-type directory trees. Eliminates a need to put ZeptoOS-specific stuff inside a directory shared with standard IBM profile.
* An installation script is now available, to ease the installation process.
* Added zswitcher, a command to switch kernel/ramdisk of a partition.
* Added zdiff-elfrd, a command to compare two ramdisks.
* Improved zinfo: more robust, more secure, easier to set up.
* CIOD read/write buffer size can now be calculated automatically, taking into account available memory, compute to I/O node ratio, etc.

'''ZeptoOS-BG 1.1''' released September 6, 2005.

----
[[ZeptoOS_Documentation|Top]]

ZOID

2009-05-11T02:04:07Z

Iskra:

[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]
----

==Introduction==

ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.

ZOID infrastructure consists of:
* A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
* <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for the forwarding of IP packets,
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.

==User interface==

ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.

===User script===

Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a ''user script'' on I/O nodes. By default, the daemon invokes <tt>$HOME/zoid-user-script.sh</tt> (this pathname can be [[#opt_user_script|changed]] by an administrator). A single parameter is passed to the script: <tt>1</tt> at the job startup, and <tt>0</tt> at the termination.

Information about the job will be passed to the script in the following environment variables:
; <tt>ZOID_JOB_EXEC</tt>
: name of the job executable,
; <tt>ZOID_JOB_ARGS</tt>
: job arguments, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ENV</tt>
: job environment variables, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ID</tt>
: BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter),
; <tt>ZOID_JOB_GLOBAL_SIZE</tt>
: the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>),
; <tt>ZOID_JOB_LOCAL_SIZE</tt>
: the number of job processes handled by this I/O node,
; <tt>ZOID_JOB_MODE</tt>
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL,
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt>
: will also be set...

'''Notes:'''
* The user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).
* For this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===File broadcast===
A <tt>/bin.rd/f2cn</tt> command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.

The command takes two arguments:
* absolute pathname to the input file on the I/O node,
* absolute pathname to the output file on the compute nodes.

The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.

The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.

'''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.

'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this:

User script (<tt>$HOME/zoid-user-script.sh</tt>):
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/f2cn $HOME/large_binary /tmp/large_binary
fi
exit 0
</pre>

Job script (submitted using Cobalt or mpirun):
<pre>
#!/bin/sh

chmod 755 /tmp/large_binary
/tmp/large_binary
</pre>

===Performance counters===

A <tt>/bin.rd/statquery</tt> command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.

The command takes a single optional argument:
* the interval between successive queries, in seconds.

If the argument is not provided, the command will terminate after the first query.

Here is a sample output generated:

<pre>
Timestamp: 1240439085.688831
Total messages sent: 5767
Total bytes sent: 7619170
Total messages received: 5717
Total bytes received: 72575
IP fwd messages sent: 196
IP fwd bytes sent: 5889
IP fwd messages received: 84
IP fwd bytes received: 6453
Stream messages sent: 65
Stream bytes sent: 520
Stream messages received: 65
Stream bytes received: 1416
Broadcast messages sent: 1
Broadcast bytes sent: 2437906
Internal messages sent: 193
Internal bytes sent: 39524
Internal messages received: 256
Internal bytes received: 1792
Plugin 5 messages sent: 0
Plugin 5 bytes sent: 0
Plugin 5 messages received: 0
Plugin 5 bytes received: 0
Plugin 2 messages sent: 5312
Plugin 2 bytes sent: 5135331
Plugin 2 messages received: 5312
Plugin 2 bytes received: 62914
</pre>

The meaning of the fields is as follows:
; Timestamp
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>,
; IP fwd
: IP packet forwarding between compute nodes and I/O nodes,
; Stream
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams,
; Broadcast
: [[#File broadcast|file broadcasts]],
; Internal
: job control messages, etc.
; Plugin 5
: internal <tt>mapping</tt> plug-in, used by MPI,
; Plugin 2
: <tt>unix</tt> plugin (POSIX file I/O).

The counters are 64-bit integers, so they will take a while to overflow :-).

Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file:
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` &
fi
exit 0
</pre>

==Administrator interface==

===Configuration file===

The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk:

; ZOID_BUFFER_SIZE (-b)
: Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
; ZOID_ACK_THRESHOLD (-a)
: Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
; ZOID_MODULES (-m)
: Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provides POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details.
; ZOID_ENABLE_NAT (-n)
: Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]).
; ZOID_USER_SCRIPT (-u)
: Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk. '''Note:''' to be able to invoke a script from user's home directory, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===The /bin.rd/update_passwd_file.sh file===

Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>.

The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this file contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user.

If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it.

===The /bin.rd/nat file===

If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications.

==Programmer interface==

ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.

===Overview===

All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.

Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine.

A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>).

The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files.

===Input header file===

The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script.

The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.

Ordinary comments are best placed on separate lines.

'''Note:''' the parser is case ''sensitive''.

====Start line====

Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:

<pre>
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
</pre>

; ID=<n>
: Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>).
; INIT=<s1>
: <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>.
; FINI=<s2>
: <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>.
; PROC=<s3>
: <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>.

====Argument hints====

Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined:

; in, out, inout
: Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default.
; obj, str, ptr, arr, arr2d
: Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default.
; size
: Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc). For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself). Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)).
; nullok
: An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash.
; zerocopy
: An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option.
; userbuf
: An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]].

Here is an example function prototype with the hints:

<pre>
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */,
char * buffer /* out:arr:size=+1 */,
size_t buffer_length /* in:obj */);
</pre>

====Limitations====

As indicated earlier, the scanner is limited, so keep the prototypes simple.

Return type of a forwarded function must be scalar or <tt>void</tt>.

Structures with pointer fields inside of them cannot be forwarded.

====Generated files====

For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.

Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>.

None of the generated files should be modified.

===Server-side API===

Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]].

The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file.

All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.

====Start-line functions====

The following [[#Start line|start-line]] functions can be defined:

<pre>
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
</pre>

The INIT function is invoked during initialization, right before a job starts running. Arguments:

; pset_mpi_proc_count
: The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
; argc
: The number of command-line arguments plus one.
; envc
: The number of environment variables.
; argenv
: An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables.

<pre>
void FINI(void);
</pre>

The FINI function is invoked after the last process of the job has terminated.

<pre>
void PROC(int added, int pset_pid);
</pre>

The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:

; added
: <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated.
; pset_pid
: A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]).

====Implementation functions====

The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file:

<pre>
int __zoid_calling_process_id(void);
</pre>

This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]].

<pre>
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
</pre>

This function writes an arbitrary string to the job's standard output or error. Arguments:

; pid
: Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
; fd
: <tt>1</tt> for standard output, <tt>2</tt> for standard error.
; buffer, len
: The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated.

The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank).

===Client-side API===

A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file:

====Initialization====

<pre>
int __zoid_init(void);
</pre>

This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function.

<pre>
int __zoid_job_size(void);
int __zoid_my_rank(void);
</pre>

These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the
current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).

====Error conditions====

<pre>
int __zoid_error(void);
</pre>

This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned:

; ENOSYS
: Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded.
; ENOMEM
: Out of memory condition.
; E2BIG
: Message exceeded the internal size limit.

<pre>
int __zoid_excessive_size(void);
</pre>

If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large.

ZOID [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers.

If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit.

'''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.

Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented:

<pre>
ssize_t read(int fd, void *buf, size_t nbytes)
{
static ssize_t max_read_nbytes = -1;
ssize_t bytes_read;

bytes_read = 0;
do
{
ssize_t toread, justread;
int error;

toread = nbytes - bytes_read;

if (max_read_nbytes != -1 && toread > max_read_nbytes)
toread = max_read_nbytes;

/* unix_read is the forwarded function call. */
justread = unix_read(fd, buf + bytes_read, toread);

if ((error = __zoid_error()))
{
if (error != E2BIG)
{
/* For a generic ZOID error, just bail out. */
errno = error;
return -1;
}

/* We tried to send a too large read request. Adjust. */
max_read_nbytes = toread - __zoid_excessive_size();
}
else
{
if (justread < 0)
{
/* For a generic read() error, just bail out.
In case of an I/O error, unix_read returns -errno. */
errno = -justread;
return -1;
}

bytes_read += justread;

if (justread != toread)
/* unix_read as such succeeded, but it read fewer bytes than
expected. We terminate prematurely then. */
break;
}
} while (bytes_read < nbytes);

return bytes_read;
}
</pre>

===Additional considerations===

====Forwarding <tt>errno</tt>====

If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this:

<pre>
if (close(server_fd) == -1)
return -errno;
else
return 0;
</pre>

Then, on the client side, we have a convenience wrapper:

<pre>
int close(int fd)
{
return unix_decode_result(unix_close(fd));
}
</pre>

<tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:

<pre>
#define unix_decode_result(result) \
({ \
typeof (result) _result = (result); \
int _n; \
if ((_n = __zoid_error()) != 0) \
{ \
errno = _n; \
_result = -1; \
} \
else if (_result < 0) \
{ \
errno = -_result; \
_result = -1; \
} \
_result; \
})
</pre>

====Returning variable amounts of data in arrays====

Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this:

<pre>
ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1 */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

Obviously, the client side needs to be modified as well, to pass the size argument by address.

'''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.

====Zerocopy performance====

Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies.

Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls.

'''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.

====Zerocopy with a custom output buffer====

Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.

This extra copy can be avoided for <tt>zerocopy</tt> output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

Arguments:

; userbuf
: The address of the buffer.
; callback
: A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>, if desired.
; priv
: A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way.

The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]].

As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>.

'''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most likely to be useful with buffers allocated outside of the implementation functions:

<pre>
static void buffer_cb(void* userbuf, void* priv)
{
free(userbuf);
}

ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1:zerocopy:userbuf */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if (posix_memalign(&buf, 16, *count))
{
*count = 0;
return -ENOMEM;
}

__zoid_register_userbuf(buf, &buffer_cb, NULL);

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

====Zerocopy with a custom input buffer====

The <tt>userbuf</tt> flag discussed above can also be used for ''input'' <tt>zerocopy</tt> <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.

If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows:

<pre>
void* <name>_allocate_cb(int len);
</pre>

The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>).

There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc.

The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.

Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.

As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function.

A simple example:

<pre>
void* unix_write_allocate_cb(int len)
{
void* ptr;

if (posix_memalign(&ptr, 16, len))
return NULL;

return ptr;
}

ssize_t unix_write(int fd /* in:obj */,
const void *buf /* in:arr:size=+1:zerocopy:userbuf */,
size_t count /* in:obj */)
{
ssize_t ret;

...

if ((ret = write(fd, buf, count)) == -1)
ret = -errno;

free((void*)buf);

return ret;
}
</pre>

----
[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]

ZOID

2009-05-08T23:04:29Z

Iskra:

[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]
----

==Introduction==

ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.

ZOID infrastructure consists of:
* A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
* <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for the forwarding of IP packets,
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.

==User interface==

ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.

===User script===

Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a ''user script'' on I/O nodes. By default, the daemon invokes <tt>$HOME/zoid-user-script.sh</tt> (this pathname can be [[#opt_user_script|changed]] by an administrator). A single parameter is passed to the script: <tt>1</tt> at the job startup, and <tt>0</tt> at the termination.

Information about the job will be passed to the script in the following environment variables:
; <tt>ZOID_JOB_EXEC</tt>
: name of the job executable,
; <tt>ZOID_JOB_ARGS</tt>
: job arguments, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ENV</tt>
: job environment variables, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ID</tt>
: BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter),
; <tt>ZOID_JOB_GLOBAL_SIZE</tt>
: the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>),
; <tt>ZOID_JOB_LOCAL_SIZE</tt>
: the number of job processes handled by this I/O node,
; <tt>ZOID_JOB_MODE</tt>
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL,
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt>
: will also be set...

'''Notes:'''
* The user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).
* For this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===File broadcast===
A <tt>/bin.rd/f2cn</tt> command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.

The command takes two arguments:
* absolute pathname to the input file on the I/O node,
* absolute pathname to the output file on the compute nodes.

The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.

The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.

'''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.

'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this:

User script (<tt>$HOME/zoid-user-script.sh</tt>):
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/f2cn $HOME/large_binary /tmp/large_binary
fi
exit 0
</pre>

Job script (submitted using Cobalt or mpirun):
<pre>
#!/bin/sh

chmod 755 /tmp/large_binary
/tmp/large_binary
</pre>

===Performance counters===

A <tt>/bin.rd/statquery</tt> command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.

The command takes a single optional argument:
* the interval between successive queries, in seconds.

If the argument is not provided, the command will terminate after the first query.

Here is a sample output generated:

<pre>
Timestamp: 1240439085.688831
Total messages sent: 5767
Total bytes sent: 7619170
Total messages received: 5717
Total bytes received: 72575
IP fwd messages sent: 196
IP fwd bytes sent: 5889
IP fwd messages received: 84
IP fwd bytes received: 6453
Stream messages sent: 65
Stream bytes sent: 520
Stream messages received: 65
Stream bytes received: 1416
Broadcast messages sent: 1
Broadcast bytes sent: 2437906
Internal messages sent: 193
Internal bytes sent: 39524
Internal messages received: 256
Internal bytes received: 1792
Plugin 5 messages sent: 0
Plugin 5 bytes sent: 0
Plugin 5 messages received: 0
Plugin 5 bytes received: 0
Plugin 2 messages sent: 5312
Plugin 2 bytes sent: 5135331
Plugin 2 messages received: 5312
Plugin 2 bytes received: 62914
</pre>

The meaning of the fields is as follows:
; Timestamp
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>,
; IP fwd
: IP packet forwarding between compute nodes and I/O nodes,
; Stream
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams,
; Broadcast
: [[#File broadcast|file broadcasts]],
; Internal
: job control messages, etc.
; Plugin 5
: internal <tt>mapping</tt> plug-in, used by MPI,
; Plugin 2
: <tt>unix</tt> plugin (POSIX file I/O).

The counters are 64-bit integers, so they will take a while to overflow :-).

Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file:
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` &
fi
exit 0
</pre>

==Administrator interface==

===Configuration file===

The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk:

; ZOID_BUFFER_SIZE (-b)
: Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
; ZOID_ACK_THRESHOLD (-a)
: Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
; ZOID_MODULES (-m)
: Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provides POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details.
; ZOID_ENABLE_NAT (-n)
: Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]).
; ZOID_USER_SCRIPT (-u)
: Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk. '''Note:''' to be able to invoke a script from user's home directory, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===The /bin.rd/update_passwd_file.sh file===

Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>.

The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this file contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user.

If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it.

===The /bin.rd/nat file===

If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications.

==Programmer interface==

ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.

===Overview===

All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.

Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine.

A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>).

The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files.

===Input header file===

The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script.

The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.

Ordinary comments are best placed on separate lines.

'''Note:''' the parser is case ''sensitive''.

====Start line====

Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:

<pre>
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
</pre>

; ID=<n>
: Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>).
; INIT=<s1>
: <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>.
; FINI=<s2>
: <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>.
; PROC=<s3>
: <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>.

====Argument hints====

Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined:

; in, out, inout
: Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default.
; obj, str, ptr, arr, arr2d
: Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default.
; size
: Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc). For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself). Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)).
; nullok
: An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash.
; zerocopy
: An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option.
; userbuf
: An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]].

Here is an example function prototype with the hints:

<pre>
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */,
char * buffer /* out:arr:size=+1 */,
size_t buffer_length /* in:obj */);
</pre>

====Limitations====

As indicated earlier, the scanner is limited, so keep the prototypes simple.

Return type of a forwarded function must be scalar or <tt>void</tt>.

Structures with pointer fields inside of them cannot be forwarded.

====Generated files====

For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.

Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>.

None of the generated files should be modified.

===Server-side API===

Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]].

The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file.

All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.

====Start-line functions====

The following [[#Start line|start-line]] functions can be defined:

<pre>
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
</pre>

The INIT function is invoked during initialization, right before a job starts running. Arguments:

; pset_mpi_proc_count
: The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
; argc
: The number of command-line arguments plus one.
; envc
: The number of environment variables.
; argenv
: An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables.

<pre>
void FINI(void);
</pre>

The FINI function is invoked after the last process of the job has terminated.

<pre>
void PROC(int added, int pset_pid);
</pre>

The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:

; added
: <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated.
; pset_pid
: A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]).

====Implementation functions====

The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file:

<pre>
int __zoid_calling_process_id(void);
</pre>

This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]].

<pre>
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
</pre>

This function writes an arbitrary string to the job's standard output or error. Arguments:

; pid
: Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
; fd
: <tt>1</tt> for standard output, <tt>2</tt> for standard error.
; buffer, len
: The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated.

The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank).

===Client-side API===

A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file:

====Initialization====

<pre>
int __zoid_init(void);
</pre>

This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function.

<pre>
int __zoid_job_size(void);
int __zoid_my_rank(void);
</pre>

These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the
current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).

====Error conditions====

<pre>
int __zoid_error(void);
</pre>

This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned:

; ENOSYS
: Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded.
; ENOMEM
: Out of memory condition.
; E2BIG
: Message exceeded the internal size limit.

<pre>
int __zoid_excessive_size(void);
</pre>

If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large.

ZOID [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers.

If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit.

'''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.

Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented:

<pre>
ssize_t read(int fd, void *buf, size_t nbytes)
{
static ssize_t max_read_nbytes = -1;
ssize_t bytes_read;

bytes_read = 0;
do
{
ssize_t toread, justread;
int error;

toread = nbytes - bytes_read;

if (max_read_nbytes != -1 && toread > max_read_nbytes)
toread = max_read_nbytes;

/* unix_read is the forwarded function call. */
justread = unix_read(fd, buf + bytes_read, toread);

if ((error = __zoid_error()))
{
if (error != E2BIG)
{
/* For a generic ZOID error, just bail out. */
errno = error;
return -1;
}

/* We tried to send a too large read request. Adjust. */
max_read_nbytes = toread - __zoid_excessive_size();
}
else
{
if (justread < 0)
{
/* For a generic read() error, just bail out.
In case of an I/O error, unix_read returns -errno. */
errno = -justread;
return -1;
}

bytes_read += justread;

if (justread != toread)
/* unix_read as such succeeded, but it read fewer bytes than
expected. We terminate prematurely then. */
break;
}
} while (bytes_read < nbytes);

return bytes_read;
}
</pre>

===Additional considerations===

====Forwarding <tt>errno</tt>====

If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this:

<pre>
if (close(server_fd) == -1)
return -errno;
else
return 0;
</pre>

Then, on the client side, we have a convenience wrapper:

<pre>
int close(int fd)
{
return unix_decode_result(unix_close(fd));
}
</pre>

<tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:

<pre>
#define unix_decode_result(result) \
({ \
typeof (result) _result = (result); \
int _n; \
if ((_n = __zoid_error()) != 0) \
{ \
errno = _n; \
_result = -1; \
} \
else if (_result < 0) \
{ \
errno = -_result; \
_result = -1; \
} \
_result; \
})
</pre>

====Returning variable amounts of data in arrays====

Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this:

<pre>
ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1 */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

Obviously, the client side needs to be modified as well, to pass the size argument by address.

'''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.

====Zerocopy performance====

Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies.

Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls.

'''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.

====Zerocopy with a custom output buffer====

Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.

This can be avoided for zero-copy output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

Arguments:

; userbuf
: The address of the buffer.
; callback
: A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>.
; priv
: A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way.

The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]].

As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>.

'''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most useful with buffers allocated outside of the implementation functions:

<pre>
static void buffer_cb(void* userbuf, void* priv)
{
free(userbuf);
}

ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1:zerocopy:userbuf */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if (posix_memalign(&buf, 16, *count))
{
*count = 0;
return -ENOMEM;
}

__zoid_register_userbuf(buf, &buffer_cb, NULL);

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

====Zerocopy with a custom input buffer====

The <tt>userbuf</tt> flag discussed above can also be used for ''input'' zero-copy <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.

If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows:

<pre>
void* <name>_allocate_cb(int len);
</pre>

The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>).

There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc.

The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.

Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.

As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function.

A simple example:

<pre>
void* unix_write_allocate_cb(int len)
{
void* ptr;

if (posix_memalign(&ptr, 16, len))
return NULL;

return ptr;
}

ssize_t unix_write(int fd /* in:obj */,
const void *buf /* in:arr:size=+1:zerocopy:userbuf */,
size_t count /* in:obj */)
{
ssize_t ret;

...

if ((ret = write(fd, buf, count)) == -1)
ret = -errno;

free((void*)buf);

return ret;
}
</pre>

----
[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]

Licensing

2009-05-08T22:33:20Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

In general, new, original software developed under the ZeptoOS project is distributed under the GNU GPL / LGPL licenses, as appropriate. Those licenses can be found in the top-level directory of the ZeptoOS tarball. If you are not familiar with the GPL or LGPL licensing, please see the files above and also visit http://www.gnu.org/.

However, ZeptoOS is a collection of many existing software components, packaged and distributed under a number of licenses, such as:

; GPL / LGPL
: Linux kernel, GNU libc, FUSE, PVFS, BusyBox, iptables, NBD, psmisc,
; CPL (Common Public License)
: DCMF, SPI,
; BSD
: MPICH, TAU (note that TAU is distributed separately from the ZeptoOS tarball), netkit-rsh

The important thing is that all these licenses are free/open source, [http://www.opensource.org/licenses OSI-approved licenses].

----
[[ZeptoOS_Documentation|Top]]

Limitations

2009-05-08T22:31:22Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

==Known Bugs / Current Limitations==

===No VN/DUAL mode in MPI===

Blue Gene/P supports three job modes:

* SMP (one application process per node)
* DUAL (two application processes per node)
* VN (four application processes per node)

In Cobalt, the job mode can be specified using <tt>cqsub -m</tt> or <tt>qsub --mode</tt>.

ZeptoOS will launch the appropriate number of application processes per node as determined by the mode; however, MPI jobs currently only work in the SMP mode. We plan to fix this problem in the near future.

===No Universal Performance Counter (UPC)===

UPC is not available in this release. Thus, PAPI will not work since it depends on UPC.
We are currently trying to enable the UPC support in our Linux environment.

===MPI-IO support===

Due to the limitations of FUSE (the compute-node infrastructure we use for I/O forwarding of POSIX calls), if using the standard glibc, pathnames passed to MPI-IO routines need to be prefixed with <tt>bglockless:</tt> or <tt>bgl:</tt> (the latter will not work with PVFS; the former should work with all filesystems).

This should not be necessary when using the version of glibc [[Other Packages#ZOID glibc|modified for ZOID]]. That version should also give a better performance, so please give it a try if the performance with the standard glibc is unsatisfactory.

Also, within the DOE FastOS [http://www.iofsl.org/ I/O forwarding project] we are working on a new, high performance I/O forwarding infrastructure for parallel applications and as this work matures, we will integrate it into ZeptoOS.

===Some MPI jobs hung when they are killed===

We have been seeing this a lot with <tt>cn-ipfwd</tt>, the [[Other Packages#IP over torus|IP-over-torus]] program. This program runs "forever", so it eventually needs to be killed. When that happens, it will frequently hung one or more compute nodes, preventing the partition from shutting down cleanly.

However, the service node will force a shutdown after a timeout of five minutes, so in practice this is not a significant problem. Also, we have not seen this problem with ordinary MPI applications (unlike most MPI applications, <tt>cn-ipfwd</tt> is multithreaded and communicates a lot with the kernel).

==Features Coming Soon==

===Multiple MPI jobs one after another===

Since ZeptoOS supports submitting a shell script as a compute node "application", it is possible to run multiple "real" applications from within one job:

<pre>
#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10; do
/path/to/real/application
done
</pre>

This does work for sequential applications, but not for those that are linked with MPI; with MPI, an application can only be run once. However, we have an experimental code that lifts this limitation and we plan to include it in the next release.

----
[[ZeptoOS_Documentation|Top]]

FAQ

2009-05-08T22:26:26Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

An MPI rank should not be confused with the torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS_source''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

==Why SSH keeps asking for a password==

As we envisioned it, partition owners should be able to log on the I/O nodes belonging to their jobs without being asked for a password. The following considerations apply:
# The account information on the partition owner must be added to the <tt>/etc/passwd</tt> file on the I/O nodes when launching a job; this is discussed [[ZOID#The /bin.rd/update_passwd_file.sh file|here]].
# For password-less logins, <tt>shosts.equiv</tt> must be configured before (re)building the I/O node ramdisk, as discussed [[Testing#Interactive login|here]]. Alternatively, users could set up SSH key pairs in their home directories (password-less, or using <tt>ssh-agent</tt> to cache the password).
# SSH might temporarily prevent a partition owner from logging in if an attempt is made before the job starts running, as discussed [[Testing#Interactive login|here]]. Root can always log in, by providing the password set when building the I/O node ramdisk for the first time.
# Finally, keep in mind that a particular site might have disabled this feature on purpose.

----
[[ZeptoOS_Documentation|Top]]

Other Packages

2009-05-08T22:17:15Z

Iskra:

[[(K)TAU]] | [[ZeptoOS_Documentation|Top]]
----

==PVFS==

[http://www.pvfs.org/ PVFS] stands for Parallel Virtual File System, an open source parallel file system designed to scale to petabytes of storage and to provide access rates at 100s of GB/s. At Argonne BGP systems, PVFS servers are running and PVFS start-up script is installed in the BGP site-specific directory (<tt>/bgp/iofs/</tt>), so that a PVFS volume is mounted at ION boot time.

We included PVFS version 2.8.1 source code and its prebuilt client binaries in the ZeptoOS release for the sites that are interested in PVFS. We also included a very simple example PVFS start-up script that can be added to the ION ramdisk. If you have PVFS servers running in your system, you can follow the steps below to add the necessary PVFS client components to the ramdisk:

<pre>
$ cd packages/pvfs2/prebuilt
$ sh add-pvfs2-client-ION-ramdisk.sh tcp://192.168.1.1:333/pvfs2-fs
</pre>

Please replace <tt>tcp://192.168.1.1:3334/pvfs2-fs</tt> with the actual server info.

Details on building and running PVFS servers are outside of the scope of this document, but the following example might give a basic idea of how to build and run PVFS:

<pre>
[Build]
$ cd pvfs-2.8.1
$ ./configure [options....]
$ make

[Create a server config file]
$ ./src/apps/admin/pvfs2-genconfig fs.conf

[Start the server]
$ ./src/server/pvfs2-server -f fs.conf -a ALIAS
$ ./src/server/pvfs2-server fs.conf -a ALIAS
</pre>

'''Note:'''
* replace <tt>ALIAS</tt> with your real alias in <tt>fs.conf</tt>
* the first <tt>pvfs2-server</tt> invocation just initializes a PVFS volume
* the second invocation actually starts the server

==IP over torus==

This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form:

<pre>
10.128.0.0 | <rank>
</pre>

where <tt><rank></tt> is the [[FAQ#MPI rank|MPI rank]]. Thus, for a 64-node partition, the IP addresses will range between <tt>10.128.0.0</tt> and <tt>10.128.0.63</tt>, and for a 1024-node partition, they will range between <tt>10.128.0.0</tt> and <tt>10.128.3.255</tt>.

To try this feature out, submit as a compute job the <tt>cn-ipfwd.sh</tt> script, which should have been installed in <tt>/path/to/install/cnbin/</tt>. The script can act as a standalone job or as a wrapper. If invoked without any arguments, it initializes the IP forwarding and then goes to sleep; if any arguments have been passed, they are interpreted as the name of the binary (along with its command line arguments) to invoke once the IP forwarding is initialized, e.g. (an example with Cobalt):

<pre>
$ cqsub -k <profile-name> -t <time> -n 64 /path/to/install/cnbin/ipfwd.sh \
<name of another binary> <arguments to that binary>
</pre>

The script can be copied to another location and adjusted to one's needs.

Once the job is running, log into a compute node, and run <tt>ifconfig</tt>; there should be a new virtual network device <tt>tun1</tt> (in addition to the usual <tt>tun0</tt>, used for IP forwarding between compute nodes and I/O nodes):

<pre>
~ # ifconfig tun1
tun1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.128.0.0 P-t-P:10.128.0.0 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
~ # ping 10.128.0.1
PING 10.128.0.1 (10.128.0.1): 56 data bytes
64 bytes from 10.128.0.1: seq=0 ttl=64 time=0.321 ms
64 bytes from 10.128.0.1: seq=1 ttl=64 time=0.191 ms
64 bytes from 10.128.0.1: seq=2 ttl=64 time=0.203 ms
64 bytes from 10.128.0.1: seq=3 ttl=64 time=0.194 ms
64 bytes from 10.128.0.1: seq=4 ttl=64 time=0.207 ms
--- 10.128.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.191/0.223/0.321 ms
~ # rsh 10.128.0.1 'grep BG_RANK_IN_PSET /proc/personality.sh'
BG_RANK_IN_PSET=59
~ #
</pre>

This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).

One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.

==ZOID glibc==

This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the compute node Linux kernel and the FUSE client (which, while convenient, is slow).

The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (<tt>.a</tt>) version. It is installed with the rest of the ZeptoOS, in <tt>/path/to/install/lib/zoid/</tt>. To link with it, simply add <tt>-L/path/to/install/lib/zoid</tt> to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:

<pre>
$ nm <binary> | grep __zoid_init
</pre>

(no output will be generated if the standard glibc was used)

When submitting a job linked with this glibc, please set the environment variable <tt>ZOID_DIRS</tt> to a list of <tt>:</tt>-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.

Here is a simple benchmark:

<pre>
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

#define BUFSIZE (1024 * 1024 * 100)

int main(int argc, char* argv[])
{
char* buffer;
int fd;
struct timeval start, stop;
double time;

if (argc != 2)
{
fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
return 1;
}

if (!(buffer = malloc(BUFSIZE)))
{
perror("malloc");
return 1;
}
if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1)
{
perror("open");
return 1;
}
gettimeofday(&start, NULL);
if (write(fd, buffer, BUFSIZE) != BUFSIZE)
{
perror("write");
return 1;
}
gettimeofday(&stop, NULL);
close(fd);
free(buffer);

time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6;
printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time);

return 0;
}
</pre>

It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:

<pre>
$ cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out
</pre>

With our home directories on a GPFS filesystem, we get the following performance:

<pre>
Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s
</pre>

On the other hand, if we link it with the standard glibc, or if we forget to set <tt>ZOID_DIRS</tt>, the performance we observe is as follows:

<pre>
Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s
</pre>

The modified glibc is not used by default because it is not yet complete. However, if one does not try to outsmart it (in particular, we recommend always passing absolute pathnames), it should work reliably.

----
[[(K)TAU]] | [[ZeptoOS_Documentation|Top]]

Other Packages

2009-05-08T22:10:26Z

Iskra: /* PVFS */

[[(K)TAU]] | [[ZeptoOS_Documentation|Top]]
----

==PVFS==

[http://www.pvfs.org/ PVFS] stands for Parallel Virtual File System, an open source parallel file system designed to scale to petabytes of storage and to provide access rates at 100s of GB/s. At Argonne BGP systems, PVFS servers are running and PVFS start-up script is installed in the BGP site-specific directory (<tt>/bgp/iofs/</tt>), so that a PVFS volume is mounted at ION boot time.

We included PVFS version 2.8.1 source code and its prebuilt client binaries in the ZeptoOS release for the sites that are interested in PVFS. We also include a very simple example PVFS start-up script that can be added to the ION ramdisk. If you have PVFS servers running in your system, you can follow the steps below to add the necessary PVFS client components to the ramdisk.

<pre>
$ cd packages/pvfs2/prebuilt
$ sh add-pvfs2-client-ION-ramdisk.sh tcp://192.168.1.1:333/pvfs2-fs
</pre>

Please replace <tt>tcp://192.168.1.1:3334/pvfs2-fs</tt> with the actual server info.

Details on building and running the PVFS servers are outside of the scope of this document, but the following example might give a basic idea to build and run the pvfs2 server.

<pre>
[Build]
$ cd pvfs-2.8.1
$ ./configure [options....]
$ make

[Create a server config file]
$ ./src/apps/admin/pvfs2-genconfig fs.conf

[Start the server]
$ ./src/server/pvfs2-server -f fs.conf -a ALIAS
$ ./src/server/pvfs2-server fs.conf -a ALIAS
</pre>

'''Note:'''
* replace <tt>ALIAS</tt> with your real alias in <tt>fs.conf</tt>
* the first <tt>pvfs2-server</tt> invocation just initializes a PVFS volume
* the second invocation actually starts the server

==IP over torus==

This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form:

<pre>
10.128.0.0 | <rank>
</pre>

where <tt><rank></tt> is the [[FAQ#MPI rank|MPI rank]]. Thus, for a 64-node partition, the IP addresses will range between <tt>10.128.0.0</tt> and <tt>10.128.0.63</tt>, and for a 1024-node partition, they will range between <tt>10.128.0.0</tt> and <tt>10.128.3.255</tt>.

To try this feature out, submit as a compute job the <tt>cn-ipfwd.sh</tt> script, which should have been installed in <tt>/path/to/install/cnbin/</tt>. The script can act as a standalone job or as a wrapper. If invoked without any arguments, it initializes the IP forwarding and then goes to sleep; if any arguments have been passed, they are interpreted as the name of the binary (along with its command line arguments) to invoke once the IP forwarding is initialized, e.g. (an example with Cobalt):

<pre>
$ cqsub -k <profile-name> -t <time> -n 64 /path/to/install/cnbin/ipfwd.sh \
<name of another binary> <arguments to that binary>
</pre>

The script can be copied to another location and adjusted to one's needs.

Once the job is running, log into a compute node, and run <tt>ifconfig</tt>; there should be a new virtual network device <tt>tun1</tt> (in addition to the usual <tt>tun0</tt>, used for IP forwarding between compute nodes and I/O nodes):

<pre>
~ # ifconfig tun1
tun1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.128.0.0 P-t-P:10.128.0.0 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
~ # ping 10.128.0.1
PING 10.128.0.1 (10.128.0.1): 56 data bytes
64 bytes from 10.128.0.1: seq=0 ttl=64 time=0.321 ms
64 bytes from 10.128.0.1: seq=1 ttl=64 time=0.191 ms
64 bytes from 10.128.0.1: seq=2 ttl=64 time=0.203 ms
64 bytes from 10.128.0.1: seq=3 ttl=64 time=0.194 ms
64 bytes from 10.128.0.1: seq=4 ttl=64 time=0.207 ms
--- 10.128.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.191/0.223/0.321 ms
~ # rsh 10.128.0.1 'grep BG_RANK_IN_PSET /proc/personality.sh'
BG_RANK_IN_PSET=59
~ #
</pre>

This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).

One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.

==ZOID glibc==

This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the Linux kernel and the FUSE client (which, while convenient, is slow).

The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (<tt>.a</tt>) version. It is installed with the rest of the ZeptoOS, in <tt>/path/to/install/lib/zoid/</tt>. To link with it, simply add <tt>-L/path/to/install/lib/zoid</tt> to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:

<pre>
$ nm <binary> | grep __zoid_init
</pre>

(no output will be generated if the standard glibc was used)

When submitting a job linked with this glibc, please set the environment variable <tt>ZOID_DIRS</tt> to a list of <tt>:</tt>-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.

Here is a simple benchmark:

<pre>
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

#define BUFSIZE (1024 * 1024 * 100)

int main(int argc, char* argv[])
{
char* buffer;
int fd;
struct timeval start, stop;
double time;

if (argc != 2)
{
fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
return 1;
}

if (!(buffer = malloc(BUFSIZE)))
{
perror("malloc");
return 1;
}
if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1)
{
perror("open");
return 1;
}
gettimeofday(&start, NULL);
if (write(fd, buffer, BUFSIZE) != BUFSIZE)
{
perror("write");
return 1;
}
gettimeofday(&stop, NULL);
close(fd);
free(buffer);

time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6;
printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time);

return 0;
}
</pre>

It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:

<pre>
$ cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out
</pre>

With our home directories on a GPFS filesystem, we get the following performance:

<pre>
Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s
</pre>

On the other hand, if we link it with the standard glibc, or if we forget to set <tt>ZOID_DIRS</tt>, the performance we observe is as follows:

<pre>
Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s
</pre>

The modified glibc is not used by default, because it is not yet complete. However, if one does not try to outsmart it (in particular, we recommend always passing absolute pathnames), it should work reliably.

----
[[(K)TAU]] | [[ZeptoOS_Documentation|Top]]

Ramdisk

2009-05-08T21:50:14Z

Iskra:

[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]
----

==Introduction==

Both the CN and the ION Linux kernels require a ramdisk to boot. Ramdisk images contain minimal Linux utilities, init scripts, configuration files, kernel modules, etc, which are required by the OS boot process.

ION ramdisk is an ELF file that contains a cpio archive of system files. Two ION ramdisk images are currently generated:

; BGP-ION-ramdisk-for-CNL.elf
: Default ION ramdisk for ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: Use this one if you need to run IBM CNK on the compute nodes (uses IBM CIOD instead of ZOID)

Our ION ramdisks are similar to the default ION ramdisk from IBM, but we add some extra files to support ZeptoOS features. The extra files are located in <tt>ramdisk/ION/ramdisk-add/</tt>. The <tt>build-ramdisk</tt> script from IBM BGP driver is used to create the ION ramdisks.

The CN ramdisk is also a gzip'ed cpio archive of system files, but CN ramdisk is embedded into the CN kernel image (<tt>BGP-CN-zImage-with-initrd.elf</tt>). The CN ramdisk is created by a custom ramdisk build script (<tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>). Both <tt>build-ramdisk</tt> and <tt>create-bgp-cn-linux-ramdisk.pl</tt> are wrappers around the Linux kernel's <tt>gen_init_cpio</tt> command.

==Creating ramdisk images==

The ramdisk images are always (re-)created from prebuilt objects if one types <tt>make</tt> at the top level directory (without any make target).

If one wants to create an ION ramdisk individually (without rebuilding other images), type:

<pre>
$ make bgp-ion-ramdisk-cnl
</pre>

If one wants to create a CN ramdisk (technically, create a CN kernel image with new ramdisk contents), type:

<pre>
$ make bgp-cn-linux
</pre>

'''Note:''' the newly built CN ramdisk can be found in <tt>ramdisk/CN/bgp-cn-ramdisk.cpio.gz</tt>, but it is not usable until it is embedded into the kernel image.

For other ramdisk-related make targets, please refer to [[Configuration#Building|Configuration]].

==Modifying ramdisk contents==

You can customize ramdisk contents for your purpose, i.e., debugging, running your custom system software on BGP, etc.

===CN ramdisk===

The CN ramdisk can be customized by editing the CN ramdisk build script, which is <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>. The build script allows to set the permission bits, create device files, etc.

Most of the contents of the CN ramdisk is kept in <tt>ramdisk/CN/tree/</tt>, but this is not a hard rule. Source files can reside anywhere as long as they are accessible from the script. It may be possible to use binaries and libraries from the login nodes, as long as they are a 32-bit PPC files (use the <tt>file</tt> command to verify) and all their dependencies are also copied.

Here is a practical example. Suppose that you need the <tt>od</tt> command in CN ramdisk. You could build the command from source code, but if you want to do something quick, you can try using the login node's version:

<pre>
$ file /usr/bin/od
/usr/bin/od: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV),
for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped
$ ldd /usr/bin/od
linux-vdso32.so.1 => (0x00100000)
libc.so.6 => /lib/ppc970/libc.so.6 (0x0fe8b000)
/lib/ld.so.1 (0xf7fe1000)
</pre>

It is a 32-bit PPC executable and the current CN ramdisk has all the necessary shared libraries, so it can be used. Now add the command to a perl array named <tt>@cmdlists</tt> in <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt> script and invoke <tt>make</tt> to recreate the CN ramdisk:

<pre>
$ vi ramdisk/CN/create-bgp-cn-linux-ramdisk.pl
# add the following line to @cmdlists
"file /bin/od /usr/bin/od 0755 0 0",
$ make bgp-cn-linux
</pre>

Now the CN ramdisk has <tt>/bin/od</tt> with file permissions <tt>0755</tt>, uid <tt>0</tt>, and gid <tt>0</tt>.

The added line is a command for the <tt>gen_init_cpio</tt> tool. One can also create directories, device files, symbolic links, named pipes, socket files, etc:

<pre>
file <name> <location> <mode> <uid> <gid>
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

<name> name of the file/dir/nod/etc in the archive
<location> location of the file in the current filesystem
<target> link target
<mode> mode/permissions of the file
<uid> user id (0=root)
<gid> group id (0=root)
<dev_type> device type (b=block, c=character)
<maj> major number of nod
<min> minor number of nod
</pre>

The order of the commands in @cmdlists ''matters''. They are executed from top to bottom, so one cannot add a file to a directory that has not yet been created.

====CN Linux startup script====

The first thing that the Linux kernel does after it boots is to execute the <tt>init</tt> program. The <tt>init</tt> program is usually in <tt>/sbin/</tt>, and in the CN ramdisk case it is part of the busybox. <tt>init</tt> reads in a config file from <tt>/etc/inittab</tt>, which in our case instructs it to execute the <tt>/etc/init.d/rc.sysinit</tt> startup script.

Our startup script is very minimalistic; its two most important actions are to start the telnet daemon to allow users to login from the I/O nodes and then to start the ZOID <tt>control</tt> process which takes care of IP forwarding and job control.

In case you need to start some processes at the CN boot time, you can add their invocations to <tt>ramdisk/CN/tree/etc/init.d/rc.sysinit</tt>, ''before'' <tt>/sbin/control</tt> is invoked.

===ION ramdisk===

Unlike the CN ramdisk, the range of customization is limited on the ION ramdisk. There is no control over file permission bits, one cannot create device nodes, etc. Currently we build the ION ramdisk using IBM's <tt>build-ramdisk</tt> script by specifying an add-on tree which contains our extra files.

Essentially, customization is limited to:
* adding new files,
* overwriting default ramdisk files by adding custom files with the same names.

Once files have been added under <tt>ramdisk/ION/ramdisk-add/</tt>, they will be automatically added to the ramdisk on the next rebuild. Here is an example of how to add a file to the ION ramdisk:

<pre>
$ vi ramdisk/ION/ramdisk-add/etc/yourfile
$ make bgp-ion-ramdisk-cnl
</pre>

If you need more than file adding, you might need to edit the <tt>build-ramdisk</tt> script itself. The script is located in <tt>/bgsys/drivers/ppcfloor/</tt>. Copy the script to a working directory, edit it and change the script path in <tt>ramdisk/ION/Makefile</tt>.

====ION startup script====

There is no <tt>rc.sysinit</tt> in <tt>ramdisk/ION/ramdisk-add/</tt>, because <tt>rc.sysinit</tt> is provided in the IBM ramdisk tree (i.e., <tt>/bgsys/drivers/ppcfloor/ramdisk/etc/init.d/rc.sysinit</tt> is the default one). If needed, copy the default one to the ZeptoOS <tt>ramdisk/etc/init.d/rc.sysinit</tt> and modify it to change the startup behaviour, but this is in general not recommended.

In most cases, what one is looking for is to start a process at the ION boot time. For such purpose, one can add a custom ION RC script to <tt>ramdisk-add/etc/init.d/rc3.d/</tt>.

RC scripts have the following naming convention:

* S##xxxx : boot-time scripts
* K##xxxx : shut-down scripts

The script names start with <tt>S</tt> or <tt>K</tt>; those starting with <tt>S</tt> are the boot-time scripts and those starting with <tt>K</tt> are the shut-down scripts. The two-digit number following the <tt>S</tt> or <tt>K</tt> is used to determine the execution order; scripts with lower numbers are executed earlier. The number is followed by the script name. On execution, "start" is passed as the first argument to boot-time scripts, and "stop" to shut-down scripts, so the same script can be used for both purposes. Here is a template of an RC script:

<pre>
#!/bin/sh
. /etc/rc.status

rc_reset
case "$1" in
start)
# fill here #
;;
stop)
# fill here #
;;
restart)
# fill here #
;;
status)
# fill here #
;;
*)
echo "Usage: $0 {start|stop|restart|status}"
exit 1
;;
esac
rc_exit
</pre>

The ZeptoOS ION ramdisk contains the following RC scripts by default (some of these are ZeptoOS-specific, others come from the IBM ramdisk tree):

'''boot''' scripts:
<pre>
S00zepto
S01bootsysctl
S02syslog
S05ntp
S11sshd
S12zepto
S40gpfs
S43ibmcmp
S46essl
S50ciod
S51zoid
S99zepto
</pre>

'''shutdown''' scripts:
<pre>
K05ntp
K10sshd
K15ciod
K20gpfs
K30syslog
K50bgsys.64
</pre>

===Ramdisk size limitation===

In regular Linux environments, ramdisk size is limited by free memory size at the time when ramdisk is loaded into memory. However, on BGP, closed-source partition booting software cannot handle images of arbitrary sizes. We do not have an exact number on the boot image size limitation, but we have seen with the current software stack that images of 100 MB or larger might fail to boot. After adding large files to the ramdisk, please check the size of the generated image files, specifically <tt>BGP-ION-ramdisk-for-CNL.elf</tt> and <tt>BGP-CN-zImage-with-initrd.elf</tt>.

==Extracting files from an existing ramdisk image==

To extract files from an existing ramdisk image, do the following (ION ramdisk only):

<pre>
$ ./packages/tools/z-extract-cpio-from-ramdisk.sh <existing_ramdisk_image> ramdisk.cpio
$ mkdir treeroot && cd treeroot
$ cpio -idv < ../ramdisk.cpio
</pre>

----
[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]

Ramdisk

2009-05-08T21:28:40Z

Iskra:

[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]
----

==Introduction==

Both the CN and the ION Linux kernels require a ramdisk to boot. Ramdisk images contain minimal Linux utilities, init scripts, configuration files, kernel modules, etc, which are required by the OS boot process.

ION ramdisk is an ELF file that contains a cpio archive of system files. Two ION ramdisk images are currently generated:

; BGP-ION-ramdisk-for-CNL.elf
: Default ION ramdisk for ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: Use this one if you need to run IBM CNK on the compute nodes (uses IBM CIOD instead of ZOID)

Our ION ramdisks are similar to the default ION ramdisk from IBM, but we add some extra files to support ZeptoOS features. The extra files are located in <tt>ramdisk/ION/ramdisk-add/</tt>. The <tt>build-ramdisk</tt> script from IBM BGP driver is used to create the ION ramdisks.

The CN ramdisk is also a gzip'ed cpio archive of system files, but CN ramdisk is embedded into the CN kernel image (<tt>BGP-CN-zImage-with-initrd.elf</tt>). The CN ramdisk is created by a custom ramdisk build script (<tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>). Both <tt>build-ramdisk</tt> and <tt>create-bgp-cn-linux-ramdisk.pl</tt> are wrappers around the Linux kernel's <tt>gen_init_cpio</tt> command.

==Creating ramdisk images==

The ramdisk images are always (re-)created from prebuilt objects if one types <tt>make</tt> at the top level directory (without any make target).

If one wants to create an ION ramdisk individually (without rebuilding other images), type:

<pre>
$ make bgp-ion-ramdisk-cnl
</pre>

If one wants to create a CN ramdisk (technically, create a CN kernel image with new ramdisk contents), type:

<pre>
$ make bgp-cn-linux
</pre>

'''Note:''' the newly built CN ramdisk can be found in <tt>ramdisk/CN/bgp-cn-ramdisk.cpio.gz</tt>, but it is not usable until it is embedded into the kernel image.

For other ramdisk-related make targets, please refer to [[Configuration#Building|Configuration]].

==Modifying ramdisk contents==

You can customize ramdisk contents for your purpose, i.e., debugging, running your custom system software on BGP, etc.

===CN ramdisk===

The CN ramdisk can be customized by editing the CN ramdisk build script, which is <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>. The build script allows to set the permission bits, create device files, etc.

Most of the contents of the CN ramdisk is kept in <tt>ramdisk/CN/tree/</tt>, but this is not a hard rule. Source files can reside anywhere as long as they are accessible from the script. It may be possible to use binaries and libraries from the login nodes, as long as they are a 32-bit PPC files (use the <tt>file</tt> command to verify) and all their dependencies are also copied.

Here is a practical example. Suppose that you need the <tt>od</tt> command in CN ramdisk. You could build the command from source code, but if you want to do something quick, you can try using the login node's version:

<pre>
$ file /usr/bin/od
/usr/bin/od: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV),
for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped
$ ldd /usr/bin/od
linux-vdso32.so.1 => (0x00100000)
libc.so.6 => /lib/ppc970/libc.so.6 (0x0fe8b000)
/lib/ld.so.1 (0xf7fe1000)
</pre>

It is a 32-bit PPC executable and the current CN ramdisk has all the necessary shared libraries, so it can be used. Now add the command to a perl array named <tt>@cmdlists</tt> in <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt> script and type <tt>make</tt> to recreate the CN ramdisk:

<pre>
$ vi ramdisk/CN/create-bgp-cn-linux-ramdisk.pl
# add the following line to @cmdlists
"file /bin/od /usr/bin/od 0755 0 0",
$ make bgp-cn-linux
</pre>

Now the CN ramdisk has <tt>/bin/od</tt> with file permissions <tt>0755</tt>, uid=0, and gid=0.

The added line is a command for the <tt>gen_init_cpio</tt> tool. One can also create directories, device files, symbolick links, pipe files, socket files, etc:

<pre>
file <name> <location> <mode> <uid> <gid>
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

<name> name of the file/dir/nod/etc in the archive
<location> location of the file in the current filesystem
<target> link target
<mode> mode/permissions of the file
<uid> user id (0=root)
<gid> group id (0=root)
<dev_type> device type (b=block, c=character)
<maj> major number of nod
<min> minor number of nod
</pre>

The order of the commands in @cmdlists ''matters''. They are executed from top to bottom, so one cannot add a file to a directory that has not yet been created.

====CN Linux startup script====

The first thing that the Linux kernel does after it boots is to execute the <tt>init</tt> program. The <tt>init</tt> program is usually in <tt>/sbin/</tt>, and in the CN ramdisk case it is part of the busybox. <tt>init</tt> reads in a config file from <tt>/etc/inittab</tt>, which in our case instructs it to execute the <tt>/etc/init.d/rc.sysinit</tt> startup script.

Our startup script is very minimalistic; its two most important actions are to start the telnet daemon to allow users to login from the I/O nodes and then to start the ZOID <tt>control</tt> process which takes care of IP forwarding and job control.

In case you need to start some processes at the CN boot time, you can add their invocations to <tt>ramdisk/CN/tree/etc/init.d/rc.sysinit</tt>, ''before'' <tt>/sbin/control</tt> is invoked.

===ION ramdisk===

Unlike the CN ramdisk, the range of customization is limited on the ION ramdisk. There is no control over file permission bits, one cannot create device nodes, etc. Currently we build the ION ramdisk using IBM's <tt>build-ramdisk</tt> script by specifying an add-on tree which contains our extra files.

Essentially, customization is limited to:
* adding new files,
* overwriting default ramdisk files by adding custom files with the same names.

Once files have been added under <tt>ramdisk/ION/ramdisk-add/</tt>, they will be automatically added to the ramdisk on the next rebuild. Here is an example of how to add a file to the ION ramdisk:

<pre>
$ vi ramdisk/ION/ramdisk-add/etc/yourfile
$ make bgp-ion-ramdisk-cnl
</pre>

If you need more than file adding, you might need to edit the <tt>build-ramdisk</tt> script itself. The script is located in <tt>/bgsys/drivers/ppcfloor/</tt>. Copy the script to a working directory, edit it and change the script path in <tt>ramdisk/ION/Makefile</tt>.

====ION startup script====

There is no <tt>rc.sysinit</tt> in <tt>ramdisk/ION/ramdisk-add/</tt>, because <tt>rc.sysinit</tt> is provided in the IBM ramdisk tree (i.e., <tt>/bgsys/drivers/ppcfloor/ramdisk/etc/init.d/rc.sysinit</tt> is default one). If needed, one can copy the default one to the ZeptoOS <tt>ramdisk/etc/init.d/rc.sysinit</tt> and modify it to change the startup behaviour, but this is in general not recommended.

In most cases, what one is looking for is to start a process at the ION boot time. For such purpose, one can add a custom ION RC script to <tt>ramdisk-add/etc/init.d/rc3.d/</tt>.

RC scripts have the following naming convention:

* S##xxxx : boot-time scripts
* K##xxxx : shut-down scripts

They start with <tt>S</tt> or <tt>K</tt>; those starting with <tt>S</tt> are the boot-time scripts and those starting with <tt>K</tt> are the shut-down scripts. The two-digit number following <tt>S</tt> or <tt>K</tt> is used to determine the execution order; scripts with lower numbers are executed earlier. The number is followed by the script name. On execution, "start" is passed as the first argument to boot-time scripts, and "stop" to shut-down scripts. Here is a template of an RC script:

<pre>
#!/bin/sh
. /etc/rc.status

rc_reset
case "$1" in
start)
# fill here #
;;
stop)
# fill here #
;;
restart)
# fill here #
;;
status)
# fill here #
;;
*)
echo "Usage: $0 {start|stop|restart|status}"
exit 1
;;
esac
rc_exit
</pre>

The ZeptoOS ION ramdisk contains the following RC scripts by default (some of these are ZeptoOS-specific, others come from the IBM ramdisk tree):

'''boot''' scripts:
<pre>
S00zepto
S01bootsysctl
S02syslog
S05ntp
S11sshd
S12zepto
S40gpfs
S43ibmcmp
S46essl
S50ciod
S51zoid
S99zepto
</pre>

'''shutdown''' scripts:
<pre>
K05ntp
K10sshd
K15ciod
K20gpfs
K30syslog
K50bgsys.64
</pre>

===Ramdisk size limitation===

In regular Linux environments, ramdisk size is limited by free memory size at the time when ramdisk is loaded into memory. However, on BGP, closed-source system software cannot handle images of arbitrary sizes. We do not have an exact number on the boot image size limitation, but we have seen with the current software stack that images of 100 MB or larger might fail to boot. If one adds large files to the ramdisk, please check the size of the generated image files, specifically <tt>BGP-ION-ramdisk-for-CNL.elf</tt> and <tt>BGP-CN-zImage-with-initrd.elf</tt>.

==Extracting files from an existing ramdisk image==

To extract file from an existing ramdisk image, do the following (ION ramdisk only):

<pre>
$ ./packages/tools/z-extract-cpio-from-ramdisk.sh <existing_ramdisk_image> ramdisk.cpio
$ mkdir treeroot && cd treeroot
$ cpio -idv < ../ramdisk.cpio
</pre>

----
[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]

MPICH, DCMF, and SPI

2009-05-08T20:44:14Z

Iskra:

[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]
----

==Introduction==

To support high performance computing (HPC) applications, specifically MPI applications, we have ported IBM's CNK communication software stack to the ZeptoOS compute node Linux environment. MPICH used in this ZeptoOS release is mpich2-1.0.7 with IBM patches. It is reasonably stable, and the performance of MPI applications on the ZeptoOS compute node Linux is comparable to that on CNK. While there are some limitations at the moment, there are benefits as well.

Benefits:
* No limitation on the number of threads
** 4 or more OpenMP threads per node
** Additional threads as I/O or backgroup tasks
* It is Linux!
** Debugging tools such as gdb, strace, etc
** Various file systems, such as ramfs

Current limitations:
* Only the SMP mode is supported
* Shared libraries are not provided at the moment
* No binary compatibility between CNK and ZeptoOS CN Linux MPI binaries

We will support a VN-equivalent mode (multiple MPI tasks per node) and provide shared libraries in a future release.

As in IBM CNK environment, Deep Computing Messaging Framework (DCMF) and System Programming Interface (SPI) are available. It is possible to write a DCMF code or an SPI code directly if necessary. DCMF is a communication library that provides non-blocking operations. Please refer to the [http://dcmf.anl-external.org/wiki/index.php/Main_Page DCMF wiki] for details. We are using DCMF version 1.0.0 in the current ZeptoOS release, which is older than the DCMF in the current driver release (V1R3M0). SPI is the lowest-level user-space API for the torus DMA, collective network, BGP-specifc lock mechanisms, and other compute node specific features. There is no public document on SPI available at the moment, but almost all header files and source code are available. Internally, MPICH depends on DMCF, which in turn depends on SPI. We will say more about it [[#Software stack layout|later]].

===ZCB and Big memory===

MPI applications running under the ZeptoOS compute node Linux environment (technically, applications that require the DMA operation or a maximum memory bandwidth) need to be configured as Zepto Compute Binaries (ZCB). This is done using the <tt>zelftool</tt>, which is invoked behind the scenes when linking a binary using the ZeptoOS MPI compiler wrapper scripts (<tt>zmpicc</tt>, etc).

ZeptoOS compute node kernel treats ZCB executables differently from ordinary processes. It creates a special memory mapping region called big memory, which is covered by large pages with semi-static TLB entries, and it loads all application sections to the big memory region. Big memory region has virtually no TLB misses and it also enables DMA operations.

Some system calls will not work correctly if used from a ZCB process, in particular <tt>fork</tt> (but creating threads ''does'' work). Also, being a separate memory region set up at kernel boot time, the size of big memory is fixed. It is set to 256 MB by default, which could be too small for larger MPI processes; it can be [[FAQ#Why large MPI processes do not work|increased]] before booting a partition, at the expense of the ordinary Linux paged memory.

==Compiling HPC applications==

While the same compiler can be used as for the applications running under the IBM CNK, ZeptoOS compute node environment requires linking with ZeptoOS-specific communication libraries (applications linked with the CNK MPI will not work on ZeptoOS).

===Compiler wrapper scripts===

We provide compiler wrapper scripts which automatically link with appropriate libraries from the ZeptoOS installation directory. We provide the same set of wrapper scripts that IBM provides, with an extra <tt>z</tt> prefix:

; zmpicc, zmpicxx, zmpif77, zmpif90
: Wrapper scripts that invoke BGP-enhanced GNU compilers

; zmpixlc, zmpixlcxx, zmpixlf2003, zmpixlf77, zmpixlf90, zmpixlf95
: Wrapper scripts that invoke IBM XL compilers

; zmpixlc_r, zmpixlcxx_r, zmpixlf2003_r, zmpixlf77_r, zmpixlf90_r, zmpixlf95_r
: Wrapper scripts that invoke IBM XL compilers (thread safe compilation for OpenMP)

To get insight into the internals of these scripts, invoke them with the <tt>-show</tt> option.

====A compilation example====

There is nothing special about compiling a program for ZeptoOS. Here is a real-world example of how to build a well-known [http://climate.lanl.gov/Models/POP/ Parallel Ocean Program (POP)].

<pre>
$ wget http://climate.lanl.gov/Models/POP/POP_2.0.1.tar.Z
$ tar xvfz POP_2.0.1.tar.Z && cd pop
$ ./setup_run_dir ztest && cd ztest
$ edit ibm_mpi.gnu # see the patch below
$ export ARCHDIR=ibm_mpi
$ make # takes a while
$ edit pop_in # test data set
- nprocs_clinic = 4
- nprocs_tropic = 4
+ nprocs_clinic = 64
+ nprocs_tropic = 64
$ cqsub -n 64 -t 10 -k <zepto_profile> ./pop

--------------------
--- orig/ibm_mpi.gnu 2009-04-15 15:01:58.666457601 -0500
+++ ztest/ibm_mpi.gnu 2009-04-15 14:17:58.099132435 -0500
@@ -6,17 +6,18 @@
# will someday be a file which is a cookbook in Q&A style: "How do I do X?"
# is followed by something like "Go to file Y and add Z to line NNN."
#
-FC = mpxlf90_r
-LD = mpxlf90_r
-CC = mpcc_r
-Cp = /usr/bin/cp
-Cpp = /usr/ccs/lib/cpp -P
+ZPATH=<zepto_dir>
+FC = $(ZPATH)/zmpixlf90
+LD = $(ZPATH)/zmpixlf90
+CC = $(ZPATH)/zmpixlc
+Cp = /bin/cp
+Cpp = /usr/bin/cpp -P
AWK = /usr/bin/awk
-ABI = -q64
+#ABI = -q64
COMMDIR = mpi

-NETCDFINC = -I/usr/local/include
-NETCDFLIB = -L/usr/local/lib
+NETCDFINC = -I/soft/apps/netcdf-4.0/include/
+NETCDFLIB = -L/soft/apps/netcdf-4.0/lib

# Enable MPI library for parallel code, yes/no.

@@ -58,7 +59,8 @@
#
#----------------------------------------------------------------------------

-FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+#FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+FBASE = $(ABI) -qarch=auto -qnosave $(NETCDFINC) -I$(ObjDepDir)

ifeq ($(TRAP_FPE),yes)
FBASE := $(FBASE) -qflttrap=overflow:zerodivide:enable -qspillsize=32704

</pre>

===Compiling without the wrapper scripts===

If one wishes to invoke the compiler directly, please make sure that the Makefile or build environment points to ZeptoOS header files and libraries correctly. An example would be:

<pre>
$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc \
-o mpi-test-linux -Wall -O3 -I<zepto_dir>/include mpi-test.c \
-L<zepto_dir>/lib -lmpich.zcl -ldcmfcoll.zcl -ldcmf.zcl -lSPI.zcl -lzcl \
-lzoid_cn -lrt -lpthread -lm
$ <zepto_dir>/bin/zelftool -e mpi-test-linux
</pre>

'''Notes:'''
* Replace <tt><zepto_dir></tt> with the ZeptoOS install path.
* Do not forget to call the <tt>zelftool</tt> utility, which makes the executable a Zepto Compute Binary.

==Building MPICH, DCMF, and SPI libraries==

We provide all the necessary source code to build MPICH, DCMF, and SPI. To build these libraries, just type:

<pre>
$ make -C comm rebuild-target
</pre>

It may take half an hour to an hour to complete the build process, depending on what file system is being used (i.e., GPFS is a lot slower than a local file system).

The <tt>rebuild-target</tt> target does not know anything about the existing installation directory; it only copies the built libraries and header files to the <tt>comm/tmp</tt> directory. To install the newly built libraries, do the following:

<pre>
$ make -C comm update-prebuilt
$ python install.py <zepto_dir>
</pre>

The <tt>update-prebuilt</tt> target basically copies the files from the <tt>comm/tmp</tt> directory to the <tt>comm/prebuilt</tt> directory, which is where the <tt>install.py</tt> script looks for to copy the files to <tt><zepto_dir></tt>.

==Software stack layout==

[[Image:Zepto-Comm-Stack.png|right]]

The figure on the right depicts the layout of the communication software stack in the ZeptoOS compute node environment. This is very similar to the IBM's CNK stack, with the exception of an extra ZEPTO SPI layer, and the use of Linux instead of CNK.

Since MPICH is a well-known software package we will not discuss it here, but we will briefly describe the DCMF and SPI components:

* DCMF
** stands for Deep Computing Messaging Framework,
** developed by IBM originally for Blue Gene architecture,
** hardware initialization, query functions,
** supports BGP Torus DMA, collective network,
** provides a timer,
** supports non-blocking collective operations,
** BGP MPICH uses DCMF internally (IBM provides a glue layer).
* SPI
** stands for System Programming Interface,
** developed by IBM; BGP-specific code,
** kernel interfaces – DMA control, lockbox, etc,
** DMA-related definitions
*** can be used in both user space and kernel space,
** RAS, BGP personality, mapping-related functions.

BGP SPI was designed specifically for IBM CNK, so it is not compatible with Linux. ZEPTO SPI is a thin software layer that absorbs the differences between the CNK and Linux or drops the requests that Linux cannot handle.

==Source code==

The source code and header files of DCMF and SPI can be found in the <tt>comm</tt> directory. The source code of MPICH is in <tt>DCMF/lib/mpich2/mpich2-1.0.7.tar.gz</tt>, which is unpacked at build time.

The DCMF source code is located in <tt>DCMF/sys/</tt>, with the core code in <tt>DCMF/sys/messaging/</tt>. Component Collective Messaging Interface (CCMI) is part of DCMF and its source code is in <tt>DCMF/sys/collectives/</tt>. Test codes can be found in <tt>DCMF/sys/collectives/tests/</tt> for CCMI and <tt>DCMF/sys/messaging/tests/</tt> for the core. Those test codes can be a good example of DCMF/CCMI programming.

SPI headers are in <tt>arch-runtime/arch/</tt> and SPI source code is in <tt>comm/arch-runtime/runtime/</tt>. The source code of the ZEPTO SPI layer is in <tt>arch-runtime/zcl_spi/</tt>, while the header files are in <tt>arch-runtime/arch/include/zepto/</tt>.

Here is an overview of the directory tree:

<pre>
comm
|-- DCMF
| |-- lib
| | |-- dev
| | `-- mpich2
| | `-- make
| |-- sys
| | |-- collectives
| | | |-- adaptor
| | | |-- kernel
| | | |-- tests
| | | `-- tools
| | |-- include
| | `-- messaging
| | |-- devices
| | |-- messager
| | |-- protocols
| | |-- queueing
| | |-- sysdep
| | `-- tests
|-- arch-runtime
| |-- arch
| | `-- include
| | |-- bpcore
| | |-- cnk
| | |-- common
| | |-- spi
| | `-- zepto
| |-- runtime
| |-- testcodes
| `-- zcl_spi
`-- testcodes
</pre>

===Debug output===

ZeptoOS versions of SPI and DCMF have a built-in debug output. The output is disabled by default, and can be enabled by setting the environment variable <tt>ZEPTO_TRACE</tt> when submitting a job. The integer value of the variable indicates the debug level (a higher number results in more debug output).

An example:
<pre>
$ cqsub -k <zepto_profile> -n 64 -t 10 ... -e ZEPTO_TRACE=2 ./a.out
</pre>

----
[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]

Kernel

2009-05-08T19:58:07Z

Iskra: /* Log files, etc */

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels:

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB page size and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which GPFS does work with.

==Kernel directory structure==

The <tt>kernel</tt> directory consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are snapshots of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log files can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

As an example, here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

The <tt>Makefile</tt> in the <tt>kernel</tt> directory has many targets. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from a source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> targets. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built ones. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, kernel tarballs are used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS developers mailing list].

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here are the locations of the kernel config files:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until one does a <tt>distclean</tt> or removes the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If one wants to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via a bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin</tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters:

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfiles have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS messages, which will not appear in the log files; they are instead stored in a database on the service node. At Argonne we have a custom command line tool named <tt>bg-listevents</tt> that shows a record of RAS events (type <tt>bg-listevents -h</tt> for command line arguments).

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Kernel

2009-05-08T19:57:52Z

Iskra:

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels:

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB page size and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which GPFS does work with.

==Kernel directory structure==

The <tt>kernel</tt> directory consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are snapshots of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log files can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

As an example, here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

The <tt>Makefile</tt> in the <tt>kernel</tt> directory has many targets. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from a source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> targets. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built ones. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, kernel tarballs are used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS developers mailing list].

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here are the locations of the kernel config files:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until one does a <tt>distclean</tt> or removes the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If one wants to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via a bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin</tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters:

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfiles have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS messages, which will not appear in the log files; they are instead stored in a database on the service node. At Argonne we have a command line tool named <tt>bg-listevents</tt> that shows a record of RAS events (type <tt>bg-listevents -h</tt> for command line arguments).

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Kernel

2009-05-08T19:44:31Z

Iskra:

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels:

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB page size and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which GPFS does work with.

==Kernel directory structure==

The <tt>kernel</tt> directory consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are snapshots of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log files can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

As an example, here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

The <tt>Makefile</tt> in the <tt>kernel</tt> directory has many targets. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from a source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> targets. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built ones. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, kernel tarballs are used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS developers mailing list].

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here are the locations of the kernel config files:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until one does a <tt>distclean</tt> or removes the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If one wants to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via a bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin</tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters.

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfile have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS message, which will not appear in the log files. A command line tool named <tt>bg-listevents</tt> shows you a record of RAS events. Type <tt>bg-listevents -h</tt> for command line arguments.

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Kernel

2009-05-08T17:58:37Z

Iskra:

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels:

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB page size and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which GPFS does work with.

==Kernel directory structure==

The <tt>kernel</tt> directory consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are snapshots of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log files can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

As an example, here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

The <tt>Makefile</tt> in the <tt>kernel</tt> directory has many targets. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from a source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> targets. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built ones. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, kernel tarballs are used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS developers mailing list].

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here is the location of the kernel config file:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until you do a <tt>distclean</tt> or remove the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If you want to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin<tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters.

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfile have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS message, which will not appear in the log files. A command line tool named <tt>bg-listevents</tt> shows you a record of RAS events. Type <tt>bg-listevents -h</tt> for command line arguments.

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Kernel

2009-05-08T17:00:50Z

Iskra:

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels:

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB page size and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which GPFS does work with.

==Kernel directory structure==

The <tt>kernel</tt> directory consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are a snapshot of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log file can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

Here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

<tt>Makefile</tt> in the <tt>kernel</tt> directory has many options. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from the source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> target. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built images. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, the kernel tarball is used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS mailing list] instead.

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here is the location of the kernel config file:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until you do a <tt>distclean</tt> or remove the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If you want to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin<tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters.

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfile have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS message, which will not appear in the log files. A command line tool named <tt>bg-listevents</tt> shows you a record of RAS events. Type <tt>bg-listevents -h</tt> for command line arguments.

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Testing

2009-05-08T16:56:52Z

Iskra:

[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]
----

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

==The /bin/sleep job==

If using Cobalt, submit using either of the commands below:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
$ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
</pre>

If using <tt>mpirun</tt> directly, submit as follows:

<pre>
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600
</pre>

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network file system to reduce the number of dependencies.

If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt):

<pre>
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated
</pre>

(we stripped the timestamp prefixes to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

<pre>
BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
</pre>

This error indicates that the job was submitted to the default software environment with the light-weight kernel, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). Go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files (see below) can be useful to diagnose the problem.

==Log files==

===I/O node===

Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like:

<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009
Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary
Starting RPC portmap daemon..done
eth0: Link status [RX+,TX+]
mount server reported tcp not available, falling back to udp
mount: RPC: Remote system error - No route to host
Zepto ION startup-00
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
Zepto ION startup-00 done
done
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart.
done
Starting network time protocol daemon (NTPD) using 172.17.3.1
May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec
May 1 12:57:11 ion-15 ntpd[653]: ntpd 4.2.0a@1.1196-r Sat Oct 4 00:01:53 UTC 2008 (1)
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123
May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040
done
Enabling ssh
Mounting site filesystems
done
Loading PVFS2 kernel module done
Sleeping 0 seconds before starting PVFS done
Starting PVFS2 client done
Sleeping 10 seconds before mounting PVFS
done
Mounting PVFS2 filesystems done
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22.
done
Zepto ION startup-12
Zepto ION startup-12 done
Starting GPFS
May 1 12:57:26 ion-15 syslogd 1.4.1: restart.
/etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00
ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists
ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists
ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists
ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists
Starting ZOID...
done
Zepto ION startup-99
Zepto ION startup-99 done
May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh'
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
~ #
</pre>

(again, we stripped the prefixes to make the lines shorter)

Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If instead one mistakenly boots with the default ramdisk, this could be recognized by messages such as:

<pre>
Starting CIO services
[ciod:initialized] done
</pre>

(<tt>ciod</tt> is ''never'' started when using the ZeptoOS compute node Linux)

In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently.

===Compute node===

All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt><service_node>-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.

A correct boot log when booting ZeptoOS will look something like this:

<pre>
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0)
iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM.
iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code.
iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved.
iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144
iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory
iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009
iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges:
iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges
iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658
iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem)
iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset
iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found.
iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs
iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0
iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2
iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered
iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7)
iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default)
iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6
iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init
iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
</pre>

This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''.

The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:

<pre>
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1
DBBlockCmd setusername iskra
iskra db_allocate ANL-R00-M1-N12-64
iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64
iskra block state C
iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64)
iskra:ANL-R00-M1-N12-64 BlockController::connect()
iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206
Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101
iskra:ANL-R00-M1-N12-64 connected to mcServer
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened
iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log
iskra:ANL-R00-M1-N12-64 MailboxListener starting
iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64
iskra:ANL-R00-M1-N12-64 BlockController::boot_block \
uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \
cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \
ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk
iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
</pre>

Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.

After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:

<pre>
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00
iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful
iskra DatabaseBlockCommandThread stopped
DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1
DBJobCmd setusername iskra
iskra Starting Job 98461
New thread 4398305505840, for jobid 98461
selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra
ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra
ANL-R00-M1-N12-64 persist: 1
ANL-R00-M1-N12-64 connecting to mpirun...
ANL-R00-M1-N12-64 setting mpirun stream, fd=386
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 Job::load() /bin/sleep
ANL-R00-M1-N12-64 Job loaded: 98461
ANL-R00-M1-N12-64 About to start /bin/sleep
ANL-R00-M1-N12-64 Job 98461 set to RUNNING
iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
</pre>

==Interactive login==

We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.

The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:

<pre>
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
</pre>

In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value).

The IP address is also available from the MMCS log file:

<pre>
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
</pre>

With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.

Once the IP address is known, one can simply use the SSH:

<pre>
iskra@login1.surveyor:~> ssh 172.16.3.15

BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/gpfs/home/iskra $ hostname
ion-15
/gpfs/home/iskra $
</pre>

If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customizations to work properly. To enable access for the partition owner, an administrator might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH key pairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided when [[Configuration#Building|building]] the ZeptoOS environment.

Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.

Here is part of the <tt>ps</tt> output from an I/O node:

<pre>
/gpfs/home/iskra $ ps -ef
UID PID PPID C STIME TTY TIME CMD
[...]
65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap
root 108 19 0 16:09 ? 00:00:00 [rpciod/0]
root 109 19 0 16:09 ? 00:00:00 [rpciod/1]
root 110 19 0 16:09 ? 00:00:00 [rpciod/2]
root 111 19 0 16:09 ? 00:00:00 [rpciod/3]
root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd
root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x
ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.
root 688 1 0 16:09 ? 00:00:00 [lockd]
root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c
root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 -
root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r
root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm
root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc]
root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc]
root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd
root 1153 1 0 16:10 ? 00:00:00 [mmkproc]
root 1152 1 0 16:10 ? 00:00:00 [mmkproc]
root 1154 1 0 16:10 ? 00:00:00 [mmkproc]
iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s
root 2823 1 0 16:10 ? 00:00:00 /bin/sh
root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv]
iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0
iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh
iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef
/gpfs/home/iskra $
</pre>

The I/O nodes run a small Linux setup with the root file system in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote file system clients (GPFS, PVFS). Please verify at this stage that the remote file systems have been mounted correctly.

One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O nodes and the compute nodes, implemented using the virtual network tunneling device available in Linux:

<pre>
/gpfs/home/iskra $ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
/gpfs/home/iskra $
</pre>

At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address <tt>192.168.1.16</tt>. If you are beginning to see a pattern here, then be advised that with a 64:1 ratio, the IP address of the second compute node is... <tt>192.168.1.59</tt>. Do not blame us for this chaos – blame IBM :-).

The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them:

<pre>
/gpfs/home/iskra $ telnet 192.168.1.64

Entering character mode
Escape character is '^]'.

BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ #
</pre>

The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger partitions with more than one I/O node, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.

Here is part of the <tt>ps</tt> output from a compute node:

<pre>
~ # ps -ef
PID USER VSZ STAT COMMAND
[...]
34 root 5440 S /bin/sh /etc/init.d/rc.sysinit
44 root 5504 S /sbin/telnetd -l /bin/sh
47 root 6528 S /sbin/inetd
48 root 46400 R N /sbin/control
62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse
116 root 5248 S /bin/sleep 3600
118 root 5504 S /bin/sh
</pre>

Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control.

There are two custom processes running on each compute node:

'''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.

'''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the front end and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin/</tt>.

==Shell script job==

Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.

Here is a sample shell script to try:

<pre>
#!/bin/sh

. /proc/personality.sh

while true; do
echo "Node $BG_RANK_IN_PSET running (stdout)"
echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2
sleep 10
done
</pre>

(please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>)

Create the script file on a network filesystem that is available on the I/O nodes, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The script prints a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered on the service node. If that is the case, kill the job and verify that the standard output data did appear.

==MPI and OpenMP jobs==

The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
<pre>
$ cd comm/testcodes
</pre>

===Compiling===

The programs can be compiled on a login node using:

<pre>
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
</pre>

===Submitting===

Submit the MPI test like any other job; use one of the below commands:

<pre>
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \
-cwd $PWD -exe $PWD/omp-test-linux
</pre>

For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> environment variable:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux
$ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
</pre>

The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".

'''Note:''' see the [[FAQ#Why large MPI processes do not work|FAQ]] if submitting larger MPI processes does not work properly.

----
[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]

Testing

2009-05-08T15:33:59Z

Iskra:

[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]
----

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

==The /bin/sleep job==

If using Cobalt, submit using either of the commands below:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
$ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
</pre>

If using <tt>mpirun</tt> directly, submit as follows:

<pre>
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600
</pre>

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.

If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt):

<pre>
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated
</pre>

(we stripped the timestamp prefixes to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

<pre>
BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
</pre>

This error indicates that the job was submitted to the default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). You need to go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files can be useful to diagnose the problem.

==Log files==

===I/O node===

Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like:

<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009
Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary
Starting RPC portmap daemon..done
eth0: Link status [RX+,TX+]
mount server reported tcp not available, falling back to udp
mount: RPC: Remote system error - No route to host
Zepto ION startup-00
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
Zepto ION startup-00 done
done
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart.
done
Starting network time protocol daemon (NTPD) using 172.17.3.1
May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec
May 1 12:57:11 ion-15 ntpd[653]: ntpd 4.2.0a@1.1196-r Sat Oct 4 00:01:53 UTC 2008 (1)
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123
May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040
done
Enabling ssh
Mounting site filesystems
done
Loading PVFS2 kernel module done
Sleeping 0 seconds before starting PVFS done
Starting PVFS2 client done
Sleeping 10 seconds before mounting PVFS
done
Mounting PVFS2 filesystems done
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22.
done
Zepto ION startup-12
Zepto ION startup-12 done
Starting GPFS
May 1 12:57:26 ion-15 syslogd 1.4.1: restart.
/etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00
ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists
ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists
ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists
ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists
Starting ZOID...
done
Zepto ION startup-99
Zepto ION startup-99 done
May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh'
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
~ #
</pre>

(again, we stripped the prefixes to make the lines shorter)

Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one instead mistakenly booted with the default ramdisk, this could be recognized by messages such as:

<pre>
Starting CIO services
[ciod:initialized] done
</pre>

(<tt>ciod</tt> is ''never'' started when using Zepto Compute Node Linux)

In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently.

===Compute node===

All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt>sn1-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.

A correct boot log when when booting ZeptoOS will look something like this:

<pre>
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0)
iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM.
iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code.
iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved.
iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144
iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory
iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009
iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges:
iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges
iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658
iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem)
iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset
iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found.
iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs
iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0
iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2
iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered
iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7)
iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default)
iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6
iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init
iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
</pre>

This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''.

The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:

<pre>
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1
DBBlockCmd setusername iskra
iskra db_allocate ANL-R00-M1-N12-64
iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64
iskra block state C
iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64)
iskra:ANL-R00-M1-N12-64 BlockController::connect()
iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206
Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101
iskra:ANL-R00-M1-N12-64 connected to mcServer
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened
iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log
iskra:ANL-R00-M1-N12-64 MailboxListener starting
iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64
iskra:ANL-R00-M1-N12-64 BlockController::boot_block \
uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \
cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \
ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk
iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
</pre>

Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.

After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:

<pre>
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00
iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful
iskra DatabaseBlockCommandThread stopped
DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1
DBJobCmd setusername iskra
iskra Starting Job 98461
New thread 4398305505840, for jobid 98461
selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra
ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra
ANL-R00-M1-N12-64 persist: 1
ANL-R00-M1-N12-64 connecting to mpirun...
ANL-R00-M1-N12-64 setting mpirun stream, fd=386
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 Job::load() /bin/sleep
ANL-R00-M1-N12-64 Job loaded: 98461
ANL-R00-M1-N12-64 About to start /bin/sleep
ANL-R00-M1-N12-64 Job 98461 set to RUNNING
iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
</pre>

==Interactive login==

We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.

The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:

<pre>
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
</pre>

In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value).

The IP address is also available from the MMCS log file:

<pre>
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
</pre>

With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.

Once the IP address is known, one can simply use the SSH:

<pre>
iskra@login1.surveyor:~> ssh 172.16.3.15

BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/gpfs/home/iskra $ hostname
ion-15
/gpfs/home/iskra $
</pre>

If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customization to work properly. To enable access for the partition owner, one might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH keypairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided while [[Configuration#Building|building]] the ZeptoOS environment.

Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.

Here's part of the <tt>ps</tt> output from the I/O node:

<pre>
/gpfs/home/iskra $ ps -ef
UID PID PPID C STIME TTY TIME CMD
[...]
65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap
root 108 19 0 16:09 ? 00:00:00 [rpciod/0]
root 109 19 0 16:09 ? 00:00:00 [rpciod/1]
root 110 19 0 16:09 ? 00:00:00 [rpciod/2]
root 111 19 0 16:09 ? 00:00:00 [rpciod/3]
root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd
root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x
ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.
root 688 1 0 16:09 ? 00:00:00 [lockd]
root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c
root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 -
root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r
root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm
root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc]
root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc]
root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd
root 1153 1 0 16:10 ? 00:00:00 [mmkproc]
root 1152 1 0 16:10 ? 00:00:00 [mmkproc]
root 1154 1 0 16:10 ? 00:00:00 [mmkproc]
iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s
root 2823 1 0 16:10 ? 00:00:00 /bin/sh
root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv]
iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0
iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh
iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef
/gpfs/home/iskra $
</pre>

The I/O nodes run a small Linux setup with the root filesystem in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote filesystem clients (GPFS, PVFS). Please verify at this stage that the remote filesystem have been mounted correctly.

One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O node and the compute nodes, implemented using the virtual network tunneling device available in Linux:

<pre>
/gpfs/home/iskra $ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
/gpfs/home/iskra $
</pre>

At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. The IP address of the second compute node is... <tt>192.168.1.59</tt>. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address <tt>192.168.1.16</tt>. Do not blame us for this chaos – blame IBM :-).

The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them:

<pre>
/gpfs/home/iskra $ telnet 192.168.1.64

Entering character mode
Escape character is '^]'.

BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ #
</pre>

The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger jobs, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.

Here's part of the <tt>ps</tt> output from the compute node:

<pre>
~ # ps -ef
PID USER VSZ STAT COMMAND
[...]
34 root 5440 S /bin/sh /etc/init.d/rc.sysinit
44 root 5504 S /sbin/telnetd -l /bin/sh
47 root 6528 S /sbin/inetd
48 root 46400 R N /sbin/control
62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse
116 root 5248 S /bin/sleep 3600
118 root 5504 S /bin/sh
</pre>

Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control.

There are two custom processes running on each compute node:

'''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.

'''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the login and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin</tt>.

==Shell script job==

Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.

Here is a sample shell script to try:

<pre>
#!/bin/sh

. /proc/personality.sh

while true; do
echo "Node $BG_RANK_IN_PSET running (stdout)"
echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2
sleep 10
done
</pre>

(please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>)

Please create the script file on the network filesystem, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The scripts print a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered. If that is the case, kill the job and verify that the standard output data did appear.

==MPI and OpenMP jobs==

The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
<pre>
$ cd comm/testcodes
</pre>

===Compiling===

The programs can be compiled on a login node using:

<pre>
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
</pre>

===Submitting===

Submit the MPI test like any other job; use one of the below:

<pre>
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \
-cwd $PWD -exe $PWD/omp-test-linux
</pre>

For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> variable:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux
$ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
</pre>

The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".

'''Note:''' see the [[FAQ#Why large MPI processes do not work|FAQ]] if submitting larger MPI processes does not work properly.

----
[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]

Installation

2009-05-08T15:32:53Z

Iskra:

[[Configuration]] | [[ZeptoOS_Documentation|Top]] | [[Testing]]
----

==Installing the support files==

Installing ZeptoOS consists of two parts. In the first part we install support files (header files, libraries, scripts, etc) used primarily when building compute node binaries for ZeptoOS. To perform this step, change to the top-level directory and type:
<pre>
$ python install.py /path/to/install
</pre>

This will install the following files:
<pre>
Install Home: /path/to/install
Creating install directories ...
Making /path/to/install/bin
Making /path/to/install/cnbin
Making /path/to/install/include
Making /path/to/install/lib
Making /path/to/install/lib/zoid

Installing MPICH2 scripts ...
/path/to/install/bin/zmpicc
/path/to/install/bin/zmpicxx
/path/to/install/bin/zmpif77
/path/to/install/bin/zmpif90
/path/to/install/bin/zmpixlc
/path/to/install/bin/zmpixlc_r
/path/to/install/bin/zmpixlcxx
/path/to/install/bin/zmpixlcxx_r
/path/to/install/bin/zmpixlf2003
/path/to/install/bin/zmpixlf2003_r
/path/to/install/bin/zmpixlf77
/path/to/install/bin/zmpixlf77_r
/path/to/install/bin/zmpixlf90
/path/to/install/bin/zmpixlf90_r
/path/to/install/bin/zmpixlf95
/path/to/install/bin/zmpixlf95_r

Installing Compute Node binaries ...
/path/to/install/cnbin/cn-ipfwd.sh
/path/to/install/cnbin/cn-ipfwd

Installing misc files ...
/path/to/install/bin/zelftool
/path/to/install/bin/zkparam.py

Installing MPICH2 headers ...
/path/to/install/include/mpe_thread.h
/path/to/install/include/mpi.h
/path/to/install/include/mpi.mod
/path/to/install/include/mpi_base.mod
/path/to/install/include/mpi_constants.mod
/path/to/install/include/mpi_sizeofs.mod
/path/to/install/include/mpicxx.h
/path/to/install/include/mpif.h
/path/to/install/include/mpio.h
/path/to/install/include/mpiof.h
/path/to/install/include/mpix.h

Installing MPICH2 libraries ...
/path/to/install/lib/libcxxmpich.zcl.a
/path/to/install/lib/libfmpich.zcl.a
/path/to/install/lib/libfmpich_.zcl.a
/path/to/install/lib/libmpich.zcl.a
/path/to/install/lib/libmpich.zclf90.a

Installing SPI libraries ...
/path/to/install/lib/libSPI.zcl.a
/path/to/install/lib/libzcl.a

Installing ZOID files ...
/path/to/install/lib/libzoid_cn.a
/path/to/install/lib/zoid/libc.a
/path/to/install/lib/zoid/libpthread.a

Installing DCMF files ...
/path/to/install/include/dcmf.h
/path/to/install/include/dcmf_collectives.h
/path/to/install/include/dcmf_coremath.h
/path/to/install/include/dcmf_globalcollectives.h
/path/to/install/include/dcmf_multisend.h
/path/to/install/include/dcmf_optimath.h
/path/to/install/lib/libdcmf.zcl.a
/path/to/install/lib/libdcmfcoll.zcl.a
</pre>

==Setting up a kernel profile==

The second part of the ZeptoOS installation is the process of setting up a ZeptoOS kernel profile.

Blue Gene systems are partitionable, meaning that the hardware can be split into multiple, largely independent, sub-units. More importantly, the system supports using different software stacks (what we call ''kernel profiles'') on different partitions at the same time. For our immediate purposes, this means that one can safely experiment with ZeptoOS on one partition while other users are running production jobs using the default light-weight kernel on the rest of the machine.

A kernel profile consists of the following elements:

; loader
: A proprietary low-level bootstrap code, loaded onto all compute and I/O nodes.
; CN image list
: A series of images loaded onto each compute node.
; ION image list
: A series of images loaded onto each I/O node.

All the images are loaded onto the machine from the service node via the service (JTAG) network. The loader is loaded first, followed by the CN- and ION-specific images loaded in order.

The CN image list defaults to the CNS followed by the CNK. The ION image list defaults to the CNS, followed by the Linux kernel, followed by the Linux ramdisk. CNS stands for Common Node Services – it is a proprietary "firmware" which negotiates between the hardware and the kernel.

To enable ZeptoOS, we need to boot the ZeptoOS CN kernel and ramdisk and the Zepto ION kernel and ramdisk in a partition that we want to use. The loader and the CNS are closed-source components, so ZeptoOS uses the same images for them as the default kernel profile.

In the remainder of this section we discuss how to assign and boot ZeptoOS-specific images.

===System using Cobalt===

If the BGP system in question has the [http://trac.mcs.anl.gov/projects/cobalt/ Cobalt] scheduler installed and its kernel profile feature has been configured properly, then using ZeptoOS should be easy. All that is necessary is to make a subdirectory in the kernel profile directory and to create a couple of symbolic links that point to ZeptoOS images.

On Argonne BGP machines, <tt>/bgsys/argonne-utils/profiles/</tt> is the top-level kernel profile directory; individual profiles are stored in its subdirectories. Assuming that the ZeptoOS images have already been created (see [[Configuration]]), and that one has write permissions to the kernel profile directory, here are the steps to create a new kernel profile:

<pre>
$ cd <kernel_profile_dir>
$ mkdir <your_profile> && cd <your_profile>
$ ln -s <zepto_dir>/BGP-CN-zImage-with-initrd.elf CNK
$ ln -s <zepto_dir>/BGP-ION-zImage.elf INK
$ ln -s <zepto_dir>/BGP-ION-ramdisk-for-CNL.elf ramdisk
$ ln -s ../default/CNS
$ ln -s ../default/uloader
</pre>

'''Note:''' ensure that the ZeptoOS images are world-readable from the service node, otherwise jobs will fail to start. If needed, copy the images to the kernel profile rather than link to them.

For Argonne users, we provide a convenience script <tt>mkprofile-ANL.sh</tt> with some extra features. The following command line is equivalent to the above steps:

<pre>
$ cd <zepto_dir> && ./mkprofile-ANL.sh --profile=<your_profile>
</pre>

Invoking the script with the <tt>-h</tt> option prints an overview of command line options. In particular, use <tt>-c</tt> if you prefer to copy images instead of making the links.

<pre>
$ ./mkprofile-ANL.sh -h
Usage: ./mkprofile-ANL.sh [OPTIONS]

Options:
-h : Show this message
-c : Copy images instead of making symbolic link
-f : Overwrite existing profile
--profile=name : Specify profile name
--cn=fn : Compute Node Kernel Image
--ion=fn : Specify I/O Node Kernel Image
--rd=fn : Specify I/O Node Ramdisk Image
--ls : show files in profile
--dryrun
</pre>

Once the ZeptoOS kernel profile is set up, it can be booted
by specifying the profile name when submitting a job using either <tt>cqsub</tt> or <tt>qsub</tt>:

<pre>
$ cqsub -k <profile_name> ...
$ qsub --kernel <profile_name> ...
</pre>

Testing and troubleshooting ZeptoOS is discussed in detail in the [[Testing|next section]].

===Manual installation using the MMCS console===

If the system does not use Cobalt, we will need to resort to using Blue Gene Midplane Management Control System (MMCS). MMCS is a low-level control mechanism; administrator-level access permissions are required to use it.

Before we begin, please select a partition to experiment on. The partition should be unused; if some reservation system is in place, please reserve that partition (but do not boot it yet).

====Assigning ZeptoOS images to the partition====

Log on the service node and start the MMCS console:

<pre>
$ ssh <service_node>
sn $ mmcs_db_console
connecting to mmcs_server
connected to mmcs_server
connected to DB2
mmcs$
</pre>

We begin by obtaining the current boot configuration of the partition, so that at the end we can revert it to the initial condition:

<pre>
mmcs$ getblockinfo <partition_name>
OK
boot info for block <partition_name>:
mloader: /bgsys/drivers/ppcfloor/boot/uloader
cnloadImg: /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk
ioloadImg: /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers/ppcfloor/boot/ramdisk
status: F
mmcs$
</pre>

Now assign ZeptoOS images to the partition:

<pre>
mmcs$ setblockinfo <partition_name> /bgsys/drivers/ppcfloor/boot/uloader \
/bgsys/drivers/ppcfloor/boot/cns,<zepto_dir>/BGP-CN-zImage-with-initrd.elf \
/bgsys/drivers/ppcfloor/boot/cns,<zepto_dir>/BGP-ION-zImage.elf,<zepto_dir>/BGP-ION-ramdisk-for-CNL.elf
mmcs$ quit
</pre>

====Booting ZeptoOS====

Once the partition has been correctly configured, the ZeptoOS images will be loaded when you run a job on that partition (via <tt>mpirun</tt>, for example):

<pre>
$ mpirun -verbose 1 -partition <partition_name> -np 64 -timeout 600 -cwd $PWD -exe ./a.out
</pre>

Testing and troubleshooting ZeptoOS is discussed in detail in the [[Testing|next section]].

====Restore the original configuration====

'''Do not forget this step!'''

Once the experiments are over, restore the partition to its original configuration:

<pre>
$ ssh <service_node>
sn $ mmcs_db_console
connecting to mmcs_server
connected to mmcs_server
connected to DB2
mmcs$ setblockinfo <partition_name> /bgsys/drivers/ppcfloor/boot/uloader \
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk \
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers/ppcfloor/boot/ramdisk
mmcs$ quit
</pre>

----
[[Configuration]] | [[ZeptoOS_Documentation|Top]] | [[Testing]]

Configuration

2009-05-08T15:21:21Z

Iskra:

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the front end nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [http://press.mcs.anl.gov/zeptoos/download download page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-<version>.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>ZeptoOS-<version></tt> directory:

<pre>
$ cd ZeptoOS-<version>
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories:

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a textual user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once the password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: ZeptoOS compute node Linux with embedded compute node ramdisk.
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-cn-linux
: Rebuilds the compute node ramdisk and embeds it into a compute node kernel image.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and embeds the ramdisk into the kernel.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
(the following <tt>make</tt> targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]

Changes

2009-05-08T15:03:29Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

'''ZeptoOS-BGP 2.0''' released May XX, 2009. 
Most important changes:

* first public release with support for Blue Gene/P
* ZeptoOS Compute Node Kernel with:
** Big Memory
** Native communication support
** IP forwarding
* ZOID enabled by default

'''ZeptoOS-BG 1.5''' released June 28, 2007. 
Most important changes:

* tested on V1R3M2 driver, should work with V1R3M0 and V1R3M1 as well
* support for ZOID
* ION kernel with experimental support for compute class processes and static TLB
* PVFS2 updated to version 2.6.3

'''ZeptoOS-BG 1.4''' released January 31, 2006. 
Most important changes:

* tested on V1R2M1 driver and may work with V1R2M0
* 2.6 ION kernel (based on 2.6.5)
* pvfs2 binary and rc script update
* boot msg clean up
* ZeptoInstall.sh fixes
* misc. such as fixing /tmp perms

'''ZeptoOS-BG 1.2''' released November 11, 2005. 
Most important changes:

* Integrated support for KTAU, a kernel profiling/tracing tool.
* Support for custom /bgl/dist-type directory trees. Eliminates a need to put ZeptoOS-specific stuff inside a directory shared with standard IBM profile.
* An installation script is now available, to ease the installation process.
* Added zswitcher, a command to switch kernel/ramdisk of a partition.
* Added zdiff-elfrd, a command to compare two ramdisks.
* Improved zinfo: more robust, more secure, easier to set up.
* CIOD read/write buffer size can now be calculated automatically, taking into account available memory, compute to I/O node ratio, etc.

'''ZeptoOS-BG 1.1''' released September 6, 2005.

----
[[ZeptoOS_Documentation|Top]]

Requirements

2009-05-08T14:58:29Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

* Blue Gene/P system with the V1R3 driver installed
* Blue Gene/P PowerPC Front End Node
* The ability to create a kernel profile
** If the system has the Cobalt scheduler installed and its kernel profile feature is available, you need write permissions to the profile directory.
** If no Cobalt kernel profile feature is available, you need access permission to the service node and the MMCS console.

----
[[ZeptoOS_Documentation|Top]]

Feature List

2009-05-08T14:57:27Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

; ZeptoOS Compute Node Linux
: Optimized kernel based on Linux 2.6.19 runs on the compute nodes of Blue Gene/P, with a slim ramdisk less than 1/3 of the size of the default I/O-node one.

; ZeptoOS-enhanced I/O node Linux
: Our I/O node kernel and ramdisk are based on the standard V1R3M0 release, but feature a number of ZeptoOS-specific improvements.

; Run arbitrary processes on the compute nodes
: Shell scripts, Java VM, unrestricted pthreads, etc.

; Big Memory support
: Zepto Compute Binaries run in Big Memory, a flat memory area covered by huge, semi-static TLB entries, resulting in a run-time performance comparable to that of a light-weight kernel.

; Native communication support
: The DMA driver, as well as standard BG/P communication libraries such as MPICH2 and the lower-level DCMF and SPI have been ported to the ZeptoOS Compute Node Linux, providing the applications with communication performance matching that of the light-weight kernel. Currently only the SMP mode is supported.

; ZOID
: ZeptoOS I/O Daemon performs job control and enables communication (remote file I/O, etc) between the compute nodes and the I/O nodes. ZOID can be easily extended through plug-ins to perform the forwarding of custom APIs.

; Log into your I/O nodes
: SSH daemon is enabled by default and is configured to allow the partition owner to log on the I/O nodes while the job is running.

; Log into your compute nodes
: Once logged on an I/O node, a user can open an interactive session to a compute node using <tt>telnet</tt> in order to, e.g., attach <tt>gdb</tt> to an application process running there. This is made possible using IP packet forwarding over the tree network, performed by ZOID.

; PVFS support
: Pre-built binaries of PVFS client code version 2.8.1, including the I/O node kernel module, are included in the ZeptoOS tarball for your convenience.

; ALL source code
: The ZeptoOS tarball includes ''all'' the source code: kernels, ramdisks, communication libraries, I/O forwarding, support libraries, tools and utilities, etc, along with all the support scripts and Makefiles needed to build binaries out of them. Pre-built binaries are also included for your convenience.

; Safe to use alongside IBM CNK
: Thanks to the way that Blue Gene machines are designed, ZeptoOS can easily and safely be used alongside the IBM CNK. Depending on the software configuration, the kernel to use can be specified on a per-partition, or even per-job, basis.

----
[[ZeptoOS_Documentation|Top]]

Installation

2009-05-07T23:02:35Z

Iskra: /* Installing the support files */

[[Configuration]] | [[ZeptoOS_Documentation|Top]] | [[Testing]]
----

==Installing the support files==

Installing ZeptoOS consists of two parts. In the first part we install support files (header files, libraries, scripts, etc) use primarily when building compute node binaries for ZeptoOS. To perform this step, change to the top-level directory and type:
<pre>
$ python install.py /path/to/install
</pre>

This will install the following files:
<pre>
Install Home: /path/to/install
Creating install directories ...
Making /path/to/install/bin
Making /path/to/install/cnbin
Making /path/to/install/include
Making /path/to/install/lib
Making /path/to/install/lib/zoid

Installing MPICH2 scripts ...
/path/to/install/bin/zmpicc
/path/to/install/bin/zmpicxx
/path/to/install/bin/zmpif77
/path/to/install/bin/zmpif90
/path/to/install/bin/zmpixlc
/path/to/install/bin/zmpixlc_r
/path/to/install/bin/zmpixlcxx
/path/to/install/bin/zmpixlcxx_r
/path/to/install/bin/zmpixlf2003
/path/to/install/bin/zmpixlf2003_r
/path/to/install/bin/zmpixlf77
/path/to/install/bin/zmpixlf77_r
/path/to/install/bin/zmpixlf90
/path/to/install/bin/zmpixlf90_r
/path/to/install/bin/zmpixlf95
/path/to/install/bin/zmpixlf95_r

Installing Compute Node binaries ...
/path/to/install/cnbin/cn-ipfwd.sh
/path/to/install/cnbin/cn-ipfwd

Installing misc files ...
/path/to/install/bin/zelftool
/path/to/install/bin/zkparam.py

Installing MPICH2 headers ...
/path/to/install/include/mpe_thread.h
/path/to/install/include/mpi.h
/path/to/install/include/mpi.mod
/path/to/install/include/mpi_base.mod
/path/to/install/include/mpi_constants.mod
/path/to/install/include/mpi_sizeofs.mod
/path/to/install/include/mpicxx.h
/path/to/install/include/mpif.h
/path/to/install/include/mpio.h
/path/to/install/include/mpiof.h
/path/to/install/include/mpix.h

Installing MPICH2 libraries ...
/path/to/install/lib/libcxxmpich.zcl.a
/path/to/install/lib/libfmpich.zcl.a
/path/to/install/lib/libfmpich_.zcl.a
/path/to/install/lib/libmpich.zcl.a
/path/to/install/lib/libmpich.zclf90.a

Installing SPI libraries ...
/path/to/install/lib/libSPI.zcl.a
/path/to/install/lib/libzcl.a

Installing ZOID files ...
/path/to/install/lib/libzoid_cn.a
/path/to/install/lib/zoid/libc.a
/path/to/install/lib/zoid/libpthread.a

Installing DCMF files ...
/path/to/install/include/dcmf.h
/path/to/install/include/dcmf_collectives.h
/path/to/install/include/dcmf_coremath.h
/path/to/install/include/dcmf_globalcollectives.h
/path/to/install/include/dcmf_multisend.h
/path/to/install/include/dcmf_optimath.h
/path/to/install/lib/libdcmf.zcl.a
/path/to/install/lib/libdcmfcoll.zcl.a
</pre>

==Setting up a kernel profile==

The second part of the ZeptoOS installation is the process of setting up a ZeptoOS kernel profile.

Blue Gene systems are partitionable, meaning that the hardware can be split into multiple, largely independent, sub-units. More importantly, the system supports using different software stacks (what we call ''kernel profiles'') on different partitions at the same time. This means that one can safely experiment with ZeptoOS on one partition while other users are running production jobs using the default light-weight kernel on the rest of the machine.

A kernel profile consists of the following elements:

; loader
: A proprietary low-level bootstrap code, loaded onto all compute and I/O nodes.
; CN image list
: A series of images loaded onto each compute node.
; ION image list
: A series of images loaded onto each I/O node.

All the images are loaded onto the machine from the service node via the service (JTAG) network. The loader is loaded first, followed by the CN- and ION-specific images loaded in order.

The CN image list defaults to CNS followed by the CNK. The ION image list defaults to CNS, followed by the Linux kernel, followed by the Linux ramdisk. CNS stands for Common Node Services – it is a proprietary "firmware" which negotiates between the hardware and the kernel.

To enable ZeptoOS, we need to boot the Zepto CN kernel and ramdisk and the Zepto ION kernel and ramdisk in a partition that we want to use. The loader and the CNS are closed-source components, so ZeptoOS uses the same images for them as the default kernel profile.

In the remainder of this section we discuss how to assign and boot ZeptoOS-specific images.

===System using Cobalt===

If the BGP system in question has the [http://trac.mcs.anl.gov/projects/cobalt/ Cobalt] scheduler installed and its kernel profile feature has been configured properly, then using ZeptoOS there should be easy. All that is necessary is to make a subdirectory in the kernel profile directory and to create a couple of symbolic links that point to ZeptoOS images.

On Argonne BGP machines, <tt>/bgsys/argonne-utils/profiles/</tt> is the top-level kernel profile directory; individual profiles are stored in its subdirectories. Assuming that the ZeptoOS images have already been created (see [[Configuration]]), and that one has write permissions to the kernel profile directory, here are the steps to create a new kernel profile:

<pre>
$ cd <kernel_profile_dir>
$ mkdir <your_profile> && cd <your_profile>
$ ln -s <zepto_dir>/BGP-CN-zImage-with-initrd.elf CNK
$ ln -s <zepto_dir>/BGP-ION-zImage.elf INK
$ ln -s <zepto_dir>/BGP-ION-ramdisk-for-CNL.elf ramdisk
$ ln -s ../default/CNS
$ ln -s ../default/uloader
</pre>

'''Note:''' ensure that the ZeptoOS images are world-readable, otherwise jobs will fail to start. If needed, copy the images to the kernel profile rather than link to them.

For Argonne users, we provide a convenience script <tt>mkprofile-ANL.sh</tt> with some extra features. The following command line is equivalent to the above steps:

<pre>
$ cd <zepto_dir> && ./mkprofile-ANL.sh --profile=<your_profile>
</pre>

Invoking it with the <tt>-h</tt> option prints an overview of command line options. In particular, use <tt>-c</tt> if you prefer to copy images instead of making the links.

<pre>
$ ./mkprofile-ANL.sh -h
Usage: ./mkprofile-ANL.sh [OPTIONS]

Options:
-h : Show this message
-c : Copy images instead of making symbolic link
-f : Overwrite existing profile
--profile=name : Specify profile name
--cn=fn : Compute Node Kernel Image
--ion=fn : Specify I/O Node Kernel Image
--rd=fn : Specify I/O Node Ramdisk Image
--ls : show files in profile
--dryrun
</pre>

Once the ZeptoOS kernel profile is set up, it can be booted
by specifying the profile name when submitting a job using either <tt>cqsub</tt> or <tt>qsub</tt>:

<pre>
$ cqsub -k <profile_name> ...
$ qsub --kernel <profile_name> ...
</pre>

Testing and troubleshooting ZeptoOS is discussed in detail in the [[Testing|next section]].

===Manual installation using the MMCS console===

If the system does not use Cobalt, we will need to resort to using Blue Gene Midplane Management Control System (MMCS). MMCS is a low-level control mechanism; administrator-level access permissions are required to use it.

Before we begin, please select a partition that we can experiment on. The partition should be unused; if some reservation system is in place, please reserve that partition (but do not boot it yet).

====Assigning ZeptoOS images to the partition====

Log on the service node and start the MMCS console:

<pre>
$ ssh <service_node>
sn $ mmcs_db_console
connecting to mmcs_server
connected to mmcs_server
connected to DB2
mmcs$
</pre>

We begin by obtaining the current configuration of a partition, so that at the end we can revert it to the initial condition:

<pre>
mmcs$ getblockinfo <partition_name>
OK
boot info for block <partition_name>:
mloader: /bgsys/drivers/ppcfloor/boot/uloader
cnloadImg: /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk
ioloadImg: /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers/ppcfloor/boot/ramdisk
status: F
mmcs$
</pre>

Now assign ZeptoOS images to the partition:

<pre>
mmcs$ setblockinfo <partition_name> /bgsys/drivers/ppcfloor/boot/uloader \
/bgsys/drivers/ppcfloor/boot/cns,<zepto_dir>/BGP-CN-zImage-with-initrd.elf \
/bgsys/drivers/ppcfloor/boot/cns,<zepto_dir>/BGP-ION-zImage.elf,<zepto_dir>/BGP-ION-ramdisk-for-CNL.elf
mmcs$ quit
</pre>

====Booting ZeptoOS====

Once you have correctly configured a partition, the ZeptoOS images will be loaded when you run a job on that partition (via <tt>mpirun</tt>, for example):

<pre>
$ mpirun -verbose 1 -partition <partition_name> -np 64 -timeout 600 -cwd $PWD -exe ./a.out
</pre>

Testing and troubleshooting ZeptoOS is discussed in detail in the [[Testing|next section]].

====Restore the original configuration====

'''Do not forget this step!'''

After you are finished using ZeptoOS, you need to restore the partition to its original configuration:

<pre>
$ ssh <service_node>
sn $ mmcs_db_console
connecting to mmcs_server
connected to mmcs_server
connected to DB2
mmcs$ setblockinfo <partition_name> /bgsys/drivers/ppcfloor/boot/uloader \
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk \
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers/ppcfloor/boot/ramdisk
mmcs$ quit
</pre>

----
[[Configuration]] | [[ZeptoOS_Documentation|Top]] | [[Testing]]

FAQ

2009-05-07T21:21:52Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

==Why SSH keeps asking for a password==

As we envisioned it, partition owners should be able to log on the I/O nodes belonging to their jobs without being asked for a password. The following considerations apply:
# The account information on the partition owner must be added to the <tt>/etc/passwd</tt> file on the I/O nodes when launching a job; this is discussed [[ZOID#The /bin.rd/update_passwd_file.sh file|here]].
# For password-less logins, <tt>shosts.equiv</tt> must be configured before (re)building the I/O node ramdisk, as discussed [[Testing#Interactive login|here]]. Alternatively, users could set up SSH key pairs in their home directories (password-less, or using <tt>ssh-agent</tt> to cache the password).
# SSH might temporarily prevent a partition owner from logging in if an attempt is made before the job starts running, as discussed [[Testing#Interactive login|here]]. Root can always log in, by providing the password set when building the I/O node ramdisk for the first time.
# Finally, keep in mind that a particular site might have disabled this feature on purpose.

----
[[ZeptoOS_Documentation|Top]]

FAQ

2009-05-07T20:51:30Z

Iskra: /* Why large MPI processes do not work */

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

----
[[ZeptoOS_Documentation|Top]]

FAQ

2009-05-07T20:48:33Z

Iskra: /* Why large MPI processes do not work */

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the Big Memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the Big Memory region at boot time and it cannot be used for paged memory, even if it were unused).

See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to increase the limit; the parameter to use is <tt>flatmemsizeMB</tt>.

----
[[ZeptoOS_Documentation|Top]]

MPICH, DCMF, and SPI

2009-05-07T20:45:09Z

Iskra: /* ZCB and Big memory */

[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]
----

==Introduction==

To support high performance computing (HPC) applications, specifically MPI applications, we have ported IBM CNK's communication software stack to the ZeptoOS compute node Linux environment. MPICH used in this ZeptoOS release is mpich2-1.0.7 with IBM patches. It is reasonably stable, and the performance of MPI applications on the ZeptoOS compute node Linux is comparable to that on CNK. While there are some limitations at the moment, there are benefits as well.

Benefits:
* No limitation on the number of threads
** 4 or more OpenMP threads per node
** Additional threads as I/O or backgroup tasks
* It is Linux!
** Debugging tools such as gdb, strace, etc
** Various file systems, such as ramfs

Current limitations:
* Only the SMP mode is supported
* Shared libraries are not provided now
* No binary compatibility between CNK and ZeptoOS CN Linux

We will support a VN-equivalent mode (multiple MPI tasks per node) and provide shared libraries in a future release.

As in IBM CNK environment, Deep Computing Messaging Framework (DCMF) and System Programming Interface (SPI) are available. It is possible to write a DCMF code or a SPI code directly if necessary. DCMF is a communication library that provides non-blocking operations. Please refer to the [http://dcmf.anl-external.org/wiki/index.php/Main_Page DCMF wiki] for details. We are using DCMF version 1.0.0 in the current ZeptoOS release, which is older than the DCMF in the current driver release (V1R3M0). SPI is the lowest-level user space API for the torus DMA, collective network, BGP-specifc lock mechanisms, and other compute node specific features. There is no public document on SPI available at the moment, but almost all header files and source code is available. Internally, MPICH depends on DMCF, which in turn depends on SPI. We will say more about it [[#Software stack layout|later]].

===ZCB and Big memory===

MPI application running under the ZeptoOS compute node environment (technically, applications that require the DMA operation or a maximum memory bandwidth) need to be configured as Zepto Compute Binaries (ZCB). This is done using the <tt>zelftool</tt>, invoked behind the scenes when linking a binary using the ZeptoOS MPI compiler wrapper scripts (<tt>zmpicc</tt>, etc).

ZeptoOS compute node kernel treats ZCB executables differently from ordinary processes. It creates a special memory mapping region called big memory, which is covered by large pages with semi-static TLB entries, and it loads all applications sections to the big memory region. Big memory region has virtually no TLB misses and it also enables DMA operations.

Some system calls will not work correctly if used from a ZCB process, in particular <tt>fork</tt> (but creating threads does work). Also, being a separate memory region set up at kernel boot time, the size of big memory is fixed. It is set to 256 MB by default, which could be too small for larger MPI processes; it can be [[FAQ#Why large MPI processes do not work|increased]] before booting a partition, at the expense of the ordinary Linux paged memory.

==Compiling HPC applications==

While the same compiler can be used as for the IBM CNK, ZeptoOS compute node environment requires linking with ZeptoOS-specific communication libraries (applications linked with the CNK MPI will not work on ZeptoOS).

===Compiler wrapper scripts===

We provide compiler wrapper scripts which automatically link with appropriate libraries from the ZeptoOS installation directory. We provide the same set of wrapper scripts that IBM provides, with an extra <tt>z</tt> prefix:

; zmpicc, zmpicxx, zmpif77, zmpif90
: Wrapper scripts that invoke BGP-enhanced GNU compilers

; zmpixlc, zmpixlcxx, zmpixlf2003, zmpixlf77, zmpixlf90, zmpixlf95
: Wrapper scripts that invoke IBM XL compilers

; zmpixlc_r, zmpixlcxx_r, zmpixlf2003_r, zmpixlf77_r, zmpixlf90_r, zmpixlf95_r
: Wrapper scripts that invoke IBM XL compilers (thread safe compilation for OpenMP)

To get insight into the internals of these scripts, invoke them with the <tt>-show</tt> option.

====A compilation example====

Understanding build system on a program might take some time, but there is nothing special to compile a program for ZeptoOS.

Here is a real-world example of how to build a well-known [http://climate.lanl.gov/Models/POP/ Parallel Ocean Program (POP)].

<pre>
$ wget http://climate.lanl.gov/Models/POP/POP_2.0.1.tar.Z
$ tar xvfz POP_2.0.1.tar.Z && cd pop
$ ./setup_run_dir ztest && cd ztest
$ edit ibm_mpi.gnu # see the patch below
$ export ARCHDIR=ibm_mpi
$ make # takes a while
$ edit pop_in # test data set
- nprocs_clinic = 4
- nprocs_tropic = 4
+ nprocs_clinic = 64
+ nprocs_tropic = 64
$ cqsub -n 64 -t 10 -k <zepto_profile> ./pop

--------------------
--- orig/ibm_mpi.gnu 2009-04-15 15:01:58.666457601 -0500
+++ ztest/ibm_mpi.gnu 2009-04-15 14:17:58.099132435 -0500
@@ -6,17 +6,18 @@
# will someday be a file which is a cookbook in Q&A style: "How do I do X?"
# is followed by something like "Go to file Y and add Z to line NNN."
#
-FC = mpxlf90_r
-LD = mpxlf90_r
-CC = mpcc_r
-Cp = /usr/bin/cp
-Cpp = /usr/ccs/lib/cpp -P
+ZPATH=<zepto_dir>
+FC = $(ZPATH)/zmpixlf90
+LD = $(ZPATH)/zmpixlf90
+CC = $(ZPATH)/zmpixlc
+Cp = /bin/cp
+Cpp = /usr/bin/cpp -P
AWK = /usr/bin/awk
-ABI = -q64
+#ABI = -q64
COMMDIR = mpi

-NETCDFINC = -I/usr/local/include
-NETCDFLIB = -L/usr/local/lib
+NETCDFINC = -I/soft/apps/netcdf-4.0/include/
+NETCDFLIB = -L/soft/apps/netcdf-4.0/lib

# Enable MPI library for parallel code, yes/no.

@@ -58,7 +59,8 @@
#
#----------------------------------------------------------------------------

-FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+#FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+FBASE = $(ABI) -qarch=auto -qnosave $(NETCDFINC) -I$(ObjDepDir)

ifeq ($(TRAP_FPE),yes)
FBASE := $(FBASE) -qflttrap=overflow:zerodivide:enable -qspillsize=32704

</pre>

===Compiling without the wrapper scripts===

If one wishes to invoke the compiler directly, please make sure that the Makefile or build environment points to ZeptoOS header files and libraries correctly. An example would be:

<pre>
$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc \
-o mpi-test-linux -Wall -O3 -I<zepto_dir>/include mpi-test.c \
-L<zepto_dir>/lib -lmpich.zcl -ldcmfcoll.zcl -ldcmf.zcl -lSPI.zcl -lzcl \
-lzoid_cn -lrt -lpthread -lm
$ <zepto_dir>/bin/zelftool -e mpi-test-linux
</pre>

'''Notes:'''
* Replace <tt><zepto_dir></tt> with your actuall ZeptoOS install path.
* Do not forget to call the <tt>zelftool</tt> utility, which makes the executable a Zepto Compute Binary.

==Building MPICH, DCMF, and SPI libraries==

We provide all the necessary source code to build MPICH, DCMF, and SPI. To build these libraries, just type:

<pre>
$ make -C comm rebuild-target
</pre>

It may take half an hour to an hour to complete the build process, depending on what file system you are using (i.e., GPFS is a lot slower than a local file system).

The <tt>rebuild-target</tt> target does not know anything about the existing installation directory; it only copies the built libraries and header files to the <tt>comm/tmp</tt> directory. To install the newly built libraries, do the following:

<pre>
$ make -C comm update-prebuilt
$ python install.py <zepto_dir>
</pre>

The <tt>update-prebuilt</tt> target basically copies the files from the <tt>comm/tmp</tt> directory to the <tt>comm/prebuilt</tt> directory, which is where the <tt>install.py</tt> script looks for to copy the files to <tt><zepto_dir></tt>.

==Software stack layout==

[[Image:Zepto-Comm-Stack.png|right]]

The figure on the right depicts the layout of the communication software stack in the ZeptoOS compute node environment. This is very similar to the IBM CNK's stack, with the exception of an extra ZEPTO SPI layer, and the use of Linux instead of CNK.

Since MPICH is a well-known software package we will not discuss it here, but we will briefly describe the DCMF and SPI components:

* DCMF
** Stands for Deep Computing Messaging Framework
** Developed by IBM originally for Blue Gene architecture
** Hardware initialization, query functions
** Supports BGP Torus DMA, collective network
** Provides timer
** Supports non-blocking collective operations
** BGP MPICH uses DCMF internally (IBM provides a glue layer)
* SPI
** Stands for System Programming Interface
** Developed by IBM. BGP-specific code.
** Kernel interfaces – DMA control, lockbox, etc
** DMA related definitions
*** can be used in both user space and kernel space
** RAS, BGP personality, mapping related functions

BGP SPI was designed specifically for IBM CNK, so it is not compatible with Linux. ZEPTO SPI is a thin software layer that absorbs the differences between the CNK and Linux or drops the requests that Linux cannot handle.

==Source code==

The source code and header files of DCMF and SPI can be found in the <tt>comm</tt> directory. The source code of MPICH is in <tt>DCMF/lib/mpich2/mpich2-1.0.7.tar.gz</tt>, which will be extracted at build time.

The DCMF source code is located in <tt>DCMF/sys/</tt>. DCMF core source code is in <tt>DCMF/sys/messaging/</tt>. Component Collective Messaging Interface (CCMI) is part of DCMF and its source code is in <tt>DCMF/sys/collectives/</tt>. Test codes can be found in <tt>DCMF/sys/collectives/tests/</tt> for CCMI and <tt>DCMF/sys/messaging/tests/</tt>. Those test codes can be a good example for DCMF/CCMI programming.

SPI headers are in <tt>arch-runtime/arch/</tt> and SPI source code is in <tt>comm/arch-runtime/runtime/</tt>. The source code of the ZEPTO SPI layer is in <tt>arch-runtime/zcl_spi/</tt>, while the header files are in <tt>arch-runtime/arch/include/zepto/</tt>.

Here is an overview of the directory tree:

<pre>
comm
|-- DCMF
| |-- lib
| | |-- dev
| | `-- mpich2
| | `-- make
| |-- sys
| | |-- collectives
| | | |-- adaptor
| | | |-- kernel
| | | |-- tests
| | | `-- tools
| | |-- include
| | `-- messaging
| | |-- devices
| | |-- messager
| | |-- protocols
| | |-- queueing
| | |-- sysdep
| | `-- tests
|-- arch-runtime
| |-- arch
| | `-- include
| | |-- bpcore
| | |-- cnk
| | |-- common
| | |-- spi
| | `-- zepto
| |-- runtime
| |-- testcodes
| `-- zcl_spi
`-- testcodes
</pre>

===Debug output===

ZeptoOS versions of SPI and DCMF have a built-in debug output. The output is disabled by default, and can be enabled by setting the environment variable <tt>ZEPTO_TRACE</tt> when submitting a job. The integer value of the variable indicates the debug level (a higher number results in more debug output).

An example:
<pre>
$ cqsub -k <zepto_profile> -n 64 -t 10 ... -e ZEPTO_TRACE=2 ./a.out
</pre>

----
[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]

Testing

2009-05-07T20:43:51Z

Iskra: /* Submitting */

[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]
----

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

==The /bin/sleep job==

If you are using Cobalt, submit using either of the commands below:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
$ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
</pre>

If you are using <tt>mpirun</tt> directly, submit as follows:

<pre>
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600
</pre>

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.

If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt):

<pre>
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated
</pre>

(we stripped the timestamp prefixes to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

<pre>
BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
</pre>

This error indicates that the job was submitted to the default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). You need to go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files can be useful to diagnose the problem.

==Log files==

===I/O node===

Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like:

<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009
Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary
Starting RPC portmap daemon..done
eth0: Link status [RX+,TX+]
mount server reported tcp not available, falling back to udp
mount: RPC: Remote system error - No route to host
Zepto ION startup-00
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
Zepto ION startup-00 done
done
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart.
done
Starting network time protocol daemon (NTPD) using 172.17.3.1
May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec
May 1 12:57:11 ion-15 ntpd[653]: ntpd 4.2.0a@1.1196-r Sat Oct 4 00:01:53 UTC 2008 (1)
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123
May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040
done
Enabling ssh
Mounting site filesystems
done
Loading PVFS2 kernel module done
Sleeping 0 seconds before starting PVFS done
Starting PVFS2 client done
Sleeping 10 seconds before mounting PVFS
done
Mounting PVFS2 filesystems done
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22.
done
Zepto ION startup-12
Zepto ION startup-12 done
Starting GPFS
May 1 12:57:26 ion-15 syslogd 1.4.1: restart.
/etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00
ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists
ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists
ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists
ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists
Starting ZOID...
done
Zepto ION startup-99
Zepto ION startup-99 done
May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh'
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
~ #
</pre>

(again, we stripped the prefixes to make the lines shorter)

Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one instead mistakenly booted with the default ramdisk, this could be recognized by messages such as:

<pre>
Starting CIO services
[ciod:initialized] done
</pre>

(<tt>ciod</tt> is ''never'' started when using Zepto Compute Node Linux)

In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently.

===Compute node===

All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt>sn1-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.

A correct boot log when when booting ZeptoOS will look something like this:

<pre>
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0)
iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM.
iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code.
iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved.
iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144
iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory
iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009
iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges:
iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges
iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658
iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem)
iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset
iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found.
iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs
iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0
iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2
iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered
iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7)
iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default)
iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6
iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init
iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
</pre>

This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''.

The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:

<pre>
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1
DBBlockCmd setusername iskra
iskra db_allocate ANL-R00-M1-N12-64
iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64
iskra block state C
iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64)
iskra:ANL-R00-M1-N12-64 BlockController::connect()
iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206
Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101
iskra:ANL-R00-M1-N12-64 connected to mcServer
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened
iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log
iskra:ANL-R00-M1-N12-64 MailboxListener starting
iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64
iskra:ANL-R00-M1-N12-64 BlockController::boot_block \
uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \
cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \
ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk
iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
</pre>

Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.

After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:

<pre>
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00
iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful
iskra DatabaseBlockCommandThread stopped
DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1
DBJobCmd setusername iskra
iskra Starting Job 98461
New thread 4398305505840, for jobid 98461
selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra
ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra
ANL-R00-M1-N12-64 persist: 1
ANL-R00-M1-N12-64 connecting to mpirun...
ANL-R00-M1-N12-64 setting mpirun stream, fd=386
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 Job::load() /bin/sleep
ANL-R00-M1-N12-64 Job loaded: 98461
ANL-R00-M1-N12-64 About to start /bin/sleep
ANL-R00-M1-N12-64 Job 98461 set to RUNNING
iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
</pre>

==Interactive login==

We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.

The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:

<pre>
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
</pre>

In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value).

The IP address is also available from the MMCS log file:

<pre>
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
</pre>

With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.

Once the IP address is known, one can simply use the SSH:

<pre>
iskra@login1.surveyor:~> ssh 172.16.3.15

BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/gpfs/home/iskra $ hostname
ion-15
/gpfs/home/iskra $
</pre>

If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customization to work properly. To enable access for the partition owner, one might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH keypairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided while [[Configuration#Building|building]] the ZeptoOS environment.

Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.

Here's part of the <tt>ps</tt> output from the I/O node:

<pre>
/gpfs/home/iskra $ ps -ef
UID PID PPID C STIME TTY TIME CMD
[...]
65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap
root 108 19 0 16:09 ? 00:00:00 [rpciod/0]
root 109 19 0 16:09 ? 00:00:00 [rpciod/1]
root 110 19 0 16:09 ? 00:00:00 [rpciod/2]
root 111 19 0 16:09 ? 00:00:00 [rpciod/3]
root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd
root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x
ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.
root 688 1 0 16:09 ? 00:00:00 [lockd]
root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c
root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 -
root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r
root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm
root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc]
root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc]
root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd
root 1153 1 0 16:10 ? 00:00:00 [mmkproc]
root 1152 1 0 16:10 ? 00:00:00 [mmkproc]
root 1154 1 0 16:10 ? 00:00:00 [mmkproc]
iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s
root 2823 1 0 16:10 ? 00:00:00 /bin/sh
root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv]
iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0
iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh
iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef
/gpfs/home/iskra $
</pre>

The I/O nodes run a small Linux setup with the root filesystem in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote filesystem clients (GPFS, PVFS). Please verify at this stage that the remote filesystem have been mounted correctly.

One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O node and the compute nodes, implemented using the virtual network tunneling device available in Linux:

<pre>
/gpfs/home/iskra $ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
/gpfs/home/iskra $
</pre>

At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. The IP address of the second compute node is... <tt>192.168.1.59</tt>. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address <tt>192.168.1.16</tt>. Do not blame us for this chaos – blame IBM :-).

The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them:

<pre>
/gpfs/home/iskra $ telnet 192.168.1.64

Entering character mode
Escape character is '^]'.

BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ #
</pre>

The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger jobs, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.

Here's part of the <tt>ps</tt> output from the compute node:

<pre>
~ # ps -ef
PID USER VSZ STAT COMMAND
[...]
34 root 5440 S /bin/sh /etc/init.d/rc.sysinit
44 root 5504 S /sbin/telnetd -l /bin/sh
47 root 6528 S /sbin/inetd
48 root 46400 R N /sbin/control
62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse
116 root 5248 S /bin/sleep 3600
118 root 5504 S /bin/sh
</pre>

Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control.

There are two custom processes running on each compute node:

'''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.

'''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the login and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin</tt>.

==Shell script job==

Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.

Here is a sample shell script to try:

<pre>
#!/bin/sh

. /proc/personality.sh

while true; do
echo "Node $BG_RANK_IN_PSET running (stdout)"
echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2
sleep 10
done
</pre>

(please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>)

Please create the script file on the network filesystem, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The scripts print a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered. If that is the case, kill the job and verify that the standard output data did appear.

==MPI and OpenMP jobs==

The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
<pre>
$ cd comm/testcodes
</pre>

===Compiling===

The programs can be compiled on a login node using:

<pre>
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
</pre>

===Submitting===

Submit the MPI test like any other job; use one of the below:

<pre>
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \
-cwd $PWD -exe $PWD/omp-test-linux
</pre>

For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> variable:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux
$ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
</pre>

The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".

'''Note:''' see the [[FAQ#Why large MPI processes do not work|FAQ]] if submitting larger MPI processes does not work properly.

----
[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]

FAQ

2009-05-07T20:43:36Z

Iskra: /* Why large MPI programs do not work */

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI processes do not work==

A common reason might be that they do not have enough memory to run. MPI processes run within the Big Memory region, which is limited to 256 MB by default. See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to change that; the parameter to use is <tt>flatmemsizeMB</tt>.

----
[[ZeptoOS_Documentation|Top]]

MPICH, DCMF, and SPI

2009-05-07T20:40:52Z

Iskra: /* ZCB and Big memory */

[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]
----

==Introduction==

To support high performance computing (HPC) applications, specifically MPI applications, we have ported IBM CNK's communication software stack to the ZeptoOS compute node Linux environment. MPICH used in this ZeptoOS release is mpich2-1.0.7 with IBM patches. It is reasonably stable, and the performance of MPI applications on the ZeptoOS compute node Linux is comparable to that on CNK. While there are some limitations at the moment, there are benefits as well.

Benefits:
* No limitation on the number of threads
** 4 or more OpenMP threads per node
** Additional threads as I/O or backgroup tasks
* It is Linux!
** Debugging tools such as gdb, strace, etc
** Various file systems, such as ramfs

Current limitations:
* Only the SMP mode is supported
* Shared libraries are not provided now
* No binary compatibility between CNK and ZeptoOS CN Linux

We will support a VN-equivalent mode (multiple MPI tasks per node) and provide shared libraries in a future release.

As in IBM CNK environment, Deep Computing Messaging Framework (DCMF) and System Programming Interface (SPI) are available. It is possible to write a DCMF code or a SPI code directly if necessary. DCMF is a communication library that provides non-blocking operations. Please refer to the [http://dcmf.anl-external.org/wiki/index.php/Main_Page DCMF wiki] for details. We are using DCMF version 1.0.0 in the current ZeptoOS release, which is older than the DCMF in the current driver release (V1R3M0). SPI is the lowest-level user space API for the torus DMA, collective network, BGP-specifc lock mechanisms, and other compute node specific features. There is no public document on SPI available at the moment, but almost all header files and source code is available. Internally, MPICH depends on DMCF, which in turn depends on SPI. We will say more about it [[#Software stack layout|later]].

===ZCB and Big memory===

MPI application running under the ZeptoOS compute node environment (technically, applications that require the DMA operation or a maximum memory bandwidth) need to be configured as Zepto Compute Binaries (ZCB). This is done using the <tt>zelftool</tt>, invoked behind the scenes when linking a binary using the ZeptoOS MPI compiler wrapper scripts (<tt>zmpicc</tt>, etc).

ZeptoOS compute node kernel treats ZCB executables differently from ordinary processes. It creates a special memory mapping region called big memory, which is covered by large pages with semi-static TLB entries, and it loads all applications sections to the big memory region. Big memory region has virtually no TLB misses and it also enables DMA operations.

Some system calls will not work correctly if used from a ZCB process, in particular <tt>fork</tt> (but creating threads does work). Also, being a separate memory region set up at kernel boot time, the size of big memory is fixed. It is set to 256 MB by default, which could be too small for larger MPI tasks; it can be [[FAQ#Why large MPI programs do not work|increased]] before booting a partition, at the expense of the ordinary Linux paged memory.

==Compiling HPC applications==

While the same compiler can be used as for the IBM CNK, ZeptoOS compute node environment requires linking with ZeptoOS-specific communication libraries (applications linked with the CNK MPI will not work on ZeptoOS).

===Compiler wrapper scripts===

We provide compiler wrapper scripts which automatically link with appropriate libraries from the ZeptoOS installation directory. We provide the same set of wrapper scripts that IBM provides, with an extra <tt>z</tt> prefix:

; zmpicc, zmpicxx, zmpif77, zmpif90
: Wrapper scripts that invoke BGP-enhanced GNU compilers

; zmpixlc, zmpixlcxx, zmpixlf2003, zmpixlf77, zmpixlf90, zmpixlf95
: Wrapper scripts that invoke IBM XL compilers

; zmpixlc_r, zmpixlcxx_r, zmpixlf2003_r, zmpixlf77_r, zmpixlf90_r, zmpixlf95_r
: Wrapper scripts that invoke IBM XL compilers (thread safe compilation for OpenMP)

To get insight into the internals of these scripts, invoke them with the <tt>-show</tt> option.

====A compilation example====

Understanding build system on a program might take some time, but there is nothing special to compile a program for ZeptoOS.

Here is a real-world example of how to build a well-known [http://climate.lanl.gov/Models/POP/ Parallel Ocean Program (POP)].

<pre>
$ wget http://climate.lanl.gov/Models/POP/POP_2.0.1.tar.Z
$ tar xvfz POP_2.0.1.tar.Z && cd pop
$ ./setup_run_dir ztest && cd ztest
$ edit ibm_mpi.gnu # see the patch below
$ export ARCHDIR=ibm_mpi
$ make # takes a while
$ edit pop_in # test data set
- nprocs_clinic = 4
- nprocs_tropic = 4
+ nprocs_clinic = 64
+ nprocs_tropic = 64
$ cqsub -n 64 -t 10 -k <zepto_profile> ./pop

--------------------
--- orig/ibm_mpi.gnu 2009-04-15 15:01:58.666457601 -0500
+++ ztest/ibm_mpi.gnu 2009-04-15 14:17:58.099132435 -0500
@@ -6,17 +6,18 @@
# will someday be a file which is a cookbook in Q&A style: "How do I do X?"
# is followed by something like "Go to file Y and add Z to line NNN."
#
-FC = mpxlf90_r
-LD = mpxlf90_r
-CC = mpcc_r
-Cp = /usr/bin/cp
-Cpp = /usr/ccs/lib/cpp -P
+ZPATH=<zepto_dir>
+FC = $(ZPATH)/zmpixlf90
+LD = $(ZPATH)/zmpixlf90
+CC = $(ZPATH)/zmpixlc
+Cp = /bin/cp
+Cpp = /usr/bin/cpp -P
AWK = /usr/bin/awk
-ABI = -q64
+#ABI = -q64
COMMDIR = mpi

-NETCDFINC = -I/usr/local/include
-NETCDFLIB = -L/usr/local/lib
+NETCDFINC = -I/soft/apps/netcdf-4.0/include/
+NETCDFLIB = -L/soft/apps/netcdf-4.0/lib

# Enable MPI library for parallel code, yes/no.

@@ -58,7 +59,8 @@
#
#----------------------------------------------------------------------------

-FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+#FBASE = $(ABI) -qarch=auto -qnosave -bmaxdata:0x80000000 $(NETCDFINC) -I$(ObjDepDir)
+FBASE = $(ABI) -qarch=auto -qnosave $(NETCDFINC) -I$(ObjDepDir)

ifeq ($(TRAP_FPE),yes)
FBASE := $(FBASE) -qflttrap=overflow:zerodivide:enable -qspillsize=32704

</pre>

===Compiling without the wrapper scripts===

If one wishes to invoke the compiler directly, please make sure that the Makefile or build environment points to ZeptoOS header files and libraries correctly. An example would be:

<pre>
$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc \
-o mpi-test-linux -Wall -O3 -I<zepto_dir>/include mpi-test.c \
-L<zepto_dir>/lib -lmpich.zcl -ldcmfcoll.zcl -ldcmf.zcl -lSPI.zcl -lzcl \
-lzoid_cn -lrt -lpthread -lm
$ <zepto_dir>/bin/zelftool -e mpi-test-linux
</pre>

'''Notes:'''
* Replace <tt><zepto_dir></tt> with your actuall ZeptoOS install path.
* Do not forget to call the <tt>zelftool</tt> utility, which makes the executable a Zepto Compute Binary.

==Building MPICH, DCMF, and SPI libraries==

We provide all the necessary source code to build MPICH, DCMF, and SPI. To build these libraries, just type:

<pre>
$ make -C comm rebuild-target
</pre>

It may take half an hour to an hour to complete the build process, depending on what file system you are using (i.e., GPFS is a lot slower than a local file system).

The <tt>rebuild-target</tt> target does not know anything about the existing installation directory; it only copies the built libraries and header files to the <tt>comm/tmp</tt> directory. To install the newly built libraries, do the following:

<pre>
$ make -C comm update-prebuilt
$ python install.py <zepto_dir>
</pre>

The <tt>update-prebuilt</tt> target basically copies the files from the <tt>comm/tmp</tt> directory to the <tt>comm/prebuilt</tt> directory, which is where the <tt>install.py</tt> script looks for to copy the files to <tt><zepto_dir></tt>.

==Software stack layout==

[[Image:Zepto-Comm-Stack.png|right]]

The figure on the right depicts the layout of the communication software stack in the ZeptoOS compute node environment. This is very similar to the IBM CNK's stack, with the exception of an extra ZEPTO SPI layer, and the use of Linux instead of CNK.

Since MPICH is a well-known software package we will not discuss it here, but we will briefly describe the DCMF and SPI components:

* DCMF
** Stands for Deep Computing Messaging Framework
** Developed by IBM originally for Blue Gene architecture
** Hardware initialization, query functions
** Supports BGP Torus DMA, collective network
** Provides timer
** Supports non-blocking collective operations
** BGP MPICH uses DCMF internally (IBM provides a glue layer)
* SPI
** Stands for System Programming Interface
** Developed by IBM. BGP-specific code.
** Kernel interfaces – DMA control, lockbox, etc
** DMA related definitions
*** can be used in both user space and kernel space
** RAS, BGP personality, mapping related functions

BGP SPI was designed specifically for IBM CNK, so it is not compatible with Linux. ZEPTO SPI is a thin software layer that absorbs the differences between the CNK and Linux or drops the requests that Linux cannot handle.

==Source code==

The source code and header files of DCMF and SPI can be found in the <tt>comm</tt> directory. The source code of MPICH is in <tt>DCMF/lib/mpich2/mpich2-1.0.7.tar.gz</tt>, which will be extracted at build time.

The DCMF source code is located in <tt>DCMF/sys/</tt>. DCMF core source code is in <tt>DCMF/sys/messaging/</tt>. Component Collective Messaging Interface (CCMI) is part of DCMF and its source code is in <tt>DCMF/sys/collectives/</tt>. Test codes can be found in <tt>DCMF/sys/collectives/tests/</tt> for CCMI and <tt>DCMF/sys/messaging/tests/</tt>. Those test codes can be a good example for DCMF/CCMI programming.

SPI headers are in <tt>arch-runtime/arch/</tt> and SPI source code is in <tt>comm/arch-runtime/runtime/</tt>. The source code of the ZEPTO SPI layer is in <tt>arch-runtime/zcl_spi/</tt>, while the header files are in <tt>arch-runtime/arch/include/zepto/</tt>.

Here is an overview of the directory tree:

<pre>
comm
|-- DCMF
| |-- lib
| | |-- dev
| | `-- mpich2
| | `-- make
| |-- sys
| | |-- collectives
| | | |-- adaptor
| | | |-- kernel
| | | |-- tests
| | | `-- tools
| | |-- include
| | `-- messaging
| | |-- devices
| | |-- messager
| | |-- protocols
| | |-- queueing
| | |-- sysdep
| | `-- tests
|-- arch-runtime
| |-- arch
| | `-- include
| | |-- bpcore
| | |-- cnk
| | |-- common
| | |-- spi
| | `-- zepto
| |-- runtime
| |-- testcodes
| `-- zcl_spi
`-- testcodes
</pre>

===Debug output===

ZeptoOS versions of SPI and DCMF have a built-in debug output. The output is disabled by default, and can be enabled by setting the environment variable <tt>ZEPTO_TRACE</tt> when submitting a job. The integer value of the variable indicates the debug level (a higher number results in more debug output).

An example:
<pre>
$ cqsub -k <zepto_profile> -n 64 -t 10 ... -e ZEPTO_TRACE=2 ./a.out
</pre>

----
[[Testing]] | [[ZeptoOS_Documentation|Top]] | [[Kernel]]

FAQ

2009-05-07T20:32:15Z

Iskra:

[[ZeptoOS_Documentation|Top]]
----

==How to obtain a CN node number==

This depends on what number one is interested in.

===Pset rank===

A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').

The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:

<pre>
#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"
</pre>

From a C program it will be easier to use the binary personality available from <tt>/proc/personality</tt>. The definition of the structure can be found in <tt>/bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h</tt>. The pset rank is in <tt>Network_Config.RankInPSet</tt>:

<pre>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
_BGP_Personality_t personality;
int fd;

if ((fd = open("/proc/personality", O_RDONLY)) == -1)
{
perror("open");
return 1;
}
if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
{
perror("read");
close(fd);
return 1;
}
close(fd);

printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

return 0;
}
</pre>

(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)

===Torus rank===

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.

The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.

Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:

<pre>
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
\\$3 * $BG_XSIZE * $BG_YSIZE}"`
</pre>

===MPI rank===

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, ''not'' node. If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

<pre>
#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
if (__zoid_init())
return 1;
printf("%d\n", __zoid_my_rank());
return 0;
}
</pre>

(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)

A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

<pre>
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
</pre>

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

==How to open a socket from a CN to the outside world==

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (<tt>192.168.1.</tt>''x''), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass <tt>ZOID_NAT_ENABLE</tt> environment variable when submitting a job. An administrator can also enable this option permanently in the [[ZOID#opt_enable_nat|config file]].

==How to obtain a Cobalt job ID==

Cobalt passes the job id to the application processes launched on the compute nodes using the <tt>COBALT_JOBID</tt> environment variable.

This variable is also accessible from the [[ZOID#User script|user script]] running on the I/O nodes, using the <tt>ZOID_JOB_ENV</tt> variable:

<pre>
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=$[^:]*$/\1/'`
</pre>

==Why large MPI programs do not work==

A common reason might be that they do not have enough memory to run. MPI programs run within the Big Memory region, which is limited to 256 MB by default. See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to change that; the parameter to use is <tt>flatmemsizeMB</tt>.

----
[[ZeptoOS_Documentation|Top]]

ZOID

2009-05-07T20:25:33Z

Iskra: /* Configuration file */

[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]
----

==Introduction==

ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.

ZOID infrastructure consists of:
* A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
* <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for forwarding of IP packets,
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.

==User interface==

ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.

===User script===

Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a ''user script'' on I/O nodes. By default, the daemon invokes <tt>$HOME/zoid-user-script.sh</tt> (this pathname can be [[#opt_user_script|changed]] by an administrator). A single parameter is passed to the script: <tt>1</tt> at the job startup, and <tt>0</tt> at the termination.

Information about the job will be passed to the script in the following environment variables:
; <tt>ZOID_JOB_EXEC</tt>
: name of the job executable,
; <tt>ZOID_JOB_ARGS</tt>
: job arguments, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ENV</tt>
: job environment variables, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ID</tt>
: BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter),
; <tt>ZOID_JOB_GLOBAL_SIZE</tt>
: the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>),
; <tt>ZOID_JOB_LOCAL_SIZE</tt>
: the number of job processes handled by this I/O node,
; <tt>ZOID_JOB_MODE</tt>
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL,
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt>
: will also be set...

'''Note:''' the user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).

'''Note 2:''' for this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===File broadcast===
A <tt>/bin.rd/f2cn</tt> command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.

The command takes two arguments:
* absolute pathname to the input file on the I/O node,
* absolute pathname to the output file on the compute nodes.

The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.

The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.

'''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.

'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this:

User script (<tt>$HOME/zoid-user-script.sh</tt>):
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/f2cn $HOME/large_binary /tmp/large_binary
fi
exit 0
</pre>

Job script (submitted using Cobalt or mpirun):
<pre>
#!/bin/sh

chmod 755 /tmp/large_binary
/tmp/large_binary
</pre>

===Performance counters===

A <tt>/bin.rd/statquery</tt> command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.

The command takes a single optional argument:
* the interval between successive queries, in seconds.

If the argument is not provided, the command will terminate after the first query.

Here is a sample output generated:

<pre>
Timestamp: 1240439085.688831
Total messages sent: 5767
Total bytes sent: 7619170
Total messages received: 5717
Total bytes received: 72575
IP fwd messages sent: 196
IP fwd bytes sent: 5889
IP fwd messages received: 84
IP fwd bytes received: 6453
Stream messages sent: 65
Stream bytes sent: 520
Stream messages received: 65
Stream bytes received: 1416
Broadcast messages sent: 1
Broadcast bytes sent: 2437906
Internal messages sent: 193
Internal bytes sent: 39524
Internal messages received: 256
Internal bytes received: 1792
Plugin 5 messages sent: 0
Plugin 5 bytes sent: 0
Plugin 5 messages received: 0
Plugin 5 bytes received: 0
Plugin 2 messages sent: 5312
Plugin 2 bytes sent: 5135331
Plugin 2 messages received: 5312
Plugin 2 bytes received: 62914
</pre>

The meaning of the fields is as follows:
; Timestamp
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>,
; IP fwd
: IP packet forwarding between compute nodes and I/O nodes,
; Stream
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams,
; Broadcast
: [[#File broadcast|file broadcasts]],
; Internal
: job control messages, etc.
; Plugin 5
: internal <tt>mapping</tt> plug-in, used by MPI,
; Plugin 2
: <tt>unix</tt> plugin (POSIX file I/O).

The counters are 64-bit integers, so they will take a while to overflow :-).

Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file:
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` &
fi
exit 0
</pre>

==Administrator interface==

===Configuration file===

The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk:

; ZOID_BUFFER_SIZE (-b)
: Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
; ZOID_ACK_THRESHOLD (-a)
: Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
; ZOID_MODULES (-m)
: Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provide POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details.
; ZOID_ENABLE_NAT (-n)
: Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]).
; ZOID_USER_SCRIPT (-u)
: Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk. '''Note:''' to be able to invoke a script from user's home directory, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===The /bin.rd/update_passwd_file.sh file===

Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>.

The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this files contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user.

If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it.

===The /bin.rd/nat file===

If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications.

==Programmer interface==

ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.

===Overview===

All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.

Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine.

A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>).

The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files.

===Input header file===

The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script.

The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.

Ordinary comments are best placed on separate lines.

'''Note:''' the parser is case ''sensitive''.

====Start line====

Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:

<pre>
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
</pre>

; ID=<n>
: Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>).
; INIT=<s1>
: <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>.
; FINI=<s2>
: <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>.
; PROC=<s3>
: <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>.

====Argument hints====

Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined:

; in, out, inout
: Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default.
; obj, str, ptr, arr, arr2d
: Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default.
; size
: Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc). For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself). Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)).
; nullok
: An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash.
; zerocopy
: An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option.
; userbuf
: An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]].

Here is an example function prototype with the hints:

<pre>
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */,
char * buffer /* out:arr:size=+1 */,
size_t buffer_length /* in:obj */);
</pre>

====Limitations====

As indicated earlier, the scanner is limited, so keep the prototypes simple.

Return type of a forwarded function must be scalar or <tt>void</tt>.

Structures with pointer fields inside of them cannot be forwarded.

====Generated files====

For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.

Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>.

None of the generated files should be modified.

===Server-side API===

Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]].

The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file.

All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.

====Start-line functions====

The following [[#Start line|start-line]] functions can be defined:

<pre>
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
</pre>

The INIT function is invoked during initialization, right before a job starts running. Arguments:

; pset_mpi_proc_count
: The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
; argc
: The number of command-line arguments plus one.
; envc
: The number of environment variables.
; argenv
: An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables.

<pre>
void FINI(void);
</pre>

The FINI function is invoked after the last process of the job has terminated.

<pre>
void PROC(int added, int pset_pid);
</pre>

The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:

; added
: <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated.
; pset_pid
: A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]).

====Implementation functions====

The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file:

<pre>
int __zoid_calling_process_id(void);
</pre>

This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]].

<pre>
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
</pre>

This function writes an arbitrary string to the job's standard output or error. Arguments:

; pid
: Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
; fd
: <tt>1</tt> for standard output, <tt>2</tt> for standard error.
; buffer, len
: The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated.

The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank).

===Client-side API===

A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file:

====Initialization====

<pre>
int __zoid_init(void);
</pre>

This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function.

<pre>
int __zoid_job_size(void);
int __zoid_my_rank(void);
</pre>

These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the
current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).

====Error conditions====

<pre>
int __zoid_error(void);
</pre>

This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned:

; ENOSYS
: Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded.
; ENOMEM
: Out of memory condition.
; E2BIG
: Message exceeded the internal size limit.

<pre>
int __zoid_excessive_size(void);
</pre>

If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large.

ZOID [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers.

If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit.

'''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.

Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented:

<pre>
ssize_t read(int fd, void *buf, size_t nbytes)
{
static ssize_t max_read_nbytes = -1;
ssize_t bytes_read;

bytes_read = 0;
do
{
ssize_t toread, justread;
int error;

toread = nbytes - bytes_read;

if (max_read_nbytes != -1 && toread > max_read_nbytes)
toread = max_read_nbytes;

/* unix_read is the forwarded function call. */
justread = unix_read(fd, buf + bytes_read, toread);

if ((error = __zoid_error()))
{
if (error != E2BIG)
{
/* For a generic ZOID error, just bail out. */
errno = error;
return -1;
}

/* We tried to send a too large read request. Adjust. */
max_read_nbytes = toread - __zoid_excessive_size();
}
else
{
if (justread < 0)
{
/* For a generic read() error, just bail out.
In case of an I/O error, unix_read returns -errno. */
errno = -justread;
return -1;
}

bytes_read += justread;

if (justread != toread)
/* unix_read as such succeeded, but it read fewer bytes than
expected. We terminate prematurely then. */
break;
}
} while (bytes_read < nbytes);

return bytes_read;
}
</pre>

===Additional considerations===

====Forwarding <tt>errno</tt>====

If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this:

<pre>
if (close(server_fd) == -1)
return -errno;
else
return 0;
</pre>

Then, on the client side, we have a convenience wrapper:

<pre>
int close(int fd)
{
return unix_decode_result(unix_close(fd));
}
</pre>

<tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:

<pre>
#define unix_decode_result(result) \
({ \
typeof (result) _result = (result); \
int _n; \
if ((_n = __zoid_error()) != 0) \
{ \
errno = _n; \
_result = -1; \
} \
else if (_result < 0) \
{ \
errno = -_result; \
_result = -1; \
} \
_result; \
})
</pre>

====Returning variable amounts of data in arrays====

Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this:

<pre>
ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1 */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

Obviously, the client side needs to be modified as well, to pass the size argument by address.

'''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.

====Zerocopy performance====

Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies.

Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls.

'''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.

====Zerocopy with a custom output buffer====

Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.

This can be avoided for zero-copy output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

Arguments:

; userbuf
: The address of the buffer.
; callback
: A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>.
; priv
: A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way.

The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]].

As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>.

'''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most useful with buffers allocated outside of the implementation functions:

<pre>
static void buffer_cb(void* userbuf, void* priv)
{
free(userbuf);
}

ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1:zerocopy:userbuf */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if (posix_memalign(&buf, 16, *count))
{
*count = 0;
return -ENOMEM;
}

__zoid_register_userbuf(buf, &buffer_cb, NULL);

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

====Zerocopy with a custom input buffer====

The <tt>userbuf</tt> flag discussed above can also be used for ''input'' zero-copy <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.

If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows:

<pre>
void* <name>_allocate_cb(int len);
</pre>

The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>).

There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc.

The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.

Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.

As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function.

A simple example:

<pre>
void* unix_write_allocate_cb(int len)
{
void* ptr;

if (posix_memalign(&ptr, 16, len))
return NULL;

return ptr;
}

ssize_t unix_write(int fd /* in:obj */,
const void *buf /* in:arr:size=+1:zerocopy:userbuf */,
size_t count /* in:obj */)
{
ssize_t ret;

...

if ((ret = write(fd, buf, count)) == -1)
ret = -errno;

free((void*)buf);

return ret;
}
</pre>

----
[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]

ZOID

2009-05-07T20:24:45Z

Iskra:

[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]
----

==Introduction==

ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.

ZOID infrastructure consists of:
* A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
* <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for forwarding of IP packets,
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.

==User interface==

ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.

===User script===

Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a ''user script'' on I/O nodes. By default, the daemon invokes <tt>$HOME/zoid-user-script.sh</tt> (this pathname can be [[#opt_user_script|changed]] by an administrator). A single parameter is passed to the script: <tt>1</tt> at the job startup, and <tt>0</tt> at the termination.

Information about the job will be passed to the script in the following environment variables:
; <tt>ZOID_JOB_EXEC</tt>
: name of the job executable,
; <tt>ZOID_JOB_ARGS</tt>
: job arguments, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ENV</tt>
: job environment variables, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ID</tt>
: BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter),
; <tt>ZOID_JOB_GLOBAL_SIZE</tt>
: the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>),
; <tt>ZOID_JOB_LOCAL_SIZE</tt>
: the number of job processes handled by this I/O node,
; <tt>ZOID_JOB_MODE</tt>
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL,
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt>
: will also be set...

'''Note:''' the user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).

'''Note 2:''' for this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===File broadcast===
A <tt>/bin.rd/f2cn</tt> command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.

The command takes two arguments:
* absolute pathname to the input file on the I/O node,
* absolute pathname to the output file on the compute nodes.

The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.

The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.

'''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.

'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this:

User script (<tt>$HOME/zoid-user-script.sh</tt>):
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/f2cn $HOME/large_binary /tmp/large_binary
fi
exit 0
</pre>

Job script (submitted using Cobalt or mpirun):
<pre>
#!/bin/sh

chmod 755 /tmp/large_binary
/tmp/large_binary
</pre>

===Performance counters===

A <tt>/bin.rd/statquery</tt> command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.

The command takes a single optional argument:
* the interval between successive queries, in seconds.

If the argument is not provided, the command will terminate after the first query.

Here is a sample output generated:

<pre>
Timestamp: 1240439085.688831
Total messages sent: 5767
Total bytes sent: 7619170
Total messages received: 5717
Total bytes received: 72575
IP fwd messages sent: 196
IP fwd bytes sent: 5889
IP fwd messages received: 84
IP fwd bytes received: 6453
Stream messages sent: 65
Stream bytes sent: 520
Stream messages received: 65
Stream bytes received: 1416
Broadcast messages sent: 1
Broadcast bytes sent: 2437906
Internal messages sent: 193
Internal bytes sent: 39524
Internal messages received: 256
Internal bytes received: 1792
Plugin 5 messages sent: 0
Plugin 5 bytes sent: 0
Plugin 5 messages received: 0
Plugin 5 bytes received: 0
Plugin 2 messages sent: 5312
Plugin 2 bytes sent: 5135331
Plugin 2 messages received: 5312
Plugin 2 bytes received: 62914
</pre>

The meaning of the fields is as follows:
; Timestamp
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>,
; IP fwd
: IP packet forwarding between compute nodes and I/O nodes,
; Stream
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams,
; Broadcast
: [[#File broadcast|file broadcasts]],
; Internal
: job control messages, etc.
; Plugin 5
: internal <tt>mapping</tt> plug-in, used by MPI,
; Plugin 2
: <tt>unix</tt> plugin (POSIX file I/O).

The counters are 64-bit integers, so they will take a while to overflow :-).

Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file:
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` &
fi
exit 0
</pre>

==Administrator interface==

===Configuration file===

The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk:

; ZOID_BUFFER_SIZE (-b)
: Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
; ZOID_ACK_THRESHOLD (-a)
: Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
; ZOID_MODULES (-m)
: Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provide POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details.
; ZOID_ENABLE_NAT (-n)
: Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]).
; ZOID_USER_SCRIPT (-u)
: Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk. '''Note:''' for this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===The /bin.rd/update_passwd_file.sh file===

Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>.

The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this files contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user.

If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it.

===The /bin.rd/nat file===

If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications.

==Programmer interface==

ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.

===Overview===

All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.

Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine.

A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>).

The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files.

===Input header file===

The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script.

The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.

Ordinary comments are best placed on separate lines.

'''Note:''' the parser is case ''sensitive''.

====Start line====

Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:

<pre>
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
</pre>

; ID=<n>
: Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>).
; INIT=<s1>
: <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>.
; FINI=<s2>
: <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>.
; PROC=<s3>
: <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>.

====Argument hints====

Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined:

; in, out, inout
: Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default.
; obj, str, ptr, arr, arr2d
: Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default.
; size
: Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc). For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself). Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)).
; nullok
: An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash.
; zerocopy
: An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option.
; userbuf
: An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]].

Here is an example function prototype with the hints:

<pre>
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */,
char * buffer /* out:arr:size=+1 */,
size_t buffer_length /* in:obj */);
</pre>

====Limitations====

As indicated earlier, the scanner is limited, so keep the prototypes simple.

Return type of a forwarded function must be scalar or <tt>void</tt>.

Structures with pointer fields inside of them cannot be forwarded.

====Generated files====

For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.

Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>.

None of the generated files should be modified.

===Server-side API===

Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]].

The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file.

All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.

====Start-line functions====

The following [[#Start line|start-line]] functions can be defined:

<pre>
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
</pre>

The INIT function is invoked during initialization, right before a job starts running. Arguments:

; pset_mpi_proc_count
: The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
; argc
: The number of command-line arguments plus one.
; envc
: The number of environment variables.
; argenv
: An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables.

<pre>
void FINI(void);
</pre>

The FINI function is invoked after the last process of the job has terminated.

<pre>
void PROC(int added, int pset_pid);
</pre>

The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:

; added
: <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated.
; pset_pid
: A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]).

====Implementation functions====

The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file:

<pre>
int __zoid_calling_process_id(void);
</pre>

This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]].

<pre>
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
</pre>

This function writes an arbitrary string to the job's standard output or error. Arguments:

; pid
: Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
; fd
: <tt>1</tt> for standard output, <tt>2</tt> for standard error.
; buffer, len
: The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated.

The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank).

===Client-side API===

A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file:

====Initialization====

<pre>
int __zoid_init(void);
</pre>

This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function.

<pre>
int __zoid_job_size(void);
int __zoid_my_rank(void);
</pre>

These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the
current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).

====Error conditions====

<pre>
int __zoid_error(void);
</pre>

This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned:

; ENOSYS
: Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded.
; ENOMEM
: Out of memory condition.
; E2BIG
: Message exceeded the internal size limit.

<pre>
int __zoid_excessive_size(void);
</pre>

If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large.

ZOID [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers.

If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit.

'''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.

Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented:

<pre>
ssize_t read(int fd, void *buf, size_t nbytes)
{
static ssize_t max_read_nbytes = -1;
ssize_t bytes_read;

bytes_read = 0;
do
{
ssize_t toread, justread;
int error;

toread = nbytes - bytes_read;

if (max_read_nbytes != -1 && toread > max_read_nbytes)
toread = max_read_nbytes;

/* unix_read is the forwarded function call. */
justread = unix_read(fd, buf + bytes_read, toread);

if ((error = __zoid_error()))
{
if (error != E2BIG)
{
/* For a generic ZOID error, just bail out. */
errno = error;
return -1;
}

/* We tried to send a too large read request. Adjust. */
max_read_nbytes = toread - __zoid_excessive_size();
}
else
{
if (justread < 0)
{
/* For a generic read() error, just bail out.
In case of an I/O error, unix_read returns -errno. */
errno = -justread;
return -1;
}

bytes_read += justread;

if (justread != toread)
/* unix_read as such succeeded, but it read fewer bytes than
expected. We terminate prematurely then. */
break;
}
} while (bytes_read < nbytes);

return bytes_read;
}
</pre>

===Additional considerations===

====Forwarding <tt>errno</tt>====

If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this:

<pre>
if (close(server_fd) == -1)
return -errno;
else
return 0;
</pre>

Then, on the client side, we have a convenience wrapper:

<pre>
int close(int fd)
{
return unix_decode_result(unix_close(fd));
}
</pre>

<tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:

<pre>
#define unix_decode_result(result) \
({ \
typeof (result) _result = (result); \
int _n; \
if ((_n = __zoid_error()) != 0) \
{ \
errno = _n; \
_result = -1; \
} \
else if (_result < 0) \
{ \
errno = -_result; \
_result = -1; \
} \
_result; \
})
</pre>

====Returning variable amounts of data in arrays====

Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this:

<pre>
ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1 */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

Obviously, the client side needs to be modified as well, to pass the size argument by address.

'''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.

====Zerocopy performance====

Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies.

Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls.

'''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.

====Zerocopy with a custom output buffer====

Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.

This can be avoided for zero-copy output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

Arguments:

; userbuf
: The address of the buffer.
; callback
: A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>.
; priv
: A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way.

The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]].

As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>.

'''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most useful with buffers allocated outside of the implementation functions:

<pre>
static void buffer_cb(void* userbuf, void* priv)
{
free(userbuf);
}

ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1:zerocopy:userbuf */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if (posix_memalign(&buf, 16, *count))
{
*count = 0;
return -ENOMEM;
}

__zoid_register_userbuf(buf, &buffer_cb, NULL);

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

====Zerocopy with a custom input buffer====

The <tt>userbuf</tt> flag discussed above can also be used for ''input'' zero-copy <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.

If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows:

<pre>
void* <name>_allocate_cb(int len);
</pre>

The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>).

There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc.

The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.

Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.

As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function.

A simple example:

<pre>
void* unix_write_allocate_cb(int len)
{
void* ptr;

if (posix_memalign(&ptr, 16, len))
return NULL;

return ptr;
}

ssize_t unix_write(int fd /* in:obj */,
const void *buf /* in:arr:size=+1:zerocopy:userbuf */,
size_t count /* in:obj */)
{
ssize_t ret;

...

if ((ret = write(fd, buf, count)) == -1)
ret = -errno;

free((void*)buf);

return ret;
}
</pre>

----
[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]

ZOID

2009-05-07T20:22:41Z

Iskra: /* User script */

[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]
----

==Introduction==

ZOID is an I/O forwarding component of the ZeptoOS project. Any communication between the compute nodes and the I/O nodes (job management, file I/O, sockets) is handled by ZOID.

ZOID infrastructure consists of:
* A multithreaded <tt>zoid</tt> daemon on the I/O nodes which performs I/O forwarding for the compute nodes and which also communicates with the service node to perform job management,
* <tt>control</tt> daemon on the compute nodes which is responsible for job management tasks such as the launching of application processes, for the forwarding of <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> data, and for forwarding of IP packets,
* <tt>zoid-fuse</tt> daemon on the compute nodes which performs file I/O forwarding for POSIX-compliant applications.

==User interface==

ZOID is meant to be transparent to users, but there are a few optional mechanisms available to interact with it.

===User script===

Right before a job starts running, and right after the last process of a job has terminated, ZOID daemon attempts to invoke a ''user script'' on I/O nodes. By default, the daemon invokes <tt>$HOME/zoid-user-script.sh</tt> (this pathname can be [[#opt_user_script|changed]] by an administrator). A single parameter is passed to the script: <tt>1</tt> at the job startup, and <tt>0</tt> at the termination.

Information about the job will be passed to the script in the following environment variables:
; <tt>ZOID_JOB_EXEC</tt>
: name of the job executable,
; <tt>ZOID_JOB_ARGS</tt>
: job arguments, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ENV</tt>
: job environment variables, separated by colons (<tt>:</tt>)
; <tt>ZOID_JOB_ID</tt>
: BG/P control system job id ('''Note:''' this is generally different from the Cobalt job ID; see [[FAQ#How to obtain a Cobalt job ID|FAQ]] for the latter),
; <tt>ZOID_JOB_GLOBAL_SIZE</tt>
: the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>),
; <tt>ZOID_JOB_LOCAL_SIZE</tt>
: the number of job processes handled by this I/O node,
; <tt>ZOID_JOB_MODE</tt>
: <tt>0</tt> for SMP, <tt>1</tt> for VN, and <tt>2</tt> for DUAL,
; <tt>SHELL</tt>, <tt>PATH</tt>, <tt>USER</tt>, and <tt>HOME</tt>
: will also be set...

'''Note:''' the user script is invoked ''synchronously'' by the daemon, i.e., the job will not start running until the script terminates. If one needs some processes to run on the I/O nodes while the job is running, they should be started in the background (&).

'''Note 2:''' for this feature to work, [[#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]] must be working correctly.

===File broadcast===
A <tt>/bin.rd/f2cn</tt> command is available on the I/O nodes for a very efficient (hardware-assisted) broadcasting of files to all the compute nodes handled by the given I/O node.

The command takes two arguments:
* absolute pathname to the input file on the I/O node,
* absolute pathname to the output file on the compute nodes.

The input file does not need to be physically on the I/O node; it can be on a network filesystem mounted on the node. The file will be created in the ramdisk of each compute node.

The throughput is in practice limited by how fast the input file can be read; we have seen results in excess of 300 MB/s for files residing in the I/O node ramdisk.

'''Note:''' all the compute nodes in the pset must be up and running. Do not use this command on ''incomplete'' partitions (e.g., a one-process job on a 64-node partition); this will likely hang the ZOID daemon.

'''Note2:''' this feature can safely be used from within a [[#User script|user script]], so one can, e.g., pre-stage large binaries, like this:

User script (<tt>$HOME/zoid-user-script.sh</tt>):
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/f2cn $HOME/large_binary /tmp/large_binary
fi
exit 0
</pre>

Job script (submitted using Cobalt or mpirun):
<pre>
#!/bin/sh

chmod 755 /tmp/large_binary
/tmp/large_binary
</pre>

===Performance counters===

A <tt>/bin.rd/statquery</tt> command is available on the I/O nodes for obtaining the performance counters of the I/O daemon.

The command takes a single optional argument:
* the interval between successive queries, in seconds.

If the argument is not provided, the command will terminate after the first query.

Here is a sample output generated:

<pre>
Timestamp: 1240439085.688831
Total messages sent: 5767
Total bytes sent: 7619170
Total messages received: 5717
Total bytes received: 72575
IP fwd messages sent: 196
IP fwd bytes sent: 5889
IP fwd messages received: 84
IP fwd bytes received: 6453
Stream messages sent: 65
Stream bytes sent: 520
Stream messages received: 65
Stream bytes received: 1416
Broadcast messages sent: 1
Broadcast bytes sent: 2437906
Internal messages sent: 193
Internal bytes sent: 39524
Internal messages received: 256
Internal bytes received: 1792
Plugin 5 messages sent: 0
Plugin 5 bytes sent: 0
Plugin 5 messages received: 0
Plugin 5 bytes received: 0
Plugin 2 messages sent: 5312
Plugin 2 bytes sent: 5135331
Plugin 2 messages received: 5312
Plugin 2 bytes received: 62914
</pre>

The meaning of the fields is as follows:
; Timestamp
: number of seconds and microseconds from the epoch, as returned by <tt>gettimeofday(2)</tt>,
; IP fwd
: IP packet forwarding between compute nodes and I/O nodes,
; Stream
: <tt>stdin</tt>/<tt>out</tt>/<tt>err</tt> streams,
; Broadcast
: [[#File broadcast|file broadcasts]],
; Internal
: job control messages, etc.
; Plugin 5
: internal <tt>mapping</tt> plug-in, used by MPI,
; Plugin 2
: <tt>unix</tt> plugin (POSIX file I/O).

The counters are 64-bit integers, so they will take a while to overflow :-).

Example user script (<tt>$HOME/zoid-user-script.sh</tt>) that samples the statistics every 60 seconds and writes them to a unique file:
<pre>
#!/bin/sh

if [ "$1" -eq "1" ]; then
/bin.rd/statquery 60 >$HOME/zoid_stats.$ZOID_JOB_ID.`hostname` &
fi
exit 0
</pre>

==Administrator interface==

===Configuration file===

The <tt>zoid</tt> I/O daemon accepts a number of command-line options that can be used to change its behavior. They can be adjusted by editing the <tt>ramdisk/ION/ramdisk-add/etc/sysconfig/zoid</tt> file and rebuilding the I/O node ramdisk:

; ZOID_BUFFER_SIZE (-b)
: Specifies the size of the buffers used for messages. Because a separate buffer is needed for a request and a reply, and typically no more than one of these needs to be large, to save memory ZOID supports buffers of two sizes: a small one (4 KB by default) and a large one (4 MB+1 KB by default – the 1 KB is there to accommodate the headers). Use a colon (<tt>:</tt>) to separate the two sizes when customizing this value. If desired, support for second buffer size can be disabled by providing only one value to this option.
; ZOID_ACK_THRESHOLD (-a)
: Specifies a size threshold for the rendezvous protocol for messages coming from the compute nodes, in the units of tree network packets (240 bytes of data each). An eager protocol is used for messages below the threshold. Messages above the threshold use flow control in the form of a rendezvous protocol with message acknowledgements; basically, the daemon will only receive one large message at a time, which improves the predictability and an overall throughput. The daemon default for this option is to not use the acknowledgements, but the config file defaults to a value of <tt>8</tt>, which is the size of the hardware FIFO buffer of the tree network device. Set this option to 0 (or comment it out altogether) to disable message acknowledgements.
; ZOID_MODULES (-m)
: Specifies a <tt>:</tt>-separated list of ZOID plug-ins to load. This defaults to <tt>"unix_impl.so:unix_preload.so:mapping_impl.so:mapping_preload.so"</tt> in the config file; do not remove any of these or basic system services will stop working. The <tt>unix</tt> plug-in provide POSIX file I/O support, while <tt>mapping</tt> is used by our MPI implementation to map between MPI ranks and Blue Gene X/Y/Z/T coordinates. Custom plug-ins can be created and added here; see [[#Programmer interface|Programmer interface]] for details.
; ZOID_ENABLE_NAT (-n)
: Enables network address translatation (NAT) for IP packets coming from the compute nodes, allowing compute nodes to communicate with the outside world. This support is disabled by default because it was found to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down network filesystems. This feature can also be enabled on a per-job basis by setting the <tt>ZOID_ENABLE_NAT</tt> environment variable when submitting a job (see the [[FAQ#How to open a socket from a CN to the outside world|FAQ]]).
; ZOID_USER_SCRIPT (-u)
: Specifies the pathname to the [[#User script|user script]]; it defaults to <tt>"/bin.rd/zoid-user-script.sh"</tt>. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/zoid-user-script.sh</tt>; it sets a few environment variables and then invokes user's custom <tt>$HOME/zoid-user-script.sh</tt>. Hence, to adjust the behavior of this feature, either change this option or the script in the ramdisk.

===The /bin.rd/update_passwd_file.sh file===

Allowing the partition owner to log into the I/O node using SSH is one of the features of the ZeptoOS software stack. Only the administrator and the partition owner are given login access; this is controlled by the <tt>/bin.rd/update_passwd_file.sh</tt> script, which is invoked by the daemon while the partition is being initialized. The script can be found in <tt>ramdisk/ION/ramdisk-add/bin/update_passwd_file.sh</tt>.

The script makes a number of assumptions that could be site-specific, so it might require an adjustment. The daemon invokes the script passing a numerical UNIX user ID of the partition owner as the only argument. The script then scans the <tt>/bgsys/iofs/etc/passwd</tt> for an entry with the same user ID (on Argonne machines, this files contains all valid account names). If a matching entry is found, it is appended to the <tt>/etc/passwd</tt> file in the I/O node ramdisk, thus enabling login access to the node for that user.

If allowing ordinary users access to the I/O nodes is undesirable, one can simply put <tt>exit 0</tt> at the top of the script to disable it.

===The /bin.rd/nat file===

If NAT has been [[#opt_enable_nat|requested]], the daemon invokes the <tt>/bin.rd/nat</tt> script to enabled it. This script can be found in <tt>ramdisk/ION/ramdisk-add/bin/nat</tt>. Generally, it should not require any modifications.

==Programmer interface==

ZOID is a flexible, extensible, high-performance function call forwarding (RPC) infrastructure. Built-in features and the standard plug-ins provide familiar POSIX file I/O and BSD socket interfaces, but, because of the number of software layers involved, they introduce a significant overhead. For applications requiring maximum bandwidth between the compute and I/O nodes, ZOID provides an option of a customized function call forwarding with minimal overheads. This section provides an overview of how to create such custom plug-ins.

===Overview===

All that ZOID provides is a function call forwarding support, and a limited one at that. Any logic (caching, prefetching, etc.) needs to be custom-built on top of it.

Follow existing plug-ins, found in <tt>packages/zoid/src/</tt>, as examples. The <tt>unix</tt> plug-in is generally the most up to date, but other plug-ins such as <tt>mapping</tt>, <tt>zoidfs</tt>, <tt>barrier</tt>, and <tt>test</tt> should also be fine.

A plug-in consists of automatically generated client-side and server-side stubs (which perform the marshalling and demarshalling of function call parameters and results, the forwarding of the function call, etc.), and of a hand-written server-side implementation which provides the implementation code for the forwarded function calls. One might also decide to provide hand-written client-side wrappers to hide some details of the ZOID API (such as the error handling) or to adhere to a particular existing API, as is the case with the <tt>unix</tt> plug-in (the wrappers used by the FUSE client are available in <tt>packages/zoid/src/unix/stubs/</tt>; another version is in the GNU libc sources, in <tt>packages/glibc/src/zoid/sysdeps/unix/sysv/linux/powerpc/powerpc32/</tt>).

The <tt>scanner.pl</tt> script, found in <tt>packages/zoid/src/</tt>, creates the automatically-generated client and server stubs based on a hand-written input header file described below. Again, please follow the examples from the existing plug-ins, such as <tt>unix</tt> or <tt>mapping</tt>. The <tt>Makefile</tt> in those plug-ins is written in a generic fashion and should only require a change to the <tt>PREFIX</tt> line to be usable with another plug-in. Use that <tt>Makefile</tt> to invoke the <tt>scanner.pl</tt> script and to compile the generated source files.

===Input header file===

The input header file must be a valid C header file with additional hints in the comments. The file is read by the <tt>scanner.pl</tt> script.

The parser in the script is rather limited and does not handle many C constructs. It is thus essential that the header file be as simple as possible. In particular, function prototypes should be specified at the end of the file, not intermixed with any other specifications such as data type definitions.

Ordinary comments are best placed on separate lines.

'''Note:''' the parser is case ''sensitive''.

====Start line====

Any complex declarations that the scanner cannot parse should be placed at the top of the file, because the parser ignores everything until it encounters the following magic start line:

<pre>
/* START-ZOID-SCANNER ID=<n> INIT=<s1> FINI=<s2> PROC=<s3> */
</pre>

; ID=<n>
: Each plug-in needs a unique, 16-bit identifier, passed in <tt><n></tt>. The following identifiers are already in use: <tt>0</tt> (internal), <tt>1</tt> (<tt>zoidfs</tt> plug-in), <tt>2</tt> (<tt>unix</tt>), <tt>3</tt> (<tt>lofar</tt>), <tt>4</tt> (<tt>test</tt>), <tt>5</tt> (<tt>mapping</tt>), and <tt>10</tt> (<tt>ftb</tt>).
; INIT=<s1>
: <tt><s1></tt> provides the name of an initialization function which will be invoked before a job starts running; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>INIT=NULL</tt>.
; FINI=<s2>
: <tt><s2></tt> provides the name of a termination function which will be invoked after all job processes have exited; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>FINI=NULL</tt>.
; PROC=<s3>
: <tt><s3></tt> provides the name of a callback function which will be invoked on a startup and termination of every application and ZOID-enabled process; see [[#Start-line functions|Start-line functions]] for more information. If a plug-in does not need this feature, please specify <tt>PROC=NULL</tt>.

====Argument hints====

Hints are generally needed by the scanner to correctly encode and decode function arguments. They need to be placed after each argument, before a separating comma (or a closing bracket), and should be embedded inside dedicated C comments. Multiple hints per argument are usually provided; these are separated by a colon (<tt>:</tt>). The following hints are currently defined:

; in, out, inout
: Specifies whether the argument is an input argument, an output argument, or both. <tt>in</tt> is the default.
; obj, str, ptr, arr, arr2d
: Specifies the type of the argument, respectively a plain object (say, an <tt>int</tt>, or a structure passed by value), a <tt>'\0'</tt>-terminated character string, a pointer to a plain object, an array of objects, or a two-dimensional array (<tt>type**</tt>, not <tt>type[][]</tt>). <tt>obj</tt> is the default.
; size
: Required for array arguments (<tt>arr</tt> and <tt>arr2d</tt>). Indicates the index of another argument in the same function, which is used to pass the array size. Absolute numbers are accepted (<tt>1</tt> to ''number of arguments'') or relative ones (<tt>+1</tt> for the next argument, <tt>-1</tt> for the previous argument, etc). For <tt>arr</tt> arguments, the size argument must be of a numerical type, or a pointer to such a type. For <tt>arr2d</tt> arguments, the size argument must itself be an array (an <tt>arr</tt> argument) of numerical elements, specifying the sizes along the less significant dimension of the array (the size of the more significant dimension is the size of the <tt>arr</tt> array itself). Please note that the unit of size for the numerical types is the size of the base array type (thus, <tt>sizeof(int)</tt> for an array of <tt>int</tt>s), not byte (if one would like it to be byte, just make the array argument have type <tt>char*</tt> or <tt>void*</tt> (a GCC extension)).
; nullok
: An option for arguments passed by pointer (basically, all but <tt>obj</tt>). If provided, it indicates that the argument is allowed to be <tt>NULL</tt>. This is not the default because supporting <tt>NULL</tt> pointers results in an additional computational and protocol overhead. '''Note:''' if a <tt>NULL</tt> pointer is passed to an argument that lacks the <tt>nullok</tt> flag, the client ''will'' crash.
; zerocopy
: An option for array arguments. Enables a more efficient marshalling/demarshalling protocol for the array, which does not use extra memory copies. Can be used for no more than one <tt>in</tt> argument and no more than one <tt>out</tt> argument. [[#Zerocopy performance|Zerocopy performance]] discusses performance considerations when using this option.
; userbuf
: An option for <tt>zerocopy</tt>; only supported for <tt>arr</tt> arguments. Enables a special form of zero-copy support, discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]] and [[#Zerocopy with a custom input buffer|Zerocopy with a custom input buffer]].

Here is an example function prototype with the hints:

<pre>
int zoidfs_readlink(const zoidfs_handle_t * handle /* in:ptr */,
char * buffer /* out:arr:size=+1 */,
size_t buffer_length /* in:obj */);
</pre>

====Limitations====

As indicated earlier, the scanner is limited, so keep the prototypes simple.

Return type of a forwarded function must be scalar or <tt>void</tt>.

Structures with pointer fields inside of them cannot be forwarded.

====Generated files====

For every function prototype found, the scanner generates two output files: one for a client calling the function and one for the server, where the function is in fact executed. Code in the generated files performs marshalling and demarshalling of function arguments and results.

Two more files per plug-in are generated: ''header''<tt>_defs.h</tt> and ''header''<tt>_dispatch.c</tt>.

None of the generated files should be modified.

===Server-side API===

Server-side stubs and the server-side implementation need to be passed as modules when invoking the ZOID I/O daemon, as described [[#opt_modules|earlier]].

The hand-written server-side implementation code should include the <tt>zoid_api.h</tt> header file (available from <tt>packages/zoid/prebuilt/</tt>) and the plug-in input header file.

All the functions listed in the header file need to be defined in the server-side implementation code. The code needs to be compiled as a shared library; use the <tt>implementation/</tt> subdirectory of the <tt>unix</tt> plug-in as an example. Please note that since ZOID is multi-threaded, multiple functions can be invoked at the same time, so one must ensure that the implementation is multi-thread-safe.

====Start-line functions====

The following [[#Start line|start-line]] functions can be defined:

<pre>
void INIT(int pset_mpi_proc_count, int argc, int envc, const char* argenv);
</pre>

The INIT function is invoked during initialization, right before a job starts running. Arguments:

; pset_mpi_proc_count
: The number of job processes that will be handled by this I/O node. Note that I/O nodes also handle additional ZOID-enabled processes, such as the FUSE clients, which are not included in this number.
; argc
: The number of command-line arguments plus one.
; envc
: The number of environment variables.
; argenv
: An array of <tt>'\0'</tt>-terminated strings, one after another. The first string is the name of the job executable, followed by <tt>argc-1</tt> command-line arguments, followed by <tt>envc</tt> environment variables.

<pre>
void FINI(void);
</pre>

The FINI function is invoked after the last process of the job has terminated.

<pre>
void PROC(int added, int pset_pid);
</pre>

The PROC function is invoked on the startup and termination of every application and ZOID-enabled process on the compute node. Arguments:

; added
: <tt>1</tt> if the process was started, <tt>0</tt> if it was terminated.
; pset_pid
: A process identifier (as returned by [[#Implementation functions|<tt>__zoid_calling_process_id</tt>]]).

====Implementation functions====

The hand-written server-side implementation functions can themselves call back a few ZOID functions, available by including the <tt>zoid_api.h</tt> header file:

<pre>
int __zoid_calling_process_id(void);
</pre>

This function returns a unique identifier of the compute node process that invoked the function. The identifier is ''not'' an MPI rank, because some processes, such as the FUSE clients, are not part of the application and hence do not have a rank. The identifiers are only unique within one I/O node, and they can be reused if a process starts after another one has terminated.

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

This function will be discussed in [[#Zerocopy with a custom output buffer|Zerocopy with a custom output buffer]].

<pre>
int __zoid_send_output(int pid, int fd, const char* buffer, int len);
</pre>

This function writes an arbitrary string to the job's standard output or error. Arguments:

; pid
: Process identifier as returned by <tt>__zoid_calling_process_id</tt>. The process in question ''must'' have an MPI rank, meaning that it must be either an application process or a process launched from an application process.
; fd
: <tt>1</tt> for standard output, <tt>2</tt> for standard error.
; buffer, len
: The string and its length. <tt>'\0'</tt> should not be included in <tt>len</tt> and <tt>buffer</tt> does not need to be <tt>'\0'</tt>-terminated.

The function returns 0 if successful, and -1 if not (such as when the process identified by <tt>pid</tt> does not have an MPI rank).

===Client-side API===

A compute node application needs to be linked with the client-side stubs and with a common support library <tt>libzoid_cn.a</tt> (a prebuilt version of the latter is in <tt>packages/zoid/prebuilt</tt>; sources are in <tt>packages/zoid/src/cnl/client</tt>). Several functions are available to applications by including the <tt>zoid_api.h</tt> header file:

====Initialization====

<pre>
int __zoid_init(void);
</pre>

This function ''must'' be invoked before any ZOID or ZOID-forwarded functions can be invoked. It returns <tt>0</tt> if successful, <tt>1</tt> otherwise. There is no corresponding termination function.

<pre>
int __zoid_job_size(void);
int __zoid_my_rank(void);
</pre>

These functions return, respectively, the number of processes in the job (the size of <tt>MPI_COMM_WORLD</tt>), and the MPI rank of the
current process. Either will return <tt>-1</tt> if the current process does not have an MPI rank, i.e., if it is not an application process and was not launched from an application process (say, if it was launched from an interactive shell).

====Error conditions====

<pre>
int __zoid_error(void);
</pre>

This function should be invoked on the client side after ''every'' forwarded function call returns, to determine if any errors occured within the forwarding layer. A return value of <tt>0</tt> indicates a success; otherwise, one of the following error values will be returned:

; ENOSYS
: Invalid command sent from the client. Typically indicates that the corresponding I/O-node-side [[#opt_modules|modules]] have not been loaded.
; ENOMEM
: Out of memory condition.
; E2BIG
: Message exceeded the internal size limit.

<pre>
int __zoid_excessive_size(void);
</pre>

If <tt>__zoid_error</tt> returned <tt>E2BIG</tt>, calling this function will provide an indication of by how many bytes the input or output was too large.

ZOID [[#opt_buffer_size|has a limit]] on the message size, around 4 MB by default. The limit is enforced on both input and output. The limit only applies to buffers "owned" by ZOID on the daemon side; it does not apply to custom [[#Zerocopy with a custom input buffer|input]] or [[#Zerocopy with a custom output buffer|output]] buffers.

If the limit is hit, the operation needs to be split into smaller ones. Information returned by <tt>__zoid_excessive_size</tt> makes it easy to adjust the buffer and resubmit.

'''Note:''' While the input-side (argument) overflow is flagged immediately on the client side, and is thus fairly cheap to hit, the output-side (result) overflow is flagged on the I/O node, after the request has been sent there (but before the implementation function is invoked). It is thus advised to cache at least the size limit for the output side for the next invocation, to avoid a future communication overhead. The size limit is function-specific, since it depends on sizes of other arguments and results.

Here is an example of how the client-side convenience wrapper for a call such as POSIX <tt>read</tt> could be implemented:

<pre>
ssize_t read(int fd, void *buf, size_t nbytes)
{
static ssize_t max_read_nbytes = -1;
ssize_t bytes_read;

bytes_read = 0;
do
{
ssize_t toread, justread;
int error;

toread = nbytes - bytes_read;

if (max_read_nbytes != -1 && toread > max_read_nbytes)
toread = max_read_nbytes;

/* unix_read is the forwarded function call. */
justread = unix_read(fd, buf + bytes_read, toread);

if ((error = __zoid_error()))
{
if (error != E2BIG)
{
/* For a generic ZOID error, just bail out. */
errno = error;
return -1;
}

/* We tried to send a too large read request. Adjust. */
max_read_nbytes = toread - __zoid_excessive_size();
}
else
{
if (justread < 0)
{
/* For a generic read() error, just bail out.
In case of an I/O error, unix_read returns -errno. */
errno = -justread;
return -1;
}

bytes_read += justread;

if (justread != toread)
/* unix_read as such succeeded, but it read fewer bytes than
expected. We terminate prematurely then. */
break;
}
} while (bytes_read < nbytes);

return bytes_read;
}
</pre>

===Additional considerations===

====Forwarding <tt>errno</tt>====

If one needs to pass a variable such as <tt>errno</tt> from the I/O node to the client, the most straightforward way is to add an extra integer <tt>out</tt> pointer argument to all functions and pass it that way. Another option is to do it the same way the UNIX kernel does: pass it as a negative return value from the functions. The <tt>unix</tt> plug-in does it that way, so, e.g., the implementation of <tt>close</tt> on the I/O node looks something like this:

<pre>
if (close(server_fd) == -1)
return -errno;
else
return 0;
</pre>

Then, on the client side, we have a convenience wrapper:

<pre>
int close(int fd)
{
return unix_decode_result(unix_close(fd));
}
</pre>

<tt>unix_decode_result</tt> is a preprocessor macro that handles both ZOID errors and errors returned by the plug-in. It uses a number of GCC extensions to make it as transparent as possible:

<pre>
#define unix_decode_result(result) \
({ \
typeof (result) _result = (result); \
int _n; \
if ((_n = __zoid_error()) != 0) \
{ \
errno = _n; \
_result = -1; \
} \
else if (_result < 0) \
{ \
errno = -_result; \
_result = -1; \
} \
_result; \
})
</pre>

====Returning variable amounts of data in arrays====

Just like with UNIX system calls, ZOID does not allocate memory for the results. Instead, callers must provide pre-allocated arrays, along with their sizes. UNIX would then typically return the size of the used part as a return value from a system call. Unfortunately, ZOID cannot make use of that – it will use the same array size argument to determine how much data to send back, so even if only a small part of the provided buffer is actually filled in, the whole buffer will be sent back, which is inefficient. This can be prevented by passing the array size as an <tt>inout</tt> pointer to a numerical type. A server-side implementation of a function such as <tt>read</tt> then looks like this:

<pre>
ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1 */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

Obviously, the client side needs to be modified as well, to pass the size argument by address.

'''Note:''' this feature has certain implementation limitations. It can misbehave in the presence of multiple output arrays (or a single output <tt>arr2d</tt>, which internally behaves a lot like multiple separate <tt>arr</tt>s). Essentially, for efficiency reasons, the placement of arrays in the result buffer is determined before an implementation function is invoked. If this feature is used to change the size of one array, and that array is followed in the output buffer by another array, a "hole" will be created in the buffer, causing problems. However, in the most common case of a single output array the feature is completely reliable.

====Zerocopy performance====

Implementation-wise, ZOID is always zero-copy on the server side, meaning that data that implementation functions put in the <tt>out</tt> arrays is sent to the compute nodes without any extra memory copies.

Client side is only zero-copy for arrays that use the <tt>zerocopy</tt> flag in the header file. Because of the additial protocol overheads that <tt>zerocopy</tt> introduces, it should be used only for potentially large memory buffers, such as the buffers of file I/O <tt>read</tt> or <tt>write</tt> calls.

'''Note:''' for maximum performance, the arrays passed as <tt>zerocopy</tt> arguments on the compute nodes must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary. If there is a danger that the user code might pass a large unaligned buffer, and the semantics will not be affected, it makes sense to write code that detects insufficient alignment and splits the operation in two: a small unaligned one (say, up to 240 bytes – the data payload of a single packet on the tree network), followed by a larger, properly aligned one.

====Zerocopy with a custom output buffer====

Normally, memory for output arrays to be filled in by server-side implementation functions is allocated by the ZOID daemon. This might be inconvenient when the data to be filled arrives asynchronously, possibly before the implementation function is even invoked; in such situations, an interim memory buffer must be used, forcing an extra memory copy.

This can be avoided for zero-copy output <tt>arr</tt> types if the <tt>userbuf</tt> flag has been used. No space will then be preallocated by the daemon for the array (the server-side stub will pass a <tt>NULL</tt> pointer); instead, the implementation function must provide the daemon with its own buffer. It can do it by calling:

<pre>
void __zoid_register_userbuf(void* userbuf,
void (*callback)(void* userbuf, void* priv),
void* priv);
</pre>

Arguments:

; userbuf
: The address of the buffer.
; callback
: A callback function that is invoked by the daemon when the buffer has been sent to the client and is thus no longer needed. <tt>userbuf</tt> is passed as the first argument to the callback. It is safe for the callback to invoke <tt>__zoid_calling_process_id</tt>.
; priv
: A private data passed as the second argument to the <tt>callback</tt>. It is not interpreted by ZOID in any way.

The size of the provided buffer is determined like for any other array argument: the maximum value is provided by the client via the <tt>size</tt> argument. The server-side implementation part may choose to return less than the maximum amount, as explained [[#Returning variable amounts of data in arrays|earlier]].

As in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>.

'''Note:''' because the buffer provided is ''not'' allocated by ZOID, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to it. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

We provide a simple example below. It is a little artificial in the sense that the buffer is allocated within the implementation function; as we indicated, this feature is most useful with buffers allocated outside of the implementation functions:

<pre>
static void buffer_cb(void* userbuf, void* priv)
{
free(userbuf);
}

ssize_t unix_read(int fd /* in:obj */,
void *buf /* out:arr:size=+1:zerocopy:userbuf */,
size_t *count /* inout:ptr */)
{
ssize_t ret;

...

if (posix_memalign(&buf, 16, *count))
{
*count = 0;
return -ENOMEM;
}

__zoid_register_userbuf(buf, &buffer_cb, NULL);

if ((ret = read(fd, buf, *count)) == -1)
{
*count = 0;
return -errno;
}
else
{
*count = ret;
return ret;
}
}
</pre>

====Zerocopy with a custom input buffer====

The <tt>userbuf</tt> flag discussed above can also be used for ''input'' zero-copy <tt>arr</tt> arguments. This could be useful to avoid extra memory copies if the data in the array will be needed after the implementation function has returned.

If the flag is used, the daemon will not allocate the memory for the array; instead, in the middle of receiving the request from the client, it will call an allocation routine from the server-side implementation code. The name of the allocation routine is the name of the function that uses the input <tt>userbuf</tt> argument, with <tt>_allocate_cb</tt> suffix attached to it. Its prototype needs to be as follows:

<pre>
void* <name>_allocate_cb(int len);
</pre>

The single argument passed by the daemon is the length of the array in bytes. The routine must return a pointer to a buffer of that size or <tt>NULL</tt> if that is not possible (in which case, the function will fail and <tt>__zoid_error</tt> on the client side will return <tt>ENOMEM</tt>).

There is a restriction on the type of the array: its base type must have a size of one byte, so the array should be of type <tt>char*</tt>, <tt>unsigned char*</tt>, <tt>void*</tt> (a GCC extension), etc.

The allocation routine is invoked in the same context as ordinary implementation functions. It may block if it so desires; this will block the compute node client that invoked the routine, but all other clients can keep communicating with the server, thanks to its multi-threaded architecture.

Once the allocation routine has returned and a complete request has been received by the daemon, the implementation function is invoked as usual, with a correct address of the input <tt>userbuf</tt> array. It is the responsibility of the plug-in implementer to release the memory occupied by that array when it is no longer needed.

As with other user-level callbacks, the allocation routine may call <tt>__zoid_calling_process_id</tt> to learn which client process sent the request. Also, as in other zero-copy cases, for maximum performance, the buffer provided must be aligned in memory to the 16-bytes boundary; otherwise, an interim buffer will be used. The memory allocation routines such as <tt>malloc</tt> have been modified to align memory to that boundary, but we recommend explicitly calling <tt>posix_memalign</tt>. Finally, as with output <tt>userbuf</tt>, message size restrictions discussed [[#size_restrictions|earlier]] do not apply to the user-allocated buffers. Please do not abuse this capability. There is a good reason for the message size limit: it is there so that the maximum amount of memory required by the ZOID daemon stays limited. Too many too large user-allocated buffers might result in an out-of-memory condition on the I/O node.

Under rare circumstances, input <tt>userbuf</tt> could result in memory leaks. For this to take place, the job would have to be interrupted after the allocation routine has been run, but before the implementation function is called. This could only cause problems if I/O nodes are not rebooted between jobs. Those concerned about this scenario can eliminate the leak by adding necessary memory release code to the [[#Start-line functions|FINI]] function.

A simple example:

<pre>
void* unix_write_allocate_cb(int len)
{
void* ptr;

if (posix_memalign(&ptr, 16, len))
return NULL;

return ptr;
}

ssize_t unix_write(int fd /* in:obj */,
const void *buf /* in:arr:size=+1:zerocopy:userbuf */,
size_t count /* in:obj */)
{
ssize_t ret;

...

if ((ret = write(fd, buf, count)) == -1)
ret = -errno;

free((void*)buf);

return ret;
}
</pre>

----
[[Ramdisk]] | [[ZeptoOS_Documentation|Top]] | [[(K)TAU]]

Testing

2009-05-07T20:12:54Z

Iskra: /* Interactive login */

[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]
----

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

==The /bin/sleep job==

If you are using Cobalt, submit using either of the commands below:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
$ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
</pre>

If you are using <tt>mpirun</tt> directly, submit as follows:

<pre>
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600
</pre>

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.

If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt):

<pre>
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated
</pre>

(we stripped the timestamp prefixes to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

<pre>
BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
</pre>

This error indicates that the job was submitted to the default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). You need to go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files can be useful to diagnose the problem.

==Log files==

===I/O node===

Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like:

<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009
Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary
Starting RPC portmap daemon..done
eth0: Link status [RX+,TX+]
mount server reported tcp not available, falling back to udp
mount: RPC: Remote system error - No route to host
Zepto ION startup-00
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
Zepto ION startup-00 done
done
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart.
done
Starting network time protocol daemon (NTPD) using 172.17.3.1
May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec
May 1 12:57:11 ion-15 ntpd[653]: ntpd 4.2.0a@1.1196-r Sat Oct 4 00:01:53 UTC 2008 (1)
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123
May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040
done
Enabling ssh
Mounting site filesystems
done
Loading PVFS2 kernel module done
Sleeping 0 seconds before starting PVFS done
Starting PVFS2 client done
Sleeping 10 seconds before mounting PVFS
done
Mounting PVFS2 filesystems done
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22.
done
Zepto ION startup-12
Zepto ION startup-12 done
Starting GPFS
May 1 12:57:26 ion-15 syslogd 1.4.1: restart.
/etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00
ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists
ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists
ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists
ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists
Starting ZOID...
done
Zepto ION startup-99
Zepto ION startup-99 done
May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh'
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
~ #
</pre>

(again, we stripped the prefixes to make the lines shorter)

Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one instead mistakenly booted with the default ramdisk, this could be recognized by messages such as:

<pre>
Starting CIO services
[ciod:initialized] done
</pre>

(<tt>ciod</tt> is ''never'' started when using Zepto Compute Node Linux)

In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently.

===Compute node===

All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt>sn1-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.

A correct boot log when when booting ZeptoOS will look something like this:

<pre>
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0)
iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM.
iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code.
iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved.
iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144
iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory
iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009
iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges:
iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges
iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658
iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem)
iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset
iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found.
iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs
iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0
iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2
iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered
iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7)
iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default)
iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6
iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init
iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
</pre>

This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''.

The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:

<pre>
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1
DBBlockCmd setusername iskra
iskra db_allocate ANL-R00-M1-N12-64
iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64
iskra block state C
iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64)
iskra:ANL-R00-M1-N12-64 BlockController::connect()
iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206
Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101
iskra:ANL-R00-M1-N12-64 connected to mcServer
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened
iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log
iskra:ANL-R00-M1-N12-64 MailboxListener starting
iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64
iskra:ANL-R00-M1-N12-64 BlockController::boot_block \
uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \
cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \
ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk
iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
</pre>

Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.

After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:

<pre>
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00
iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful
iskra DatabaseBlockCommandThread stopped
DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1
DBJobCmd setusername iskra
iskra Starting Job 98461
New thread 4398305505840, for jobid 98461
selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra
ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra
ANL-R00-M1-N12-64 persist: 1
ANL-R00-M1-N12-64 connecting to mpirun...
ANL-R00-M1-N12-64 setting mpirun stream, fd=386
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 Job::load() /bin/sleep
ANL-R00-M1-N12-64 Job loaded: 98461
ANL-R00-M1-N12-64 About to start /bin/sleep
ANL-R00-M1-N12-64 Job 98461 set to RUNNING
iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
</pre>

==Interactive login==

We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.

The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:

<pre>
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
</pre>

In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value).

The IP address is also available from the MMCS log file:

<pre>
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
</pre>

With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.

Once the IP address is known, one can simply use the SSH:

<pre>
iskra@login1.surveyor:~> ssh 172.16.3.15

BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/gpfs/home/iskra $ hostname
ion-15
/gpfs/home/iskra $
</pre>

If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customization to work properly. To enable access for the partition owner, one might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH keypairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided while [[Configuration#Building|building]] the ZeptoOS environment.

Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.

Here's part of the <tt>ps</tt> output from the I/O node:

<pre>
/gpfs/home/iskra $ ps -ef
UID PID PPID C STIME TTY TIME CMD
[...]
65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap
root 108 19 0 16:09 ? 00:00:00 [rpciod/0]
root 109 19 0 16:09 ? 00:00:00 [rpciod/1]
root 110 19 0 16:09 ? 00:00:00 [rpciod/2]
root 111 19 0 16:09 ? 00:00:00 [rpciod/3]
root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd
root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x
ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.
root 688 1 0 16:09 ? 00:00:00 [lockd]
root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c
root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 -
root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r
root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm
root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc]
root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc]
root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd
root 1153 1 0 16:10 ? 00:00:00 [mmkproc]
root 1152 1 0 16:10 ? 00:00:00 [mmkproc]
root 1154 1 0 16:10 ? 00:00:00 [mmkproc]
iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s
root 2823 1 0 16:10 ? 00:00:00 /bin/sh
root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv]
iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0
iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh
iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef
/gpfs/home/iskra $
</pre>

The I/O nodes run a small Linux setup with the root filesystem in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote filesystem clients (GPFS, PVFS). Please verify at this stage that the remote filesystem have been mounted correctly.

One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O node and the compute nodes, implemented using the virtual network tunneling device available in Linux:

<pre>
/gpfs/home/iskra $ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
/gpfs/home/iskra $
</pre>

At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. The IP address of the second compute node is... <tt>192.168.1.59</tt>. On a machine with a 16:1 ratio of compute nodes to I/O nodes, the first compute node has IP address <tt>192.168.1.16</tt>. Do not blame us for this chaos – blame IBM :-).

The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them:

<pre>
/gpfs/home/iskra $ telnet 192.168.1.64

Entering character mode
Escape character is '^]'.

BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ #
</pre>

The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger jobs, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.

Here's part of the <tt>ps</tt> output from the compute node:

<pre>
~ # ps -ef
PID USER VSZ STAT COMMAND
[...]
34 root 5440 S /bin/sh /etc/init.d/rc.sysinit
44 root 5504 S /sbin/telnetd -l /bin/sh
47 root 6528 S /sbin/inetd
48 root 46400 R N /sbin/control
62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse
116 root 5248 S /bin/sleep 3600
118 root 5504 S /bin/sh
</pre>

Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control.

There are two custom processes running on each compute node:

'''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.

'''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the login and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin</tt>.

==Shell script job==

Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.

Here is a sample shell script to try:

<pre>
#!/bin/sh

. /proc/personality.sh

while true; do
echo "Node $BG_RANK_IN_PSET running (stdout)"
echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2
sleep 10
done
</pre>

(please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>)

Please create the script file on the network filesystem, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The scripts print a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered. If that is the case, kill the job and verify that the standard output data did appear.

==MPI and OpenMP jobs==

The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
<pre>
$ cd comm/testcodes
</pre>

===Compiling===

The programs can be compiled on a login node using:

<pre>
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
</pre>

===Submitting===

Submit the MPI test like any other job; use one of the below:

<pre>
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \
-cwd $PWD -exe $PWD/omp-test-linux
</pre>

For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> variable:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux
$ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
</pre>

The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".

----
[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]

Testing

2009-05-07T20:01:25Z

Iskra: /* Interactive login */

[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]
----

Once ZeptoOS is configured and installed, it is time to test it. Here are a few trivial tests to verify that the environment is working:

==The /bin/sleep job==

If you are using Cobalt, submit using either of the commands below:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 /bin/sleep 3600
$ qsub --kernel <profile-name> -t <time> -n 1 /bin/sleep 3600
</pre>

If you are using <tt>mpirun</tt> directly, submit as follows:

<pre>
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -exe /bin/sleep 3600
</pre>

This test, if successful, will verify that the ZeptoOS compute and I/O node environments are booting correctly. We deliberately chose a system binary such as <tt>/bin/sleep</tt> instead of something from a network filesystem so that even if the network filesystem does not come up for some reason, the test can still succeed.

If everything works out fine, messages such as the following will be found in the error stream (''jobid''.error file if using Cobalt):

<pre>
FE_MPI (Info) : initialize() - using jobname '' provided by scheduler interface
FE_MPI (Info) : Invoking mpirun backend
FE_MPI (Info) : connectToServer() - Handshake successful
BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is BGP
FE_MPI (Info) : Preparing partition
BE_MPI (Info) : Examining specified partition
BE_MPI (Info) : Checking partition ANL-R00-M1-N12-64 initial state ...
BE_MPI (Info) : Partition ANL-R00-M1-N12-64 initial state = FREE ('F')
BE_MPI (Info) : Checking partition owner...
BE_MPI (Info) : Setting new owner
BE_MPI (Info) : Initiating boot of the partition
BE_MPI (Info) : Waiting for partition ANL-R00-M1-N12-64 to boot...
BE_MPI (Info) : Partition is ready
BE_MPI (Info) : Done preparing partition
FE_MPI (Info) : Adding job
BE_MPI (Info) : Adding job to database...
FE_MPI (Info) : Job added with the following id: 98461
FE_MPI (Info) : Starting job 98461
FE_MPI (Info) : Waiting for job to terminate
BE_MPI (Info) : IO - Threads initialized
BE_MPI (Info) : I/O input runner thread terminated
</pre>

(we stripped the timestamp prefixes to make the lines shorter)

If these messages are immediately followed by other, error messages, then there is a problem. One common instance would be:

<pre>
BE_MPI (Info) : I/O output runner thread terminated
BE_MPI (Info) : Job 98463 switched to state ERROR ('E')
BE_MPI (ERROR): Job execution failed
[...]
BE_MPI (ERROR): The error message in the job record is as follows:
BE_MPI (ERROR): "Load failed on 172.16.3.11: Program segment is not 1MB aligned"
</pre>

This error indicates that the job was submitted to the default software environment, not to ZeptoOS (at the very least, the default I/O node ramdisk was used). You need to go back to the [[Installation#Setting up a kernel profile|Installation]] section to fix the problem. Information from the system log files can be useful to diagnose the problem.

==Log files==

===I/O node===

Every I/O node has its own log file located in <tt>/bgsys/logs/BGP/</tt>, with a name such as <tt>R*-M*-N*-J*.log</tt>. This name will generally correspond to the name of the partition where the job was running. Above, our job ran on <tt>ANL-R00-M1-N12-64</tt> (we could see that in the error stream; Cobalt users can also use <tt>[c]qstat</tt>); a corresponding I/O node log file on Argonne machines will be <tt>R00-M1-N12-J00.log</tt>. This is how a log file from a successful ZeptoOS boot looks like:

<pre>Linux version 2.6.16.46-297 (geeko@buildhost) (gcc version 4.1.2 (BGP)) #1 SMP Wed Apr 22 15:04:42 CDT 2009
Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
init started: BusyBox v1.4.2 (2008-04-10 05:20:01 UTC) multi-call binary
Starting RPC portmap daemon..done
eth0: Link status [RX+,TX+]
mount server reported tcp not available, falling back to udp
mount: RPC: Remote system error - No route to host
Zepto ION startup-00
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
Zepto ION startup-00 done
done
Starting syslog servicesDec 31 18:00:36 ion-15 syslogd 1.4.1: restart.
done
Starting network time protocol daemon (NTPD) using 172.17.3.1
May 1 12:57:11 ion-15 ntpdate[642]: step time server 172.17.3.1 offset 1241200617.470271 sec
May 1 12:57:11 ion-15 ntpd[653]: ntpd 4.2.0a@1.1196-r Sat Oct 4 00:01:53 UTC 2008 (1)
May 1 12:57:11 ion-15 ntpd[653]: precision = 1.000 usec
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface wildcard, 0.0.0.0#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface eth0, 172.16.3.15#123
May 1 12:57:11 ion-15 ntpd[653]: Listening on interface lo, 127.0.0.1#123
May 1 12:57:11 ion-15 ntpd[653]: kernel time sync status 0040
done
Enabling ssh
Mounting site filesystems
done
Loading PVFS2 kernel module done
Sleeping 0 seconds before starting PVFS done
Starting PVFS2 client done
Sleeping 10 seconds before mounting PVFS
done
Mounting PVFS2 filesystems done
Starting SSH daemonMay 1 12:57:21 ion-15 sshd[833]: Server listening on 0.0.0.0 port 22.
done
Zepto ION startup-12
Zepto ION startup-12 done
Starting GPFS
May 1 12:57:26 ion-15 syslogd 1.4.1: restart.
/etc/init.d/rc3.d/S40gpfs: GPFS is ready on I/O node ion-15 : 172.16.3.15 : R00-M1-N12-J00
ln: creating symbolic link `/home/acherryl/acherryl' to `/gpfs/home/acherryl': File exists
ln: creating symbolic link `/home/bgpadmin/bgpadmin' to `/gpfs/home/bgpadmin': File exists
ln: creating symbolic link `/home/davidr/davidr' to `/gpfs/home/davidr': File exists
ln: creating symbolic link `/home/scullinl/scullinl' to `/gpfs/home/scullinl': File exists
Starting ZOID...
done
Zepto ION startup-99
Zepto ION startup-99 done
May 1 17:57:59 ion-15 init: Starting pid 2823, console /dev/console: '/bin/sh'
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
~ #
</pre>

(again, we stripped the prefixes to make the lines shorter)

Messages such as <tt>Zepto ION startup</tt> or <tt>Starting ZOID</tt> clearly indicate that a ZeptoOS I/O node ramdisk is being used. If one instead mistakenly booted with the default ramdisk, this could be recognized by messages such as:

<pre>
Starting CIO services
[ciod:initialized] done
</pre>

(<tt>ciod</tt> is ''never'' started when using Zepto Compute Node Linux)

In addition to verifying the ramdisk, the correct I/O node kernel can also be verified using the I/O node logfile by checking the kernel build timestamp in the first line of the boot log. As of this writing the default kernel on the Argonne machines has a timestamp of <tt>Wed Oct 29 18:51:19 UTC 2008</tt>; as can be seen above, the ZeptoOS kernel was built more recently.

===Compute node===

All the compute nodes on the machine share the same MMCS log file, located in <tt>/bgsys/logs/BGP/</tt>. The name of the log file is not fixed (it contains a timestamp), but <tt>sn1-bgdb0-mmcs_db_server-current.log</tt> always links to the current file. Because the file is shared with other jobs, we recommed to grep it for user name, partition name, or both.

A correct boot log when when booting ZeptoOS will look something like this:

<pre>
iskra:ANL-R00-M1-N12-64 {20}.0: Common Node Services V1R3M0 (efix:0)
iskra:ANL-R00-M1-N12-64 {20}.0: Licensed Machine Code - Property of IBM.
iskra:ANL-R00-M1-N12-64 {20}.0: Blue Gene/P Licensed Machine Code.
iskra:ANL-R00-M1-N12-64 {20}.0: Copyright IBM Corp., 2006, 2007 All Rights Reserved.
iskra:ANL-R00-M1-N12-64 {20}.0: Z: Zepto Linux Kernel relocating CNS... dst=80280000 src=fff40000 size=262144
iskra:ANL-R00-M1-N12-64 {20}.0: Z: CNS is successfully relocated to 00280000 in physical memory
iskra:ANL-R00-M1-N12-64 {20}.0: Linux version 2.6.19.2-g66cbca2d (kazutomo@login1) (gcc version 4.1.2 (BGP)) #12 SMP Tue Apr 21 12:58:11 CDT 2009
iskra:ANL-R00-M1-N12-64 {20}.0: Zone PFN ranges:
iskra:ANL-R00-M1-N12-64 {20}.0: DMA 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: Normal 28672 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.0: early_node_map[1] active PFN ranges
iskra:ANL-R00-M1-N12-64 {20}.1: 0: 0 -> 28672
iskra:ANL-R00-M1-N12-64 {20}.1: Built 1 zonelists. Total pages: 28658
iskra:ANL-R00-M1-N12-64 {20}.1: Kernel command line: console=bgcons root=/dev/ram0 lpj=8500000
iskra:ANL-R00-M1-N12-64 {20}.1: PID hash table entries: 4096 (order: 12, 16384 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Dentry cache hash table entries: 262144 (order: 4, 1048576 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Inode-cache hash table entries: 131072 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: Memory: 1826560k available (1408k kernel code, 832k data, 192k init, 0k highmem)
iskra:ANL-R00-M1-N12-64 {20}.0: Calibrating delay loop (skipped)... 1700.00 BogoMIPS preset
iskra:ANL-R00-M1-N12-64 {20}.0: Mount-cache hash table entries: 8192
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 1 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 1 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 2 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 2 found.
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done callin...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done setup...
iskra:ANL-R00-M1-N12-64 {20}.0: CPU 3 done timebase take...
iskra:ANL-R00-M1-N12-64 {20}.0: Processor 3 found.
iskra:ANL-R00-M1-N12-64 {20}.0: Brought up 4 CPUs
iskra:ANL-R00-M1-N12-64 {20}.0: migration_cost=0
iskra:ANL-R00-M1-N12-64 {20}.0: checking if image is initramfs... it is
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing initrd memory: 2575k freed
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 16
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 2
iskra:ANL-R00-M1-N12-64 {20}.0: IP route cache hash table entries: 16384 (order: 0, 65536 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP established hash table entries: 65536 (order: 3, 524288 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP bind hash table entries: 32768 (order: 2, 262144 bytes)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP: Hash tables configured (established 65536 bind 32768)
iskra:ANL-R00-M1-N12-64 {20}.0: TCP reno registered
iskra:ANL-R00-M1-N12-64 {20}.0: fuse init (API version 7.7)
iskra:ANL-R00-M1-N12-64 {20}.0: io scheduler noop registered (default)
iskra:ANL-R00-M1-N12-64 {20}.0: RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
iskra:ANL-R00-M1-N12-64 {20}.0: tun: Universal TUN/TAP device driver, 1.6
iskra:ANL-R00-M1-N12-64 {20}.0: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
iskra:ANL-R00-M1-N12-64 {20}.0: TCP cubic registered
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 1
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 17
iskra:ANL-R00-M1-N12-64 {20}.0: NET: Registered protocol family 15
iskra:ANL-R00-M1-N12-64 {20}.0: Freeing unused kernel memory: 192k init
iskra:ANL-R00-M1-N12-64 {20}.0: init started: BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT)
</pre>

This is very easy to tell from a boot log of the default light-weight kernel, which will consist of the first four lines ''only''.

The MMCS log file contains other useful information besides the boot log of the compute nodes. Before the kernel starts booting, the following messages related to the newly submitted job can be found there:

<pre>
DBBlockCmd DatabaseBlockCommandThread started: block ANL-R00-M1-N12-64, user iskra, action 1
DBBlockCmd setusername iskra
iskra db_allocate ANL-R00-M1-N12-64
iskra DBConsoleController::setAllocating() ANL-R00-M1-N12-64
iskra block state C
iskra DBConsoleController::addBlock(ANL-R00-M1-N12-64)
iskra:ANL-R00-M1-N12-64 BlockController::connect()
iskra:ANL-R00-M1-N12-64 connecting to mcServer at 127.0.0.1:1206
Connected to MCServer as iskra@sn1. Client version 3. Server version 3 on fd 101
iskra:ANL-R00-M1-N12-64 connected to mcServer
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 created
iskra:ANL-R00-M1-N12-64 mcServer target set ANL-R00-M1-N12-64 opened
iskra:ANL-R00-M1-N12-64 {0} I/O log file: /bgsys/logs/BGP/R00-M1-N12-J00.log
iskra:ANL-R00-M1-N12-64 MailboxListener starting
iskra:ANL-R00-M1-N12-64 DBConsoleController::doneAllocating() ANL-R00-M1-N12-64
iskra:ANL-R00-M1-N12-64 BlockController::boot_block \
uloader=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/uloader \
cnload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNK \
ioload=/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/CNS,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/INK,/bgsys/argonne-utils/partitions/ANL-R00-M1-N12-64/ramdisk
iskra:ANL-R00-M1-N12-64 boot_block cookie: 587867023 compute_nodes: 64 io_nodes: 1
</pre>

Of particular relevance is the pathname to the I/O node log file(s) (if it cannot be easily guessed from the partition name) and the pathnames to the kernels and ramdisks used to boot the partition.

After the kernel boot log, the log file will also contain information about subsequent phases of starting a job:

<pre>
iskra:ANL-R00-M1-N12-64 I/O node initialized: R00-M1-N12-J00
iskra:ANL-R00-M1-N12-64 DBBlockController::waitBoot(ANL-R00-M1-N12-64) block initialization successful
iskra DatabaseBlockCommandThread stopped
DBJobCmd DatabaseJobCommandThread started: job 98461, user iskra, action 1
DBJobCmd setusername iskra
iskra Starting Job 98461
New thread 4398305505840, for jobid 98461
selectBlock(): ANL-R00-M1-N12-64 iskra(1) connected state: I owner: iskra
ANL-R00-M1-N12-64 Jobid is 98461, homedir is /gpfs/home/iskra
ANL-R00-M1-N12-64 persist: 1
ANL-R00-M1-N12-64 connecting to mpirun...
ANL-R00-M1-N12-64 setting mpirun stream, fd=386
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 connected to control node 0 at 172.16.3.15:7000
ANL-R00-M1-N12-64 Job::load() /bin/sleep
ANL-R00-M1-N12-64 Job loaded: 98461
ANL-R00-M1-N12-64 About to start /bin/sleep
ANL-R00-M1-N12-64 Job 98461 set to RUNNING
iskra:ANL-R00-M1-N12-64 {20}.0: floating point used in kernel (task=8080cfe0, pc=80017064)
</pre>

==Interactive login==

We are assuming at this point that launching <tt>/bin/sleep</tt> has been successful and that the "job" is running. We can now start an interactive session on our BG/P resources. Probably the most complicated part of this operation is finding the IP address of the I/O node(s). The allocation of I/O nodes to partitions is fixed, so on a small machine one could simply make a list. This information is also available in the log files discussed above.

The IP address is printed near the top of the I/O node boot log, as part of the interface configuration of the Ethernet device:

<pre>
eth0 Link encap:Ethernet HWaddr 00:14:5E:7D:0C:57
inet addr:172.16.3.15 Bcast:172.31.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:880 errors:0 dropped:0 overruns:0 frame:0
TX packets:1009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3878545 (3.6 Mb) TX bytes:151458 (147.9 Kb)
Interrupt:32
</pre>

In this case, the address is <tt>172.16.3.15</tt> (the <tt>inet addr</tt> value).

The IP address is also available from the MMCS log file:

<pre>
ANL-R00-M1-N12-64 contacting control node 0 at 172.16.3.15:7000
</pre>

With larger partitions that include multiple I/O nodes, querying the MMCS logfile is probably better, as it will list all the addresses.

Once the IP address is known, one can simply use the SSH:

<pre>
iskra@login1.surveyor:~> ssh 172.16.3.15

BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

/gpfs/home/iskra $ hostname
ion-15
/gpfs/home/iskra $
</pre>

If everything is configured correctly, SSH will only let in root and the partition owner; no other unprivileged user will be allowed on the node. However, this might require site-specific customization to work properly. To enable access for the partition owner, one might need to make adjustments to [[ZOID#The /bin.rd/update_passwd_file.sh file|update_passwd_file.sh]]. To enable password-less login for the partition owners without requiring them to set up personal SSH keypairs, we recommend to add the names of the front end nodes to the <tt>shosts.equiv</tt> file, found in <tt>ramdisk/ION/ramdisk-add/etc/ssh.zepto/</tt> (it is empty by default; remember to use the names from the network that interconnects front end and I/O nodes, which might be different from hostnames, e.g., at Argonne we need to add the <tt>-data</tt> suffix to the hostnames). Until this has all been set up, one might prefer to log on as root (<tt>ssh -l root</tt>), passing the password provided while [[Configuration#Building|building]] the ZeptoOS environment.

Also, even when the partition owner is correctly set up, there will be a time window while booting the I/O node when the SSH daemon is already running, but a job has not yet been started; during that window, the partition owner cannot log on. If that happens, wait a few seconds and try again.

Here's part of the <tt>ps</tt> output from the I/O node:

<pre>
/gpfs/home/iskra $ ps -ef
UID PID PPID C STIME TTY TIME CMD
[...]
65534 98 1 0 16:09 ? 00:00:00 /sbin/portmap
root 108 19 0 16:09 ? 00:00:00 [rpciod/0]
root 109 19 0 16:09 ? 00:00:00 [rpciod/1]
root 110 19 0 16:09 ? 00:00:00 [rpciod/2]
root 111 19 0 16:09 ? 00:00:00 [rpciod/3]
root 570 1 0 16:09 ? 00:00:00 /sbin/syslogd
root 577 1 0 16:09 ? 00:00:00 /sbin/klogd -c 1 -x -x
ntp 653 1 0 16:09 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.
root 688 1 0 16:09 ? 00:00:00 [lockd]
root 775 1 0 16:09 ? 00:00:00 /bgsys/iosoft/pvfs2/sbin/pvfs2-c
root 776 775 0 16:09 ? 00:00:00 pvfs2-client-core --child -a 5 -
root 833 1 0 16:10 ? 00:00:00 /usr/sbin/sshd -o PidFile=/var/r
root 1016 1 0 16:10 ? 00:00:00 /bin/ksh /usr/lpp/mmfs/bin/runmm
root 1079 1 0 16:10 ? 00:00:00 [nfsWatchKproc]
root 1080 1 0 16:10 ? 00:00:00 [gpfsSwapdKproc]
root 1146 1016 0 16:10 ? 00:00:01 /usr/lpp/mmfs/bin//mmfsd
root 1153 1 0 16:10 ? 00:00:00 [mmkproc]
root 1152 1 0 16:10 ? 00:00:00 [mmkproc]
root 1154 1 0 16:10 ? 00:00:00 [mmkproc]
iskra 2810 1 98 16:10 ? 00:04:09 /bin.rd/zoid -a 8 -m unix_impl.s
root 2823 1 0 16:10 ? 00:00:00 /bin/sh
root 3328 833 0 16:10 ? 00:00:00 sshd: iskra [priv]
iskra 3332 3328 0 16:10 ? 00:00:00 sshd: iskra@ttyp0
iskra 3333 3332 0 16:10 ttyp0 00:00:00 -sh
iskra 3346 3333 0 16:14 ttyp0 00:00:00 ps -ef
/gpfs/home/iskra $
</pre>

The I/O nodes run a small Linux setup with the root filesystem in the ramdisk. Custom processes can be started, just like on any ordinary Linux node. In the example above, it is mostly a few system daemons and the remote filesystem clients (GPFS, PVFS). Please verify at this stage that the remote filesystem have been mounted correctly.

One custom process running on the node is [[ZOID]], the I/O forwarding and job control daemon, which enables the communication with the compute nodes. One of the facilities offered by ZOID is IP forwarding between the I/O node and the compute nodes, implemented using the virtual network tunneling device available in Linux:

<pre>
/gpfs/home/iskra $ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.1.254 P-t-P:192.168.1.254 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
/gpfs/home/iskra $
</pre>

At least on Argonne machines, with a 64:1 ratio of compute nodes to I/O nodes, compute nodes have addresses <tt>192.168.1.1</tt> to <tt>192.168.1.64</tt> (the last octet of the address is the [[FAQ#Pset rank|pset rank]]). Somewhat confusingly, the first compute node (compute node <tt>0</tt>) has IP address <tt>192.168.1.64</tt>, so if one submits a one-node job as we did, that is the IP address that needs to be used to log on that sole running compute node. The IP address of the second compute node is... <tt>192.168.1.59</tt> (do not blame us – blame IBM :-).

The compute nodes are running a <tt>telnet</tt> daemon, and no password is required to log on them:

<pre>
/gpfs/home/iskra $ telnet 192.168.1.64

Entering character mode
Escape character is '^]'.

BusyBox(for ZeptoOS Compute Node) v1.12.1 (2009-04-21 16:08:55 CDT) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ #
</pre>

The IP address of the I/O node on this virtual network is <tt>192.168.1.254</tt>. The network is local to each I/O node, so for larger jobs, there will be multiple distinct virtual networks that cannot communicate with each other, and the IP addresses will duplicate.

Here's part of the <tt>ps</tt> output from the compute node:

<pre>
~ # ps -ef
PID USER VSZ STAT COMMAND
[...]
34 root 5440 S /bin/sh /etc/init.d/rc.sysinit
44 root 5504 S /sbin/telnetd -l /bin/sh
47 root 6528 S /sbin/inetd
48 root 46400 R N /sbin/control
62 root 7872 S /bin/zoid-fuse -o allow_other -s /fuse
116 root 5248 S /bin/sleep 3600
118 root 5504 S /bin/sh
</pre>

Compute nodes have an even more stripped-down environment than the I/O nodes. There are no user accounts – everything runs as root, including the application processes. This is not a security concern, because the only practical way for a compute node to communicate with the outside world is through the I/O node, and I/O nodes ''do'' enforce user-level access control.

There are two custom processes running on each compute node:

'''control''' is a job management daemon responsible for tasks such as the launching of application processes, for the forwarding of stdin/out/err data, and for the management of the virtual network tunneling device from the compute node side. Do not interfere with this process in any way; this would likely make the node inaccessible.

'''zoid-fuse''' is a FUSE ([http://fuse.sourceforge.net/ Filesystem in Userspace]) client responsible for making the filesystems from the I/O nodes available to ordinary POSIX-compliant processes running on the compute nodes. The whole filesystem namespace from the I/O nodes is made available on the compute nodes under <tt>/fuse/</tt>, and symbolic links such as <tt>/home -> /fuse/home</tt> are set up to keep the login and I/O node pathnames valid on the compute nodes. Please verify that this is correctly set up. We do not foresee a need to change this setup, but should that prove necessary, the responsbile <tt>fuse-start</tt> and <tt>fuse-stop</tt> scripts can be found under <tt>ramdisk/CN/tree/bin</tt>.

==Shell script job==

Assuming that the above steps have been successful, one can now test running a simple job from a network filesystem, such as one's home directory.

Here is a sample shell script to try:

<pre>
#!/bin/sh

. /proc/personality.sh

while true; do
echo "Node $BG_RANK_IN_PSET running (stdout)"
echo "Node $BG_RANK_IN_PSET running (stderr)" 1>&2
sleep 10
done
</pre>

(please see the [[FAQ#Pset rank|FAQ]] for the explanation of <tt>/proc/personality.sh</tt> and <tt>BG_RANK_IN_PSET</tt>)

Please create the script file on the network filesystem, set the executable bit (<tt>chmod 755</tt>) and submit it. Verify that the script starts correctly and that at least the standard error output is visible immediately. The scripts print a line of output from each node every ten seconds. It does so both to the standard output and to the standard error, because, depending on software configuration, the standard output stream could be buffered. If that is the case, kill the job and verify that the standard output data did appear.

==MPI and OpenMP jobs==

The final tests involve parallel programming jobs, respectively MPI and OpenMP. Use the test programs provided with the distribution. From the top level directory:
<pre>
$ cd comm/testcodes
</pre>

===Compiling===

The programs can be compiled on a login node using:

<pre>
$ /path/to/install/bin/zmpicc -o mpi-test-linux mpi-test.c
$ /path/to/install/bin/zmpixlc_r -qsmp=omp -o omp-test-linux omp-test.c
</pre>

===Submitting===

Submit the MPI test like any other job; use one of the below:

<pre>
$ cqsub -k <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ qsub --kernel <profile-name> -t <time> -n <number-of-processes> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np <number-of-processes> -timeout <time> \
-cwd $PWD -exe $PWD/omp-test-linux
</pre>

For the OpenMP test, we pass the number of OpenMP threads to use in the <tt>OMP_NUM_THREADS</tt> variable:

<pre>
$ cqsub -k <profile-name> -t <time> -n 1 -e OMP_NUM_THREADS=<num> $PWD/omp-test-linux
$ qsub --kernel <profile-name> -t <time> -n 1 --env OMP_NUM_THREADS=<num> $PWD/mpi-test-linux
$ mpirun -verbose 1 -partition <partition-name> -np 1 -timeout <time> \
-cwd $PWD -env OMP_NUM_THREADS=<num> -exe $PWD/omp-test-linux
</pre>

The MPI test benchmarks the performance of various MPI operations. The OpenMP test is just a parallel "Hello world".

----
[[Installation]] | [[ZeptoOS_Documentation|Top]] | [[MPICH, DCMF, and SPI]]

Kernel

2009-05-07T18:24:57Z

Iskra: /* Kernel (command line) parameters */

[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]
----

==Introduction==

We currently provide two Linux kernels because of GPFS support on ION.

* 2.6.19-based kernel: ZeptoOS CN kernel
** IBM V1R3 patch and ZeptoOS patch applied
** 64 KB pagesize and big memory region available
** Device drivers for compute node devices such as DMA, lockbox, etc
** Allows to run MPICH/DCMF code through Zepto Compute Binary (ZCB)
** Can be used as enhanced ION kernel

* 2.6.16-based kernel: ZeptoOS ION kernel
** IBM V1R3 patch applied
** Only minor changes compared to the IBM ION kernel.

We focus our development efforts on the 2.6.19-based kernel. It is meant primarily for the compute nodes, but can also be used on the I/O nodes. The problem is that GPFS does not work with this kernel, so we also provide the 2.6.16-based kernel which works with GPFS.

==Kernel directory structure==

The <tt>kernel</tt> directory basically consists of three main subdirectories: <tt>prebuilt</tt>, <tt>config</tt>, and <tt>tarball</tt>.

<pre>
kernel
|-- prebuilt
| |-- 2.6.16
| | `-- ION
| `-- 2.6.19
| |-- CN
| `-- objs
|-- tarball
`-- config
</pre>

The <tt>prebuilt</tt> directory contains prebuilt kernel images and modules. While a complete prebuilt ION kernel ELF file is provided, for the CN kernel we provide intermediate object files instead. This is because we embed the CN ramdisk in the CN kernel image when building ZeptoOS, and this process requires the object files.

The <tt>tarball</tt> directory contains kernel tarballs separately for the ION and the CN Linux kernel. Technically, those tarballs are a snapshot of the ZeptoOS kernel git repository. The directory might contain a <tt>.patch</tt> file that contains the differences between the last snapshot and the current git HEAD since we wanted to avoid creating a snapshot from git for small modifications. Associated git log file can also be found in this directory. A <tt>.SNAPSHOT_HEAD</tt> file indicates the git revision at the time when a snapshot was created, so this information is used to create a patch file.

Here is a list of files for the CN kernel:

<pre>
linux-2.6.19.2-BGP-V1R3.git.log
linux-2.6.19.2-BGP-V1R3.patch
linux-2.6.19.2-BGP-V1R3.SNAPSHOT_HEAD
linux-2.6.19.2-BGP-V1R3.tar.bz2
</pre>

The <tt>config</tt> directory contains Linux kernel configs. In case of the 2.6.19 kernel, we provide separate config files for the compute node and the I/O node.

==Building a kernel==

<tt>Makefile</tt> in the <tt>kernel</tt> directory has many options. Just type <tt>make</tt> and it will print out a help message.

If one needs to build (or rebuild) a kernel from the source tarball, use <tt>bgp-ion-linux-build</tt> or <tt>bgp-cn-linux-build</tt> target. By default, it extracts ION or CN kernel tarball in a directory named <tt>work</tt>, applies a patch if any and starts the kernel build. Once the kernel has successfully been built, kernel images (in both ZeptoOS top-level directory and the <tt>tmp</tt> directory) will be replaced with newly built images. The ION kernel source code is extracted into <tt>work/linux-2.6.16.46-297-BGP-V1R3</tt> and the CN kernel source into <tt>work/linux-2.6.19.2-BGP-V1R3</tt>.

Here is an example of building and rebuilding the CN kernel:

<pre>
$ cd kernel
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
$ vi work/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make bgp-cn-linux-build
....
$ ls -al ../BGP-CN-zImage-with-initrd.elf
</pre>

===Building a kernel from the ZeptoOS kernel git repository===

As mentioned earlier, the kernel tarball is used as the source by default. If instead one passes <tt>GIT=1</tt> to <tt>make</tt>, one can build directly from the ZeptoOS kernel git tree. This is very useful for kernel development since it makes it easier to keep track of local modifications.

<pre>
$ cd kernel
$ make GIT=1 bgp-cn-linux-build
....
$ vi repo/linux-2.6.19.2-BGP-V1R3/kernel/sched.c
$ make GIT=1 bgp-cn-linux-build
....
</pre>

This will create <tt>repo/linux-2.6.19.2-BGP-V1R3</tt>, which is a git repository that is cloned from http://git.anl-external.org/bg-linux.repos/linux-2.6.19-BGP-V1R3.git/. Our http repo is read-only, so you cannot push your modifications to it. Instead, please post any patches to the [mailto:zeptoos@lists.mcs.anl.gov ZeptoOS mailing list] instead.

See also the [http://bg-linux.anl-external.org/wiki/index.php/Main_Page BG-Linux page] for the details on our kernel git repository.

===Kernel config===

When one invokes <tt>make</tt> with a kernel build target for the first time, the associated kernel config file is copied to <tt>.config</tt> in the kernel build directory. <tt>config/bgp-cn-2.6.19.2-dot-config</tt> is applied to the CN Linux kernel build tree, and <tt>config/bgp-ion-2.6.16.46-dot-config</tt> is applied to the ION Linux kernel build tree.

Here is the location of the kernel config file:
* Regular build
** work/build-2.6.19.2-BGP-V1R3/.config
** work/build-2.6.16.46-297-BGP-V1R3/.config
* GIT build
** repo/build-2.6.19.2-BGP-V1R3/.config
** repo/build-2.6.16.46-297-BGP-V1R3/.config

Please note that the kernel config file is copied only once, until you do a <tt>distclean</tt> or remove the files manually.

The <tt>bgp-cn-linux-menuconfig</tt> and <tt>bgp-ion-linux-menuconfig</tt> <tt>make</tt> targets invoke text-based Linux kernel configuration menus:

<pre>
$ make bgp-ion-linux-menuconfig
$ make bgp-cn-linux-menuconfig
</pre>

For GIT build:
<pre>
$ make GIT=1 bgp-ion-linux-menuconfig
$ make GIT=1 bgp-cn-linux-menuconfig
</pre>

These menu targets never update the default kernel config files from the <tt>config</tt> directory. If you want to apply a new config permanently, please copy it to the <tt>config</tt> directory by hand:
<pre>
$ cp work/build-2.6.19.2-BGP-V1R3/.config config/bgp-cn-2.6.19.2-dot-config
</pre>

==Kernel (command line) parameters==

In common server/desktop Linux environments, kernel parameters can be passed via bootloader such as grub. However, Blue Gene/P boot mechanism does not provide such capability, so we have modified the CN Linux kernel (2.6.19) to use a kernel parameter string embedded in kernel ELF image file itself.

One can (re)set the kernel parameters in a kernel ELF file using a command line tool <tt>zkparam.py</tt>, located in the <tt>bin<tt> subdirectory of the ZeptoOS installation directory. Here is the synopsis of the tool:

<pre>
zkparam.py <kernel_image> [options]
</pre>

If options are omitted, the tool shows the current kernel parameters.

<pre>
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf zepto_console_output=2
$ ./kernel/zkparam.py BGP-CN-zImage-with-initrd.elf
Current Kernel Parameters:
zepto_console_output=2
</pre>

===ZeptoOS-pecific kernel parameters===

* '''zepto_debug'''=<integer>
** Specifies the ZeptoOS kernel debug level.
** The higher the number, the more messages are generated.
** <tt>0</tt> turns off all debug messages.
** default=1
* '''flatmemsizeMB'''=<integer>
** Specifies the size of big memory in MB.
** Currently the granularity of memory size is 256 MB.
** default=256 min=256 max=1792
* '''zepto_console_output'''=<integer>
** Specifies the console output behavior.
** 0 disables console output from all compute nodes.
** 1 enables console output from the first compute node ([[FAQ#Torus rank|torus rank]] 0).
** 2 enables console output from all compute nodes.

==Log files, etc==


The compute node and I/O node logfile have been discussed extensively in [[Testing#Log files|Testing]].

In addition to regular console logs, the kernels can also generate RAS message, which will not appear in the log files. A command line tool named <tt>bg-listevents</tt> shows you a record of RAS events. Type <tt>bg-listevents -h</tt> for command line arguments.

----
[[MPICH, DCMF, and SPI]] | [[ZeptoOS_Documentation|Top]] | [[Ramdisk]]

Configuration

2009-05-07T17:53:35Z

Iskra: /* Configuring */

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the frontend nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [http://press.mcs.anl.gov/zeptoos/download download page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-*.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>ZeptoOS-<version></tt> directory:

<pre>
$ cd ZeptoOS-<version>
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories.

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once a password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: A merged ZeptoOS compute node Linux and compute node ramdisk file.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-cn-linux
: Rebuilds the compute node ramdisk and embeds it into a compute node kernel image.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and merges them.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
(the following <tt>make</tt>targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]

Ramdisk

2009-05-07T17:50:33Z

Iskra:

[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]
----

==Introduction==

Both the CN and ION Linux kernels require a ramdisk to boot. Ramdisk images contain minimal Linux utilities, init scripts, configuration files, kernel modules, etc, which are required by the OS boot process.

ION ramdisk is an ELF file that contains a cpio archive of system files. Two ION ramdisk images are currently generated:

; BGP-ION-ramdisk-for-CNL.elf
: Default ION ramdisk for ZeptoOS.
; BGP-ION-ramdisk-for-CNK.elf
: Use this one if you need to run IBM CNK on the compute nodes (uses IBM CIOD instead of ZOID)

Our ION ramdisks are similar to the default ION ramdisk from IBM, but we add some extra files to support ZeptoOS features. The extra files are located in <tt>ramdisk/ION/ramdisk-add/</tt>. The <tt>build-ramdisk</tt> script from IBM BGP driver is used to create the ION ramdisks.

The CN ramdisk is also a gzip'ed cpio archive of system files, but CN ramdisk is embedded into the CN kernel image (<tt>BGP-CN-zImage-with-initrd.elf</tt>). The CN ramdisk is created by a custom ramdisk build script (<tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>). Both <tt>build-ramdisk</tt> and <tt>create-bgp-cn-linux-ramdisk.pl</tt> are wrappers of the Linux kernel's <tt>gen_init_cpio</tt> command.

==Creating ramdisk images==

The ramdisk images are always (re-)created from prebuild objects if one types <tt>make</tt> at the top level directory (without any make target).

If one wants to create an ION ramdisk individually (without rebuilding other images), type:

<pre>
$ make bgp-ion-ramdisk-cnl
</pre>

If one wants to create a CN ramdisk (technically, create a CN kernel image with new ramdisk contents), type:

<pre>
$ make bgp-cn-linux
</pre>

'''Note:''' the newly built CN ramdisk can be found in <tt>ramdisk/CN/bgp-cn-ramdisk.cpio.gz</tt>, but it is not useable until it is embedded into the kernel image.

For other ramdisk-related make targets, please refer to [[Configuration#Building|Configuration]].

==Modifying ramdisk contents==

You can customize ramdisk contents for your purpose, i.e., debugging, running your custom system software on BGP, etc.

===CN ramdisk===

The CN ramdisk can be customized by editing the CN ramdisk build script, which is <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>. The build script allows to set the permission bits, create device files, etc.

Most of the contents of the CN ramdisk is kept in <tt>ramdisk/CN/tree/<tt>, but this is not a hard rule. Source files can reside anywhere as long as they are accessible from the script. It may be possible to use binaries and libraries from the login nodes, as long as they are a 32-bit PPC files (use the <tt>file</tt> command to verify) and all their dependencies are also copied.

Here is a practical example. Suppose that you need the <tt>od</tt> command in CN ramdisk. You could build the command from source code, but if you want to do something quick, you can try using the login node's version:

<pre>
$ file /usr/bin/od
/usr/bin/od: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV),
for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped
$ ldd /usr/bin/od
linux-vdso32.so.1 => (0x00100000)
libc.so.6 => /lib/ppc970/libc.so.6 (0x0fe8b000)
/lib/ld.so.1 (0xf7fe1000)
</pre>

It is a 32-bit PPC executable and the current CN ramdisk has all the necessary shared libraries, so it can be used. Now add the command to a perl array named <tt>@cmdlists</tt> in <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt> script and type <tt>make</tt> to recreate the CN ramdisk:

<pre>
$ vi ramdisk/CN/create-bgp-cn-linux-ramdisk.pl
# add the following line to @cmdlists
"file /bin/od /usr/bin/od 0755 0 0",
$ make bgp-cn-linux
</pre>

Now the CN ramdisk has <tt>/bin/od</tt> with file permissions <tt>0755</tt>, uid=0, and gid=0.

The added line is a command for the <tt>gen_init_cpio</tt> tool. One can also create directories, device files, symbolick links, pipe files, socket files, etc:

<pre>
file <name> <location> <mode> <uid> <gid>
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

<name> name of the file/dir/nod/etc in the archive
<location> location of the file in the current filesystem
<target> link target
<mode> mode/permissions of the file
<uid> user id (0=root)
<gid> group id (0=root)
<dev_type> device type (b=block, c=character)
<maj> major number of nod
<min> minor number of nod
</pre>

The order of the commands in @cmdlists ''matters''. They are executed from top to bottom, so one cannot add a file to a directory that has not yet been created.

====CN Linux startup script====

The first thing that the Linux kernel does after it boots is to execute the <tt>init</tt> program. The <tt>init</tt> program is usually in <tt>/sbin/</tt>, and in the CN ramdisk case it is part of the busybox. <tt>init</tt> reads in a config file from <tt>/etc/inittab</tt>, which in our case instructs it to execute the <tt>/etc/init.d/rc.sysinit</tt> startup script.

Our startup script is very minimalistic; its two most important actions are to start the telnet daemon to allow users to login from the I/O nodes and then to start the ZOID <tt>control</tt> process which takes care of IP forwarding and job control.

In case you need to start some processes at the CN boot time, you can add their invocations to <tt>ramdisk/CN/tree/etc/init.d/rc.sysinit</tt>, ''before'' <tt>/sbin/control</tt> is invoked.

===ION ramdisk===

Unlike the CN ramdisk, the range of customization is limited on the ION ramdisk. There is no control over file permission bits, one cannot create device nodes, etc. Currently we build the ION ramdisk using IBM's <tt>build-ramdisk</tt> script by specifying an add-on tree which contains our extra files.

Essentially, customization is limited to:
* adding new files,
* overwriting default ramdisk files by adding custom files with the same names.

Once files have been added under <tt>ramdisk/ION/ramdisk-add/</tt>, they will be automatically added to the ramdisk on the next rebuild. Here is an example of how to add a file to the ION ramdisk:

<pre>
$ vi ramdisk/ION/ramdisk-add/etc/yourfile
$ make bgp-ion-ramdisk-cnl
</pre>

If you need more than file adding, you might need to edit the <tt>build-ramdisk</tt> script itself. The script is located in <tt>/bgsys/drivers/ppcfloor/</tt>. Copy the script to a working directory, edit it and change the script path in <tt>ramdisk/ION/Makefile</tt>.

====ION startup script====

There is no <tt>rc.sysinit</tt> in <tt>ramdisk/ION/ramdisk-add/</tt>, because <tt>rc.sysinit</tt> is provided in the IBM ramdisk tree (i.e., <tt>/bgsys/drivers/ppcfloor/ramdisk/etc/init.d/rc.sysinit</tt> is default one). If needed, one can copy the default one to the ZeptoOS <tt>ramdisk/etc/init.d/rc.sysinit</tt> and modify it to change the startup behaviour, but this is in general not recommended.

In most cases, what one is looking for is to start a process at the ION boot time. For such purpose, one can add a custom ION RC script to <tt>ramdisk-add/etc/init.d/rc3.d/</tt>.

RC scripts have the following naming convention:

* S##xxxx : boot-time scripts
* K##xxxx : shut-down scripts

They start with <tt>S</tt> or <tt>K</tt>; those starting with <tt>S</tt> are the boot-time scripts and those starting with <tt>K</tt> are the shut-down scripts. The two-digit number following <tt>S</tt> or <tt>K</tt> is used to determine the execution order; scripts with lower numbers are executed earlier. The number is followed by the script name. On execution, "start" is passed as the first argument to boot-time scripts, and "stop" to shut-down scripts. Here is a template of an RC script:

<pre>
#!/bin/sh
. /etc/rc.status

rc_reset
case "$1" in
start)
# fill here #
;;
stop)
# fill here #
;;
restart)
# fill here #
;;
status)
# fill here #
;;
*)
echo "Usage: $0 {start|stop|restart|status}"
exit 1
;;
esac
rc_exit
</pre>

The ZeptoOS ION ramdisk contains the following RC scripts by default (some of these are ZeptoOS-specific, others come from the IBM ramdisk tree):

'''boot''' scripts:
<pre>
S00zepto
S01bootsysctl
S02syslog
S05ntp
S11sshd
S12zepto
S40gpfs
S43ibmcmp
S46essl
S50ciod
S51zoid
S99zepto
</pre>

'''shutdown''' scripts:
<pre>
K05ntp
K10sshd
K15ciod
K20gpfs
K30syslog
K50bgsys.64
</pre>

===Ramdisk size limitation===

In regular Linux environments, ramdisk size is limited by free memory size at the time when ramdisk is loaded into memory. However, on BGP, closed-source system software cannot handle images of arbitrary sizes. We do not have an exact number on the boot image size limitation, but we have seen with the current software stack that images of 100 MB or larger might fail to boot. If one adds large files to the ramdisk, please check the size of the generated image files, specifically <tt>BGP-ION-ramdisk-for-CNL.elf</tt> and <tt>BGP-CN-zImage-with-initrd.elf</tt>.

==Extracting files from an existing ramdisk image==

To extract file from an existing ramdisk image, do the following (ION ramdisk only):

<pre>
$ ./packages/tools/z-extract-cpio-from-ramdisk.sh <existing_ramdisk_image> ramdisk.cpio
$ mkdir treeroot && cd treeroot
$ cpio -idv < ../ramdisk.cpio
</pre>

----
[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]

Ramdisk

2009-05-06T22:07:02Z

Iskra:

[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]
----

==Introduction==

Both the CN and ION Linux kernels require a ramdisk to boot. Ramdisk images contain minimal Linux utilities, init scripts, configuration files, kernel modules, etc, which are required by the OS boot process.

ION ramdisk is an ELF file that contains a cpio archive of system files. Two ION ramdisk images are currently generated:

; BGP-ION-ramdisk-for-CNL.elf
: Default ION ramdisk for ZeptoOS.
; BGP-ION-ramdisk-for-CNK.elf
: Use this one if you need to run IBM CNK on the compute nodes (uses IBM CIOD instead of ZOID)

Our ION ramdisks are similar to the default ION ramdisk from IBM, but we add some extra files to support ZeptoOS features. The extra files are located in <tt>ramdisk/ION/ramdisk-add/</tt>. The <tt>build-ramdisk</tt> script from IBM BGP driver is used to create the ION ramdisks.

The CN ramdisk is also a gzip'ed cpio archive of system files, but CN ramdisk is embedded into the CN kernel image (<tt>BGP-CN-zImage-with-initrd.elf</tt>). The CN ramdisk is created by a custom ramdisk build script (<tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>). Both <tt>build-ramdisk</tt> and <tt>create-bgp-cn-linux-ramdisk.pl</tt> are wrappers of the Linux kernel's <tt>gen_init_cpio</tt> command.

==Creating ramdisk images==

The ramdisk images are always (re-)created from prebuild objects if one types <tt>make</tt> at the top level directory (without any make target).

If one wants to create an ION ramdisk individually (without rebuilding other images), type:

<pre>
$ make bgp-ion-ramdisk-cnl
</pre>

If one wants to create a CN ramdisk (technically, create a CN kernel image with new ramdisk contents), type:

<pre>
$ make bgp-cn-linux
</pre>

'''Note:''' the newly built CN ramdisk can be found in <tt>ramdisk/CN/bgp-cn-ramdisk.cpio.gz</tt>, but it is not useable until it is embedded into the kernel image.

For other ramdisk-related make targets, please refer to [[Configuration#Building|Configuration]].

==Modifying ramdisk contents==

You can customize ramdisk contents for your purpose, i.e., debugging, running your custom system software on BGP, etc.

===CN ramdisk===

The CN ramdisk can be customized by editing the CN ramdisk build script, which is <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt>. The build script allows to set the permission bits, create device files, etc.

Most of the contents of the CN ramdisk is kept in <tt>ramdisk/CN/tree/<tt>, but this is not a hard rule. Source files can reside anywhere as long as they are accessible from the script. It may be possible to use binaries and libraries from the login nodes, as long as they are a 32-bit PPC files (use the <tt>file</tt> command to verify) and all their dependencies are also copied.

Here is a practical example. Suppose that you need the <tt>od</tt> command in CN ramdisk. You could build the command from source code, but if you want to do something quick, you can try using the login node's version:

<pre>
$ file /usr/bin/od
/usr/bin/od: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV),
for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped
$ ldd /usr/bin/od
linux-vdso32.so.1 => (0x00100000)
libc.so.6 => /lib/ppc970/libc.so.6 (0x0fe8b000)
/lib/ld.so.1 (0xf7fe1000)
</pre>

It is a 32-bit PPC executable and the current CN ramdisk has all the necessary shared libraries, so it can be used. Now add the command to a perl array named <tt>@cmdlists</tt> in <tt>ramdisk/CN/create-bgp-cn-linux-ramdisk.pl</tt> script and type <tt>make</tt> to recreate the CN ramdisk:

<pre>
$ vi ramdisk/CN/create-bgp-cn-linux-ramdisk.pl
# add the following line to @cmdlists
"file /bin/od /usr/bin/od 0755 0 0",
$ make bgp-cn-linux
</pre>

Now the CN ramdisk has <tt>/bin/od</tt> with file permissions <tt>0755</tt>, uid=0, and gid=0.

The added line is a command for the <tt>gen_init_cpio</tt> tool. One can also create directories, device files, symbolick links, pipe files, socket files, etc:

<pre>
file <name> <location> <mode> <uid> <gid>
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

<name> name of the file/dir/nod/etc in the archive
<location> location of the file in the current filesystem
<target> link target
<mode> mode/permissions of the file
<uid> user id (0=root)
<gid> group id (0=root)
<dev_type> device type (b=block, c=character)
<maj> major number of nod
<min> minor number of nod
</pre>

The order of the commands in @cmdlists ''matters''. They are executed from top to bottom, so one cannot add a file to a directory that has not yet been created.

====CN Linux startup script====

The first thing that the Linux kernel does after it boots is to execute the <tt>init</tt> program. The <tt>init</tt> program is usually in <tt>/sbin/</tt>, and in the CN ramdisk case it is part of the busybox. <tt>init</tt> reads in a config file from <tt>/etc/inittab</tt>, which in our case instructs it to execute the <tt>/etc/init.d/rc.sysinit</tt> startup script.

Our startup script is very minimalistic; its two most important actions are to start the telnet daemon to allow users to login from the I/O nodes and then to start the ZOID <tt>control</tt> process which takes care of IP forwarding and job control.

In case you need to start some process at the CN boot time, you can add its invocations to <tt>ramdisk/CN/tree/etc/init.d/rc.sysinit</tt>, ''before'' <tt>/sbin/control</tt> is invoked.

===ION ramdisk===

Unlike with the CN ramdisk, the range of customization is limited on the ION ramdisk. There is no control over file permission bits, one cannot create device nodes, etc. Currently we build the ION ramdisk using IBM's <tt>build-ramdisk</tt> script by specifying an add-on tree which contains our extra files.

What you can do are basically:
* add files,
* overwrite default ramdisk files by adding custom files with the same names.

Once files have been added under <tt>ramdisk/ION/ramdisk-add/</tt>, they will be automatically added to the ramdisk on the next rebuild. Here is an example of how to add a file to the ION ramdisk:

<pre>
$ vi ramdisk/ION/ramdisk-add/etc/yourfile
$ make bgp-ion-ramdisk-cnl
</pre>

If you need more than file adding, you might need to edit the <tt>build-ramdisk</tt> script itself. The script is located in <tt>/bgsys/drivers/ppcfloor/</tt>. Copy the script to your working directory, edit it and change the script path in <tt>ramdisk/ION/Makefile</tt>.

====ION startup script====

There is no rc.sysinit in ramdisk/ION/ramdisk-add/ since
rc.sysinit is provided from IBM ramdisk tree.
i.e., /bgsys/drivers/ppcfloor/ramdisk/etc/init.d/rc.sysinit is default one.
You can copy the default one to ramdisk/etc/init.d/rc.sysinit (local) and modify it
to change the startup behaviour but it is not recommended.

In most cases, what you need is to start your software at ION boot time.
For such purpose, you can add your ION RC script to ramdisk-add/etc/init.d/rc3.d
to do some action.

RC script has own naming convention.

* S##xxxx : boot time scripts
* K##xxxx : shut down scripts

It starts with S or K. Scritps with S are boot time script and scripts with K are shut down scripts.
A two-digit number is followed by 'S' or 'K' is used to decide execution order ;
a smaller number script is executed before a larger number script. Then script name follows.
The init scripts passes "start" as the 1st argument to boot time scripts when it is executed. Similarly, "stop" is passed to shut down script.
If you parse the argument, one rc script can serve as both boot time and shut down script.
Here is a template of rc script.

<pre>
#!/bin/sh
. /etc/rc.status

rc_reset
case "$1" in
start)
# fill here #
;;
stop)
# fill here #
;;
restart)
# fill here #
;;
status)
# fill here #
;;
*)
echo "Usage: $0 {start|stop|restart|status}"
exit 1
;;
esac
rc_exit
</pre>

Default Zepto ION ramdisk contains the following rc scripts.

'''boot''' scripts
<pre>
S00zepto
S01bootsysctl
S02syslog
S05ntp
S11sshd
S12zepto
S40gpfs
S43ibmcmp
S46essl
S50ciod
S51zoid
S99zepto
</pre>

'''shutdown''' scripts
<pre>
K05ntp
K10sshd
K15ciod
K20gpfs
K30syslog
K50bgsys.64
</pre>

===Ramdisk size limitation===

On regular Linux environment, ramdisk size is
basically limited by free memory size at the time when ramdisk is loaded into memory.
However, on BGP, the system software(non-opensource) can not handle bigger image.
We don't have the exact number on the boot image size limitation
but 100MB or bigger ramdisk might fail to boot with the current environment.
If you add bigger files to ramdisk, please make sure the ramdisk file size,
specifically, BGP-ION-ramdisk-for-CNL.elf and BGP-CN-zImage-with-initrd.elf.

==Extracting files from an existing ramdisk image==

To extract file from an existing ramdisk image, do the following (ION ramdisk only):

<pre>
$ ./packages/tools/z-extract-cpio-from-ramdisk.sh <existing_ramdisk_image> ramdisk.cpio
$ mkdir treeroot && cd treeroot
$ cpio -idv < ../ramdisk.cpio
</pre>

----
[[Kernel]] | [[ZeptoOS_Documentation|Top]] | [[ZOID]]

Configuration

2009-05-06T21:32:07Z

Iskra: /* Building */

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the frontend nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [http://press.mcs.anl.gov/zeptoos/download download page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-*.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>BGP</tt> directory:

<pre>
$ cd BGP/
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories.

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once a password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: A merged ZeptoOS compute node Linux and compute node ramdisk file.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-cn-linux
: Rebuilds the compute node ramdisk and embeds it into a compute node kernel image.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and merges them.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
(the following <tt>make</tt>targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]

Configuration

2009-05-06T21:15:11Z

Iskra: /* Building */

[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]
----

== Downloading ==

* Log on one of the frontend nodes of the Blue Gene (a login node or a service node).

* Download the ZeptoOS tarball from the ZeptoOS [http://press.mcs.anl.gov/zeptoos/download download page].

* Extract the sources from the package:
<pre>
$ tar xjf ZeptoOS-*.tar.bz2
</pre>

== Configuring ==

Change to the top-level <tt>BGP</tt> directory:

<pre>
$ cd BGP/
</pre>

A <tt>configure</tt> script is provided to set the pathnames to various system directories.

<pre>
$ ./configure
</pre>

If invoked without any arguments, it will use the defaults, which should be appropriate if ZeptoOS is configured on a system with a supported BG/P driver version. The pathnames can be changed with the help of a user interface by invoking the script as follows:

<pre>
$ ./configure --edit
</pre>

This will display the following menu:

[[Image:Configure1.png|border|Main menu]]

Please select the top item (<tt>BG/P DIST_DIR</tt>). The screen will change to:

[[Image:Configure2.png|border|DIST_DIR menu]]

The following options are available:

; DRV_DIR
: The directory with the BG/P driver tree. The default (<tt>/bgsys/drivers/ppcfloor/</tt>) is a link pointing to the currently active driver.
; BGP_CROSS
: A prefix to the pathnames of the GNU cross-compilers used to build the compute node and I/O node software.
; BGCNS_H_PATH and BGCNS_H
: The location of a file needed to rebuild the kernel (these options are temporary and will be removed in the next version).
; OS_DIR
: The directory with the supplementary I/O node software used when booting the I/O nodes. It needs to be set to match the BG/P driver version being used.

The second top-level menu (<tt>Debugging</tt>) has only one option:

; ADD_DEBUG_TOOLS
: Check this option to include <tt>gdb</tt> and <tt>strace</tt> in the compute node ramdisk. They are not included by default because of their size.

The third top-level menu (<tt>Kernel Profiling</tt>) is discussed in the [[(K)TAU#Configure ZeptoOS to point to KTAU patch and path|(K)TAU section]]

Select <tt>Exit</tt> (multiple times if needed) and confirm if you want to save any changes made.

== Building ==

To start using the pre-built binaries simply type:

<pre>
$ make
</pre>

On the first invocation, this will ask for a root password to use on I/O nodes:

<pre>
Create root password for I/O Node
Leave the password field empty if you want to disable root login
New password:
</pre>

'''Security note: root-level access to I/O nodes should only be given to trusted individuals. A root user can access and modify files of all users in the system.'''

Once a password has been entered and confirmed, <tt>make</tt> will use pre-built kernel images, and will build the ramdisks from pre-built tools and utilities. The following generated files will be placed in the top-level directory:

; BGP-CN-zImage-with-initrd.elf
: A merged ZeptoOS compute node Linux and compute node ramdisk file.
; BGP-ION-ramdisk-for-CNL.elf
: ZeptoOS I/O node ramdisk for use with the ZeptoOS compute node Linux.
; BGP-ION-ramdisk-for-CNK.elf
: ZeptoOS I/O node ramdisk for use with the IBM CNK (optional).
; BGP-ION-zImage.elf
: ZeptoOS I/O node kernel.

It is possible to rebuild individual ZeptoOS components using one of the following <tt>make</tt> targets (the list is also available by typing <tt>make help</tt> or <tt>make menu</tt>):

; bgp-ion-ramdisk-cnk
: Rebuilds the I/O node ramdisk for the IBM CNK.
; bgp-ion-ramdisk-cnl
: Rebuilds the I/O node ramdisk for the ZeptoOS compute node Linux.
; bgp-ion-linux-build
: Rebuilds the I/O node kernel.
; bgp-cn-linux-build
: Rebuilds the compute node kernel and ramdisk and merges them.
; bgp-all-pkg-rebuild
: Rebuilds all packages from sources.
(the following <tt>make</tt>targets are mostly for internal use)
; bgp-ion-linux
: Copies a recently rebuilt I/O node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel).
; bgp-cn-linux
: Copies a recently rebuilt compute node kernel if one is available; otherwise, uses a prebuilt binary (will not rebuild the kernel), rebuilds the compute node ramdisk, and merges the kernel and ramdisk.
; bgp-all-pkg-smart
: Copies recently rebuilt packages if available; otherwise, uses prebuilt binaries (used when preparing to rebuild ramdisks).

----
[[Introduction]] | [[ZeptoOS_Documentation|Top]] | [[Installation]]