Other Packages

From ZeptoOS
Revision as of 17:17, 8 May 2009 by Iskra (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

(K)TAU | Top


PVFS stands for Parallel Virtual File System, an open source parallel file system designed to scale to petabytes of storage and to provide access rates at 100s of GB/s. At Argonne BGP systems, PVFS servers are running and PVFS start-up script is installed in the BGP site-specific directory (/bgp/iofs/), so that a PVFS volume is mounted at ION boot time.

We included PVFS version 2.8.1 source code and its prebuilt client binaries in the ZeptoOS release for the sites that are interested in PVFS. We also included a very simple example PVFS start-up script that can be added to the ION ramdisk. If you have PVFS servers running in your system, you can follow the steps below to add the necessary PVFS client components to the ramdisk:

$ cd packages/pvfs2/prebuilt
$ sh add-pvfs2-client-ION-ramdisk.sh tcp://

Please replace tcp:// with the actual server info.

Details on building and running PVFS servers are outside of the scope of this document, but the following example might give a basic idea of how to build and run PVFS:

$ cd pvfs-2.8.1
$ ./configure [options....]
$ make

[Create a server config file]
$ ./src/apps/admin/pvfs2-genconfig fs.conf

[Start the server]
$ ./src/server/pvfs2-server -f fs.conf -a ALIAS
$ ./src/server/pvfs2-server    fs.conf -a ALIAS


  • replace ALIAS with your real alias in fs.conf
  • the first pvfs2-server invocation just initializes a PVFS volume
  • the second invocation actually starts the server

IP over torus

This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form: | <rank>

where <rank> is the MPI rank. Thus, for a 64-node partition, the IP addresses will range between and, and for a 1024-node partition, they will range between and

To try this feature out, submit as a compute job the cn-ipfwd.sh script, which should have been installed in /path/to/install/cnbin/. The script can act as a standalone job or as a wrapper. If invoked without any arguments, it initializes the IP forwarding and then goes to sleep; if any arguments have been passed, they are interpreted as the name of the binary (along with its command line arguments) to invoke once the IP forwarding is initialized, e.g. (an example with Cobalt):

$ cqsub -k <profile-name> -t <time> -n 64 /path/to/install/cnbin/ipfwd.sh \
<name of another binary> <arguments to that binary>

The script can be copied to another location and adjusted to one's needs.

Once the job is running, log into a compute node, and run ifconfig; there should be a new virtual network device tun1 (in addition to the usual tun0, used for IP forwarding between compute nodes and I/O nodes):

~ # ifconfig tun1
tun1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:  P-t-P:  Mask:
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
~ # ping
PING ( 56 data bytes
64 bytes from seq=0 ttl=64 time=0.321 ms
64 bytes from seq=1 ttl=64 time=0.191 ms
64 bytes from seq=2 ttl=64 time=0.203 ms
64 bytes from seq=3 ttl=64 time=0.194 ms
64 bytes from seq=4 ttl=64 time=0.207 ms
--- ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.191/0.223/0.321 ms
~ # rsh 'grep BG_RANK_IN_PSET /proc/personality.sh'
~ # 

This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).

One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.

ZOID glibc

This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the compute node Linux kernel and the FUSE client (which, while convenient, is slow).

The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (.a) version. It is installed with the rest of the ZeptoOS, in /path/to/install/lib/zoid/. To link with it, simply add -L/path/to/install/lib/zoid to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:

$ nm <binary> | grep __zoid_init

(no output will be generated if the standard glibc was used)

When submitting a job linked with this glibc, please set the environment variable ZOID_DIRS to a list of :-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.

Here is a simple benchmark:

#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

#define BUFSIZE (1024 * 1024 * 100)

int main(int argc, char* argv[])
    char* buffer;
    int fd;
    struct timeval start, stop;
    double time;

    if (argc != 2)
	fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
	return 1;

    if (!(buffer = malloc(BUFSIZE)))
	return 1;
    if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1)
	return 1;
    gettimeofday(&start, NULL);
    if (write(fd, buffer, BUFSIZE) != BUFSIZE)
	return 1;
    gettimeofday(&stop, NULL);

    time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6;
    printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time);

    return 0;

It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:

$ cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out

With our home directories on a GPFS filesystem, we get the following performance:

Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s

On the other hand, if we link it with the standard glibc, or if we forget to set ZOID_DIRS, the performance we observe is as follows:

Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s

The modified glibc is not used by default because it is not yet complete. However, if one does not try to outsmart it (in particular, we recommend always passing absolute pathnames), it should work reliably.

(K)TAU | Top