Other Packages

From ZeptoOS
Jump to navigationJump to search

(K)TAU | Top


PVFS

We include PVFS version 2.8.1 source code and its prebuilt client binaries in the current ZeptoOS release. We also provide a very simple pvfs2 start-up script as an example. If you have pvfs2 server running in your system, you could follow the following steps to add the start-up script to the ION ramdisk.

$ cd packages/pvfs2/prebuilt
$ sh add-pvfs2-client-ION-ramdisk.sh  tcp://192.168.1.1:3334/pvfs2-fs  

Please replace tcp://192.168.1.1:3334/pvfs2-fs with your actual server info.

Details on building and running the pvfs2 server is out of our scope, but the following example might give you an idea to build and start the pvfs2 server.

$ cd pvfs-2.8.1
$ ./configure           # may need some configure options
$ make
$ ./src/apps/admin/pvfs2-genconfig fs.conf
( you will be asked some basic information )
$ ./src/server/pvfs2-server -f fs.conf  -a  ALIAS   # Replace ALIAS with your real alias
$ /src/server/pvfs2-server fs.conf  -a  ALIAS

Reference: [1]

IP over torus

This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form:

10.128.0.0 | <rank>

where <rank> is the MPI rank. Thus, for a 64-node partition, the IP addresses will range between 10.128.0.0 and 10.128.0.63, and for a 1024-node partition, they will range between 10.128.0.0 and 10.128.3.255.

To try this feature out, submit as a compute job the cn-ipfwd.sh script, which should have been installed in /path/to/install/cnbin/. The script can act as a standalone job or as a wrapper. If invoked without any arguments, it initializes the IP forwarding and then goes to sleep; if any arguments have been passed, they are interpreted as the name of the binary (along with its command line arguments) to invoke once the IP forwarding is initialized, e.g. (an example with Cobalt):

$ cqsub -k <profile-name> -t <time> -n 64 /path/to/install/cnbin/ipfwd.sh \
<name of another binary> <arguments to that binary>

The script can be copied to another location and adjusted to one's needs.

Once the job is running, log into a compute node, and run ifconfig; there should be a new virtual network device tun1 (in addition to the usual tun0, used for IP forwarding between compute nodes and I/O nodes):

~ # ifconfig tun1
tun1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:10.128.0.0  P-t-P:10.128.0.0  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:65535  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
~ # ping 10.128.0.1
PING 10.128.0.1 (10.128.0.1): 56 data bytes
64 bytes from 10.128.0.1: seq=0 ttl=64 time=0.321 ms
64 bytes from 10.128.0.1: seq=1 ttl=64 time=0.191 ms
64 bytes from 10.128.0.1: seq=2 ttl=64 time=0.203 ms
64 bytes from 10.128.0.1: seq=3 ttl=64 time=0.194 ms
64 bytes from 10.128.0.1: seq=4 ttl=64 time=0.207 ms
--- 10.128.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.191/0.223/0.321 ms
~ # rsh 10.128.0.1 'grep BG_RANK_IN_PSET /proc/personality.sh'
BG_RANK_IN_PSET=59
~ # 

This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).

One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.

ZOID glibc

This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the Linux kernel and the FUSE client (which, while convenient, is slow).

The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (.a) version. It is installed with the rest of the ZeptoOS, in /path/to/install/lib/zoid/. To link with it, simply add -L/path/to/install/lib/zoid to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:

$ nm <binary> | grep __zoid_init

(no output will be generated if the standard glibc was used)

When submitting a job linked with this glibc, please set the environment variable ZOID_DIRS to a list of :-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.

Here is a simple benchmark:

#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

#define BUFSIZE (1024 * 1024 * 100)

int main(int argc, char* argv[])
{
    char* buffer;
    int fd;
    struct timeval start, stop;
    double time;

    if (argc != 2)
    {
	fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
	return 1;
    }

    if (!(buffer = malloc(BUFSIZE)))
    {
	perror("malloc");
	return 1;
    }
    if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1)
    {
	perror("open");
	return 1;
    }
    gettimeofday(&start, NULL);
    if (write(fd, buffer, BUFSIZE) != BUFSIZE)
    {
	perror("write");
	return 1;
    }
    gettimeofday(&stop, NULL);
    close(fd);
    free(buffer);

    time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6;
    printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time);

    return 0;
}

It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:

$ cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out

With our home directories on a GPFS filesystem, we get the following performance:

Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s

On the other hand, if we link it with the standard glibc, or if we forget to set ZOID_DIRS, the performance we observe is as follows:

Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s

The modified glibc is not used by default, because it is not yet complete. However, if one does not try to outsmart it (in particular, we recommend always passing absolute pathnames), it should work reliably.


(K)TAU | Top