Other Packages
PVFS
IP over torus
This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form:
10.128.0.0 | <rank>
where <rank> is the MPI rank. Thus, for a 64-node partition, the IP addresses will range between 10.128.0.0 and 10.128.0.63, and for a 1024-node partition, they will range between 10.128.0.0 and 10.128.3.255.
To try this feature out, submit as a compute job the cn-ipfwd.sh script, which should have been installed in /path/to/install/bin. The script initializes the IP forwarding and then goes to sleep; feel free to adjust it to your needs. Log into a compute node, and run ifconfig; there should be a new virtual tun1 network (in addition to the usual tun0, used for IP forwarding between compute nodes and I/O nodes):
~ # ifconfig tun1 tun1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:10.128.0.0 P-t-P:10.128.0.0 Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) ~ # ping 10.128.0.1 PING 10.128.0.1 (10.128.0.1): 56 data bytes 64 bytes from 10.128.0.1: seq=0 ttl=64 time=0.321 ms 64 bytes from 10.128.0.1: seq=1 ttl=64 time=0.191 ms 64 bytes from 10.128.0.1: seq=2 ttl=64 time=0.203 ms 64 bytes from 10.128.0.1: seq=3 ttl=64 time=0.194 ms 64 bytes from 10.128.0.1: seq=4 ttl=64 time=0.207 ms --- 10.128.0.1 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 0.191/0.223/0.321 ms ~ # rsh 10.128.0.1 'grep BG_RANK_IN_PSET /proc/personality.sh' BG_RANK_IN_PSET=59 ~ #
This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).
One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.
ZOID glibc
This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the Linux kernel and the FUSE client (which, while convenient, is slow).
The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (.a) version. It is installed with the rest of the ZeptoOS, in /path/to/install/lib/zoid/. To link with it, simply add -L/path/to/install/lib/zoid to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:
$ nm <binary> | grep __zoid_init
(no output will be generated if the standard glibc was used)
When submitting a job linked with this glibc, please set the environment variable ZOID_DIRS to a list of :-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.
Here is a simple benchmark:
#include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <sys/time.h> #define BUFSIZE (1024 * 1024 * 100) int main(int argc, char* argv[]) { char* buffer; int fd; struct timeval start, stop; double time; if (argc != 2) { fprintf(stderr, "Usage: %s <pathname>\n", argv[0]); return 1; } if (!(buffer = malloc(BUFSIZE))) { perror("malloc"); return 1; } if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1) { perror("open"); return 1; } gettimeofday(&start, NULL); if (write(fd, buffer, BUFSIZE) != BUFSIZE) { perror("write"); return 1; } gettimeofday(&stop, NULL); close(fd); free(buffer); time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6; printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time); return 0; }
It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:
cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out
With our home directories on a GPFS filesystem, we get the following performance:
Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s
On the other hand, if we link it with the default glibc, or if we forget to set ZOID_DIRS, the performance we observe is as follows:
Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s