Difference between revisions of "Other Packages"
(→PVFS) |
(→PVFS) |
||
Line 4: | Line 4: | ||
==PVFS== | ==PVFS== | ||
− | [http://www.pvfs.org/ PVFS] stands for Parallel Virtual File System, | + | [http://www.pvfs.org/ PVFS] stands for Parallel Virtual File System, an open source parallel file system designed to scale to petabytes of storage and to provide access rates at 100s of GB/s. At Argonne BGP systems, PVFS servers are running and PVFS start-up script is installed in the BGP site-specific directory (<tt>/bgp/iofs/</tt>), so that a PVFS volume is mounted at ION boot time. |
− | of storage and provide access rates at 100s of GB/s. At Argonne BGP | ||
− | are running and | ||
− | |||
− | We | + | We included PVFS version 2.8.1 source code and its prebuilt client binaries in the ZeptoOS release for the sites that are interested in PVFS. We also include a very simple example PVFS start-up script that can be added to the ION ramdisk. If you have PVFS servers running in your system, you can follow the steps below to add the necessary PVFS client components to the ramdisk. |
− | in the ZeptoOS release for the sites | ||
− | We also include a very simple | ||
− | If you have | ||
− | you can follow the steps below to add | ||
<pre> | <pre> | ||
$ cd packages/pvfs2/prebuilt | $ cd packages/pvfs2/prebuilt | ||
− | $ sh add-pvfs2-client-ION-ramdisk.sh | + | $ sh add-pvfs2-client-ION-ramdisk.sh tcp://192.168.1.1:333/pvfs2-fs |
</pre> | </pre> | ||
− | Please replace tcp://192.168.1.1:3334/pvfs2-fs with | + | Please replace <tt>tcp://192.168.1.1:3334/pvfs2-fs</tt> with the actual server info. |
− | Details on building and running the | + | Details on building and running the PVFS servers are outside of the scope of this document, but the following example might give a basic idea to build and run the pvfs2 server. |
− | might give | ||
<pre> | <pre> | ||
[Build] | [Build] | ||
$ cd pvfs-2.8.1 | $ cd pvfs-2.8.1 | ||
− | $ ./configure | + | $ ./configure [options....] |
$ make | $ make | ||
Line 35: | Line 27: | ||
[Start the server] | [Start the server] | ||
− | $ ./src/server/pvfs2-server -f fs.conf | + | $ ./src/server/pvfs2-server -f fs.conf -a ALIAS |
− | $ ./src/server/pvfs2-server fs.conf | + | $ ./src/server/pvfs2-server fs.conf -a ALIAS |
+ | </pre> | ||
− | + | '''Note:''' | |
− | + | * replace <tt>ALIAS</tt> with your real alias in <tt>fs.conf</tt> | |
− | + | * the first <tt>pvfs2-server</tt> invocation just initializes a PVFS volume | |
− | + | * the second invocation actually starts the server | |
− | |||
==IP over torus== | ==IP over torus== |
Revision as of 16:10, 8 May 2009
PVFS
PVFS stands for Parallel Virtual File System, an open source parallel file system designed to scale to petabytes of storage and to provide access rates at 100s of GB/s. At Argonne BGP systems, PVFS servers are running and PVFS start-up script is installed in the BGP site-specific directory (/bgp/iofs/), so that a PVFS volume is mounted at ION boot time.
We included PVFS version 2.8.1 source code and its prebuilt client binaries in the ZeptoOS release for the sites that are interested in PVFS. We also include a very simple example PVFS start-up script that can be added to the ION ramdisk. If you have PVFS servers running in your system, you can follow the steps below to add the necessary PVFS client components to the ramdisk.
$ cd packages/pvfs2/prebuilt $ sh add-pvfs2-client-ION-ramdisk.sh tcp://192.168.1.1:333/pvfs2-fs
Please replace tcp://192.168.1.1:3334/pvfs2-fs with the actual server info.
Details on building and running the PVFS servers are outside of the scope of this document, but the following example might give a basic idea to build and run the pvfs2 server.
[Build] $ cd pvfs-2.8.1 $ ./configure [options....] $ make [Create a server config file] $ ./src/apps/admin/pvfs2-genconfig fs.conf [Start the server] $ ./src/server/pvfs2-server -f fs.conf -a ALIAS $ ./src/server/pvfs2-server fs.conf -a ALIAS
Note:
- replace ALIAS with your real alias in fs.conf
- the first pvfs2-server invocation just initializes a PVFS volume
- the second invocation actually starts the server
IP over torus
This is currently a preview feature. It implements IP packet forwarding on top of MPI, over the torus network. Torus is a point-to-point network that interconnects all the compute nodes in a partition. Every compute node gets a unique IP address, of the form:
10.128.0.0 | <rank>
where <rank> is the MPI rank. Thus, for a 64-node partition, the IP addresses will range between 10.128.0.0 and 10.128.0.63, and for a 1024-node partition, they will range between 10.128.0.0 and 10.128.3.255.
To try this feature out, submit as a compute job the cn-ipfwd.sh script, which should have been installed in /path/to/install/cnbin/. The script can act as a standalone job or as a wrapper. If invoked without any arguments, it initializes the IP forwarding and then goes to sleep; if any arguments have been passed, they are interpreted as the name of the binary (along with its command line arguments) to invoke once the IP forwarding is initialized, e.g. (an example with Cobalt):
$ cqsub -k <profile-name> -t <time> -n 64 /path/to/install/cnbin/ipfwd.sh \ <name of another binary> <arguments to that binary>
The script can be copied to another location and adjusted to one's needs.
Once the job is running, log into a compute node, and run ifconfig; there should be a new virtual network device tun1 (in addition to the usual tun0, used for IP forwarding between compute nodes and I/O nodes):
~ # ifconfig tun1 tun1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:10.128.0.0 P-t-P:10.128.0.0 Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) ~ # ping 10.128.0.1 PING 10.128.0.1 (10.128.0.1): 56 data bytes 64 bytes from 10.128.0.1: seq=0 ttl=64 time=0.321 ms 64 bytes from 10.128.0.1: seq=1 ttl=64 time=0.191 ms 64 bytes from 10.128.0.1: seq=2 ttl=64 time=0.203 ms 64 bytes from 10.128.0.1: seq=3 ttl=64 time=0.194 ms 64 bytes from 10.128.0.1: seq=4 ttl=64 time=0.207 ms --- 10.128.0.1 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 0.191/0.223/0.321 ms ~ # rsh 10.128.0.1 'grep BG_RANK_IN_PSET /proc/personality.sh' BG_RANK_IN_PSET=59 ~ #
This feature can be used to implement an arbitrary IP-based network protocol between the compute nodes. We have even experimented running a TCP/IP-based MPICH on top of it (which, while obviously not as fast as the native Blue Gene one, has the advantage of being able to, e.g., run multiple MPI jobs at a time on a single partition).
One major disadvantage of this feature is that the current implementation is computationally intensive; it permanently occupies one core on each node.
ZOID glibc
This is another preview feature. It provides a modified version of GNU libc for the compute nodes, which features much better file I/O throughput rates to the I/O nodes and remote file systems than the default one. It does so by communicating with the ZOID daemon directly, instead of going through the Linux kernel and the FUSE client (which, while convenient, is slow).
The modified glibc is meant for compiled application processes, not for shell scripts and such. It is currently only available in a static (.a) version. It is installed with the rest of the ZeptoOS, in /path/to/install/lib/zoid/. To link with it, simply add -L/path/to/install/lib/zoid to the final linking stage. Use the following command to verify that the modified version of glibc has been used for linking:
$ nm <binary> | grep __zoid_init
(no output will be generated if the standard glibc was used)
When submitting a job linked with this glibc, please set the environment variable ZOID_DIRS to a list of :-separated pathname prefixes. Only files opened using pathnames beginning with those prefixes will be directly forwarded to the I/O node; other files will be handled via the compute node kernel and possibly FUSE, which is much slower.
Here is a simple benchmark:
#include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <sys/time.h> #define BUFSIZE (1024 * 1024 * 100) int main(int argc, char* argv[]) { char* buffer; int fd; struct timeval start, stop; double time; if (argc != 2) { fprintf(stderr, "Usage: %s <pathname>\n", argv[0]); return 1; } if (!(buffer = malloc(BUFSIZE))) { perror("malloc"); return 1; } if ((fd = open(argv[1], O_CREAT | O_WRONLY, 0666)) == -1) { perror("open"); return 1; } gettimeofday(&start, NULL); if (write(fd, buffer, BUFSIZE) != BUFSIZE) { perror("write"); return 1; } gettimeofday(&stop, NULL); close(fd); free(buffer); time = stop.tv_sec - start.tv_sec + (stop.tv_usec - start.tv_usec) * 1e-6; printf("Writing %d B took %g s, %g B/s\n", BUFSIZE, time, BUFSIZE / time); return 0; }
It writes 1 GB of data to a file passed on the command line. With Cobalt, we submit it as follows:
$ cqsub -k <profile-name> -t 10 -n 1 -e ZOID_DIRS=$HOME $PWD/speed_zoid $HOME/speed_zoid-out
With our home directories on a GPFS filesystem, we get the following performance:
Writing 1073741824 B took 4.58026 s, 2.34428e+08 B/s
On the other hand, if we link it with the standard glibc, or if we forget to set ZOID_DIRS, the performance we observe is as follows:
Writing 1073741824 B took 10.4905 s, 1.02354e+08 B/s
The modified glibc is not used by default, because it is not yet complete. However, if one does not try to outsmart it (in particular, we recommend always passing absolute pathnames), it should work reliably.