Difference between revisions of "FAQ"
Line 117: | Line 117: | ||
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=\([^:]*\)/\1/'` | COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=\([^:]*\)/\1/'` | ||
</pre> | </pre> | ||
+ | |||
+ | ==Why large MPI programs do not work== | ||
+ | |||
+ | A common reason might be that they do not have enough memory to run. MPI programs run within the Big Memory region, which is limited to 256 MB by default. See the [[Kernel#Kernel (command line) parameters|Kernel]] section to learn how to change that; the parameter to use is <tt>flatmemsizeMB</tt>. | ||
---- | ---- | ||
[[ZeptoOS_Documentation|Top]] | [[ZeptoOS_Documentation|Top]] |
Revision as of 14:32, 7 May 2009
How to obtain a CN node number
This depends on what number one is interested in.
Pset rank
A pset rank is a number identifying a compute node within each pset (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do not start from 0; they start from 1 for some mysterious reason (do not blame us – blame IBM :-).
Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (x in 192.168.1.x).
The pset rank is available on the compute nodes from /proc/personality.sh, in the BG_RANK_IN_PSET variable:
#!/bin/sh . /proc/personality.sh echo "My pset rank is $BG_RANK_IN_PSET"
From a C program it will be easier to use the binary personality available from /proc/personality. The definition of the structure can be found in /bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h. The pset rank is in Network_Config.RankInPSet:
#include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <common/bgp_personality.h> int main(void) { _BGP_Personality_t personality; int fd; if ((fd = open("/proc/personality", O_RDONLY)) == -1) { perror("open"); return 1; } if (read(fd, &personality, sizeof(personality)) != sizeof(personality)) { perror("read"); close(fd); return 1; } close(fd); printf("My pset rank is %d\n", personality.Network_Config.RankInPSet); return 0; }
(compile the above with -I/bgsys/drivers/ppcfloor/arch/include)
Torus rank
A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from 0.
The torus rank is easy to obtain from a C program: it is the Network_Config.Rank field of the personality structure.
Unfortunately, the torus rank is not available in /proc/personality.sh, but a shell script can easily calculate it from other fields:
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \ \\$3 * $BG_XSIZE * $BG_YSIZE}"`
MPI rank
MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, not node. If one submits a job in the VN or DUAL mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the BG_MAPPING environment variable changes the mapping between the torus coordinates and MPI ranks.
While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?
One way would be to invoke a simple C program:
#include <stdio.h> #include "zoid_api.h" int main(void) { if (__zoid_init()) return 1; printf("%d\n", __zoid_my_rank()); return 0; }
(compile with -Ipath_to_ZeptoOS/packages/zoid/prebuilt -Lpath_to_ZeptoOS/packages/zoid/prebuilt -lzoid_cn)
A slight disadvantage of this approach is that __zoid_init registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.
How to open a socket from a CN to the outside world
ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (192.168.1.x), they cannot communicate directly with the outside world.
The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.
To enable the translation, pass ZOID_NAT_ENABLE environment variable when submitting a job. An administrator can also enable this option permanently in the config file.
How to obtain a Cobalt job ID
Cobalt passes the job id to the application processes launched on the compute nodes using the COBALT_JOBID environment variable.
This variable is also accessible from the user script running on the I/O nodes, using the ZOID_JOB_ENV variable:
COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=\([^:]*\)/\1/'`
Why large MPI programs do not work
A common reason might be that they do not have enough memory to run. MPI programs run within the Big Memory region, which is limited to 256 MB by default. See the Kernel section to learn how to change that; the parameter to use is flatmemsizeMB.