Difference between revisions of "FAQ"

From ZeptoOS
Jump to navigationJump to search
Line 8: Line 8:
 
===Pset rank===
 
===Pset rank===
  
A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it).  Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').
+
A pset rank is a number identifying a compute node within each ''pset'' (an I/O node and the compute nodes that communicate with it).  Note that on partitions larger than one pset, the pset ranks will not be unique.  Also, pset ranks do ''not'' start from <tt>0</tt>; they start from <tt>1</tt> for some mysterious reason (do not blame us &#8211; blame IBM&nbsp;:-).
 +
 
 +
Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (''x'' in <tt>192.168.1.</tt>''x'').
  
 
The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:
 
The pset rank is available on the compute nodes from <tt>/proc/personality.sh</tt>, in the <tt>BG_RANK_IN_PSET</tt> variable:
Line 54: Line 56:
 
(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)
 
(compile the above with <tt>-I/bgsys/drivers/ppcfloor/arch/include</tt>)
  
==How to find the MPI rank from a shell script==
+
===Torus rank===
 +
 
 +
A torus rank is a number identifying a compute node within a whole partition.  In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from <tt>0</tt>.
 +
 
 +
The torus rank is easy to obtain from a C program: it is the <tt>Network_Config.Rank</tt> field of the personality structure.
 +
 
 +
Unfortunately, the torus rank is not available in <tt>/proc/personality.sh</tt>, but a shell script can easily calculate it from other fields:
 +
 
 +
<pre>
 +
TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
 +
            \\$3 * $BG_XSIZE * $BG_YSIZE}"`
 +
</pre>
 +
 
 +
===MPI rank===
 +
 
 +
MPI rank should not be confused with a torus rank, even though by default the two are the same.  MPI rank is a property of a process, ''not'' node.  If one submits a job in the <tt>VN</tt> or <tt>DUAL</tt> mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank.  Also, using the <tt>BG_MAPPING</tt> environment variable changes the mapping between the torus coordinates and MPI ranks.
 +
 
 +
While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?
 +
 
 +
One way would be to invoke a simple C program:
 +
 
 +
<pre>
 +
#include <stdio.h>
 +
#include "zoid_api.h"
 +
 
 +
int main(void)
 +
{
 +
    if (__zoid_init())
 +
        return 1;
 +
    printf("%d\n", __zoid_my_rank());
 +
    return 0;
 +
}
 +
</pre>
 +
 
 +
(compile with <tt>-I</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -L</tt>''path_to_ZeptoOS''<tt>/packages/zoid/prebuilt -lzoid_cn</tt>)
 +
 
 +
A slight disadvantage of this approach is that <tt>__zoid_init</tt> registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need.  Another solution, without using any binaries, is as follows:
 +
 
 +
<pre>
 +
MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`
 +
</pre>
 +
 
 +
This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.
  
 
==How to open a socket from a CN to the outside world==
 
==How to open a socket from a CN to the outside world==

Revision as of 18:17, 28 April 2009

KTAU | Top


How to obtain a CN node number

This depends on what number one is interested in.

Pset rank

A pset rank is a number identifying a compute node within each pset (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do not start from 0; they start from 1 for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (x in 192.168.1.x).

The pset rank is available on the compute nodes from /proc/personality.sh, in the BG_RANK_IN_PSET variable:

#!/bin/sh

. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"

From a C program it will be easier to use the binary personality available from /proc/personality. The definition of the structure can be found in /bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h. The pset rank is in Network_Config.RankInPSet:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
{
    _BGP_Personality_t personality;
    int fd;

    if ((fd = open("/proc/personality", O_RDONLY)) == -1)
    {
        perror("open");
        return 1;
    }
    if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
    {
        perror("read");
        close(fd);
        return 1;
    }
    close(fd);

    printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

    return 0;
}

(compile the above with -I/bgsys/drivers/ppcfloor/arch/include)

Torus rank

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from 0.

The torus rank is easy to obtain from a C program: it is the Network_Config.Rank field of the personality structure.

Unfortunately, the torus rank is not available in /proc/personality.sh, but a shell script can easily calculate it from other fields:

TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
            \\$3 * $BG_XSIZE * $BG_YSIZE}"`

MPI rank

MPI rank should not be confused with a torus rank, even though by default the two are the same. MPI rank is a property of a process, not node. If one submits a job in the VN or DUAL mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the BG_MAPPING environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

#include <stdio.h>
#include "zoid_api.h"

int main(void)
{
    if (__zoid_init())
        return 1;
    printf("%d\n", __zoid_my_rank());
    return 0;
}

(compile with -Ipath_to_ZeptoOS/packages/zoid/prebuilt -Lpath_to_ZeptoOS/packages/zoid/prebuilt -lzoid_cn)

A slight disadvantage of this approach is that __zoid_init registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

How to open a socket from a CN to the outside world

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (192.168.1.x), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass ZOID_NAT_ENABLE environment variable when submitting a job. An administrator can also enable this option permanently in the config file.

How to obtain a Cobalt job ID

Cobalt passes the job id to the application processes launched on the compute nodes using the COBALT_JOBID environment variable.

This variable is also accessible from the user script running on the I/O nodes, using the ZOID_JOB_ENV variable:

COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=\([^:]*\)/\1/'`

KTAU | Top