From ZeptoOS
Jump to navigationJump to search


How to obtain a CN node number

This depends on what number one is interested in.

Pset rank

A pset rank is a number identifying a compute node within each pset (an I/O node and the compute nodes that communicate with it). Note that on partitions larger than one pset, the pset ranks will not be unique. Also, pset ranks do not start from 0; they start from 1 for some mysterious reason (do not blame us – blame IBM :-).

Pset rank is used as the last octet in the IP address on the tree network connecting the compute nodes and the I/O nodes (x in 192.168.1.x).

The pset rank is available on the compute nodes from /proc/personality.sh, in the BG_RANK_IN_PSET variable:


. /proc/personality.sh

echo "My pset rank is $BG_RANK_IN_PSET"

From a C program it will be easier to use the binary personality available from /proc/personality. The definition of the structure can be found in /bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h. The pset rank is in Network_Config.RankInPSet:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <common/bgp_personality.h>

int main(void)
    _BGP_Personality_t personality;
    int fd;

    if ((fd = open("/proc/personality", O_RDONLY)) == -1)
        return 1;
    if (read(fd, &personality, sizeof(personality)) != sizeof(personality))
        return 1;

    printf("My pset rank is %d\n", personality.Network_Config.RankInPSet);

    return 0;

(compile the above with -I/bgsys/drivers/ppcfloor/arch/include)

Torus rank

A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from 0.

The torus rank is easy to obtain from a C program: it is the Network_Config.Rank field of the personality structure.

Unfortunately, the torus rank is not available in /proc/personality.sh, but a shell script can easily calculate it from other fields:

TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \
            \\$3 * $BG_XSIZE * $BG_YSIZE}"`

MPI rank

An MPI rank should not be confused with the torus rank, even though by default the two are the same. MPI rank is a property of a process, not node. If one submits a job in the VN or DUAL mode, there will be multiple MPI tasks per node, obviously each with a different MPI rank. Also, using the BG_MAPPING environment variable changes the mapping between the torus coordinates and MPI ranks.

While obtaining MPI rank from an MPI application is trivial, how to obtain it from a shell script?

One way would be to invoke a simple C program:

#include <stdio.h>
#include "zoid_api.h"

int main(void)
    if (__zoid_init())
        return 1;
    printf("%d\n", __zoid_my_rank());
    return 0;

(compile with -Ipath_to_ZeptoOS_source/packages/zoid/prebuilt -Lpath_to_ZeptoOS_source/packages/zoid/prebuilt -lzoid_cn)

A slight disadvantage of this approach is that __zoid_init registers the process with the ZOID daemon on the I/O node, which is an overhead we do not need. Another solution, without using any binaries, is as follows:

MPI_RANK=`echo $CONTROL_INIT | awk -F, '{print $4}'`

This has a disadvantage of using internal ZOID variables which are not guaranteed to be supported in future releases.

How to open a socket from a CN to the outside world

ZOID provides IP packet forwarding between the compute nodes and the I/O nodes. However, because the compute nodes use non-routable IP addresses (192.168.1.x), they cannot communicate directly with the outside world.

The most transparent solution to this problem is to perform network address translation (NAT) on the I/O nodes using the Linux kernel netfilter infrastructure. We used to enable this by default, but experiments have shown it to have a detrimental effect on the overall performance of the TCP/IP stack on the I/O nodes, slowing down access to the network filesystems.

To enable the translation, pass ZOID_ENABLE_NAT environment variable when submitting a job. An administrator can also enable this option permanently in the config file.

How to obtain a Cobalt job ID

Cobalt passes the job id to the application processes launched on the compute nodes using the COBALT_JOBID environment variable.

This variable is also accessible from the user script running on the I/O nodes, using the ZOID_JOB_ENV variable:

COBALT_JOBID=`echo $ZOID_JOB_ENV | sed 's/^.*COBALT_JOBID=\([^:]*\).*$/\1/'`

Why large MPI processes do not work

A common reason might be that they do not have enough memory to run. MPI processes run within the big memory region, which by default is limited to just 256 MB so as not to deplete the ordinary Linux paged memory pool too much (main memory is allocated to the big memory region at boot time and it cannot be reclaimed by the kernel, even if it were unused).

See the Kernel section to learn how to increase the limit; the parameter to use is flatmemsizeMB. We suggest creating multiple profiles with different big memory sizes to accommodate different uses of ZeptoOS.

Why SSH keeps asking for a password

As we envisioned it, partition owners should be able to log on the I/O nodes belonging to their jobs without being asked for a password. The following considerations apply:

  1. The account information on the partition owner must be added to the /etc/passwd file on the I/O nodes when launching a job; this is discussed here.
  2. For password-less logins, shosts.equiv must be configured before (re)building the I/O node ramdisk, as discussed here. Alternatively, users could set up SSH key pairs in their home directories (password-less, or using ssh-agent to cache the password).
  3. SSH might temporarily prevent a partition owner from logging in if an attempt is made before the job starts running, as discussed here. Root can always log in, by providing the password set when building the I/O node ramdisk for the first time.
  4. Finally, keep in mind that a particular site might have disabled this feature on purpose.