Establishing Socket Connections
From Mpich2
NOTE: this document is very rough, incomplete, and was created by analyzing existing code rather than writing code to follow this. As such, take any statements in here with a big grain of salt.
Sock conn protocol is a related document, although it was created independently from this document.
Contents |
Socket Connections
Connections in MPICH2 are established as necessary, providing better scalability and reducing startup time. In addition, this approach reduces the consumption of Unix file descriptors; there are a limited number of these (as few as 1024 in some systems), and while this may seem like a lot, clusters with more nodes than this are becoming common.
The connection process is best described with a state diagram and events that cause transitions between states. When the MPI dynamic process features are included, a connection my be made, closed, and reopened, possibly many times during a computation.
Basic events that change state
- Receive connection request
- Receive close request
- Receive EOF (unexpected close)
Basic States
- Unconnected
- Connected
ToDo: expand the states to include all transitions (including failure and connection requests received when the connection is not in the unconnected state (e.g., in closing and reopening).
Also ToDo: describe related information (e.g., PMI process group info).
Likely additional states include
- wait for connect info
- connect handshake
- wait for close handshake
A common problem is the one of two processes each opening connections to each other. The socket code assume that the sockets are bidirectional, thus only one socket is needed by each pair of connected processes, not one socket for each member of the pair.
ToDo: refactor the states and state machine into a clear set of VC connection states and connection states.
There are three related objects used during a connection event. They are the connection itself (a structure specific to the communication method, sockets in the case of this note), the virtual connection, and the process group to which the virtual connection belongs. Note that the reference counts on the VC and the process group are independent of whether there is a connection; the reference counts on the VC indicate how many communicators refer to that VC; the reference count on the process group indicates how many VCs are part of some communicator. Thus, none of the connection operations (whether open or close) change the reference count on either a VC or a process group.
The following describes the process of establishing a connection, based on analyzing the code (note that this should have been documented first rather than created in an ad hoc fashion).
Connect side
Note that the VC must already exist (all VCs are created with the associated process group).
An attempt to send a message detects that the VC is in state
VC_STATE_UNCONNECTED (e.g., in ch3_istartmsg). Then:
save message in a request added to the send queue (SendQ_enqueue(vc))
VC_post_connect(vc)
set vc->state to VC_STATE_CONNECTING
get connection string from PG, get connection info from string
Connection_alloc(&conn)
allocates space for Connection_t and pgid
we need the pgid to identify the VC (??)
(note: instead of pgid, connections should simply
point at the PG object)
(note: this step should initialize the conn fields)
Sock_post_connect_ifaddr(conn)
creates the socket, sets sock opts, allocates internal sock
structure, adds the socket to the poll list
execute system connect(sock)
set sock state to CONNECTED_RW, CONNECTING, or (on error)
DISCONNECTED
returns the socket object (MPIDU_Sock *)
conn->state = CONN_STATE_CONNECTING
init conn fields (note: move into allow)
(at this point, wait for the connect to become ready, which will cause a
SOCK_OP_CONNECT event. Thus:)
in ch3_progress, in
Handle_sock_event
if (event->op == MPIDU_SOCK_OP_CONNECT)
# Note that when we get a connection request, we don't yet know
# what VC this connection is for. We get that information
# by being send the pg_id for the process group and the
# rank of the VC within that process group.
Sockconn_handle_connect_event()
(note that this routine checks for event error; not the right place)
if (conn == CONN_STATE_CONNECTING)
conn->state = CONN_STATE_OPEN_CSEND
initialize a packet contained within the conn structure to
PKT_SC_OPEN_REQ, send length of pg_id and pg_rank
(note: for hetero, these need to be in fixed byteorder, length)
connection_post_send_pkt_and_pgid
also sends the pgid itself
(note: perhaps this and related routines should be a
general ch3 routine)
(note: should move formation of all data to send into a
single routine)
Sock_post_writev() for this (pkt + pg_id)
return to handle_sock_event
if (event->op == MPIDU_SOCK_OP_WRITE)
if (!conn->send_active)
(assumes finishing connection write)
Sockconn_handle_connwrite()
if (state == CONN_STATE_OPEN_CSEND)
conn->state = CONN_STATE_OPEN_CRECV
connection_post_recv_pkt
return to handle_sock_event
if (event->op == MPIDU_SOCK_OP_READ) {
If (pkt_type == PKT_SC_OPEN_RESP)
if (pkt->ack is true)
conn->state = STATE_CONNECTED
vc->state = VC_STATE_CONNECTED
connection_post_recv_pkt(conn)
connection_post_sendq_req(conn)
If we had enqueued a send of a request, start it
else
# Close connection because this was closed on the
# other end (?), probably because it lost the
# head-to-head connection race.
conn->state = CONN_STATE_CLOSING
Sock_post_close(conn->sock)
conn->vc = NULL
# note: vc state itself is unchanged (discarding this
# connection, not the associated vc)
Accept side
As part of initialization, an accept is issued on a listener socket
CH3I_Progress_init sets up the listener socket (note: this code should be
moved into the ch3u_connect_sock file and global variables
made local (static) there. This post a socket state for OP_ACCEPT. When
a connect call is made to the listener on this process, starting this
sequence:
Progress_handle_sock_event
if (event->op == MPIDU_SOCK_OP_ACCEPT)
Sockconn_handle_accept_event()
Allocate connection (same routine as in connect)
MPIDU_Sock_accept( conn )
executes accept, set sock opts
(note: there should be a common routine for setting the
sock opts on connect and accept)
initialize sock and associated poll structures
initialize conn fields
conn->state = OPEN_LRECV_PKT
connection_post_recv_pkt
Sock_post_read
adds this buffer to pending reads on this FD
return from handle
(note: no code checks for the conn state to be OPEN_LRECV_PKT.
Progress_handle_sock_event
if (event->op == MPIDU_SOCK_OP_READ)
if (conn->state not recognized) (note: bad code style)
Sockconn_handle_conn_event( conn ) (conn comes from user_ptr in
event)
if (conn->pkt is PKT_SC_OPEN_REQ)
(added check that conn state == OPEN_LRECV_PKT)
conn->state = OPEN_LRECV_DATA
(read the process group id)
Sock_post_read(pg_id,pkt->pg_id_len)
(a non-blocking read)
return to handle_sock_event
Progress_handle_sock_event
if (event->op == MPIDU_SOCK_OP_READ)
if (conn->state == OPEN_LRECV_DATA) {
Sockconn_handle_connopen_event(conn)
The conn->pg_id field is now set.
(find the corresponding process group. We are
guaranteed to find the pg)
MPIDI_PG_Find(conn->pg_id,&pg).
the connection pkt still contains the pg_rank for this
connection
Find the corresponding virtual connection (note that on
an accept operation, we don't know until this point the
vc for this connection request)
MPIDI_PG_Get_vc(pg,pg_rank,&vc);
(at this point, we need to check for head-to-head connections,
since we may already be attempting to form this VC, having
originated a connection from this side).
if (vc->conn == NULL || (mypg < pg) ||
(pg == mypg && myrank < pg_rank of conn) )
not head to head OR winner of head-to-head.
Continue with connection
VC state is now initialized to VC_STATE_CONNECTING
vc->conn is set to this connection, and the associated
sock is also set
conn->vc = vc
In all cases, return an ack:
conn->state = OPEN_LSEND (note, even when refusing connection)
conn->pkt = MPIDI_CH3I_PKT_SC_OPEN_RESP
ptk.ack = true if accepting, false otherwise
Sock_post_write(pkt)
if (event->op == MPIDU_SOCK_OP_WRITE)
if (conn->state == OPEN_LSEND) {
finished sending response packet.
if (conn.pkt->ack is true)
(note: this should use the same code as the connect brach)
conn->state = CONN_STATE_CONNECTED
connection_post_recv_pkt
connection_post_sendq_req
vc->state = VC_STATE_CONNECTED
else
conn->state = CONN_STATE_CLOSING
Sock_post_close(conn->sock)
This primarily enquees SOCK_OP_CLOSE event
On a close sock event: (to do; this isn't ready since it is clear that additional events occur before we'd get to this point)
case SOCK_OP_CLOSE:
Sockconn_handle_close_event( conn )
conn->vc->ch.state = STATE_UNCONNECTED
Handle_connection(vc,EVENT_TERMINATED)
(EVENT_TERMINATED is the only event type,
we should integrate this with the other connection
state change events)
switch (vc->state)
VC_STATE_CLOSE_ACKED: (this must be set because the
default generates an error)
Note also that mpid_finalize.c contains some vc close code (most of this, including all code that performs state changes, should not be in this file).
Unresolved questions. There are at least three sets of states:
conn->state - defined in ch3/channels/*/include/mpidi_ch3_impl.h vc->state - defined in ch3/include/mpidpre.h vc->ch.state - defined in ch3/channels/*/include/mpidi_ch3_pre.h
Why is there a separate vc state and vc channel state? Are the vc->ch.state values really different from vc->state, and how do we ensure that changes to vc->state and vc->ch.state are made consistently?
OLD
Here is a guess at the state transitions in the current implementation.
On accept side:
CONN_STATE_OPEN_LRECV_PKT
VC_STATE_CONNECTING
CONN_STATE_OPEN_LSEND
Enqueuing accept connection
CONN_STATE_CONNECTED
VC_STATE_CONNECTED
Dequeueing accept connection
VC_STATE_LOCAL_CLOSE
VC_STATE_CLOSE_ACKED
VC_STATE_CLOSE_ACKED (yes, state was set twice) (? perhaps separate vc?)
CONN_STATE_CLOSING
VC_STATE_UNCONNECTED
VC_STATE_INACTIVE
CONN_STATE_CLOSED
( this appears to be the connection used to establish the original
intercomm; the connection is closed for some reason)
( then an ch3_istartmsg causes the following )
posting connect and enqueuing request
VC_STATE_CONNECTING
CONN_STATE_CONNECTING
CONN_STATE_OPEN_CSEND
CONN_STATE_OPEN_CRECV
CONN_STATE_OPEN_LRECV_DATA
