Skip to content

Exascale Launch

rhc54 edited this page Aug 28, 2014 · 2 revisions

Theory of Operation

Start from point of allocation. Broadcast is sent to all daemons involved in the allocation via the overlay network. Daemons on nodes involved in the allocation fork/exec a session daemon, with one designated as the shepherd for the job and the others in the lamb role. Explain role of shepherd vs lamb daemon: shepherd is responsible for coordinating the application, including executing any error management strategies if problems arise and reporting resource usage by the application at the end of the job. Each session daemon is given a static endpoint that other daemons in the session can use to connect into the overlay network topology.

Emphasize that no communication is allowed between session and root daemons. Root daemon creates shared memory segment and "drops" the application info into it, passing address and ownership to each session daemon upon launch. This includes allocation of endpoint resources for all procs on the local node, and for all procs on remote nodes. Typically, this is homogeneous by design, but can be heterogeneous if required. If endpoints cannot be allocated (or were not allocated for ORCM's use), then the PMIx wireup system will be used.

Each session daemon parses the application information to determine the number of processes in the job and any other details. The daemon then begins mapping the job to determine which procs are local to it. As each process is mapped, local procs are transferred to the ODLS for immediate launch while the mapper continues to process the application. This optimizes the parallel launch of processes.

As the mapper computes proc locations, it populates a shared memory database on the node with all known information about each process. This includes the location of the process, any pre-assigned endpoints, its binding locality, node and local rank. This requires homogeneous nodes, which is typical for large-scale systems.

There are a few key elements to the revised wireup plan, some of which are implemented in ORCM and some in OMPI:

  • more scalable barrier algorithms in the RTE. We took a first step towards this with the PMIx modifications, and Mellanox is working on the next phase. I have a final follow-on phase that will further reduce the time required for a cross-cluster barrier to a very low level. When completed, we expect to have this executing in rather short time intervals (we'll provide numbers as we measure them)

  • RM management of connection information. Much of the data we modex around is actually "static" - e.g., we send NIC-level info on GIDs and LIDs for Infiniband. These do change upon restart of the respective fabric manager, but that is a rare and detectable occurrence. Now that the BTLs are in the OPAL layer and thus accessible from outside of MPI, ORCM's daemons are querying the BTLs for this non-job-level information and including it in their inventory report. Thus, each daemon now has access to all that info at time-zero, and there is no longer a need to include it in the modex. The table is being provided on a shared memory basis to each process by the daemon to minimize the memory footprint.

  • RM assignment of rendezvous endpoints. Most fabrics utilize either connectionless endpoints, or have a rendezvous protocol whereby two procs can exchange a handshake for dynamic assignment of endpoints. In either case, ORCM's daemons will query the fabric at startup to identify assignable endpoints, and then assign them on a node-rank basis to processes as they are launched. Procs can then use the provided table to lookup the endpoint info for any peer in the system, and connect to it as desired. The table is again being provided on a shared memory basis by the daemon.

  • conversion of BTLs to only call modex_recv on first message instead of at startup. This is required not only for improvement of startup times, but also for dealing with memory footprint as we get to ever larger scale. The fact is that most MPI apps only involve a rank talking to a small subset of its peers. Thus, having each rank "grab" all endpoint info for every other process in the job at startup wastes both time and memory as most of that data will never be used. Some of the BTLs already have been modified for this mode of operation, and the PMIx change takes advantage of it when only those BTLs are used. This becomes less of an issue for fabrics where the RM can fully manage endpoints, but is a significant enhancement for situations where we cannot do so.

  • distributed mapping. Right now, mpirun computes the entire process map and sends it around to all the daemons, who then extract their list of local processes to start. This results in a fairly large launch message. The next update coming in a couple of weeks will shift that burden to the daemons by having mpirun send only the app_context info to the daemons. Each daemon will independently compute the map, spawning its own local procs as they are computed. This will result in a much smaller launch message and help further parallelize the procedure.

  • use of faster transports. ORCM's daemons will utilize the fabric during job startup to move messages and barriers around, and then fallback to the OOB transport for any communication during job execution in order to minimize interference with the application.

Clone this wiki locally