Enabling Scalable SAMR Implementations


Parallel implementations of SAMR methods lead to interesting challenges in dynamic resource allocation, data-distribution, load-balancing, and runtime management. While balancing these conflicting requirements is necessary for achieving large-scale scalability of SAMR applications, identifying appropriate trade-offs is non-trivial. This is especially true in the case of SAMR applications with highly dynamic and localized features and requiring high SAMR efficiencies, as they lead to deep hierarchies and small regions of refinement with unfavorable computation to communication ratios. The primary objective of this research is to address the scalability requirements of such applications on large numbers of processors (upto 1024 processors). We present a SAMR runtime engine consisting of two components: (1) a hierarchical partitioning framework consisting of a stack of partitioners that can address the space-time heterogeneity and dynamism of the SAMR grid hierarchy. The partitioners build on a locality preserving space-filling-curve-based representation of the SAMR grid hierarchy and enhance it based on localized requirements to minimize synchronization costs within a level (level-based partitioning), balance load (bin-packing based partitioning), or reduce partitioning costs (greedy partitioner); and (2) a SAMR communication substrate that is sensitive to the implementations of the MPI non-blocking communication primitives and is optimized to reduce communication overheads.

The SAMR runtime engine as well as its individual components are experimentally evaluated on the IBM SP2 supercomputer using a 3-D Richtmyer-Meshkov (RM3D) compressible turbulence kernel. RM3D is characterized by highly localized solution features resulting in small patches and deep application grid hierarchies (i.e., a small region is very highly refined in space and time). As a result, the application has increasing computational workloads and greater communication/synchronization requirements at higher refinement levels with unfavorable computation to communication ratios. The resulting characteristics significantly limit RM3D scalability on large numbers of processors. The SAMR runtime engine addresses these runtime challenges in synchronization, load-balance, locality, and communication to realize scalable RM3D implementations.

SAMR Hierarchical Partitioning Framework

The hierarchical partitioning framework consists of a stack of partitioners that can manage the space-time heterogeneity and dynamism of the SAMR grid hierarchy. These dynamic partitioning algorithms are based on a core Composite Grid Distribution Strategy (CGDS) belonging to the GrACE (Grid Adaptive Computational Engine) SAMR infrastructure. CGDS uses space filling curves (SFCs) and partitions the entire SAMR domain into subdomains such that each subdomain keeps all refinement levels in the subdomain as a single composite grid unit. Thus, all inter-level communication are local to a subdomain and the inter-level communication time is greatly reduced. The resulting composite grid unit list (GUL) for the overall domain must now be partitioned and balanced across processors. Once the grid hierarchy index space is mapped using the SFC+CGDS scheme, higher-level partitioning techniques can be applied in a hierarchical manner using the hierarchical partitioning algorithm (HPA). In HPA, processor groups are defined based on the dynamic grid hierarchy structure and correspond to regions of the overall computational domain. The top processor group partitions the global GUL obtained initially and assign portions to each processor subgroup in a hierarchical manner. In this way, HPA further localizes the communication to subgroups, reduces global communication and synchronization costs, and enables concurrent communication. Within each processor subgroup, higher-level partitioning strategies are then applied based on the local requirements of the SAMR grid hierarchy subdomain. The objective could be to minimize synchronization costs within a level using the level-based partitioning algorithm (LPA), or efficiently balance load using the bin-packing based partitioning algorithm (BPA), or reduce partitioning costs using the greedy partitioning algorithm (GPA), or combinations of the above.

Greedy Partitioning Algorithm

GrACE uses a default greedy partitioning algorithm to partition the global GUL and produce a local GUL for each processor. The GPA scheme performs a rapid partitioning of the SAMR grid hierarchy as it scans the global GUL only once while attempting to distribute the load equally among all processors. GPA helps in reducing partitioning costs and works quite well for a relatively homogeneous computational domain with few levels of relatively uniform refinement, and small-to-medium scale application runs. However, due to the greedy nature of the algorithm, GPA tends to result in heavy loading of later processors.

Level-based Partitioning Algorithm

The computational workload for a certain patch of the SAMR application is tightly coupled to the refinement level at which the patch exists. The computational workload at a finer level is considerably greater than that at coarser levels. The level-based partitioning algorithm (LPA) attempts to simultaneously balance load and minimize synchronization cost. LPA essentially preprocesses the global application computational units represented by a global GUL, disassembles them based on their refinement levels, and feeds the resulting homogeneous units at each refinement level to GPA (or any other partitioning/load-balancing scheme). The GPA scheme then partitions this list to balance the workload. Due to the preprocessing, the load on each refinement level is also balanced. LPA benefits from the SFC-based technique by maintaining parent-children relationship throughout the composite grid and localizing inter-level communications, while simultaneously balancing the load at each refinement level, which reduces the synchronization cost.

Bin-packing based Partitioning Algorithm

The bin-packing based partitioning algorithm (BPA) attempts to improve the load balance during the SAMR partitioning phase, where the computational workload associated with a GUL at different refinement levels of the SAMR hierarchy is distributed among available processors. The distribution is performed under constraints such as the minimum patch size and the aspect ratio. BPA distributes the workload among processors as long as the processor work threshold is not exceeded. Patches with a workload larger than the threshold limit are split into smaller patches. Unallocated work is first distributed using a ``best-fit'' approach. If this fails, the ``most-free'' approach is adopted until all the work is allocated. BPA allows the user to set a ``tolerance'' value that determines the acceptable workload imbalance for the SAMR application. In case of BPA, the load imbalance, if any, is low since it is bounded by the tolerance threshold. Due to the underlying bin-packing algorithm, the BPA technique provides overall better load balance for the grid hierarchy partitions among processors as compared to the GPA scheme. However, a large number of patches may be created as a result of multiple patch divisions. Also, the load distribution strategy in BPA can result in multiple scans of the grid unit list that marginally increases the partitioning overheads.

Optimizing SAMR Communication Substrate

Due to irregular load distributions and communication requirements across different levels of the grid hierarchy, parallel SAMR implementations make extensive use of MPI non-blocking primitives to reduce synchronization overheads. The analysis on the IBM SP2 MPI implementation demonstrates that the time spent waiting for processes to synchronize is the major source of communication overheads, rather than the network latency. This problem is particularly significant in applications where communications are not completely synchronized and there is some load imbalance, which is typical in SAMR applications. An attempt to increase the MPI eager limit on the IBM SP2 is not a scalable solution, since increasing this value of the environment variable increases the MPI memory usage per processor, thus reducing the amount of overall memory available to the SAMR application. A more scalable strategy is to address this at the application level by appropriating positioning MPI_Isend, MPI_Irecv, MPI_Wait on send side (Ws), and MPI_Wait on receive side (Wr) calls. The basic optimization strategy consists of delaying (Ws) until after (Wr) as shown in the figures below.

Evaluation: LPA and BPA techniques

The RM3D evaluation for LPA and BPA is performed on 64 processors on Blue Horizon using the GrACE infrastructure. RM3D uses a base grid of 128*32*32 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level. The RM3D application executes for 100 iterations and the total execution time, synchronization (Sync) time, recompose time, average maximum load imbalance, and the number of boxes are measured for each of the following 3 configurations, namely: (1) default GrACE partitioner (GPA), (2) LPA scheme using GrACE, and (3) the LPA+BPA technique using GrACE, i.e. the combined approach. The figure below shows the effect of LPA and BPA partitioning optimizations on RM3D application performance. Note that the values plotted in the figure are normalized against the corresponding maximum value. The results demonstrate that the LPA scheme helps to reduce application synchronization time while the BPA technique provides better load balance. A combined approach reduces overall execution time and results in improved performance.

Evaluation: ``Delayed waits'' optimization

The evaluation for ``delayed waits'' optimization consists of measuring the message passing and application execution times for the RM3D application with and without the optimization in the SAMR communication substrate. Except for the MPI non-blocking optimization, all other application-specific and refinement-specific parameters are kept constant. RM3D uses a base grid size of 256*64*64 and 3 levels of factor 2 space-time refinements with regridding performed every 8 time-steps at each level, and the application executes for 100 iterations. The figure below shows the comparisons of the execution times and communication times for the two configurations on 64, 128, and 256 processors. The ``delayed waits'' optimization helps to reduce overall execution time, primarily due to the decrease in message passing time. In this evaluation, the reduction in communication time is 44.37% on the average and results in improved application performance.

Evaluation: Overall RM3D Scalability

The scalability of RM3D application execution time, computation time, and synchronization time for 256, 512, and 1024 processors are illustrated in the figures below. Clearly, the LPA partitioning, bin-packing based load-balancing, and MPI non-blocking communication optimization techniques enable scalable RM3D runs with multiple refinement levels on large numbers of processors. Though the RM3D computation time scales quite well, the scalability of synchronization time is limited by the highly communication-dominated application behavior and unfavorable computation to communication ratios. Also, note that a ``unigrid'' implementation of the RM3D application for the same experimental settings will use a domain of size 1024*256*256 with approximately 67 million computational grid points and require extremely large memory availability. Thus, unigrid implementations are not even feasible for large-scale execution of the RM3D application.



To evaluate the benefits of using SAMR, another experiment compares the RM3D application performance on 512 processors of Blue Horizon by varying coarse base grid size and the number of refinement levels. All other application-specific and refinement-specific parameters kept constant. Note that for every step on the coarse level (level 0), two steps are taken at the first refinement level (level 1), four steps on level 2, and so on. The evaluation is performed for two configurations with the same resolution (8000 steps) at the finest level of the SAMR grid hierarchy. The RM3D domain size at the finest level is 1024*256*256 and the evaluation uses the LPA+BPA technique with a minimum block size of 16. The first configuration (Run 1) uses a coarse grid of size 128*32*32 with 4 levels of factor 2 space-time refinements, and executes 1000 coarse-level iterations which correspond to 8000 steps at the finest level. The second configuration (Run 2) uses a 64*16*16 base grid and 5 refinement levels, and runs for 500 coarse-level iterations to achieve the same resolution at the finest level. As shown in the table above, Run 2 provides better performance (reduced execution and compute times) than Run 1. The improvement in RM3D application performance for the 5-level hierarchy is due to the efficiency of the basic Berger-Oliger SAMR algorithm.