Simulations of complex physical phenomena, modeled by systems of partial differential equations (PDEs), play an important role in science and engineering. Dynamic structured adaptive mesh refinement (SAMR) methods have emerged as attractive formulations of these simulations on structured grids. SAMR techniques have been used to solve complex systems of PDEs that exhibit localized features in various application domains including computational fluid dynamics, numerical relativity, astrophysics, combustion simulation, subsurface modeling and oil reservoir simulation.
Structured grids usually employ regular data structures that are easier to partition and lead to regular access and communication patterns. Hence, structured formulations of parallel scientific simulations can result in relatively simpler, and more efficient and scalable implementations. Parallelization of these applications typically consists of partitioning the structured grid into uniform blocks and allowing processors to compute on these blocks in parallel. Several existing structured grid infrastructures, such as GrACE, SAMRAI, Chombo, and Paramesh, address SAMR partitioning challenges and support parallel adaptive implementations. Furthermore, these frameworks typically assume that the computational effort at each grid point is the same and the workload at any level on the structured grid is uniformly distributed.
There exists a class of scientific applications involving reactive flows, such as the simulation of hydrocarbon flames with detailed chemistry, where the physical models include transport/-structured processes with uniform computational loads as well as reactive/pointwise processes having varying workloads. In these simulations, the solution of pointwise processes, at each iteration, requires a different number of sub-cycles to complete the computation at each grid point within a single global timestep. As a result, the computational load varies at different grid points and is only known locally and at runtime. Therefore, traditional parallelization approaches are not suitable for these simulations, and their parallel/distributed implementations present significant partitioning and runtime challenges. To address these challenges, we have developed the Dispatch dynamic structured partitioning strategy that has been applied to parallel uniform and adaptive formulations of scientific applications with computational heterogeneity.
Simulations with Computational Heterogeneity
Parallel structured implementations of PDE-based simulations typically assume a uniform workload per grid point, and use numerical schemes that require all points to march forward, together in time, in lock-step. However, in problems involving reactive flows, physical point processes are coupled to other similar point processes through a second process. Reactive processes operate at different timescales than convective and diffusive processes, and hence can be considered approximately decoupled. This approximation is exploited in operator-split integration methods.
Since the reactive processes are approximately decoupled in space and there are no spatial terms in the reaction operator, a conceptually simple solution exists. The grid points are distributed arbitrarily across all processors as long as the loads are equalized. Such a solution has no spatial couplings, and hence does not preserve the connections between a point and its neighbors or consider communication cost that can be prohibitive. This motivates the requirement to calculate the reactive processes in situ, and achieve load balance by a prudent domain decomposition that incorporates the uneven nature of load distribution.
A model problem approximating the ignition of a methane-air mixture is used as an illustrative example to evaluate our Dispatch partitioning strategy, and is referred to as the reactive-diffusion (R-D) application or kernel. The first figure is a 2-D snapshot of the R-D kernel that illustrates application dynamics after 100 timesteps during the ignition of a CH4-Air mixture in a non-uniform temperature field with 3 “hot-spots”. The application exhibits high dynamism, space-time heterogeneity and varying computational workloads, and is representative of the class of simulations targeted by this research. The second figure shows a sample distribution of the heterogeneous computational workloads associated with pointwise processes for the R-D kernel on a 128*128 structured grid. The reactive processes near the flame fronts have high computational requirements that correspond to large values of workloads at the 3 hot-spots, while the diffusive processes have uniform loads with a value of 1. The R-D kernel solves an equation of the form
Dispatch is a dynamic structured partitioning strategy for scientific applications with pointwise varying workloads. Dispatch has been integrated with the GrACE computational framework and enables parallel uniform and adaptive simulations. Dispatch augments the structured grid formulations by combining an inverse space-filling curve based partitioner (ISP) with in-situ weighted load balancing using global thresholds. The flowchart below illustrates the parallel execution of the R-D application and the Dispatch scheme.
Dispatch maintains the computational weights associated with pointwise processes in a distributed manner. The Dispatch partitioner is invoked at periodic intervals during application execution and uses in-situ global load balancing to determine workload transfers and data remapping among processors. The Dispatch strategy consists of the following four steps: (1) it maps the computational weights onto the current grid structure (represented by distributed grid functions) that may be organized uniformly on a single level or as a multi-level SAMR hierarchy with dynamic refinement regions; (2) it generates intermediate workloads using interpolation for new refinements in adaptive meshes that correspond to pointwise processes; (3) it computes the local workloads (in parallel) and partitioning thresholds to determine processor allocations proportional to the computational weights; and (4) it performs workload and data redistribution that preserves application locality. In contrast, the Homogeneous strategy in the GrACE framework assumes that all grid points have the same workload requirements, and hence does not consider computational heterogeneity during load balancing.
The experimental evaluation of the Dispatch strategy is performed using unigrid and SAMR implementations of the 2-D reactive-diffusion (R-D) kernel. The evaluation is performed on the IBM SP4 “DataStar” at the San Diego Supercomputer Center (SDSC). Datastar is SDSC’s largest IBM terascale machine that is especially suited for data intensive computations, and has 272 (8-way) P655+ compute nodes with 16-32 GB of memory. The experiments consist of comparing the performance of the Dispatch scheme and the default GrACE partitioner (Homogeneous) by measuring overall application execution time, load imbalance, synchronization time, and redistribution overheads. The unigrid evaluation is summarized below.
The unigrid (1-level) evaluation for the R-D application is performed using Homogeneous and Dispatch strategies on 8-128 processors on DataStar using 256*256 2-D base grid resolution. The application executes for 200 iterations with all other application-specific parameters kept unchanged. The first figure plots the total execution time Texec for Homogeneous and Dispatch schemes. The Dispatch scheme improves overall application execution time by 11.23% for 16 processors up to 46.34% for 64 processors. To further analyze load distribution and application runtime behavior, the second figure plots the average (across all processors) compute (μcomp) and synchronization (μsync) times for 8-128 processor runs. Dispatch considers the weights of the pointwise processes while performing load balancing and achieves consistently smaller standard deviations of compute times and synchronization times than the Homogeneous scheme.
Using the Dispatch strategy does lead to increased partitioning overheads that include the costs to extract the existing pointwise weights and interpolate new ones, compute the weights of local grid blocks for each processor, determine the global workload, and perform an appropriate domain decomposition based on the heterogeneous loads. However, the cumulative partitioning overheads are of O(10) seconds, which is an order of magnitude smaller than the application execution time. Hence, the partitioning overheads for Dispatch are considered negligible compared to the performance improvement in the uniform grid case. The Dispatch strategy also shows overall performance improvements in the case of SAMR applications, modeled as 2-level and 3-level grid hierarchies.
Future work aims at scaling up the R-D application (a full-scale production code model) so that there is enough computation to go along with the synchronization costs for large application domains executing on larger number of processors. This will enable us to analyze the scalability of Dispatch and address performance bottlenecks by conducting several experiments for various application and processor configurations. One major concern as we scale up this research problem is that the global load balancing may prove too expensive on larger number of processors. For the local schedule phase of the Dispatch scheme, we plan to develop smarter algorithms that will aid the construction of workload transfer schedules. Rather than the greedy strategy, processors can decide what portions of the patches they own should be kept by them and which ones should be shipped to neighbors. These decentralized load balancing and synchronization schemes for Dispatch will use local workload information within processor neighborhoods to adapt partitioning behavior during application execution. This approach will enable partitioning and redistribution to be performed without the complete knowledge of the global state of the application.