Application-sensitive Partitioning: Experimental Study

The choice of the ``best'' partitioning technique and associated partitioning parameters (e.g. granularity) not only depends the nature of the application, but also on its runtime state. This is motivated by the observation that for parallel/distributed SAMR, no single partitioning scheme performs the best for all types of applications and systems. Even for a single application, the most suitable partitioning technique depends on input parameters and the application runtime state. As a result, it becomes necessary to actively manage these dynamic applications at runtime. This includes using application and system runtime state to select and configure the partitioning and load-balancing strategy to be used, so as to maximize performance. The goal of the adaptive application-sensitive meta-partitioner is to reduce overall runtimes of parallel SAMR applications by dynamically selecting and configuring partitioners at run-time to match the current state of the application. The structure of the adaptive grid hierarchy is used to characterize its current state and define its current partitioning requirements. These requirements are then used by the adaptive partitioner to appropriately select and configure the partitioning technique that best satisfies them.

The objective of this feasibility study is to manually demonstrate the benefits of adaptive application-sensitive partitioning based on a manual analysis of the SAMR application. The adaptation policies are manually formulated using a SAMR simulator and these policies are used to adaptively select and invoke the most suitable partitioner from among the available SFC, G-MISP+SP, and pBD-ISP domain-based partitioners. The SAMR application used in this evaluation is a basic, rudimentary version of the RM3D compressible turbulence kernel (RM3Db) solving the Richtmyer-Meshkov instability in 3-D. The Richtmyer-Meshkov instability is a fingering instability which occurs at a material interface accelerated by a shock wave. This instability plays an important role in studies of supernova and inertial confinement fusion.

Characterizing Application State

Application characterization consists of two steps. First, the adaptive behavior of the application is captured in an adaptation trace generated using a single processor run. The adaptation trace contains snap-shots of the SAMR grid hierarchy at each regrid step. This trace is then analyzed using the octant approach (in this case, manually) and the adaptive partitioning strategy is defined. The application is executed for 800 coarse level time-steps and the trace consists of over 200 snap-shots. A selection of these snap-shots is shown in the figure below and illustrates the dynamics of the RM3Db application. RM3Db starts out with a scattered adaptation, lower activity dynamics, and more computation, placing it initially in octant IV. Either G-MISP+SP or SFC is the most appropriate partitioner for this octant. As the application evolves, its adaptation patterns cause it to move between octants and may require different partitioning schemes. At regrid step 106, the RM3Db application has high communication requirements, high dynamics, and a scattered adaptation, placing it in octant VI. pBD-ISP is the partitioner of choice in this octant. At the end of the simulation, the application adaptations are localized, with increased computation and lower activity, placing the application in octant III and G-MISP+SP is the appropriate partitioner.

Formulating Partitioner Adaptation Policy using SAMR Simulator

Once the RM3Db application trace containing snap-shots of the SAMR grid hierarchy is obtained, a SAMR simulator is used to analyze the behavior of a partitioner for the application on 64 processors. In the discussion below, we only present the simulator analysis for the SFC and pBD-ISP partitioners. As the G-MISP+SP scheme has similar characteristics as the SFC partitioner, its analysis is not presented. The SAMR simulator uses the trace to determine the performance of the partitioners at each regrid step in terms of the primary components of the quality metric (communication, data movement, load imbalance). Note that the other two metric, partitioning overhead and partitioning time, are intrinsic characteristics of a partitioner. The partitioning time is directly dependent on the three primary components that determine the runtime performance of the partitioner. The three figures below plot the maximum values of the total communication, data movement, and load imbalance components of the quality metric respectively for SFC and pBD-ISP partitioners on 64 processors. Analyzing these graphs, five operational zones can be identified.

  1. Zone I (0 to ~25): This is the application start-up zone and is typically associated with low communication and data migration. In this zone, we observe that the communication and data movement costs for both partitioners are similar. The deciding factor then becomes the load imbalance. Though SFC starts with high load imbalance, this is quickly corrected to achieve improved load balance. In this zone, the G-MISP+SP scheme starts with and maintains low imbalance. In contrast, the pBD-ISP scheme produces relatively high load imbalance in this zone. Hence, G-MISP+SP or SFC is the preferred partitioner for this zone.
  2. Zone II (~25 to ~135): In this zone, there are substantial communication and data movement costs as the application evolves. The pBD-ISP scheme aims at reducing these costs while maintaining moderate-to-low load imbalance. Though the SFC and G-MISP+SP partitioners yield better load balance than pBD-ISP, they produce larger data movement and communication overheads that adversely affect performance. Resolving these conflicting objectives, the pBD-ISP scheme is expected to perform better in this zone.
  3. Zone III (~135 to ~150): This zone is characterized by high activity dynamics and greater communication requirements. The SFC and G-MISP+SP schemes outperform pBD-ISP in, both, the load balance and communication metric components, while producing comparable data movement. Clearly, G-MISP+SP or SFC is the appropriate partitioner for this zone.
  4. Zone IV (~150 to ~175): In this zone, pBD-ISP gives lower data movement and communication overheads and produces an acceptable load imbalance as compared to the G-MISP+SP or SFC scheme, making it the partitioner of choice for the zone.
  5. Zone V (~175 to ~200): This is the relatively ``quiet'' final stage of the RM3Db application. The data migration and communication requirements are low and constant. Although all schemes perform equally well with respect to the quality metric, the G-MISP+SP or SFC scheme is slightly preferred as compared to pBD-ISP due to the better load balance produced.

An application-sensitive partitioner adaptation policy (denoted by ``adaptive'' in this example) for RM3Db can be formulated using the simulator analysis for application-sensitive partitioning as shown in the table below. The regrid step is used as the switching point where a new partitioner is selected, configured, and invoked.

Experimental Evaluation of the Adaptive Partitioner

The adaptive partitioning policy, manually formulated above, is experimentally evaluated on the NPACI IBM SP2 ``Blue Horizon'' at the San Diego Supercomputing Center, using the RM3Db application with a base grid of size 128*32*32, 3 levels of factor 2 space-time refinements, and regridding performed every 4 time-steps. The experiments consisted of measuring application execution times for different processor configurations using the meta-partitioning strategies. The partitioner as well as the partitioning parameters were switched on-the-fly while the application was executing using the regrid step as an indicator. Run-times from single partitioner runs were also measured. Run-times for experiments on 32 and 64 processors are shown in the following graphs. The basic RM3Db version serves as a ``feasibility study'' example and is significantly different from the original RM3D kernel.

 

The above results demonstrate that an adaptive application-sensitive partitioning policy that dynamically switches partitioners (and associated partitioning parameters) at runtime improves application performance and results in reduced execution times. For 64 processors, the improvement is 27.2% over the slowest partitioner. In the adaptive partitioner case, G-MISP+SP improves the load balance when the application is computationally dominated, while pBD-ISP reduces communication and data-migration overheads. This ``feasibility study'' example demonstrates the potential benefits of adaptive application-sensitive partitioning and motivates the development of the ARMaDA autonomic partitioning framework.