ARMaDA is an autonomic partitioning framework for SAMR applications that dynamically selects and configures partitioning algorithms at runtime to optimize overall application performance. The ARMaDA framework has three components: application state monitoring and characterization component, partitioner selection component and policy engine, and runtime adaptation component. The partitioners used includes a selection from software tools such as GrACE and Vampire. Note that ARMaDA does not require any a priori knowledge about the behavior of the application or its adaptation patterns.
Automated Application State Monitoring and Characterization
The current state of a SAMR application and its partitioning requirements are
determined using the structure of the SAMR grid hierarchy, which is expressed as
sets of bounding boxes assigned to the different processors at various levels of
refinement. These boxes are used to inexpensively compute the
computation/communication requirements, application dynamics, and the nature of
the adaptation using simple geometric operations as follows.
Computation/Communication Requirements: The computation-to-communication ratio (``CCratio'') determines whether the current application state is computationally-intensive or communication-dominated. The ratio of the total volume to total surface area of the bounding boxes determines CCratio for the current state of the application.
Application Dynamics: The application dynamics (``Dynamics'') estimate the rate of change of the application refinement patterns and are determined by comparing the current application state with its previous state. To compute the Dynamics metric, the overlap between the current set of bounding boxes and the previous set of boxes at each level is computed.
Nature of Adaptation: The nature of the adaptation (``Adapt'') captures the adaptation pattern, i.e., whether the refinements are scattered or localized. A simple, approximate algorithm determines the nature of adaptation based on the number and geometric arrangement of refinement regions within the overall domain under consideration.
The three metric described above can now be used to effectively estimate the state of the SAMR application at runtime. To avoid the possibility of thrashing due to very frequent application state changes, the ARMaDA framework maintains a history of the application state by storing the structure of the grid hierarchy for two preceding regrid steps. Three normalized ratios are computed corresponding to the three metric, viz., computation/communication ratio ``Cratio'', application dynamics ratio ``Dratio'', and adaptation ratio ``Aratio''.
Policy-based Partitioner Selection
The three normalized ratios (Cratio, Dratio, and Aratio) computed by the state characterization component are combined to map the application state to a specific octant. First, user and system defined thresholds are used to assign a bit to each metric. Application-dependent weights fine-tune the sensitivity of the metric to closely match the needs of the dynamic application. The three bits are then used to map the application to the application state octant. The appropriate partitioner is then selected from the set of partitioners that are mapped to the octant and satisfy the partitioning requirements associated with that octant.
Let Mr denote any one of the three metric that are computed from the state of the grid hierarchy at regrid step r. The normalized metric ratio is then computed as follows. Let Mratio and Mbit represent the normalized ratio and the octant-bit value for a particular metric M. Let lowMwt and highMwt denote the application metric weights for the low and high application state thresholds (represented by LOW_THRESH and HIGH_THRESH) respectively. The metric ratio to octant bit mapping is given by
Three octant bits,``Cbit'', ``Dbit'', and ``Abit'' corresponding to the three metric ratios are computed in this way. A low Cbit (Cbit = 0) represents more communication while a high Cbit (Cbit = 1) indicates greater computation. A low Dbit (Dbit = 0) indicates greater activity whereas a high Dbit (Dbit = 1) represents lesser application dynamics. A low Abit (Abit = 0) denotes more localized adaptation while a high Abit (Abit = 1) indicates that the nature of adaptation is scattered. These bits are then used to identify the ARMaDA characterized state, which is then mapped to an application state octant and associated partitioners using the policies as shown in the table below.
Adaptive Application-sensitive Partitioning
The ARMaDA adaptation component configures the selected partitioner with appropriate partitioning parameters (such as partitioning granularity) and invokes it to partition and load balance the SAMR grid hierarchy. A common interface is defined for the available partitioners allowing them to be uniformly invoked. The granularity is based on the requirements of the octant, though it may be overidden by a user defined value. A key concern in adaptive partitioning is ensuring that the overheads of characterizing the application state and switching partitioners (if necessary) at runtime do not offset the benefits of the adaptation. ARMaDA uses efficient and inexpensive mechanisms based on the union and interaction of the structured geometry of the grid components to characterize application state and compute the required metric ratios. Furthermore, ARMaDA provides optimizations and control parameters that can be used to customize the sensitivity, quality, and overheads of application monitoring and partitioner adaptation.
Evaluation of ARMaDA Framework
The experimental evaluation of the ARMaDA framework is performed using various SAMR application kernels. The experiments consist of measuring overall application execution times for different partitioner configurations using the ARMaDA framework. The partitioners used in the evaluation include SFC, G-MISP+SP, pBD-ISP, and adaptive combinations of these partitioners determined at runtime based on application state. With the exception of the partitioning strategy and associated partitioning granularity, all application-specific and refinement-specific parameters are kept constant. In the adaptive partitioning experiments using ARMaDA, an initial partitioner is selected by the user along with other partitioning parameters and thresholds. The initial partitioner is used for the initial distribution of the SAMR grid hierarchy, while the partitioners used in subsequent distributions are autonomically determined by ARMaDA.
TportAMR-2D on ``Discover'': ``Discover'' is a 16-node Linux Beowulf cluster at Rutgers University. The Transport AMR 2D (TportAMR-2D) application is a benchmark kernel solving the transport equation and is primarily communication-dominated with high adaptation overheads. This evaluation is performed on 8 processors of Discover using an application base grid of size 64*64 and 3 levels of factor 2 space-time refinements. Regridding is performed every 4 time-steps at each level and the application executes for 40 coarse level time steps. The application execution times are listed in the table below. The TportAMR-2D application used in this experiment is computationally inexpensive but requires considerable communication and data movement. Furthermore, the experiment uses a small domain and few iterations. Since the pBD-ISP partitioner is particularly suited to strongly communication-dominated application states and reduces communication and data migration costs, it is expected to perform better than the other partitioners. The results show that the ARMaDA framework with pBD-ISP as the initial partitioner performs the best and its performance is similar to the stand alone pBD-ISP case. In this evaluation, the overheads associated with the ARMaDA framework range between 40 and 100 milliseconds for an entire run of approximately 350 seconds. Though thrashing is detected in the case of G-MISP+SP with switching between G-MISP+SP and pBD-ISP partitioners, its effects are not significant due to the ARMaDA optimizations described earlier.
VectorWave-2D on ``Frea'': ``Frea'' is a 64-node Linux Beowulf cluster at Rutgers University. The VectorWave-2D application is a coupled set of partial differential equations and forms a part of the Cactus 2-D numerical relativity toolkit, solving Einstein's and gravitational equations. This evaluation is performed on 32 processors of Frea using an application base grid of size 128*128 and 3 levels of factor 2 space-time refinements. Regridding is performed every 4 time-steps at each level and the application runs for 60 coarse level time steps. The application execution times for the individual partitioners and for ARMaDA with SFC start are presented in the table above. The VectorWave-2D application is primarily computation-dominated, requiring good load balance and reduced communication and data migration costs. SFC and pBD-ISP partitioners optimize communication and data migration metric while G-MISP+SP gives good load balance. In this evaluation, the ARMaDA partitioner with SFC start improves performance by 26.19% over the slowest partitioner. The ARMaDA framework overhead for state sensing and adaptation in this case is 0.0616% of the total execution time, which is negligible.
RM2D & RM3D on ``Blue Horizon'': The NPACI IBM SP2 ``Blue Horizon'' is a Teraflop-scale Power3 based clustered SMP system at the San Diego Supercomputing Center. RM2D and RM3D are 2-D and 3-D versions respectively of a compressible turbulence kernel solving the Richtmyer-Meshkov instability. This evaluation is performed on 64 processors of Blue Horizon. The RM2D evaluation uses a base grid of size 128*32 and 60 coarse level time steps, while RM3D uses a base grid of size 128*32*32 and 100 coarse level time steps. Both evaluations use 3 levels of factor 2 space-time refinements and regridding is performed every 4 time-steps at each level. The application execution times for the individual partitioners and ARMaDA for RM2D and RM3D are listed in the tables below. Both, RM2D and RM3D start out as computation-dominated with low activity dynamics and a more scattered adaptation, placing them in octant IV. As the applications evolve, their communication requirements become more significant. Moreover, the applications exhibit different dynamics and adaptations at different stages of their execution. The pBD-ISP scheme improves communication and data migration metric and is seen to perform better than the SFC and G-MISP+SP partitioners. The ARMaDA adaptive partitioner starts out with either the G-MISP+SP or SFC partitioner as these partitioners are better suited for the initial stages of the application. The overheads associated with the ARMaDA framework are low. For the RM2D and RM3D evaluations, the overheads are 0.415 seconds and 1.85 seconds respectively, and are negligible as compared to the overall application execution times. These experimental results demonstrate that autonomic partitioning using the ARMaDA framework reduces execution time and improves application performance. In the case of RM2D application, improvements due to ARMaDA are 4.66%, 11.32%, and 27.88% over pBD-ISP, G-MISP+SP, and SFC partitioners respectively. In the case of RM3D application, these improvements are 22.17% over the pBD-ISP partitioner and 38.28% over the SFC scheme.