AMR Performance Simulator                                                      

The AMR simulator is a tool for evaluating partitioning techniques without having to run the actual application on a parallel computer. The application dumps a "trace" which is attacked by a partitioner. The trace contains the state of the grid hierarchy (without any partitioning) at each stage of the adaptive application. The simulator reads the output from the partitioner and evaluates the quality of the partitioning, as comprised by the 5-component metric defining partitioning quality.

The following three steps describe the concepts associated with the AMR simulator.

The input to the AMR simulator is essentially a list of bounding boxes obtained as a result of the partitioning scheme at different levels of refinement of the grid hierarchy. The input file also provides the processors involved at each time-step and the level of refinement during the execution of the partitioning routine. The maximum number of levels is also provided. The simulator hence reads in the time-step, processor number, level and bounding boxes. It then creates a list of bounding boxes for each level per processor per time-step. These lists are used to measure the intra-level and inter-level communication, the load imbalance created, the data movement involved, and the associated overheads.

Intra-level Communication

At each time step, the intra-level communication is the amount of communication present between processors at the same level. This is measured as follows.
From the input list, each bounding box (bb) is grown by a specified stencil size in all directions such that it maintains an overlapping zone at its boundaries with its neighboring processing elements (bb in this case) for synchronization of data during the updating process. So, the bb can be logically divided into appropriate number of interaction_bbs (depending on the dimension). Each of these interaction_bbs is intersected with each of the remaining bounding boxes on the list. If there is an intersection between the two, there is indeed intra-level communication and it is measured by the size of the intersection_bb of the two bbs. The intra-level communication at a particular level is the sum of intra-level communication of all bounding boxes in the list.
Amount of intra-level communication =    .. (1)
Now, intra-level communication of bounding box i is the sum of intra-level communication it has with all other bounding boxes. We can update eqn. (1) as
Amount of intra-level communication = 
The amount of intra-level communication of bounding box i with bounding box j can be found by calculating half the size of intersection of two bounding boxes.

The calculation above needs to be adjusted for the stencil used by multiplying with a stencil factor, the value of which is shown in the figure below.

Finally, amount of intra-level communication = 

Inter-level Communication

This is measured for each of the time-steps, between the processors that contain bbs at different levels of refinement. When a refined grid is created, its boundaries are usually interior to some coarser grid values. After updating the function values on a fine grid, the underlying coarse grid values are updated through restriction. The measurement process is as follows for each of the lists of bbs belonging to all processors at each timestep.

Each bb for a bblist is compared to each bb of a bblist that is either one level greater (child). If there is an interaction between the 2 bb, then inter-level communication exists and it is measured by the size of the intersection bb.

Data Movement

At the end of each time-step, the partitioning scheme allocates different bb to different processors. Hence bb that belonged to one processor may or may not belong to it at the next step. Therefore, data movement is characterized by the migration of any such bbs from one processor to another at different timesteps. It is measured as follows.

Compare the list of bbs at different time-steps. If there is an intersection between the 2 bbs, then there indeed is data movement and it is measured by the size of the intersection bb. The redistribution from time-step t1 to time-step t2 is given as:
Let the load on each processor be load1, load2, ..., loadn at time-step t1
Let the loads on each processor be load1a, load2a, ..., loadna at time-step t2
Data movement for processor1 = dm1 = load1 Ç load1a ...... (intersection)
Data movement for processor2 = dm2 = load2 Ç load2a
Data movement for processor1n = dmn = loadn Ç loadna
Total data movement = dm1 + dm2 + .... + dmn

Load Imbalance & Overheads

The load on each processor is computed as the workload area belonging to that processor at a certain level of refinement. Considering the refinement in space and time (with refinement factor of 2), and the size of the bounding box as the workload parameter, the work for a processor is given by
        Work = bb.size() *  (2 ^ level)

Sum of the loads on all the bounding boxes of all the processors is given by Total_load + = Work
        Ideal load on each processor = Total_load /nproc

The load imbalance on each processor at a timestep can then be calculated as
        Load Imbalance = ( ( Actual load Ideal load ) / Ideal load ) * 100 %

Partitioning overheads in terms of number of boxes can be computed as the number of boxes in the bounding box list associated with each processor at a given time-step.