Portable Profiling and Performance Analysis using 


Objective: Profiling Applications allows the user to identify bottlenecks in the application and get function profiling information for templated functions. Generating a profile can show which modules are most time intensive. Hence the objective is to collect and analyze performance data, in order to isolate functions in GrACE that can be scheduled more efficiently to improve the overall execution time.

TAU: TAU (Tuning and Analysis utilities) is one such visual and performance analysis environment for parallel C++ and HPF that uses Tcl/Tk for graphics. It is currently designed to instrument parallel multi-threaded, C & C++ code. TAU collects performance data during run time execution of the program and then provides a post mortem analysis and display of performance information. 

TAU can show  the exclusive and inclusive time spent in each function. For templated entities, it shows the breakup of time spent for each instantiation. The other data includes how many times each function was called, how many profiled functions did each function invoke, what the mean inclusive time per call was. It shows the mean time spent in a function over all nodes, contexts and threads. It can also show the exclusive and inclusive times spent in a function for each invocation of every function (and the aggregated sum over all invocations).
Instead of time, it can use hardware performance counters and show the number of instructions issued for each function, the cycles, loads, stores, floating point operations, primary and secondary data cache misses, TLB misses, etc.
It can also calculate the statistics such as the standard deviation of the exclusive time( or counts) spent in each templated function.
Instead of Profiling functions, the user can profile at a finer granularity using timers and it can profile all the above quantities for multiple user defined timers to profile statements in the code instead of functions.

Instrumenting the code using PDT: For Profiling a function Macros must be added to the source code to identify routine transitions. This can be automatically done using the TAU C++ instrumentor tau_instrumentor. Or by instrumenting the code at runtime using the Dyninst API. We used  PDT (Program Database Toolkit) provided by the Oregon University to instrument the GrACE source code. PDT inserts macros in the source code during compilation and then the object files are created from the instrumented source files. The architecture can be explained by the diagram below.

Visualizing traces using VAMPIR: Typically profiling shows the distribution of execution time across routines. It can show the code locations associated with specific bottlenecks, but it does not show the temporal aspect of performance variations. Tracing the execution of a parallel program shows when and where an event occurred, in terms of the process that executed it and the location in the source code. In Addition to PROFILE files, TAU also generates TRACE files for each node, thread and context. These TRACE files can be then converted to .pv format to be viewed using VAMPIR (Visualization and analysis of MPI programs).

This generates exactly at what time a message is sent from one node to other along with other parallelism statistics

References :

TAU http://www.acl.lanl.gov/tau/

VAMPIR http://www.pallas.com/e/products/vampir/index.htm

PDT http://www.cs.uoregon.edu/research/paracomp/pdtoolkit/