The move to massively parallel hardware, such as multicore/manycore and accelerator platforms, is significantly impacting software programmers because existing programs have to be properly parallelized before they can take advantage of these advanced platforms. In fact, existing programs may actually run slower because the clock rates of a single core are reducing for better power efficiency.
Most existing mainstream programming systems (models, languages, tools, libraries, runtime) do not address parallelism and related issues. While the stream programming model, an instance of data parallel programming that presents good portability, simplicity and scalability, has addressed the above challenges for general-purpose GPU platforms and has gained significant acceptance, the current programming systems to support it have several limitations: (1) they either require a drastic change to existing programming and software engineering practices (e.g., a completely new language, or an unverified programming paradigm), (2) rely on very low level and potentially error-prone mechanisms that significantly decrease programmer productivity, (3) or are limited to particular hardware platforms.
We are working on a programming system for GPU accelerators that incorporates the stream programming support into current mainstream object-oriented programming environments. We use conventional OO programming language, e.g. C++, to deal with stream data and the associated computation kernels, and use Aspect-Oriented Programming principles to manage parallel execution parameters, such as parallelization granularity and memory access optimization, as aspects. A source-to-source compiler is used to combine the core OO program with these aspects to generate parallelized programs executable on GPU accelerators. This approach has a small impact on existing program structure, is non-intrusive with respect to the computational part of a program, compatible with existing engineering practice, and allows an incremental adoption. As a result, it can help with the scalable transitioning of current programs into the coming manycore era.
To demonstrate the viability of the proposed system, we are conducting a few case studies, including n-body simulation, stock options pricing, and Adaptive Mesh Refinement method in solving partial differential equations. We compared the program structure and the performance of the original sequential version with an OOSP version.
Compared to the hand-coded CUDA program, the OOSP version maintained the same source code structure for the core OO program as the serial version, and achieved about 80% of the runtime efficiency of the hand-coded CUDA version.
The same OOSP program has also been successfully translated into OpenCL for execution on a multicore CPU system.
We also are developing an intuitive model based on Bulk Synchronous Parallel (BSP) to characterize the runtime performance of these parallel programs.