The Applied Software Systems Laboratory

Room 637, CoRE building, Busch Campus. 94 Brett Road, Piscataway, NJ 08854

Cloud and Autonomic Computing Center at Rutgers | Rutgers, The State University of New Jersey | Department of Electrical and Computer Engineering
  • Home
  • Research
    • Comet Cloud ⇗
    • Green HPC ⇗
    • Autonomic Management ⇗
    • Data Management ⇗
    • GPU & Multicore
    • Discover on BlueGene
    • Previous Projects
  • Publications
    • By Year
    • By Subject
      • All
      • Adaptive Engines
      • Autonomic Computing
      • Autonomic Provisioning
      • Cloud Computing
      • Comp Collaboratories
      • Data Streaming
      • GPU & Multicore
      • Par & Distr Computing
      • Performance Evaluation
      • Autonomic Ecosystems
      • QoS & Active Networks
      • Miscellaneous
      • Green Computing
    • By Type
      • All
      • Book
      • Book Chapter
      • Journal Paper
      • Conference Paper
      • Workshop Paper
      • Poster Presentation
      • PhD Dissertation
      • Masters Thesis
      • Technical Report
      • Communication
  • People
    • Faculty & Senior Staff
    • Students
    • Alumni
  • Opportunities
  • Sponsors & Collaborators
  • Contact
  
 

 

GPU and Multicore Architectures

Mingliang Wang, Manish Parashar
Department of Electrical and Computer Engineering, Rutgers University



Home | Test

Problem

The move to massively parallel hardware, such as multicore/manycore and accelerator platforms, is significantly impacting software programmers because existing programs have to be properly parallelized before they can take advantage of these advanced platforms. In fact, existing programs may actually run slower because the clock rates of a single core are reducing for better power efficiency.

Most existing mainstream programming systems (models, languages, tools, libraries, runtime) do not address parallelism and related issues. While the stream programming model, an instance of data parallel programming that presents good portability, simplicity and scalability, has addressed the above challenges for general-purpose GPU platforms and has gained significant acceptance, the current programming systems to support it have several limitations: (1) they either require a drastic change to existing programming and software engineering practices (e.g., a completely new language, or an unverified programming paradigm), (2) rely on very low level and potentially error-prone mechanisms that significantly decrease programmer productivity, (3) or are limited to particular hardware platforms.

Approach

We are working on a programming system for GPU accelerators that incorporates the stream programming support into current mainstream object-oriented programming environments. We use conventional OO programming language, e.g. C++, to deal with stream data and the associated computation kernels, and use Aspect-Oriented Programming principles to manage parallel execution parameters, such as parallelization granularity and memory access optimization, as aspects. A source-to-source compiler is used to combine the core OO program with these aspects to generate parallelized programs executable on GPU accelerators. This approach has a small impact on existing program structure, is non-intrusive with respect to the computational part of a program, compatible with existing engineering practice, and allows an incremental adoption. As a result, it can help with the scalable transitioning of current programs into the coming manycore era.

Current Result

To demonstrate the viability of the proposed system, we are conducting a few case studies, including n-body simulation, stock options pricing, and Adaptive Mesh Refinement method in solving partial differential equations. We compared the program structure and the performance of the original sequential version with an OOSP version.

Compared to the hand-coded CUDA program, the OOSP version maintained the same source code structure for the core OO program as the serial version, and achieved about 80% of the runtime efficiency of the hand-coded CUDA version.

The same OOSP program has also been successfully translated into OpenCL for execution on a multicore CPU system.

We also are developing an intuitive model based on Bulk Synchronous Parallel (BSP) to characterize the runtime performance of these parallel programs.

©2010 TASSL