Newsletter

4/14/09 - The Spring 2009 (PDF) edition of the CAC newsletter is available.

Newsletter Archive

 

The University of Florida , the University of Arizona and Rutgers, the State University of New Jersey , have established a national research center for autonomic computing (CAC).

This center is funded by the Industry/University Cooperative Research Center program of the National Science Foundation, CAC members from industry and government, and university matching funds.

Autonomic Computing Engines on MS-HPCS

Principal researchers: Andres Quiroz, Shivangi Chaudhari, Manish Parashar (Rutgers)

Current collaborators: Brian Hammond (Microsoft)

Status: Ongoing

Consolidated and virtualized cluster-based computing centers have become dominant computing platforms in industry and research for enabling complex and compute intensive applications. However, as scales, operating costs, and energy requirements increase, maximizing efficiency, cost-effectiveness, and utilization of these systems becomes paramount. Furthermore, the complexity, dynamism, and often time critical nature of application workloads makes on-demand scalability, integration of geographically distributed resources, and incorporation of utility computing services extremely critical. Finally, the heterogeneity and dynamics of the system, application, and computing environment require context-aware dynamic scheduling and runtime management.

This project envisions an autonomic computing engine capable of: (1) Supporting dynamic utility-driven on-demand scale-out of resources and applications, where organizations incorporate computational resources based on perceived utility. These include resources within the enterprise and across virtual organizations, as well as from emerging utility computing clouds. (2) Enabling complex and highly dynamic application workflows consisting of heterogeneous and coupled tasks/jobs through programming and runtime support for a range of computing patterns (e.g., master-slave, pipelined, data-parallel, asynchronous, system-level acceleration). (3) Integrated runtime management (including scheduling and dynamic adaptation) of the different dimensions of application metrics and execution context. Context awareness includes system awareness to manage heterogeneous resource costs, capabilities, availabilities, and loads, application awareness to manage heterogeneous and dynamic application resources, data and interaction/coordination requirements, and ambient-awareness to manage the dynamics of the execution context such as heat/temperature and power.

This project builds on the Comet computing substrate, which provides a foundation and core capabilities for the envisioned autonomic computing engine. Comet supports different programming abstractions for parallel computing, including master/worker, data parallel and asynchronous iterations, in a dynamic and widely distributed environment. It provides the abstraction of virtual semantic shared spaces that forms the basis for flexible scheduling, associative coordination, and content-based asynchronous and decoupled interactions. Comet builds on a self-organizing and fault-tolerant dynamic overlay of computing resources. It is currently deployed on a range of platforms, including local clusters, campus Grids, and wide-area computing platforms (e.g., PlanetLab) and supports several computational applications from science, engineering, and finance.

Specifically, this project has resulted in the deployment of the Comet Infrastructure on the Microsoft Windows High Performance Compute Server cluster, and its use to support online financial analytics such as Value-at-Risk. HPCS simplifies the deployment and management of high-performance clusters and to reduce total cost of ownership. The job scheduler can address batch and service-oriented jobs, and can customize advanced policies or mixed environments. It provides API support for exploiting multi-cores, RDMA using Network Direct, and MS-MPI. There is also support for provisioning, configuring, system monitoring, and managing cluster nodes, user access.

A key aspect of this effort is using HPCS to enable application-level autonomics in Comet. This includes using the advanced networking support provided by HPCS such as RDMA (Winsock Direct) and offloading (TCP Chimney), to enable low latency communications and latency hiding techniques. Another aspect is to provide consolidated and virtualized cluster-based Grid as parallel computing platform for task deployment, scheduling, load balancing as encapsulated (secure) virtualized images (Windows Server 2008 hyper-V). The effort is driven by a family of scientific and business applications.

Reference:

  • Z. Li and M. Parashar, Computational Infrastructure for Grid-based Asynchronous Parallel Applications,
  • Proceedings of the 16th International Symposium on High-Performance Distributed Computing (HPDC), Monterey, CA, USA, pp. 229, June 2007.