Newsletter

4/14/09 - The Spring 2009 (PDF) edition of the CAC newsletter is available.

Newsletter Archive

 

The University of Florida , the University of Arizona and Rutgers, the State University of New Jersey , have established a national research center for autonomic computing (CAC).

This center is funded by the Industry/University Cooperative Research Center program of the National Science Foundation, CAC members from industry and government, and university matching funds.

Autonomic Cloud Bursts on Amazon EC2

Principal researchers: Hyunjoo Kim, Vamsi Kodamasimham, Manish Parashar (Rutgers)

Status: Ongoing

Cluster-based data centers have become dominant computing platforms in industry and research for enabling complex and compute intensive applications. However, as scales, operating costs, and energy requirements increase, maximizing efficiency, cost-effectiveness, and utilization of these systems becomes paramount. Furthermore, the complexity, dynamism, and often time critical nature of application workloads makes on-demand scalability, integration of geographically distributed resources, and incorporation of utility computing services extremely critical. Finally, the heterogeneity and dynamics of the system, application, and computing environment require context-aware dynamic scheduling and runtime management.

Autonomic cloud bursts is the dynamic deployment of a software application that runs on internal organizational compute resources to a public cloud to address a spike in demand. Provisioning data center resources to handle sudden and extreme spikes in demand is a critical requirement, and this can be achieved by combining both private data center resources and remote on-demand cloud resources such as Amazon EC2, which provides resizable computing capacity in the cloud.

This project envisions a computational engine that can enable autonomic cloud bursts capable of: (1) Supporting dynamic utility-driven on-demand scale-out of resources and applications, where organizations incorporate computational resources based on perceived utility. These include resources within the enterprise and across virtual organizations, as well as from emerging utility computing clouds. (2) Enabling complex and highly dynamic application workflows consisting of heterogeneous and coupled tasks/jobs through programming and runtime support for a range of computing patterns (e.g., master-slave, pipelined, data-parallel, asynchronous, system-level acceleration). (3) Integrated runtime management (including scheduling and dynamic adaptation) of the different dimensions of application metrics and execution context. Context awareness includes system awareness to manage heterogeneous resource costs, capabilities, availabilities, and loads, application awareness to manage heterogeneous and dynamic application resources, data and interaction/coordination requirements, and ambient-awareness to manage the dynamics of the execution context.

Comet service model has three kinds of clouds. One is highly robust and secure cloud and nodes in this cloud can be masters. In most application, data is critical and should be in the secure space. Hence, only masters in this cloud can treat the whole data for the application. Another is secure but not robust cloud. Nodes in this cloud can be workers and provide Comet shared coordination space. Robust/secure masters and secure workers construct a global virtualized Comet space. A master generates tasks which are small unit of work for parallelization and inserts them into Comet shared coordination space. Each task is mapped to a node on the overlay using its keyword and stored in the storage space of the mapped node. Hence, robust/secure masters and secure workers have Comet shared space in its architecture substrate. The master provides a management agent for tasks, scheduling and monitoring tasks. It also provides a computing agent because it can provide computing capability. A secure worker gets a task from the space one at a time, hence, it has a computing agent in its architecture. The workers consume the tasks and return the results back to the master through direct connection. The other cloud is for unsecured workers. Unsecured workers cannot access Comet shared space directly and also cannot provide their storage to store tasks but provide their computing capability. Hence they have only computing agent in their architecture. They request a task to one of the masters in the robust/secure network. Then the master accesses to the Comet shared space, gets a task and forwards it to the unsecured worker. When the worker finishes its task, then it sends the result back to the master.

As part of this project we have developed the Comet computing substrate, which provides a foundation and core capabilities for the envisioned autonomic computing engine. Comet supports different programming abstractions for parallel computing, including master/worker, BOT, workflow, data parallel and asynchronous iterations, in a dynamic and widely distributed environment. It provides the abstraction of virtual semantic shared spaces that forms the basis for flexible scheduling, associative coordination, and content-based asynchronous and decoupled interactions. Comet builds on a self-organizing and fault-tolerant dynamic overlay of computing resources. It is currently deployed on a range of platforms, including local clusters, campus Grids, wide-area computing platforms (e.g., PlanetLab), Microsoft HPCS, and Amazon EC2, and supports several computational applications from science, engineering, and finance. But here we have focused on deployment of Value-at-Risk application on Amazon EC2 and have shown how autonomic cloud bursts are implemented and deployed on EC2.

Ongoing efforts are focused on dynamically growing and shrinking clouds according to the workloads and optimization policies such as time (get the result as soon as possible with sufficient budget) or budget (get the result limiting costs to within a budget). Also we have working on robustness and security issues, e.g., isolating unsecured nodes in a cloud form sensitive data and application logic. Hence the autonomic cloud bursts can be possible in the aspect of secure reliable nodes and unsecured cloud nodes. Future works include utility-based and cooperative workflow scheduling models, scheduling and monitoring tasks, simultaneous failure management, and support for high-level application models such as Hadoop/MapReduce.

Reference

Z. Li and M. Parashar, ¡°A Computational Infrastructure for Grid-based Asynchronous Parallel Applications,¡± Proceedings of the 16th International Symposium on High-Performance Distributed Computing (HPDC), Monterey, CA, USA, pp. 229, June 2007.