Newsletter

4/14/09 - The Spring 2009 (PDF) edition of the CAC newsletter is available.

Newsletter Archive

 

The University of Florida , the University of Arizona and Rutgers, the State University of New Jersey , have established a national research center for autonomic computing (CAC).

This center is funded by the Industry/University Cooperative Research Center program of the National Science Foundation, CAC members from industry and government, and university matching funds.

Robust Clustering Analysis for Self-Monitoring Distributed Systems

Principal researchers: Manish Parashar and Andres Quiroz (Rutgers University), Current collaborators: Naveen Sharma and Nathan Gnanasambandam (XEROX)

The control and timely management of large-scale distributed systems, such as device networks, data centers, and compute clusters are tasks that are rapidly exceeding human ability, given their complexity, dynamics, and large amounts of data involved. Thus, the automated and online management of these systems is essential to ensure their continued performance and robust operation. Fortunately, systems' available in-network resources can be harnessed to perform self-monitoring and data analysis tasks which are crucial for effective management.

A self-monitoring system is able to observe and analyze system state and behavior, to discover anomalies or violations, and to notify autonomic or human administrators in a timely manner so that appropriate management actions can be effectively applied. Furthermore, implementing the analysis technique in a decentralized and in-network fashion (using network resources and minimal extraneous information) ensures computational tractability and acceptable response times. However, because self-monitoring mechanisms are subject to the same failures that occur in the network that they are helping to manage, the robustness of these mechanisms is of great importance to ensure overall system reliability. Therefore, it is very important to ensure the robustness of the proposed solution at different levels.

Working toward achieving the goals outlined above, the main contribution of this work is the formulation and validation of a robust decentralized data analysis mechanism [1] that applies density-based clustering techniques [2-4] to identify anomalies and clusters of arbitrary size and shape in monitoring data. Clustering data is given in the form of periodic behavior and operational status updates events from system components, defined in terms of known attributes. The event attributes are used to construct a multidimensional coordinate space, which is then used to measure the similarity of events. Components that behave in a similar fashion can then be identified by the clusters formed by their status events in this space, while devices with abnormal behavior will produce isolated events. The clustering algorithm requires minimal computation at processing nodes, which makes it suitable for online execution.

The robustness of the decentralized mechanisms is dealt with at three levels. First, we assume that the connectivity of the network is maintained despite node failures through self-healing mechanisms provided at the overlay level. Next, at the data messaging level, we use replication to prevent the loss of the events required for the clustering analysis. To minimize the overhead incurred by replication, data is selectively replicated at nodes based on their probability of failure, which is obtained by maintaining a failure history and calculated using an appropriate failure model. The selectivity of replication can be further aided by information available at the analysis level. Because the primary focus of the clustering analysis is on anomaly detection, only points that are most likely to be anomalies should be replicated. This can be predicted given previous clusters and anomalies observed in the system.

This work is part of an ongoing effort to create tools for integrated data analysis for the autonomic management of performance, security/trust and reliability at a system level. Current and future efforts include improving cluster descriptions produced by the algorithm for effective profiling of system behavior and developing predictive system models of distributed system state. We plan to combine these mechanisms with tools for defining and conditioning the application of system policies with these profiles and state predictions for autonomic resource management and provisioning, usage control and monitoring, and trust management and authentication.

References:

  1. "Robust Clustering Analysis for the Management of Self-Monitoring Distributed Systems," A. Quiroz, N. Gnanasambandam, M. Parashar, and N. Sharma. To appear in Journal of Cluster Computing, Springer 2008, DOI: 10.1007/s10586-008-0068-5.
  2. "Algorithms for Clustering Data," A.K. Jain and R.C. Dubes. Prentice Hall, 1988.
  3. "A density-based algorithm for discovering clusters in large spatial databases with noise," M. Ester, H.P. Kriegel, J. Sander, X. Xu. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
  4. "Automatic subspace clustering of high dimensional data for data mining applications," R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. In: Proceedings of 1998 ACM-SIGMOD Int. Conf. Management of Data, pp. 94{105. Seattle, Washington,1998.