FailRank: Towards a unified GRID failure monitoring and ranking system
Date
2008Author




ISBN
978-0-387-78447-2Source
Making Grids Work - Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments2007 Joint CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments
Pages
247-259Google Scholar check
Keyword(s):
Metadata
Show full item recordAbstract
The objective of Grid computing is to make processing power as accessible and easy to use as electricity and water. The last decade has seen an unprecedented growth in Grid infrastructures which nowadays enables large-scale deployment of applications in the scientific computation domain. One of the main challenges in realizing the full potential of Grids is making these systems dependable. In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable and dependable infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 91% of the cases.