Identifying failures in grids through monitoring and ranking
Date
2008Author




ISBN
978-0-7695-3192-2Source
Proceedings of the 7th IEEE International Symposium on Networking Computing and Applications, NCA 20087th IEEE International Symposium on Networking Computing and Applications, NCA 2008
Pages
291-298Google Scholar check
Keyword(s):
Metadata
Show full item recordAbstract
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases. We believe that our work constitutes another important step towards realizing adaptive Grid computing systems. © 2008 IEEE.