Modeling the implications of DRAM failures and protection techniques on datacenter TCO
Date
2015Author
Nikolaou, PanagiotaSazeides, Yiannakis
Ndreu, L.
Kleanthous, Marios M.
ISBN
978-1-4503-4034-2Publisher
IEEE Computer SocietySource
Proceedings of the Annual International Symposium on Microarchitecture, MICRO48th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2015
Volume
05-09-December-2015Pages
572-584Google Scholar check
Keyword(s):
Metadata
Show full item recordAbstract
Total Cost of Ownership (TCO) is a key optimization metric for the design of a datacenter. This paper proposes, for the first time, a framework for modeling the implications of DRAM failures and DRAM error protection techniques on the TCO of a datacenter. The framework captures the Effects and interactions of several key parameters including: the choice of DRAM protection technique (e.g. single vs dual channel Chipkill), device width (x4 or x8), memory size, power, FITs for various failure modes, the performance, power and temperature overheads of a protection technique for a given service and mixes of collocated services. The usefulness of the proposed framework is demonstrated through several case studies that identify the best DRAM protection technique in each case, in terms of TCO. Interestingly, our analysis reveals that among the three DRAM protection techniques considered, there is no one that is always superior to all the others. Moreover, each technique is better than the others for some cases. This underlines the importance and the need of the proposed framework for making optimal memory protection datacenter design decisions. As part of this work, we analyze and report the performance and power with single channel and dual channel Chipkill on real hardware when running a web search benchmark alone and collocated with benchmarks of varying memory intensity. This analysis reveals that the choice of memory protection can have serious performance and TCO ramifications depending on the memory characteristics of collocated services. Other analysis reveals that, for the datacenter and services assumed in this study, when using Chipkill protection it can be beneficial for TCO to use DRAM with 100x the failure rate of a baseline DRAM as long as the cost per DIMM is at least a dollar less compared to the baseline. © 2015 ACM.