Cache reliability for large numbers of permanent faults
View/ Open
Date
2010-12Author
Ladas, NikolasPublisher
Πανεπιστήμιο Κύπρου, Σχολή Θετικών και Εφαρμοσμένων Επιστημών / University of Cyprus, Faculty of Pure and Applied SciencesPlace of publication
CyprusGoogle Scholar check
Metadata
Show full item recordAbstract
Process variability in future technology nodes is expected to severely limit the benefits from dynamic voltage scaling. To keep power at bay, low voltage operation has been proposed. Because of the cubic relation between voltage,frequency and power, when operating at low voltage,
significant power and energy savings can be achieved. However, at this mode of operation some devices fail. SRAM cells, used to built caches, are the most sensitive to low voltage operation since they are built with minimal geometry in order to save area. As a result, when operating at low voltage large numbers of faults occur in the caches. Traditional reliability techniques, such as sparing and ECC, are unable to deal with large numbers of faults. Because of this, novel reliability mechanisms have been proposed that are able to protect caches in high fault rate scenarios.
However, most of these techniques are costly in terms of area or very complex to implement.
In this work we provide a new approach for dealing with high fault rates in caches. We propose to use a simple, well known reliability technique - block disabling - and combine it with performance enhancing mechanisms, such as prefetching and victim caching, and with careful selection of cache parameters such as block size and associativity. This approach is easy to implement since it uses technology that already exists in modern processors and requires little area overhead.
To select the optimal cache configuration for block disabling we use a combination of probability analysis and accurate performance simulation. Using probability analysis we show that a smaller block size is preferable since more cache capacity is available(72% for a 32B cache over 54% for a 64B cache). Also, we show that by using a smaller block size and higher associativity, the probability of having clustered faults in the same set is reduced. Using simulations we show that the capacity benefit from the smaller block size is beneficial to performance. Furthermore, we show that prefetching and victim caching can be useful to reduce performance losses caused by faults. The victim cache is especially useful for reducing performance non-determinism caused by the random placement of faults in the cache.
Our best performing block disabling configuration is shown to outperform word disabling/bitfix, a recently proposed mechanism for low voltage operation, by 7%. Furthermore, our lowcost block disabling configuration performs similarly to word disabling/bit-fix, requires less area overhead and is simpler to implement.