Fault detection and recovery in a data-driven real-time multiprocessor
Date
1994ISBN
0-8186-5602-6Publisher
Publ by IEEESource
Proceedings of the International Conference on Parallel ProcessingProceedings of the 8th International Parallel Processing Symposium
Pages
769-774Google Scholar check
Keyword(s):
Metadata
Show full item recordAbstract
This paper introduces the mechanisms required to perform fault detection and recovery in the DART multiprocessor architecture. The DART multiprocessors uses prioritized data-driven scheduling to ensure that multiple hard and soft deadlines are met. A data-driven checkpointing scheme has been developed that ensures that these deadlines are met even in the case of processor failures. The basic approach is to monitor the behavior of each computational thread by means of hardware timers. The results of a thread are released only if the thread completes before its given timeout period expires. Otherwise, the partial computation on the processor is discarded and the thread is rescheduled on a different processor. A strategy to statically predict the system performance in the event of multiple processor failures is presented and evaluated. Simulation results are provided to illustrate the fault detection and recovery response times for single processor failures on DART multiprocessor architectures with 2,3,8,16 and 32 processing elements.