Fault detection and recovery in a data-driven real-time multiprocessor
AuthorFarquhar, William G.
PublisherPubl by IEEE
SourceProceedings of the International Conference on Parallel Processing
Proceedings of the 8th International Parallel Processing Symposium
Google Scholar check
MetadataShow full item record
This paper introduces the mechanisms required to perform fault detection and recovery in the DART multiprocessor architecture. The DART multiprocessors uses prioritized data-driven scheduling to ensure that multiple hard and soft deadlines are met. A data-driven checkpointing scheme has been developed that ensures that these deadlines are met even in the case of processor failures. The basic approach is to monitor the behavior of each computational thread by means of hardware timers. The results of a thread are released only if the thread completes before its given timeout period expires. Otherwise, the partial computation on the processor is discarded and the thread is rescheduled on a different processor. A strategy to statically predict the system performance in the event of multiple processor failures is presented and evaluated. Simulation results are provided to illustrate the fault detection and recovery response times for single processor failures on DART multiprocessor architectures with 2,3,8,16 and 32 processing elements.