O/S-enabled on-line software-based self-test and recovery for resilient shared-memory multicore systems

Skitsas, Michael A.

dc.contributor.advisor	Michael, Maria K.	en
dc.contributor.advisor	Nicopoulos, Chrysostomos	en
dc.contributor.author	Skitsas, Michael A.	en
dc.coverage.spatial	Cyprus	en
dc.creator	Skitsas, Michael A.	en
dc.date.accessioned	2018-03-13T06:45:12Z
dc.date.available	2018-03-13T06:45:12Z
dc.date.issued	2017-12
dc.date.submitted	2017-12-29
dc.identifier.uri	https://gnosis.library.ucy.ac.cy/handle/7/39724	en
dc.description	Includes bibliography (p. 101-107).	en
dc.description	Number of sources in the bibliography: 81	en
dc.description	Thesis (Ph. D.) -- University of Cyprus, Faculty of Engineering, Department of Electrical and Computer Engineering, 2017.	en
dc.description	The University of Cyprus Library holds the printed form of the thesis.	en
dc.description.abstract	Καθώς η τεχνολογία εξελίσσεται, τα μεγέθη αποτύπωσης ολοκληρωμένων κυκλωμάτων συρρικνώνονται, τα τρανζίστορ γίνονται λιγότερο αξιόπιστα. Ως αποτέλεσμα αυτής της εξέλιξης, τα μελλοντικά συστήματα αναμένεται να είναι πιο ευάλωτα σε φαινόμενα φθοράς (με την πάροδο του χρόνου και την χρήση). Το ζήτημα της φθοράς με την πάροδο του χρόνου και της βαθμιαίας υποβάθμισης καθιστά αναγκαία την χρήση μηχανισμών που επιτρέπουν την προστασία του συστήματος από ανεπιθύμητες συμπεριφορές διευκολύνοντας έτσι στην ανίχνευση, τον μετριασμό ή / και την αποκατάσταση ορθής λειτουργίας από σφάλματα καθ 'όλη τη διάρκεια ζωής του συστήματος. Πρόσφατα, στην βιβλιογραφία έχουν προταθεί αρκετές τεχνικές για έλεγχο των συστημάτων που να επιτρέπουν τη δυναμική ανίχνευση μόνιμων σφαλμάτων. Η ανίχνευση σφαλμάτων από τα ίδια τα συστήματα με την χρήση λογισμικού είναι μια διαδεδομένη τεχνική στον τομέα ελέγχου ψηφιακών κυκλωμάτων και μικροεπεξεργαστών. Η λειτουργία αυτή βασίζεται στην εκμετάλλευση των υφιστάμενων διαθέσιμων πόρων που υπάρχουν στο σύστημα. Πέρα από την ανίχνευση σφαλμάτων, τα σύγχρονα συστήματα πρέπει να ενισχυθούν με μηχανισμούς που είναι σε θέση να επιδιορθώσουν και να ανακτήσουν την ορθή λειτουργία του συστήματος στην παρουσία σφάλματος, προκειμένου να παραμείνει λειτουργικό παρά την ύπαρξη μόνιμων βλαβών. Σκοπός αυτής της διατριβής είναι η ανάπτυξη τεχνικών για: (i) ανίχνευση σφαλμάτων, (ii) μεθοδολογίες προγραμματισμού για την αύξηση της διαθεσιμότητας του συστήματος κατά τη διάρκεια των ελέγχου του συστήματος και (iii) ενίσχυση του συστήματος με δυνατότητες αποκατάστασης. Το πρώτο μέρος αυτής της εργασίας εισάγει ένα νέο παράδειγμα ανίχνευσης σφαλμάτων που ελέγχει το σύστημα για σφάλματα στη διακριτότητα των επί μέρους συστημάτων (υπολογιστικών μονάδων) ενός επεξεργαστή σε συστήματα πολλαπλών πυρήνων λαμβάνοντας υπόψιν το ιστορικό της λειτουργίας τους. Συγκεκριμένα, αναπτύχθηκε το πλαίσιο DaemonGuard που επιτρέπει την παρατήρηση σε πραγματικό χρόνο των επί μέρους υπολογιστικών συστημάτων ενός επεξεργαστή εκτελώντας μια διαδικασία για έλεγχο (σε τοπικό επίπεδο) χωρίς να γίνεται ολικός έλεγχος του επεξεργαστή για σφάλματα σε επίπεδο υλικού. Αυτή η τεχνική στοχεύει στη μείωση του χρόνου εκτέλεσης ελέγχου αποφεύγοντας τον συχνό έλεγχο των μονάδων των οποίον η χρήση ήταν σε χαμηλό επίπεδο. Το δεύτερο μέρος διερευνά τη σχέση μεταξύ του χρόνου κατά τον οποίο το σύστημα βρίσκεται υπό έλεγχο και του συνολικού χρόνου που χρειάζεται να ελεγχθούν όλοι οι πυρήνες του συστήματος. Για αυτό το σημείο στην έρευνα μας, αναπτύσσουμε ένα πλαίσιο εξερεύνησης ικανό να προσδιορίσει την καλύτερη πολιτική προγραμματισμού για να αυξήσει τη διαθεσιμότητα του συστήματος. Επιπλέον, προτείνουμε, αξιολογούμε και ενσωματώνουμε μια νέα μεθοδολογία που στοχεύει στην περαιτέρω βελτίωση των τεχνικών καθώς το σύστημα μεγαλώνει. Για το τελευταίο μέρος της παρούσας διατριβής, προτείνουμε τεχνικές που βελτιώνουν το προτεινόμενο πλαίσιο και μπορούν να υποστηρίξουν δυνατότητες αποκατάστασης της σωστής λειτουργίας παρά την εμφάνιση σφαλμάτων. Συγκεκριμένα, προτείνουμε έναν αποδοτικό μηχανισμό ανάκτησης και επαναφοράς, ο οποίος, μετά την ανίχνευση σφαλμάτων, μπορεί να επαναφέρει το σύστημα στην πιο πρόσφατη έγκυρη κατάσταση ορθής λειτουργίας και να επαναλάβει την εκτέλεση, υποθέτοντας την απενεργοποίηση του ελαττωματικού πυρήνα, οδηγώ-ντας έτσι σε ένα υποβαθμισμένο μεν, αλλά λειτουργικό σύστημα. Όλες οι προτεινόμενες τεχνικές αξιολογούνται μέσω μιας σειράς πειραμάτων με τη χρήση προσομοίωσης.	el
dc.description.abstract	As technology scales deep into the sub-micron regime, transistors become less reliable. Future systems are widely predicted to suffer from considerable aging and wear-out effects. The issue of aging and gradual degradation necessitates the use of mechanisms that can enable protection against undesired system behavior by facilitating detection, mitigation, and/or recovery from faults throughout the lifetime of the system. Recently, several on-line testing techniques have been proposed in literature enabling dynamic detection of permanent faults. Software-based Self-Testing (SBST) is an emerging new paradigm in the testing domain, which relies on the exploitation of existing available resources resident in the system. Beyond the detection of faults, modern systems must be enhanced with mechanisms able to self-repair and recover the system to a fault-free state, in order to remain functional despite the presence of permanent faults. The objectives of this work are to develop techniques for: (i) on-line fault detection, (ii) scheduling methodologies to increase the system availability during testing and (iii) enhance the system with recovery capabilities. The first part of this thesis introduces a new paradigm of SBST that performs testing at the granularity of individual microprocessor core components in multi-/many-core systems based on the utilization. In particular, we develop the DaemonGuard, a framework that enables the real-time observation of individual sub-core modules and performs on-demand selective testing of modules that have been stressed. This technique aims to reduce the testing time by avoiding the over-testing of under-utilized units. The second part investigates the relation between system test latency and test time overhead under several scheduling policies. For this part we develop an exploration framework able to identify the best scheduling policy in order to increase system availability under a given test latency constraint. Additionally, a new methodology aiming to reduce the extra overhead related to testing that is incurred as the system scales up (i.e. the number of on-chip cores increases) is integrated and evaluated under the developed exploration framework. For the last part of this thesis, we propose to enhance our framework to support fault recovery capabilities. In particular, we propose an efficient check pointing and rollback recovery mechanism which, upon fault detection, can restore the system to the most recently valid correct state and resume the normal operation assuming disabling of the faulty core, thereby leading to a healthy (but degraded) system. All the proposed techniques are evaluated through a series of experiments using a full-system, execution-driven simulation framework running a commodity operating system and real multi-threaded workloads.	en
dc.format.extent	xxii, 107 p. : col. ill., diagrs., tables ; 31 cm.	en
dc.language.iso	eng	en
dc.publisher	Πανεπιστήμιο Κύπρου, Πολυτεχνική Σχολή / University of Cyprus, Faculty of Engineering
dc.rights	info:eu-repo/semantics/openAccess	en
dc.rights	Open Access	en
dc.subject.lcsh	Computer engineering	en
dc.subject.lcsh	Multiprocessors	en
dc.subject.lcsh	Software architecture	en
dc.subject.lcsh	Computer architecture	en
dc.subject.lcsh	Reliability (Engineering)	en
dc.subject.lcsh	Fault location (Engineering) -- Data processing	en
dc.subject.lcsh	Computer software -- Testing	en
dc.subject.lcsh	Systems availability	en
dc.title	O/S-enabled on-line software-based self-test and recovery for resilient shared-memory multicore systems	en
dc.title.alternative	Τεχνικές για ανίχνευση και αποκατάσταση σφαλμάτων σε συστήματα πολλαπλών πυρήνων κοινόχρηστης μνήμης σε επίπεδο λειτουργικού συστήματος	el
dc.type	info:eu-repo/semantics/doctoralThesis	en
dc.contributor.committeemember	Θεοχαρίδης, Ιωάννης	el
dc.contributor.committeemember	Έλληνας, Γεώργιος	el
dc.contributor.committeemember	Νεοφύτου, Στέλιος	el
dc.contributor.committeemember	Ψαράκης, Μιχάλης	el
dc.contributor.committeemember	Theocharides, Ioannis	en
dc.contributor.committeemember	Ellinas, Georgios	en
dc.contributor.committeemember	Neophytou, Stelios	en
dc.contributor.committeemember	Psarakis, Mihalis	en
dc.contributor.department	Τμήμα Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών / Department of Electrical and Computer Engineering
dc.subject.uncontrolledterm	ΕΛΕΓΧΟΣ ΨΗΦΙΑΚΩΝ ΣΥΣΤΗΜΑΤΩΝ	el
dc.subject.uncontrolledterm	ΑΞΙΟΠΙΣΤΙΑ ΣΥΣΤΗΜΑΤΩΝ	el
dc.subject.uncontrolledterm	ΠΟΛΥΠΥΡΗΝΑ ΣΥΣΤΗΜΑΤΑ	el
dc.subject.uncontrolledterm	ΑΝΙΧΝΕΥΣΗ ΣΦΑΛΜΑΤΩΝ	el
dc.subject.uncontrolledterm	ΔΙΑΘΕΣΙΜΟΤΗΤΑ ΣΥΣΤΗΜΑΤΟΣ	el
dc.subject.uncontrolledterm	SELF-TESTING	en
dc.subject.uncontrolledterm	MULTI-/MANY-CORE SYSTEMS	en
dc.subject.uncontrolledterm	RELIABILITY	en
dc.subject.uncontrolledterm	ON-LINE FAULT DETECTION	en
dc.subject.uncontrolledterm	SOFTWARE-BASED SELF-TESTING	en
dc.subject.uncontrolledterm	SYSTEM AVAILABILITY	en
dc.identifier.lc	QA76.76.T48S55 2017	en
dc.author.faculty	Πολυτεχνική Σχολή / Faculty of Engineering
dc.author.department	Τμήμα Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών / Department of Electrical and Computer Engineering
dc.type.uhtype	Doctoral Thesis	en
dc.rights.embargodate	2017-12-29
dc.contributor.orcid	Nicopoulos, Chrysostomos [0000-0001-6389-6068]
dc.contributor.orcid	Skitsas, Michael A. [0000-0003-4715-3162]
dc.gnosis.orcid	0000-0001-6389-6068
dc.gnosis.orcid	0000-0003-4715-3162

Files in this item

Name:: Michael A. Skitsas PhD.pdf
Size:: 1.954Mb
Format:: PDF
Description:: Διδακτορική Διατριβή

View/Open

Name:: Σκίτσας Μιχαήλ Α. - ΗΜΜΥ - 2017.pdf
Size:: 363.6Kb
Format:: PDF
Description:: Έντυπο έγκρισης ηλεκτρονικής ...

View/Open

This item appears in the following Collection(s)

Τμήμα Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών / Department of Electrical and Computer Engineering [81]

Show simple item record