Experiences with software-based soft-error mitigation using AN codes

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Martin Hoffmann
  • Peter Ulbrich
  • Christian Dietrich
  • Horst Schirmeier
  • Daniel Lohmann
  • Wolfgang Schröder-Preikschat

Externe Organisationen

  • Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU Erlangen-Nürnberg)
  • Technische Universität Dortmund
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)87-113
Seitenumfang27
FachzeitschriftSoftware quality journal
Jahrgang24
Ausgabenummer1
PublikationsstatusVeröffentlicht - 22 Nov. 2014
Extern publiziertJa

Abstract

Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.

ASJC Scopus Sachgebiete

Zitieren

Experiences with software-based soft-error mitigation using AN codes. / Hoffmann, Martin; Ulbrich, Peter; Dietrich, Christian et al.
in: Software quality journal, Jahrgang 24, Nr. 1, 22.11.2014, S. 87-113.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Hoffmann, M, Ulbrich, P, Dietrich, C, Schirmeier, H, Lohmann, D & Schröder-Preikschat, W 2014, 'Experiences with software-based soft-error mitigation using AN codes', Software quality journal, Jg. 24, Nr. 1, S. 87-113. https://doi.org/10.1007/s11219-014-9260-4
Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H., Lohmann, D., & Schröder-Preikschat, W. (2014). Experiences with software-based soft-error mitigation using AN codes. Software quality journal, 24(1), 87-113. https://doi.org/10.1007/s11219-014-9260-4
Hoffmann M, Ulbrich P, Dietrich C, Schirmeier H, Lohmann D, Schröder-Preikschat W. Experiences with software-based soft-error mitigation using AN codes. Software quality journal. 2014 Nov 22;24(1):87-113. doi: 10.1007/s11219-014-9260-4
Hoffmann, Martin ; Ulbrich, Peter ; Dietrich, Christian et al. / Experiences with software-based soft-error mitigation using AN codes. in: Software quality journal. 2014 ; Jahrgang 24, Nr. 1. S. 87-113.
Download
@article{98ed2d4835624f21bbb566725eb7543b,
title = "Experiences with software-based soft-error mitigation using AN codes",
abstract = "Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.",
keywords = "Arithmetic code, Dependability, Fault injection",
author = "Martin Hoffmann and Peter Ulbrich and Christian Dietrich and Horst Schirmeier and Daniel Lohmann and Wolfgang Schr{\"o}der-Preikschat",
note = "Funding information: This work was partly supported by the Bavarian Ministry of State for Economics, Traffic, and Technology under the (EU EFRE funds) Grant No. 0704/883 25 and the German Research Foundation (DFG) priority program SPP 1500 under grant no. LO 1719/1-2 and SP 968/5-2. Implementation and further experimental results : http://www4.cs.fau.de/Research/CoRed .",
year = "2014",
month = nov,
day = "22",
doi = "10.1007/s11219-014-9260-4",
language = "English",
volume = "24",
pages = "87--113",
number = "1",

}

Download

TY - JOUR

T1 - Experiences with software-based soft-error mitigation using AN codes

AU - Hoffmann, Martin

AU - Ulbrich, Peter

AU - Dietrich, Christian

AU - Schirmeier, Horst

AU - Lohmann, Daniel

AU - Schröder-Preikschat, Wolfgang

N1 - Funding information: This work was partly supported by the Bavarian Ministry of State for Economics, Traffic, and Technology under the (EU EFRE funds) Grant No. 0704/883 25 and the German Research Foundation (DFG) priority program SPP 1500 under grant no. LO 1719/1-2 and SP 968/5-2. Implementation and further experimental results : http://www4.cs.fau.de/Research/CoRed .

PY - 2014/11/22

Y1 - 2014/11/22

N2 - Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.

AB - Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.

KW - Arithmetic code

KW - Dependability

KW - Fault injection

UR - http://www.scopus.com/inward/record.url?scp=84956699828&partnerID=8YFLogxK

U2 - 10.1007/s11219-014-9260-4

DO - 10.1007/s11219-014-9260-4

M3 - Article

AN - SCOPUS:84956699828

VL - 24

SP - 87

EP - 113

JO - Software quality journal

JF - Software quality journal

SN - 0963-9314

IS - 1

ER -