A1 Refereed original research article in a scientific journal
Private reliability environments for efficient fault-tolerance in CGRAs
Authors: Jafri SMAH, Piestrak SJ, Hemani A, Paul K, Plosila J, Tenhunen H
Publisher: Springer New York LLC
Publication year: 2014
Journal: Design Automation for Embedded Systems
Journal name in source: DESIGN AUTOMATION FOR EMBEDDED SYSTEMS
Journal acronym: DES AUTOM EMBED SYST
Volume: 18
Issue: 3-4
First page : 295
Last page: 327
Number of pages: 33
ISSN: 0929-5585
eISSN: 1572-8080
DOI: https://doi.org/10.1007/s10617-014-9129-6
In the era of platforms hosting multiple applications with variable reliability needs, worst-case platform-wide fault-tolerance decisions are neither optimal nor desirable. As a solution to this problem, designs commonly employ adaptive fault-tolerance strategies that provide each application with the reliability level actually needed. However, in the CGRA domain, the existing schemes either only allow to shift between different levels of modular redundancy (duplication, triplication, etc.) or protect only a particular region of a device (e.g. configuration memory, computation, or data memory). To complement these strategies, we propose private fault-tolerance environments which, in addition to modular redundancy, also provide low cost sub-modular (e.g. residue mod 3) redundancy capable of handling both permanent and temporary faults in configuration memory, computation, communication, and data memory. In addition, we also present adaptive configuration scrubbing techniques which prevent fault accumulation in the configuration memory. Simulation results using a few selected algorithms (FFT, matrix multiplication, and FIR filter) show that the approach proposed is capable of providing flexible protection with energy overhead ranging from 3.125 % to 107 % for different reliability levels. Synthesis results have confirmed that the proposed architecture reduces the area overhead for self-checking (58 %) and fault-tolerant (7.1 %) versions, compared to the state of the art adaptive reliability techniques.