CFP last date
28 January 2025
Reseach Article

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

by Samia A. Ali
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 3 - Number 2
Year of Publication: 2012
Authors: Samia A. Ali
http:/ijais12-450448

Samia A. Ali . Adaptable Fault Tolerance Configurations for Multiprocessor Systems. International Journal of Applied Information Systems. 3, 2 ( July 2012), 1-8. DOI=http:/ijais12-450448

@article{ http:/ijais12-450448,
author = { Samia A. Ali },
title = { Adaptable Fault Tolerance Configurations for Multiprocessor Systems },
journal = { International Journal of Applied Information Systems },
issue_date = { July 2012 },
volume = { 3 },
number = { 2 },
month = { July },
year = { 2012 },
issn = { 2249-0868 },
pages = { 1-8 },
numpages = {9},
url = { https://www.ijais.org/archives/volume3/number2/201-0448/ },
doi = { http:/ijais12-450448 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T10:45:21.190547+05:30
%A Samia A. Ali
%T Adaptable Fault Tolerance Configurations for Multiprocessor Systems
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 3
%N 2
%P 1-8
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Fault-tolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double transient and permanent faults in any processor of multiprocessor systems. The tolerance for faults takes place in three consecutive steps; fault detection, fault diagnosing and system recovery. The overhead cost for the first (second) configuration is only 100% hardware (time) for fault detection, an extra 100% time for fault diagnoses and system recovery only for those processes running on the faulty processors. The advantages of the proposed configurations are the ease of applicability and the low associated overhead cost over the system without any fault tolerance. An enhancement is developed for both configurations to check upon the system state adequately to detect and recover from faults as soon as they infect the system. Simulations are performed to illustrate the usefulness of the proposed configurations.

References
  1. Shivakumar, P. Keckler, S. W. , Moore, C. R. , Burger, D. , "Exploiting Microarchitectural Redundancy for Defect Tolerance", the 21st International Conference on Computer Design (ICCD), October, 2003.
  2. Bernick, D. , Bruckert, B. , Vigna, P. D. , Garcia, D. , Jardine, R. , Klecka,J. , Smullen, J. , "NonStop® Advanced Architecture", DSN, 2005.
  3. Anderson, T. , Lee, A. , "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981.
  4. Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
  5. Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
  6. Rotenberg, E. . AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
  7. Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002
  8. Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
  9. Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99–110.
  10. Fair, M. L. , Conklin, C. R. , Swaney, S. B. , Meaney, P. J. , Clarke, W. J. , Alves, L. C. , Modi, I. N. , Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
  11. J. S. Plank and W. R. Elwasif, "Experimental assessment of workstation failures and their impact on checkpointing systems," in 28th International Symposium on Fault-Tolerant Computing, June 1998.
  12. N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Transactions on Computers, vol. 46 ,Aug. 1997.
  13. K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994.
  14. J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.
Index Terms

Computer Science
Information Sciences

Keywords

Hardware Redundancy Time Redundancy Transient Fault Permanent Fault Cold Standby Spare