A Large-Scale Framework for Automatic Addition of Fault-Tolerance

The advent of multi-core processors justifies the need for the tools that facilitate the design of parallel programs and decrease development costs. Design automation is an approach that has the potential to decrease development costs. More importantly, design automation results in generating a program that is correct by construction, thereby eliminating the need for its proof of correctness. However, the exponential complexity of automatic design of parallel programs is a major obstacle in front of the development of such tools. One approach for decreasing the computational cost of automated design is to exploit the processing power of parallel/distributed platforms for design automation.

The focus of this work is on the development of a distributed framework for automated design of fault-tolerant parallel programs from their fault-intolerant version. Specifically, we propose a divide-and-conquer approach that takes an existing fault-intolerant program and partitions the intolerant program into a set of subsets of its
instructions. Subsequently, each subset of instructions is automatically analyzed and revised in isolation on a separate machine in such a way that the entire program becomes fault-tolerant against a specific type of faults. Based on this approach, thus far, we have implemented a distributed fault tolerance synthesizer (in C++) that utilizes the processing power of parallel/distributed machines to add fault tolerance to parallel/distributed programs. Our experiments with the current implementation of our framework show very promising results.

 

Short bio            CV                               Research                Teaching              

Awards               Publications           For prospective PhD students!