A Large-Scale Framework for Automatic Addition of Fault-Tolerance
The advent of multi-core
processors justifies the need for the tools that facilitate the design
of parallel programs and decrease development costs. Design automation
is an approach that has the potential to decrease development costs.
More importantly, design automation results in generating a program that
is correct by construction, thereby eliminating the need for its proof
of correctness. However, the exponential complexity of automatic design
of parallel programs is a major obstacle in front of the development of
such tools. One approach for decreasing the computational cost of
automated design is to exploit the processing power of
parallel/distributed platforms for design automation.
The focus of this work is on the development of a distributed framework
for automated design of fault-tolerant parallel programs from their
fault-intolerant version. Specifically, we propose a divide-and-conquer
approach that takes an existing fault-intolerant program and partitions
the intolerant program into a set of subsets of its
instructions. Subsequently, each subset of instructions is automatically
analyzed and revised in isolation on a separate machine in such a way
that the entire program becomes fault-tolerant against a specific type
of faults. Based on this approach, thus far, we have implemented a
distributed fault tolerance synthesizer (in
C++) that utilizes the processing power of parallel/distributed machines
to add fault tolerance to parallel/distributed programs. Our experiments
with the current implementation of our framework show very promising results.
Short bio CV Research Teaching |