Faults are common-place and inevitable in today's applications. Hence,
automated techniques are necessary to locate faults by analyzing failed
executions. For applications that are critical and for which down time is
highly detrimental, techniques for surviving software failures and letting
the execution continue are desired. These problems are more challenging
when applied to programs that are multithreaded and long-running. In this
talk, I will present techniques for fault location and avoidance in
programs which can be multi-threaded and long-running.
For locating faults in programs, dynamic slices have been shown to be very
effective in reducing the effort of debugging. While prior work has
primarily focused on single-threaded programs, in this talk I will how
dynamic slicing can be used for fault location in multithreaded programs.
I will show dynamic slices can be used to track down faults due to data
races in multithreaded programs by incorporating additional data
dependences that arise in the presence of many threads. I will discuss a compact trace representation that can be used to efficiently store
dependence traces which are necessary to construct the dynamic slice. I
will then describe a framework for tracing that achieves its scalability
via checkpointing/logging based collection of traces. The collected traces
only contain the dynamic information relevant to the fault and are hence
highly compact. We have successfully employed this framework in tracing
long-running multithreaded applications.
For fault avoidance, I will present a technique to recover applications
from a class of faults that are caused by the execution environment. The
technique survives a fault by rolling back the execution to an appropriate
program execution point and re-executing the code region under a modified
environment. The technique can also prevent the fault from occurring again
in the future by learning from the first occurrence. This technique has
been successfully used to avoid faults, in a variety of applications,
caused due to thread scheduling, heap overflow, and malformed user
requests.