Colloquium Speaker

Speaker: 
Sriraman Tallam, Computer Science Department
Topic: 
Fault Location and Avoidance in Long-Running Multithreaded Applications.
Date: Monday, October 1, 2007
Time: 4:00 PM
Place: Gould-Simpson, Room 701
Light Refreshments will be served in the 7th floor lobby of Gould-Simpson at 3:45 PM

*Please note: Special Day, Time and Location*


Abstract

Faults are common-place and inevitable in today's applications. Hence, automated techniques are necessary to locate faults by analyzing failed executions. For applications that are critical and for which down time is highly detrimental, techniques for surviving software failures and letting the execution continue are desired. These problems are more challenging when applied to programs that are multithreaded and long-running. In this talk, I will present techniques for fault location and avoidance in programs which can be multi-threaded and long-running.

For locating faults in programs, dynamic slices have been shown to be very effective in reducing the effort of debugging. While prior work has primarily focused on single-threaded programs, in this talk I will how dynamic slicing can be used for fault location in multithreaded programs. I will show dynamic slices can be used to track down faults due to data races in multithreaded programs by incorporating additional data dependences that arise in the presence of many threads. I will discuss a compact trace representation that can be used to efficiently store dependence traces which are necessary to construct the dynamic slice. I will then describe a framework for tracing that achieves its scalability via checkpointing/logging based collection of traces. The collected traces only contain the dynamic information relevant to the fault and are hence highly compact. We have successfully employed this framework in tracing long-running multithreaded applications.

For fault avoidance, I will present a technique to recover applications from a class of faults that are caused by the execution environment. The technique survives a fault by rolling back the execution to an appropriate program execution point and re-executing the code region under a modified environment. The technique can also prevent the fault from occurring again in the future by learning from the first occurrence. This technique has been successfully used to avoid faults, in a variety of applications, caused due to thread scheduling, heap overflow, and malformed user requests.

 


Back to Index