Survivability Issues in Cactus


Matti A. Hiltunen, Richard D. Schlichting, and Carlos A. Ugarte

Department of Computer Science
The University of Arizona
Tucson, AZ 85721, USA
email: hiltunen/rick/cau@cs.arizona.edu


Abstract

Survivability, the ability of a system to tolerate intentional attacks or accidental failures or errors, is becoming increasingly important with extended use of computer systems in society. Techniques, such as cryptographic methods, intrusion detection, and traditional fault-tolerance are being developed to improve the survivability of such systems. One of the special challenges of survivability is the need to work with existing systems - legacy or off-the-shelf - that were not designed with survivability in mind. The Cactus project offers some potential solutions to survivability in the form of highly customizable middleware.


1. Introduction

A survivable system is one that is able to "complete its mission in a timely manner, even if significant portions are incapacitated by attack or accident" [Bar96]. Systems connected to unbounded networks, such as the Internet, are faced with additional challenges [EFL+97]. These systems have to prevent unauthorized use, maintain confidentiality, and provide adequate service to proper users. While survivability builds on research in security, reliability, fault-tolerance, safety and availability, it is more than just a combination of these - it is also about the interaction of these different properties [VMG97].

This paper outlines potential contributions to survivability by the Cactus projects at the University of Arizona.

2. Cactus Approach

The goal of the Cactus project [SH97] is to provide an approach to constructing highly customizable middleware services for networked systems. Cactus provides an integrated approach that addresses a wide range of functional properties as well as quality of service (QoS) aspects such as reliability, timeliness, performance, and security. The fine-grain customization provided by Cactus allows application-specific control over these service aspects and their potential tradeoffs such as the one between service performance and security.

The Cactus approach to constructing highly customizable middleware services is based on implementing abstract properties and functional components of a service as separate modules that interact using an event-driven execution model. The basic building block of this model is a micro-protocol, a software module that implements a well-defined property of the desired service. A micro-protocol is, in turn, structured as a collection of event handlers, which are procedure-like segments of code that are executed when a specified event occurs. Events are used to signify state changes of interest, e.g., "message arrival from the network". When such an event occurs, all event handlers bound for that event are executed. Events can be raised by micro-protocols or be raised implicitly by the runtime system. Execution of handlers is atomic with respect to concurrency, i.e., each handler is executed to completion without interruption. The binding of handlers to events can be changed at runtime.

Event handler binding and invocation are implemented by a standard runtime system or framework that is linked with the micro-protocols to form a composite protocol. The framework also supports shared data (e.g., messages) that can be accessed by the micro-protocols configured into the framework. Once created, a composite protocol can be composed in a traditional hierarchical manner using system like the x-kernel [HP91] with other protocols to construct a middleware service.

3. Cactus and Survivability

Although the Cactus approach was not designed with survivability as a major goal, its integrated framework for QoS makes it easy to add properties and QoS attributes related to survivability.

3.1. Survivability Methods in Cactus

Fault-tolerance and security, two methods for increasing survivability [KMT97], are naturally part of the QoS attributes provided by Cactus services. Fault-tolerance is provided by micro-protocols that implement object or process replication (active and passive) and reliable communication through retransmissions. Security is provided by micro-protocols that implement different authentication and encryption methods. A similar approach could be used to implement intrusion detection [Jou97, OM97, BS97]. The configurability aspect of the Cactus approach would make it easy to add any number of intrusion detection (as well as intrusion handling) micro-protocols and any combinations of them could be used together.

Another aspect of Cactus that may have a great impact on survivability is diversity provided by the high level of customizability. If users customize the underlying services to their exact requirements and characteristics of the execution environment, the different service instances may become different enough that one method of breaking the service might not work on others [CP97].

Furthermore, an important aspect we are exploring in the Cactus project is adaptability. By adaptability, we mean a system's ability to react to changes - both in the execution environment (e.g., failures, intrusions) and user requirements (e.g., user's request to increase the security level of the system) - dynamically at runtime. The Cactus model offers a method of adaptation in event-handler rebinding: different modes of operation can be implemented in different event-handlers and a mode switch only involves unbinding the old event-handlers and binding the new ones. We are also exploring methods of dynamic code modification, where new event-handlers can be inserted into a running composite protocol. To coordinate the adaptation of distributed programs, we have developed a three phase model for adaptations consisting of change detection, agreement, and action [HS96]. The agreement phase, where the different nodes of the distributed computation agree if the adaptation is necessary and what the adaptation should be, is the most complicated. To simplify the implementation of this phase, we have developed a library of agreement micro-protocols.

Finally, as mentioned above, it is often necessary to increase the survivability of existing legacy or off-the-shelf applications. This can be often accomplished by either replacing some of the underlying communication and operating system services with survivable ones or by transparently inserting new middleware services between the application and the underlying services. One of the Cactus prototypes runs on the MK operating system and CORDS communication subsystem from the OpenGroup Research Institute. This platform allows all or part of a protocol stack to be inserted into the operating system kernel, and could thus replace existing communication services provided by the operating system. We are exploring the transparent insertion of new middleware services in the context of CORBA. The goal is to enhance the QoS for existing CORBA clients and servers without modifying either their code or the underlying CORBA ORBs. We are working on accomplishing this goal on the Orbix CORBA ORB by inserting composite protocols into smart proxies (at the clients) and filters (at the servers).

3.2. Survivability in Cactus Services

We are designing and implementing a number of highly-customizable middleware services using the Cactus approach. A number of these services could be used to improve system survivability. The following are some examples of services and their survivability aspects.

GroupRPC service. GroupRPC provides remote procedure call service to a group of servers. Using a group of servers rather than a single server naturally improves survivability in the sense that the failure of one server does not prevent a remote procedure call from completing successfully. Furthermore, the service allows the client to specify a collation function that is used to combine the responses from different servers into one. Such a collation function can be used to implement majority voting and even sanity checks on the responses, which allows the masking of different types of failures and even intrusions to some of the servers. Moreover, another special feature of GroupRPC is that it allows the user to specify the failure model that the service is to tolerate. The failure models currently supported include crash, omission, timing, and Byzantine (arbitrary). The failure model chosen not only affects the behavior at the client but also the interaction between servers. In particular, if the service is configured to tolerate Byzantine failures, the servers run a Byzantine agreement protocol to agree on the set of requests to be processed. Thus, any faulty or compromised server cannot cause the server group to become inconsistent.

Security service. The security service is a composite protocol that provides enhanced security features such as authentication, integrity, and privacy for messages traversing that layer. The service provides a choice of micro-protocols for each of the aspects of secure communication. For example, we can provide a choice of methods for authentication, encryption, and detection of replay attacks. All aspects of security are optional. For example, a user might require integrity but not privacy (encrypted message signature). Also, the service not only allows a choice of cryptographic protocol from a list of options but also the arbitrary combination of these protocols. This makes it more difficult for the intruder the break the encryption.

4. Summary

This paper has outlined potential contributions of the Cactus approach to research on survivability. The flexible composition model of Cactus supports construction of services with various fault-tolerance, security, and intrusion detection features as well as adaptivity to deal with failures, errors, and intrusions at runtime.

Acknowledgments

This work has been supported in part by the National Science Foundation under grant CCR-9633336 and the Defense Advanced Research Projects Agency under grants F30602-96-1-0342 and N66001-97-C-8518.
 

References

[Bar96]
          M. Barbacci. Survivability in the age of vulnerable systems. IEEE Computer, 29(11):8, Nov 1996.
[BS97]
W. Bevier and M. Smith. Position paper. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.
[CP97]
C. Cowan and C. Pu. Immunix: Survivability through specialization. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.
[EFL+97]
R. Ellison, D. Fisher, R. Linger, H. Lipson, T. Longstaff, and N. Mead. Survivable Network Systems: An Emerging Discipline. Technical Report CMU/SEI-97-TR-013, Software Engineering Institute, Carnegie Mellon University, Nov 1997.
[HP91]
N. Hutchinson and L. Peterson. The x-kernel: An architecture for implementing network protocols. IEEE Transactions on Software Engineering, 17(1):64-76, Jan 1991.
[HS96]
M. Hiltunen and R. Schlichting. Adaptive distributed and fault-tolerant systems. Computer Systems Science and Engineering, 11(5):125-133, Sep 1996.
[Jou97]
Y.F. Jou. Scalable intrusion detection for the emerging network infrastructure. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.
[KMT97]
P. Krupp, J. Maurer, and B. Thuraisingham. Survivability issues for evolvable real-time command and control systems. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.
[OM97]
D. Oppenheimer and M. Martonosi. Performance signatures: A mechanism for intrusion detection. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.
[SH97]
R. Schlichting and M. Hiltunen. The Cactus Project. http://www.cs.arizona.edu/cactus, 1997.
[VMG97]
J. Voas, G. McGraw, and A. Ghosh. Reducing uncertainty about survivability. In Proceedings of the 1997 Information Survivability Workshop, Feb 1997.