By Douglas Low
The language Java was designed to be compiled into a platform independent bytecode format. Much of the information contained in the source code remains in the bytecode, thus decompilation is easy. We will examine how code obfuscation can help protect Java bytecodes.
Traditionally it has been difficult to reverse engineer [13] applications because they are large, monolithic and distributed as "stripped" object code. Stripping object code of its symbol table removes information like variable names and obscures references to library routines. For example, a call to the C language library routine printf in the source code might appear in the stripped object code as a procedure call to the memory address 35720.
Since the advent of Java [7], the threat of reverse engineering has been taken seriously. The language was designed to be compiled into a platform independent bytecode format. Being portable, there is little control over the distribution of the bytecodes. Also, much of the information contained in the source code remains in the bytecode, facilitating decompilation [17, 19]. The threat of reverse engineering is thus intensified.
One possible way to prevent reverse engineering of source code is to not allow physical access to the program. Instead, users communicate with the program via an interface with a limited number of services. This is the client-server model [14]. Unfortunately, this imposes performance penalties due to limitations on network bandwidth and latency. A partial solution is to keep the parts of the program that need to be hidden on the server and have the user's machine run the rest locally.
Encryption of code is another possibility. However, unless the entire encryption/decryption process takes place in hardware, it is possible for the user to intercept and decrypt the code [8, 18]. Unfortunately, specialised hardware tends to limit the portability of programs.
Transmitting programs in a form less vulnerable to decompilation might seem like a good idea. Native object codes can be supplied instead of Java bytecodes. The task of decompilation is made more difficult, although not impossible [4]. However native object codes are not subject to bytecode verification, which gives Java a measure of protection against malicious programs such as viruses [3]. Digital signatures [15], verifying that the native code is actually from a trusted source and has not been tampered with, can help to alleviate this problem. The downside is that different versions of the program are required for different architectures. This means that the software maintenance effort is increased. With Java, only one version is required, since it is executed by a virtual machine.
If we cannot make reverse engineering impossible, we can at least make the task costly in terms of time and effort. Code obfuscations transform a program so that it is more difficult to understand, yet is functionally identical to the original [5, 6]. The program must still produce the same results, although it may execute slower or have additional side effects, due to added code. There is a trade-off between the security provided by code obfuscation and the execution time-space penalty imposed on the transformed program.
The current work on Java obfuscation has been in the form of freeware, shareware and commercial programs, rather than academic publications. Some of the code obfuscations discussed below are based on traditional compiler optimisations. Examples include array and loop reordering, and procedure inlining [2].
We classify code obfuscations according to what kind of information they target and how they affect their target.
Information that is unnecessary to the execution of the program, such as identifier names and comments, is altered. There are many utilities (such as [10, 16]) which will change the identifers in a program to less meaningful ones. Identifier scrambling is a common obfuscation that has been applied to other languages. The C Shroud system [9], a source code obfuscator available for the C language, is an example of such a tool.
These affect the data structures used by a program.
Data storage obfuscations affect how data is stored in memory. For example converting a local variable can be converted into a global one. Data encoding obfuscations affect how the stored data is interpreted. For example, replacing an integer variable i by 8 * i + 3. We can see the effect below:
| Before | After | |
int i = 1;
while (i < 1000) {
... A[i] ...;
i ++;
}
|
int i=11;
while (i < 8003) {
... A[(i-3)/8] ...;
i += 8;
}
|
The idea here is to disguise the real control flow in a program.
Control aggregation obfuscations change the way in which program statements are grouped together. For example, it is possible to inline procedures. That is, replacing a procedure call with the statements from the called procedure itself.
Control ordering obfuscations alter the order in which statements are executed. For example, loops can be made to iterate backwards instead of forwards.
Control computation obfuscations affect the control flow in a program. These can be divided up further:
| Before | After | |
int i = 1;
while (i < 1000) {
...
i ++;
}
|
int i = 1;
while ((i < 1000) || (i % 1000 == 0)) {
...
i ++;
}
|
These attempt to stop decompilers from operating, by exploiting their weaknesses. HoseMocha [11] is a utility which appends extra instructions after a return instruction. The execution of the program is unaffected but the obfuscation causes the Java decompiler Mocha [17] to crash.
The task of making reverse engineering difficult is not easy. Client-server models of protection, while providing the best security, suffer from limitations on the network. Encryption requires the use of specialised hardware, which limits the portability of programs. Using native object codes makes reverse engineering harder but increases the software support effort. Also, digital signatures are required to prevent tampering. Code obfuscation, while not providing absolute security, is portable, does not require specialised hardware and is transparent to the Java bytecode verifier. However, it does impose an execution time-space penalty on the program being protected.
Code obfuscation is a fruitful area for further research. There are many issues and implications remaining that need to be resolved, both theoretical and practical [5].
My Masters Thesis research into Java obfuscation has been performed jointly with my supervisors Dr. Christian Collberg and Professor Clark Thomborson.
Reverse engineering
The process by which information about a program is obtained. This includes
obtaining source code from compiled programs.
Native object codes
Binary files designed to execute directly on a specific computer platform.
Douglas Low is currently a graduate student at the University of Auckland, New Zealand, working on the application of code obfuscation to Java. His research interests include compiler implementation, computational combinatorics and constraint satisfaction (AI). Electronic versions of papers that he has co-authored can be found at http://www.cs.auckland.ac.nz/~collberg/Research/Students/DouglasLow/index.html.