CSc 352: Some Hints on Performance Tuning

This document gives some hints on improving program performance. There has been a lot of work done in this area (a lot of people, from astronomers to game developers, care a lot about performance), and we can barely scratch the surface of this topic here. These should be seen as high-level hints on what sorts of things to think about when approaching the problem of performance tuning, rather than an exhaustive list of techniques.

1. Compiler Flags

The particular compiler being used, and the level of optimization chosen, affect the quality and speed of the code generated by the compiler. One cheap way to improve performance, therefore, is to turn on compiler optimizations. For C compilers, this typically involves using the -O flag; there may be a modifier (e.g., -O2 or -O3) indicating the amount of optimization effort invested (and the kinds of optimizations performed) -- the higher the modifying the number, the greater the optimization effort. There may be additional flags that turn on or off specific optimizations, or specify a particular processor to optimize for (different implementations of an architecture may sometimes have slightly different instructions, and specifying a particular processor can allow a compiler to generate specialized code for that processor). For example, on a SPARC machine with UltraSparc processors (such as the host lectura) we might use

gcc -O2 -mcpu=ultrasparc ...

It's important to keep in mind that the optimization effort invested by a compiler doesn't necessarily correlate directly with the execution speed of the resulting executable. Interactions between different compiler optimizations are often complex and can have hard-to-predict effects on things like the way hardware registers are used, or the impact on cache utilization. As a result, there may occasionally be situations where one can get better performance for a particular program by optimizing at a lower level, or by turning off some specific optimization. Thus, while as a general heuristic it makes sense to compile at the highest optimization level possible, for a particular program one may have to experiment with different optimization levels, or turning specific optimizations on or off, to get the best performance.

2. High-Level Transformations

Usually, the biggest improvements in speed come from algorithmic improvements that change the nature of the computation in significant ways. Such improvements are, in almost all cases, far beyond the capabilities of typical compilers; instead, they have to be implemented by programmers with a good understanding of what the program is supposed to do (its semantics) and how a particular piece of code is doing it (its pragmatics, in particular its execution profile, identifying hot spots and performance bottlenecks).

Here are some examples of the sorts of transformations that can improve execution speed. (This is not meant to be an exhaustive list. Not all of these transformations may be meaningful or useful for a particular program.)

Cacheing. This refers to computing some data once, ahead of time, and then looking them up, instead of recomputing them repeatedly.
The speed improvement comes from avoiding recomputation; the tradeoff is the additional space needed to store the precomputed values.
Buffering. Instead of invoking some action on a large number of small objects, we may choose to ``buffer'' them into larger collections, and invoke the actions on the larger data that result.
The speed improvement comes from reducing the number of times the action is invoked. The tradeoff is that more space is needed for the buffered objects.
Filtering. When examining large quantities of data to select those items that satisfy some specific property, if it may sometimes be possible to devise a ``filter'' satisfying the following properties:
- It is computationally [relatively] inexpensive.
- The filter never discards any item that has the property of interest; but it may pass some items that don't have the property we're looking for.
In other words, the filter allows us to avoid examining some of the data, while guaranteeing (the second property above) that none of the data that satisfies the property of interest is discarded. The speed savings then come from being able to avoid examining irrelevant data.
Examples of filtering include hashing, as well as the bit-mask technique discussed in class for the anagrams problem. Another example is the use of binary search, which uses a simple test to localize the search and thereby avoid searching through irrelevant data.

3. Low-Level Transformations

This refers to code transformations designed to improve the behavior of the program on the actual hardware. Unlike high-level transformations, which change the overall nature or structure of a computation, these don't actually fundamentally alter the computation, but remove overheads here and there in the code. Compiler optimizations, referred to above, are usually of this variety. However, not infrequently there are situations where a compiler is unable to carry out a particular optimization, simply because it cannot figure out enough information about the behavior of a program, even though in principle that optimization is applicable. In such situations, a programmer may be able to improve performance by transforming the code manually. (How can we tell whether a compiler is carrying out a particular optimization or not? One simple approach is to simply examine the assembly code emitted by the compiler: for this we use the -S compiler option, e.g., invoking gcc -O2 -S foo.c puts the generated assembly code in the file foo.s.)

An important point to keep in mind when carrying out low-level optimizations by hand is that sometimes, something that speeds up a piece of code for a particular processor/compiler may, depending on exactly what the programmer is doing, adversely affect program on a different processor/compiler.

Example low-level optimizations include the following (again, not an exhaustive list):

Function Inlining. This refers to replacing a function call by (an appropriate instance of) the body of the called function.
The speed improvement comes from not having to make a function call. The tradeoff is an increase in code size; too much inlining can be bad, in that the code size increase can become large enough to adversely affect performance.
Eliminating Unnecessary Computation. It stands to reason that getting rid of unnecessary computation should improve performance. Examples of this technique include:
- Invariant Code Motion out of Loops. If something is being computed inside a loop, and the value that is computed is the same regardless of how many times the loop iterates, then it may be better to compute it once outside the loop and save the value so computed, then use the saved value inside the loop. E.g.: transform
```
while (...) {
   ...
   ... some computation ...   /* same value computed on all iterations */
   ...
}
```
  to
```
Saved_Val = ... some computation ...
while (...) {
   ...
   ... use Saved_Val ...
   ...
}
```
- Common Subexpression Elimination. Invariant Code Motion is applicable only if the value of the expression is the same for all iterations of the loop. It may happen that this is not the case, but we may still have a situation where the same value is computed more than once. In this case we can save the computed value the first time it is computed, then use the saved value subsequently. E.g.:
```
while (...) {
   ...
   ... computation A ...
   ... computation B ...
   ... computation A ...
}
```
  to
```
while (...) {
   ...
   Saved_Val = computation A
   ... use Saved_Val ...
   ... computation B ...
   ... use Saved_Val ...
   ...
}
```
Special-Casing Common Values. There are some situations where certain values (of expressions, arguments, etc.) are very common. While a compiler has no way of knowing this, a programmer who is aware of it may generate special-purpose code that is tuned for the common case. Thus, we may take code of the form
```
... some general-purpose computation ...
```
and generate code that looks like
```
if (value == CommonValue) then
  ... optimized special-case code ...
else
  ... some general-purpose computation ...
```
Notice that this requires some care: we've introduced an extra runtime test, which isn't free, so the common-case value had better be common enough, and/or the speed difference between the special-cased code and the general-purpose code large enough, to offset the cost of this test (and, possibly, any effects due to the resulting increase in code size).

4. Space-Time Tradeoffs

The transformations discussed above, and many others like them, very often involve trading time for space, i.e., reducing the execution time of the program by increasing the amount of space it takes. This is a tradeoff that should be given careful consideration: because of the architecture of modern processors, in particular multiple levels of cache memory, too much memory usage can lead to a performance degradation.