CSc 352: Some Hints on Performance Tuning
This document gives some hints on improving program performance.
There has been a lot of work done in this
area (a lot of people, from astronomers
to game developers, care a lot about performance), and we can
barely scratch the surface of this topic here. These should be seen as
high-level hints on what sorts of things to think about when approaching the
problem of performance tuning, rather than an exhaustive list of
techniques.
1. Compiler Flags
The particular compiler being used, and the level of optimization chosen,
affect the quality and speed of the code generated by the compiler.
One cheap way to improve performance, therefore, is to turn on compiler
optimizations.
For C compilers, this typically involves using the -O flag; there
may be a modifier (e.g., -O2 or -O3) indicating the amount of
optimization effort invested (and the kinds of optimizations performed) --
the higher the modifying the number, the greater the optimization effort.
There may be additional flags that turn on or off specific optimizations,
or specify a particular processor to optimize for (different implementations
of an architecture may sometimes have slightly different instructions, and
specifying a particular processor can allow a compiler to generate
specialized code for that processor). For example, on a SPARC machine with
UltraSparc processors (such as the host lectura) we might use
gcc -O2 -mcpu=ultrasparc ...
It's important to keep in mind that the optimization effort invested
by a compiler doesn't necessarily correlate directly with the
execution speed of the resulting executable.
Interactions between different compiler optimizations are often complex and
can have hard-to-predict effects on things like the way hardware registers
are used, or the impact on cache utilization. As a result, there may
occasionally be situations where one can get better performance for a
particular program by optimizing at a lower level, or by turning off some
specific optimization. Thus, while as a general heuristic it makes sense to
compile at the highest optimization level possible, for a particular program
one may have to experiment with different optimization levels, or turning specific
optimizations on or off, to get the best performance.
2. High-Level Transformations
Usually, the biggest improvements in speed come from algorithmic
improvements that change the nature of the computation in significant ways.
Such improvements are, in almost all cases, far beyond the capabilities of
typical compilers; instead, they have to be implemented by programmers with
a good understanding of what the program is supposed to do (its semantics)
and how a particular piece of code is doing it (its pragmatics, in
particular its execution profile, identifying hot spots and performance
bottlenecks).
Here are some examples of the sorts of transformations that can improve
execution speed. (This is not meant to be an exhaustive list. Not all of these
transformations may be meaningful or useful for a particular program.)
-
Cacheing.
This refers to computing some data once, ahead of time, and then looking
them up, instead of recomputing them repeatedly.
The speed improvement comes from avoiding recomputation; the tradeoff is
the additional space needed to store the precomputed values.
-
Buffering.
Instead of invoking some action on a large number of small objects, we may
choose to ``buffer'' them into larger collections, and invoke the actions
on the larger data that result.
The speed improvement comes from reducing the number of times the action is
invoked. The tradeoff is that more space is needed for the buffered
objects.
-
Filtering.
When examining large quantities of data to select those items that satisfy
some specific property, if it may sometimes be possible to devise a
``filter'' satisfying the following properties:
- It is computationally [relatively] inexpensive.
- The filter never discards any item that has the property of interest;
but it may pass some items that don't have the property we're looking for.
In other words, the filter allows us to avoid examining some of the data,
while guaranteeing (the second property above) that none of the data that
satisfies the property of interest is discarded. The speed savings then
come from being able to avoid examining irrelevant data.
Examples of filtering include hashing, as well as the bit-mask technique
discussed in class for the anagrams problem. Another example is the
use of binary search, which uses a simple test to localize the search and
thereby avoid searching through irrelevant data.
3. Low-Level Transformations
This refers to code transformations designed to improve the behavior of
the program on the actual hardware. Unlike high-level
transformations, which change the overall nature or structure of a
computation, these don't actually fundamentally alter the computation, but
remove overheads here and there in the code.
Compiler optimizations, referred to above, are usually of this variety.
However, not infrequently there are situations where a compiler is unable to
carry out a particular optimization, simply because it cannot figure out enough
information about the behavior of a program, even though in principle that
optimization is applicable. In such situations, a programmer may be able
to improve performance by transforming the code manually. (How can we tell
whether a compiler is carrying out a particular optimization or not?
One simple approach is to simply examine the assembly code emitted by
the compiler: for this we use the -S compiler option, e.g., invoking
gcc -O2 -S foo.c puts the generated assembly code in the file
foo.s.)
An important point to keep in mind when carrying out low-level optimizations
by hand is that sometimes, something that speeds up a piece of code for a
particular processor/compiler may, depending on exactly what the programmer
is doing, adversely affect program on a different processor/compiler.
Example low-level optimizations include the following (again, not an
exhaustive list):
-
Function Inlining.
This refers to replacing a function call by (an appropriate instance of)
the body of the called function.
The speed improvement comes from not having to make a function call.
The tradeoff is an increase in code size; too much inlining can be bad, in
that the code size increase can become large enough to adversely affect
performance.
-
Eliminating Unnecessary Computation.
It stands to reason that getting rid of unnecessary computation should
improve performance. Examples of this technique include:
-
Invariant Code Motion out of Loops.
If something is being computed inside a loop, and the value that is computed
is the same regardless of how many times the loop iterates, then it may be
better to compute it once outside the loop and save the value so computed,
then use the saved value inside the loop. E.g.:
transform
while (...) {
...
... some computation ... /* same value computed on all iterations */
...
}
to
Saved_Val = ... some computation ...
while (...) {
...
... use Saved_Val ...
...
}
-
Common Subexpression Elimination.
Invariant Code Motion is applicable only if the value of the
expression is the same for all iterations of the loop. It may happen that
this is not the case, but we may still have a situation where the same
value is computed more than once. In this case we can save the computed
value the first time it is computed, then use the saved value subsequently.
E.g.:
while (...) {
...
... computation A ...
... computation B ...
... computation A ...
}
to
while (...) {
...
Saved_Val = computation A
... use Saved_Val ...
... computation B ...
... use Saved_Val ...
...
}
-
Special-Casing Common Values.
There are some situations where certain values (of expressions, arguments,
etc.) are very common. While a compiler has no way of knowing this,
a programmer who is aware of it may generate special-purpose code that
is tuned for the common case. Thus, we may take code of the form
... some general-purpose computation ...
and generate code that looks like
if (value == CommonValue) then
... optimized special-case code ...
else
... some general-purpose computation ...
Notice that this requires some care: we've introduced an extra runtime test,
which isn't free, so the common-case value had better be common enough,
and/or the speed difference between the special-cased code and the
general-purpose code large enough, to offset the cost of this test (and,
possibly, any effects due to the resulting increase in code size).
4. Space-Time Tradeoffs
The transformations discussed above, and many others like them, very often
involve trading time for space, i.e., reducing the execution time of the
program by increasing the amount of space it takes. This is a tradeoff
that should be given careful consideration: because of the architecture of
modern processors, in particular multiple levels of cache memory, too much
memory usage can lead to a performance degradation.