

# Introduction

correst seguerr.com

Copyright © 2011 Christian Collberg



### Code Generation Issues I

- The purpose of the code generation phase of the compiler is to transform the intermediate code produced by the front end into some other code that can be executed.
- Often the the code generator will produce assembly code or object code which (after assembly and linking) can be directly executed by the hardware.
- Alternatively, the code generator can generate C-code and use the native C-compiler as the "real" back-end.
- Or, the code generator can generate code for a "virtual machine", and use an interpreter to execute the code.
- We expect the code generator to produce code that is as efficient as possible.

101 101 121 121 2 000



- The input to the code generator can be any one of the intermediate representations we've discussed: Trees, Tuples, Graphs,...
- The work of the code generator consists of several (interdependent) tasks:

Instruction

- selection: Which instructions should be generated?
- scheduling: In which order should they be generated?

### Register

- allocation: Which variables should be kept in registers?
- assignment: In which registers should they be stored?
- spilling: Which registers should be spilled when?

### Machine Architectures I



# Architectures

Kinds of Addressing Modes: Kinds of Register Classes (cont): Register Ind. with Displacement: d(R) The contents of the Integer+Float+Address Separate integer, floating point, and memory address R+d, where R is a register and d a address register sets (MC68k). (small) constant. (All architectures.) Kinds of Addressing Modes: The Cost of an instruction: Immediate: #X The value of the constant X. (All architectures.) The Cost of an instruction is the number of machine cycles it Register Direct: R The contents of register R. (All architectures.) takes to execute it. Register Indirect: (R) The contents of the memory address in On RISCs, most instructions take 1 cycle to execute. Loads, register R. (All.) stores, branches, multiplies, and divides may take longer. Register Indirect with increment: (R+) The contents of the On CISCs, the number of cycles required to execute an memory address in register R. R is incremented by the instruction Instr Op1, Op2 is size of the instruction (i.e. if MOVE.W (R+), Addr cost(Instr)+cost(Op1)+cost(Op2). cost(Opi) is the moves two bytes, then R would be incremented by number of cycles required to compute the addressing mode 2). (VAX, MC68k.) Op;. 101 (0) (2) (2) (2) (2) (2) 1 1 M 1 1 2 1 1 2 1 2 1 3 1 9 1 9 1 9

## Code Generation Example I

 A straight-forward code generator considers one tuple at a time, without looking at other tuples. The code generator is simple, but the generated code is sub-optimal.

### int A[5], i, x;

| The Tuple            | Code             |
|----------------------|------------------|
| (1) i := 1           | (9) T5 := i      |
| (2) TO := i          | (10) T6 := A[T5] |
| (3) IF TO<6 GDTO (5) | (11) T7 := T4+T6 |
| (4) GOTO (17)        | (12) x := T7     |
| (5) T1 := i          | (13) T8 := i     |
| (6) T2 := A[T1]      | (14) T9 := T8+1  |
| (7) T3 := x          | (15) i := T9     |
| (8) T4 := T2*T3      | (16) GOTO (2)    |

# A Simple Example

| Code Ge   | meration Example II (A)                                                  |
|-----------|--------------------------------------------------------------------------|
| <br>li    | Unoptimized MIPS Code:                                                   |
| sw        |                                                                          |
| L2:<br>lw |                                                                          |
| sl<br>bn  |                                                                          |
| j         | L3 # GOTO L3                                                             |
| L5:<br>lw | (5) T1 := i<br>\$2,i # \$2 := CONT(i)                                    |
|           | $ \begin{array}{c ccccccccccccccccccccccccccccccccccc$                   |
| SW        |                                                                          |
| lw        | (14) T9 := T8 + 1                                                        |
|           | du \$2,\$3,1 # \$2 := \$3 + 1<br>ve \$3,\$2 # \$3 := \$2<br>(15) i := T9 |
| SW        |                                                                          |
| j<br>L3:  | (16) GOTO (2)<br>L2 # GOTO L2                                            |

|      | (6) T2 := A[T1]                 |
|------|---------------------------------|
| move | \$3,\$2 # \$3 := \$2            |
| sll  | \$2,\$3,2 # \$2 := \$3 * 4      |
| la   | \$3,A # \$3 := ADDR(A)          |
| addu | \$2,\$2,\$3 # \$2 := \$2 + \$3  |
| lw   | \$2,0(\$2) # \$2 := CONT(A[i])  |
|      | (7) T3 := x                     |
| lw   | \$3,x # \$3 := CONT(x);         |
|      | (8) T4 := T2 * T3               |
| mult | \$3,\$2 # \$lo := \$3 * \$2     |
| mflo | \$4 # \$4 := \$1o               |
|      | (9) T5 := i                     |
| lw   | \$2,i # \$2 := CONT(i)          |
|      | (10) T6 := A[T5]                |
| move | \$3,\$2 # \$3 := \$2            |
| sll  | \$2,\$3,2 # \$2 := \$3 * 4      |
| la   | \$3,A # \$3 := ADDR(A)          |
| addu | \$2,\$2,\$3 # \$2 := \$2 + \$3  |
| lw   | \$3,0(\$2) # \$2 := CONT(A[i]), |

## Code Generation Example III (A)

 The generated code becomes a lot faster if we perform Common Sub-Expression Elimination (CSE) and keep the index variable i in a register (\$6) over the entire loop:

| li  | (1) i := 1<br>\$6,0x1 # \$6 := 1 |
|-----|----------------------------------|
| L2: | (2) TO := i                      |
|     | (3) IF i < 6 GOTO (5)            |
| slt | \$3,\$6,6 # \$3 := i < 6         |
| bne | \$3,\$0,L5                       |
|     | (4) GOTO (17)                    |
| j   | L3 # GOTO L3                     |
|     |                                  |
| L5: | (5) T1 := i                      |

101 (B) (2) (2) (2) 2 (0)

• A[T1] is computed once, and the result is kept in register \$5 until it's needed the next time.

|      | (6) T2 := A[T1]                |
|------|--------------------------------|
| move | \$3,\$6 # \$3 := \$6           |
| sll  | \$2,\$3,2 # \$2 := \$3 * 4     |
| la   | \$3,A # \$3 := ADDR(A)         |
| addu | \$2,\$2,\$3 # \$2 := \$2 + \$3 |
| lw   | \$5,0(\$2) # \$5 := CONT(A[i]) |
|      | (7) T3 := x                    |
| lw   | \$3,x # \$3 := CONT(x);        |
|      | (8) T4 := T2 * T3              |
| mult | \$3,\$5 # \$10 := \$3 * \$5    |
| mflo | \$4 # \$4 := \$lo              |
|      | (9) T5 := i                    |
|      | (10) T6 := A[T5]               |
|      |                                |

- Code Generation Example IV (A)
  - Since x and ADDR(A) seem to be used a lot in the loop, we keep them in registers (\$7 and \$8, respectively) as well.
  - We also reverse the comparison, which allows us to remove one jump.
  - . The move instruction is unnecessary, so we remove it also.

$$\begin{array}{c} (1) \ i := 1 \\ 1i \ & \$ 6, 0x1 \ & \# \$ 6 := 1 \\ 1w \ & \$ 7, x \ & \# \$ 7 := CONT(x); \\ 1a \ & \$ 8, A \ & \# \$ 8 := ADDR(A) \\ L2: \ & \hline (2) \ TO := i \\ \hline & \hline (3) \ IF \ i < 6 \ GOTO \ (5) \\ \hline & \hline (4) \ GOTO \ (17) \\ 8sge \ & \$ 3, \$ 6, 6 \ & \# \$ 3 := i >= 6 \\ bne \ & \$ 3, \$ 0, L3 \ & \# IF \$ 3 \neq 0 \ GOTO \ L3 \end{array}$$

 After the loop we need to store the value of \$6 which has been used to hold the loop index variable i.

| addu<br>sw | $\begin{array}{cccccccccccccccccccccccccccccccccccc$                            |
|------------|---------------------------------------------------------------------------------|
| addu       | (13) T8 := i<br>(14) T9 := T8 + 1<br>(15) i := T9<br>\$6,\$6,1 # \$6 := \$6 + 1 |
| j<br>L3:sw | (16) GOTO (2)<br>L2 # GOTO L2<br>\$6,i # i := \$6                               |

|      | (9) T5 := i                    |
|------|--------------------------------|
|      | (10) T6 := A[T5]               |
|      | (11) T7 := T4 + T6             |
|      | (12) x := T7                   |
| addu | \$7,\$4,\$5 # \$7 := \$4 + \$5 |

| (13) | T8 | := | i  |   |   |
|------|----|----|----|---|---|
| (14) | T9 | := | Τ8 | + | 1 |



- The unoptimized code (produced by gcc -S -g) was 28 instructions long. Our optimized code is 16 instructions. Improvement: 42%.
- More importantly, in the original code there were 26 instructions inside the loop, and 2 outside. Since the loop runs 5 times, we will execute 3 + 5 \* 25 = 128 instructions.
- In the optimized case, we have 11 instructions in the loop and 5 outside. We will execute only 5 + 5 \* 11 = 60 instructions. Improvement: 53%.

# Instruction Seleection

### Instruction Selection I

- Instruction selection is usually pretty simple on RISC architectures – there is often just one possible sequence of instructions to perform a particular kind of computation.
- CISC's like the VAX, on the other hand, leave the compiler with more choices: ADD2 1, R1 ADD3 R1, 1, R1 INC R1 all add 1 to register R1.

V \* 2 – Unoptimized Sparc Code

| set  | V, %oO     | <pre># %o0 := ADDR(V);</pre> |
|------|------------|------------------------------|
| ld   | [%00], %00 | # %00 := CONT(V);            |
| set  | 2, %01     | # %o1 := 2;                  |
| call | .mul, 2    | # %o0 := %o0 * %o1;          |
| nop  |            | # Empty delay slot           |

### Instruction Selection II

V \* 2 - Better Instr. Selection

• The Sparc has a library function .mul and a hardware multiply instruction smul:

set V, %oO

- ld [%00], %00
- smul %00, 1, %00 # %00 := %00 \* %01;

V + 2 - Even Better Instr. Selection

- The Sparc also has hardware shift instructions (sll, srl).
- Integer multiplication by 2<sup>i</sup> can be implemented as a shift i steps to the left.

set V, %o0

ld [%00],%00

sll %00, 1, %00 # %00 := %00 \* 2;

### Instruction Scheduling I

### Instruction Scheduling II (A)

V \* 2 – Unoptimized Sparc Code

- - V + 2 Better Instr. Scheduling
  - Instruction scheduling is important for architectures with several functional units, pipelines, delay slots. I.e. most modern architectures.
  - The Sparc (and other RISCs) have branch delay slots. These are instructions (textually immediately following the branch) that are "executed for free" during the branch.

- The Sparc's integer and floating point units can execute in parallel. Integer and floating point instructions should therefore be reordered so that operations are interleaved.
- Consider this example program:

int a, b, c; double x, y, z;
{
 a = b - c; c = a + b; b = a + c;
 y = x \* x; z = x + y; x = y / z;}

 How will the generated code be different if the compiler takes advantage of parallel execution, or not?

Instruction Scheduling II (B)

| int a<br>{ |       | ; double x, y, z<br>- c; c = a + b; |       | + c;                  |
|------------|-------|-------------------------------------|-------|-----------------------|
|            | y = x | * x; z = x + y;                     | x = y | / z;}                 |
|            |       | cc -02                              |       | cc -03                |
|            | set   | Ъ,%оЗ                               | fmuld | %f30,%f30,%f28        |
|            | sub   | %00,%01,%01                         | set   | c,%o1                 |
|            | set   | a,%o0                               | ld    | [%o1],%o2             |
|            | add   | %04,%05,%04                         | faddd | %f30,%f28,%f30        |
|            | add   | %00,%02,%00                         | set   | b,%o0                 |
|            | set   | x, %o0                              | ld    | [%00],%04             |
|            | fmuld | %f0,%f2,%f0                         | set   | z,%g1                 |
|            | sethi | %hi(z),%o2                          | sub   | %04,%02,%02           |
|            | faddd | %f6,%f8,%f6                         | fdivd | %f28,%f30,%f2         |
|            | fdivd | %f12,%f14,%f12                      | add   | %04,%02,%04           |
|            |       |                                     | add   | %o2,%o4,%o5           |
|            |       |                                     |       | 1000 ST 150 ST 150 ST |

# Register Allocation/Assignment/Spilling

Register Allocation Issues

| Register Assignment (A)                                                                                                        | Register Assignment (B)                                                                                                                                                                   |  |  |
|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| (日) (日) (日) (日) (日)                                                                                                            | (日) (四) (注) (注)                                                                                                                                                                           |  |  |
| Ocmmon sub-expressions are stored in regs.                                                                                     |                                                                                                                                                                                           |  |  |
| Loads and Stores are expensive<br>long as possible.                                                                            | <ul> <li>MC68k has address, integer, and floating point, etc.</li> </ul>                                                                                                                  |  |  |
| Procedure arguments are passed in regs.                                                                                        | <ul> <li>groups of registers that can only hold one type of data:</li> <li>MIPS &amp; Sparc have floating point and integer registers;</li> </ul>                                         |  |  |
| Intermediate results are stored in regs.                                                                                       | <ul> <li>Some architectures have several different register classes,</li> </ul>                                                                                                           |  |  |
| Instructions take operands in regs.                                                                                            | hold each of these variables.                                                                                                                                                             |  |  |
| Register Uses:                                                                                                                 | <ul> <li>Secondly, we have to decide which physical registers should</li> </ul>                                                                                                           |  |  |
| Registers have short access time.                                                                                              | Register Assignment:                                                                                                                                                                      |  |  |
| Hence, a one-word instruction can reference 3 registers but a<br>two-word instruction is necessary to reference a memory word. | Register Allocation:<br>• First we have to decide which variables should reside in<br>registers at which point in the program.<br>• Variables that are used frequently should be favored. |  |  |
| We only need 4–7 bits to access a register, but 32–64 bits to<br>access a memory word.                                         |                                                                                                                                                                                           |  |  |
| Why do we need registers?                                                                                                      |                                                                                                                                                                                           |  |  |

- · Some architectures pass procedure arguments in registers. If a value is used twice, first in a computation and then in a procedure call, we should allocate the value to the appropriate procedure argument register.
- Sparc passes it's first 6 arguments in registers %00,%01,%02,%03,%04,%05.
- · See the next slide for an example.

| <pre>main () {     int a,b;     a = b + 15;</pre> | $/* \leftarrow b$ is used here $/*$                                             |
|---------------------------------------------------|---------------------------------------------------------------------------------|
| P(b):                                             | $/* \Leftarrow$ and here. */                                                    |
| <pre>}</pre>                                      | <pre># %00 := CONT(b);<br/># %01 := %00 + 15<br/># a := %01;<br/># P(%00)</pre> |

> -0.0 C

### Register Spilling I

- We may have 8 | 16 | 32 regs available.
- When we run out of registers (during code generation) we need to pick a register to spill. I.e. in order to free the register for it's new use, it's current value first has to be stored in memory.
- Which register should be spilt? Least resently used, Least frequently used, Most distant use, ... (take your pick).

### Example:

- Assume a machine with registers R1--R3.
- R1 holds variable a; R2 holds b, R3 holds c, and R4 holds d. Generate code for:

```
x = a + b; # \Leftarrow Which reg for x?
```

```
y = x + c;
```

• Which register should be spilt to free a register to hold x?

> 2940

## Register Spilling Example

```
FOR i := 1 TO 100000 DO
   A[5,i] := b;
   FOR j := 1 TO 100000 DO
      A[j,i] := <Complicated Expression>;
   END
END
        1st Attempt (4 Regs available):
Allocation/Assignment: i in R_1, j in R_2, ADDR(A) in R_3,
           ADDR(A[5,]) in R_4.
    Spilling: Spill R4 in the inner loop to get enough registers to
           evaluate the complicated expression.
           _____ 2nd Attempt (4 Regs available): ______
Allocation/Assignment: i in R_1, j in R_2, ADDR(A) in R_3.
    Spilling: No spills. But ADDR(A[5,i]) must be loaded every
            time in the outer loop.
                                       101 (0) (2) (2) (2) 2 000
```

- 2 Registers Available k and ADDR(Å) in registers. (Prefer variables in inner loops).
   4 Registers Available k , ADDR(Å), j, and i in registers. (Prefer index variables).
- 5 Registers Available k, ADDR(A), j, i, and b in registers. (Prefer most frequently used variables).

# Basic Blocks and Flow Graphs

## Basic Blocks and Flow Graphs I

- We divide the intermediate code of each procedure into basic blocks. A basic block is a piece of straight line code, i.e. there are no jumps in or out of the middle of a block.
- The basic blocks within one procedure are organized as a flow graph.
- A flowgraph has
  - basic blocks B<sub>1</sub> · · · B<sub>n</sub> as nodes,
  - $\bullet\,$  a directed edge  $B_1 \to B_2$  if control can flow from  $B_1$  to  $B_2.$
- Code generation can be performed on a small or large piece of the flow graph at a time (small=easy, large=hard):

Local Within one basic block. Global Within one procedure. Inter-procedural Within one program.



X := 20; WHILE X < 10 DO X := X-1; A[X] := 10; IF X = 4 THEN X := X - 2; ENDIF; ENDDO; Y := X + 5;





Constructing Basic Blocks

## Basic Blocks I

## Basic Blocks II

- . How do we identify the basic blocks and build the flow graph?
- Assume that the input to the code generator is a list of tuples. How do we find the beginning and end of each basic block?

\_\_\_ Algorithm: \_\_\_\_\_

- First determine a set of leaders, the first tuple of basic blocks:
  - The first tuple is a leader.
  - Tuple L is a leader if there is a tuple if ...goto L or goto L.
  - Tuple L is a leader if it immediately follows a tuple if ...goto L or goto L.
- A basic block consists of a leader and all the following tuples until the next leader.

(D) (B) (E) (E) (E) (E) (O)

Block  $B_1$ : [(1) P:=0; (2) I:=1] Block  $B_2$ : [(3) P:=P4I; (4) IF P<=60 GOTO  $B_4$ ] Block  $B_3$ : [(5) P:=0; (6) I:=5] Block  $B_4$ : [(7) T1:=I42; (8) I:=T1+1; (9) IF I<=20 GOTO  $B_2$ ] Block  $B_4$ : [(10) K:=P43]

 $\begin{array}{c} P := 0 \\ I := 1 \\ P := P + I \\ I P > < 60 \text{ GDTD } B_{4} \\ \hline P := 0 \\ I := 6 \\ R_{4} \\ \hline T_{1} := 1 + 2 \\ I := T + 1 + 1 \\ I := T - 2 \text{ or } D_{2} \\ \hline R_{4} \\ \hline T_{1} := 1 + 2 \\ I := 7 + 2 \\ \hline R_{4} \\ \hline R_{4} \\ \hline R_{4} \\ \hline R_{5} \\ \hline$ 

```
P := 0; I := 1;
REPEAT
P := P + I;
IF P > 60 THEN P := 0; I := 5 ENDIF;
I := I * 2 + 1;
UNTLL I > 20;
K := P * 3
```

### \_\_\_\_ Tuples:

```
(1)
     P := 0

    Leader (Rule 1.a)

(2)
     T := 1
    P := P + I
                 ⇐ Leader (Rule 1.b)
(3)
    IF P <= 60 GOTO (7)
(4)
(5)
     P := 0
                 ⇐ Leader (Rule 1.c)
(6)
    I := 5
    T1 := I * 2
                 (7)
```

# Summary

## Summary I

- Read the Tiger book: Instruction selection pp. 205–216 Taming conditional branches pp. 185–188
- Or, read the Dragon book: Introduction 513–521
   Basic Blocks 528–530
   Flow Graphs 532–534

- Instruction selection picks which instruction to use, instruction scheduling picks the ordering of instructions.
- Register allocation picks which variables to keep in registers, register assignment picks the actual register in which a particular variable should be stored.
- We prefer to keep index variables and variables used in inner loops in registers.
- When we run out of registers, we have to pick a register to spill, i.e. to store back into memory. We avoid inserting spill code in inner loops.

101 (B) (2) (2) (2) 2 000

## Summary II

- Code generation checklist:
  - Is the code correct?
  - Are values kept in registers for as long as possible?
  - Is the cheapest register always chosen for spilling?
  - Are values in inner loops allocated to registers?
- A basic block is a straight-line piece of code, with no jumps in or out except at the beginning and end.
- Local code generation considers one basic block at a time, global one procedure, and inter-procedural one program.

## Homework

000 CE 120 CE 120 CE 1000

### Homework I

- Translate the program below into quadruples.
- Identify beginnings and ends of basic blocks.
- Build the control flow graph.

PROGRAM P;

```
VAR X : INTEGER; Y : REAL;
BEGIN
X := 1; Y := 5.5;
WHILE X < 10 D0
Y := Y + FLOAT(X);
X := X + 1;
IF Y > 10 THEN
Y := Y * 2.2;
ENDDG;
ENDD0;
END.
```

```
int A[5],x,i,n;
                         (1) i := 1
for (i=1; i<=n; i++) {</pre>
                         (2) IF i>n GOTO (14)
  if (i<n) {
                         (3) IF i>=n GOTO (6)
    x = A[i];
                         (4) x := A[i]
  } else {
                         (5) GOTO (11)
    while (x>4) {
                         (6) IF x<=4 GOTO (11)
      x = x * 2 + A[i];
                         (7) T1 := x*2
                         (8) T2 := A[i]
    };
                         (9) x := T1+T2
  };
  x = x+5;
                         (10) GOTO (6)
                         (11) x := x+5
                         (12) i := i+1
                         (13) GOTO (2)
```

101 (B) (C) (C) (C) (C) (C)

100 S (S) (S) (S) (S)