#### CSc 553

Principles of Compilation

7: Code Generation I

Department of Computer Science University of Arizona

collberg@gmail.com

Copyright © 2011 Christian Collberg



# Introduction



#### Code Generation Issues I

- The purpose of the code generation phase of the compiler is to transform the intermediate code produced by the front end into some other code that can be executed.
- Often the the code generator will produce assembly code or object code which (after assembly and linking) can be directly executed by the hardware.
- Alternatively, the code generator can generate C-code and use the native C-compiler as the "real" back-end.
- Or, the code generator can generate code for a "virtual machine", and use an interpreter to execute the code.
- We expect the code generator to produce code that is as efficient as possible.

#### Code Generation Issues II



#### Code Generation Issues III

- The input to the code generator can be any one of the intermediate representations we've discussed: Trees, Tuples, Graphs,...
- The work of the code generator consists of several (interdependent) tasks:

#### Instruction

- **selection:** *Which* instructions should be generated?
- scheduling: In which order should they be generated?

#### Register

- **allocation:** Which variables should be kept in registers?
- assignment: In which registers should they be stored?
- **spilling:** Which registers should be spilled when?



# **Architectures**

#### Machine Architectures I

```
Kinds of Instructions:
3-Register: add R1, R2, R3
             [R1 := R2 + R3] (MIPS, VAX, ...).
Register-Address: add R, Addr
             [R := R + Addr] (VAX, \times 86, MC68k)
2-Register: add R1, R2
             [R1 := R1 + R2] (VAX, \times 86, MC68k)
2-Address: add Addr1, Addr2
             [Addr1 := Addr1 + Addr2] (VAX)
3-Address: add Addr1, Addr2, Addr3
             [Addr1 := Addr2 + Addr3] (VAX)
                  _ Kinds of Register Classes: ___
    General One set of register that can hold any type of data
```

Integer+Float Separate integer and floating point register sets

(VAX, Alpha).

(Sparc, MIPS).

#### Machine Architectures II

```
Kinds of Register Classes (cont): ___
Integer+Float+Address Separate integer, floating point, and
             address register sets (MC68k).
                   Kinds of Addressing Modes: ___
Immediate: #X
                  The value of the constant X. (All architectures.)
Register Direct: R The contents of register R. (All architectures.)
Register Indirect: (R) The contents of the memory address in
             register R. (All.)
Register Indirect with increment: (R+)
                                         The contents of the
             memory address in register R. R is incremented by the
             size of the instruction (i.e. if MOVE.W (R+), Addr
             moves two bytes, then R would be incremented by
             2). (VAX, MC68k.)
```

#### Machine Architectures III

| Kinds of Addressing Modes:         |                                                   |  |                     |  |  |
|------------------------------------|---------------------------------------------------|--|---------------------|--|--|
| Register Ind.                      |                                                   |  | The contents of the |  |  |
|                                    | memory address R+d, where R is a register and d a |  |                     |  |  |
|                                    | (small) constant. (All architectures.)            |  |                     |  |  |
| The <b>Cost</b> of an instruction: |                                                   |  |                     |  |  |

- The Cost of an instruction is the number of machine cycles it takes to execute it.
- On RISCs, most instructions take 1 cycle to execute. Loads, stores, branches, multiplies, and divides may take longer.
- On CISCs, the number of cycles required to execute an instruction  $[Instr Op_1, Op_2]$  is  $cost(Instr)+cost(Op_1)+cost(Op_2)$ .  $cost(Op_i)$  is the number of cycles required to compute the addressing mode  $Op_i$ .

# A Simple Example

### Code Generation Example I

 A straight-forward code generator considers one tuple at a time, without looking at other tuples. The code generator is simple, but the generated code is sub-optimal.

```
int A[5], i, x;
main(){for(i=1;i<=5;i++) x=x*A[i]+A[i];}
```

| The Tuple Code       |                  |  |  |  |
|----------------------|------------------|--|--|--|
| (1) i := 1           | (9) T5 := i      |  |  |  |
| (2) TO := i          | (10) T6 := A[T5] |  |  |  |
| (3) IF TO<6 GOTO (5) | (11) T7 := T4+T6 |  |  |  |
| (4) GOTO (17)        | (12) x := T7     |  |  |  |
| (5) T1 := i          | (13) T8 := i     |  |  |  |
| (6) $T2 := A[T1]$    | (14) T9 := T8+1  |  |  |  |
| (7) T3 := x          | (15) i := T9     |  |  |  |
| (8) T4 := T2*T3      | (16) GOTO (2)    |  |  |  |

# Code Generation Example II (A)

```
Unoptimized MIPS Code: ____
          (1) i := 1
   li
         $2,0x1
                    # $2 := 1
                      # i := $2
         $2,i
   SW
          (2) T0 := i
L2:
   lw
         $2,i
                      # $2 := i
          (3) IF i < 6 GOTO (5)
         $3,$2,6 # $3 := i < 6
   slt
         $3,$0,L5 # IF $3≠0 GOTO L5
   bne
          (4) GOTO (17)
         L3
                      # GOTO L3
          (5) T1 := i
L5:
   lw
         $2,i
                        $2 := CONT(i)
```

```
(6) T2 := A[T1]
     $3,$2
               # $3 := $2
move
     $2,$3,2
              # $2 := $3 * 4
sll
              # $3 := ADDR(A)
la
     $3,A
     $2,$2,$3 # $2 := $2 + $3
addu
lw
     $2,0($2) # $2 := CONT(A[i])
     (7) T3 := x
     $3,x
               # $3 := CONT(x);
lw
     (8) T4 := T2 * T3
     $3,$2
               # $1o := $3 * $2
mult
mflo
     $4
               # $4 := $1o
     (9) T5 := i
               # $2 := CONT(i)
lw
     $2,i
     (10) T6 := A[T5]
     $3,$2
               # $3 := $2
move
sll
     $2,$3,2
               # $2 := $3 * 4
               # $3 := ADDR(A)
la
     $3,A
     $2,$2,$3 # $2 := $2 + $3
addu
     lw
```

4□ ト ← □ ト ← 巨 ト ■ り へ ○

# Code Generation Example III (A)

1.5:

 The generated code becomes a lot faster if we perform Common Sub-Expression Elimination (CSE) and keep the index variable i in a register (\$6) over the entire loop:

1i 
$$\$6,0x1$$
 #  $\$6:=1$ 

L2: (2) T0:=i
(3) IF i < 6 GOTO (5)

slt  $\$3,\$6,6$  #  $\$3:=$ i < 6
bne  $\$3,\$0,L5$  # IF  $\$3\neq0$  GOTO L5
(4) GOTO (17)

j L3 # GOTO L3



• A[T1] is computed once, and the result is kept in register \$5 until it's needed the next time.

• After the loop we need to store the value of \$6 which has been used to hold the loop index variable i.

# Code Generation Example IV (A)

- Since x and ADDR(A) seem to be used a lot in the loop, we keep them in registers (\$7 and \$8, respectively) as well.
- We also reverse the comparison, which allows us to remove one jump.
- The move instruction is unnecessary, so we remove it also.

```
(1) i := 1
               # $6 := 1
       $6,0x1
  li
       7,x # 7 := CONT(x);
  lw
       \$8,A # \$8 := ADDR(A)
  la
       (2) TO := i
L2:
        (3) IF i < 6 GOTO (5)
        (4) GOTO (17)
       $3,$6,6 # $3 := i >= 6
  sge
                 # IF $3\neq 0 GOTO L3
  bne
       $3,$0,L3
```

```
L5:
          (5) T1 := i
          (6) T2 := A[T1]
         $2,$6,2
                     # $2 := $3 * 4
   sll
         $2,$2,$8
                     # $2 := $2 + $8
   addu
         $5,0($2) # $5 := CONT(A[i])
   lw
          (7) T3 := x
         (8) T4 := T2 * T3
         $7,$5
  mult
                     # $1o := $7 * $5
  mflo
                     # $4 := $1o
         $4
          (9) T5 := i
          (10) T6 := A[T5]
          (11)
                 := T4 + T6
         (12) x := T7
                   # $7 := $4 + $5
         $7,$4,$5
   addu
          (13)
              T8 := i
                 := T8 + 1
          (14)
              Т9
```

↓□▶ ↓□▶ ↓□▶ ↓□▶ □ ♥Q♠

```
addu $6,$6,1  # $6 := $6 + 1

(16) GOTO (2)

j L2  # GOTO L2

L3:sw $6,i  # i := $6

sw $7,x  # x := $7
```

- The unoptimized code (produced by gcc -S -g) was 28 instructions long. Our optimized code is 16 instructions.
   Improvement: 42%.
- More importantly, in the original code there were 26 instructions inside the loop, and 2 outside. Since the loop runs 5 times, we will execute 3 + 5 \* 25 = 128 instructions.
- In the optimized case, we have 11 instructions in the loop and 5 outside. We will execute only 5 + 5 \* 11 = 60 instructions. Improvement: 53%.

# Instruction Seleection

#### Instruction Selection I

- Instruction selection is usually pretty simple on RISC architectures – there is often just one possible sequence of instructions to perform a particular kind of computation.
- CISC's like the VAX, on the other hand, leave the compiler with more choices: ADD2 1, R1 ADD3 R1, 1, R1
   INC R1 all add 1 to register R1.

```
_ V ∗ 2 − Unoptimized Sparc Code _____
```

```
set V, %00  # %00 := ADDR(V);
ld [%00], %00  # %00 := CONT(V);
set 2, %01  # %01 := 2;
call .mul, 2  # %00 := %00 * %01;
nop  # Empty delay slot
```

#### Instruction Selection II

\_\_\_ V \* 2 – Better Instr. Selection \_\_\_\_\_

 The Sparc has a library function .mul and a hardware multiply instruction smul:

```
set V, %o0
ld [%o0], %o0
smul %o0, 1, %o0 # %o0 := %o0 * %o1;
```

V \* 2 – Even Better Instr. Selection

- The Sparc also has hardware shift instructions (sll, srl).
- Integer multiplication by 2<sup>i</sup> can be implemented as a shift i steps to the left.

```
set V, %o0
ld [%o0], %o0
sll %o0, 1, %o0 # %o0 := %o0 * 2;
```

## Instruction Scheduling I

- Instruction scheduling is important for architectures with several functional units, pipelines, delay slots. I.e. most modern architectures.
- The Sparc (and other RISCs) have **branch delay slots**. These are instructions (textually immediately following the branch) that are "executed for free" during the branch.

# Instruction Scheduling II (A)

- The Sparc's integer and floating point units can execute in parallel. Integer and floating point instructions should therefore be reordered so that operations are interleaved.
- Consider this example program:

```
int a, b, c; double x, y, z;

{    a = b - c; c = a + b; b = a + c;

    y = x * x; z = x + y; x = y / z;}
```

 How will the generated code be different if the compiler takes advantage of parallel execution, or not?

# Instruction Scheduling II (B)

```
int a, b, c; double x, y, z;
     a = b - c; c = a + b; b = a + c;
     y = x * x; z = x + y; x = y / z;
           cc -02
                                cc - 03
     set b,%o3
                          fmuld %f30,%f30,%f28
     sub %00,%01,%01
                          set c.%o1
     set a.%00
                          ld [%o1],%o2
     add %04,%05,%04
                          faddd %f30,%f28,%f30
     add %00,%02,%00
                          set b,%00
     set x, %00
                          ld [%00].%04
                          set z, \%g1
     fmuld %f0,%f2,%f0
     sethi %hi(z),%o2
                          sub %o4,%o2,%o2
     faddd %f6,%f8,%f6
                          fdivd %f28, %f30, %f2
                          add %o4,%o2,%o4
     fdivd %f12,%f14,%f12
                              %o2,%o4,%o5
                          add
```

# Register Allocation/Assignment/Spilling

## Register Allocation Issues

- \_\_\_\_\_ Why do we need registers? \_\_\_\_\_
- We only need 4–7 bits to access a register, but 32–64 bits to access a memory word.
- We Hence, a one-word instruction can reference 3 registers but a two-word instruction is necessary to reference a memory word.
- Segisters have short access time.

Register Uses: \_\_\_\_\_

- Instructions take operands in regs.
- ② Intermediate results are stored in regs.
- Opening Procedure arguments are passed in regs.
- 4 Loads and Stores are expensive ⇒ keep variables in regs for as long as possible.
- **5** Common sub-expressions are stored in regs.



# Register Allocation/Assignment

| Register | Allocation: |  |
|----------|-------------|--|
|          |             |  |

- First we have to decide which variables should reside in registers at which point in the program.
- Variables that are used frequently should be favored.

| R | legister | Assignment: |
|---|----------|-------------|
|   |          |             |

- Secondly, we have to decide which physical registers should hold each of these variables.
- Some architectures have several different register classes, groups of registers that can only hold one type of data:
  - MIPS & Sparc have floating point and integer registers;
  - MC68k has address, integer, and floating point, etc.

# Register Assignment (A)

- Some architectures pass procedure arguments in registers. If a value is used twice, first in a computation and then in a procedure call, we should allocate the value to the appropriate procedure argument register.
- Sparc passes it's first 6 arguments in registers %00,%01,%02,%03,%04,%05.
- See the next slide for an example.

# Register Assignment (B)

```
main () {
   int a,b;
   a = b + 15; /* \Leftarrow b is used here /*
         /* \Leftarrow \text{ and here. } */
  P(b);
ld [\%fp-8],\%o0 # \%o0 := CONT(b);
add \%00,15,\%01 # \%01 := \%00 + 15
st \%01, [\%fp-4] # a := \%01;
                     # P(%o0)
call P,1
```

# Register Spilling I

- We may have 8 | 16 | 32 regs available.
- When we run out of registers (during code generation) we need to pick a register to spill. I.e. in order to free the register for it's new use, it's current value first has to be stored in memory.
- Which register should be spilt? Least resently used, Least frequently used, Most distant use, ... (take your pick).

```
_____ Example: _____
```

- Assume a machine with registers R1--R3.
- R1 holds variable a; R2 holds b, R3 holds c, and R4 holds d.
   Generate code for:

```
x = a + b; # \Leftarrow Which reg for x?

y = x + c;
```

• Which register should be spilt to free a register to hold x?



# Register Allocation Example

```
FOR i := 1 TO n DO
   B[5,i] := b * b * b;
   FOR j := 1 TO n DO
      FOR k := 1 TO n DO
         A[i,j] := A[i,k] * A[k,j];
      END
   END
END
2 Registers Available • k and ADDR(A) in registers. (Prefer
                variables in inner loops).
4 Registers Available • k, ADDR(A), j, and i in registers.
                (Prefer index variables).
5 Registers Available • k, ADDR(A), j, i, and b in registers.
                (Prefer most frequently used variables).
                                        ロトオ部トオミトオミト ミ からぐ
```

# Register Spilling Example

```
FOR i := 1 TO 100000 DO
   A[5,i] := b:
   FOR j := 1 TO 100000 DO
       A[j,i] := <Complicated Expression>;
   END
END
              ___ 1st Attempt (4 Regs available): ____
Allocation/Assignment: i in R_1, j in R_2, ADDR(A) in R_3,
            ADDR(A[5,_]) in R_{4}.
    Spilling: Spill R_4 in the inner loop to get enough registers to
            evaluate the complicated expression.
               _ 2nd Attempt (4 Regs available): _____
Allocation/Assignment: i in R_1, j in R_2, ADDR(A) in R_3.
    Spilling: No spills. But ADDR(A[5,i]) must be loaded every
            time in the outer loop.
                                          4D > 4A > 4B > 4B > B 900
```

# Basic Blocks and Flow Graphs

### Basic Blocks and Flow Graphs I

- We divide the intermediate code of each procedure into basic blocks. A basic block is a piece of straight line code, i.e. there are no jumps in or out of the middle of a block.
- The basic blocks within one procedure are organized as a flow graph.
- A flowgraph has
  - basic blocks  $B_1 \cdots B_n$  as nodes,
  - a directed edge  $B_1 \to B_2$  if control can flow from  $B_1$  to  $B_2$ .
- Code generation can be performed on a small or large piece of the flow graph at a time (small=easy, large=hard):

Local Within one basic block.

Global Within one procedure.

Inter-procedural Within one program.





\_\_\_\_\_ Source Code: \_\_\_\_\_

X := 20; WHILE X < 10 DO X := X-1; A[X] := 10;

IF X = 4 THEN X := X - 2; ENDIF;

**ENDDO**; Y := X + 5;

(2) if X>=10 goto (8) (6) X := X-2

(3) X := X-1 (7) goto (2)

(4) A[X] := 10 (8) Y := X+5

Flow Graph: \_\_



## Constructing Basic Blocks

#### Basic Blocks I

- How do we identify the basic blocks and build the flow graph?
- Assume that the input to the code generator is a list of tuples.
   How do we find the beginning and end of each basic block?

| ٩l٤ | gor | ith | nm | 1: |  |
|-----|-----|-----|----|----|--|
|-----|-----|-----|----|----|--|

- First determine a set of leaders, the first tuple of basic blocks:
  - The first tuple is a leader.
  - ② Tuple L is a leader if there is a tuple if ...goto L or goto L.
  - Tuple L is a leader if it immediately follows a tuple if ...goto L or goto L.
- ② A basic block consists of a leader and all the following tuples until the next leader.

#### Basic Blocks II

```
P := 0; I := 1;
REPEAT
   P := P + I;
   IF P > 60 THEN P := 0; I := 5 ENDIF;
   I := I * 2 + 1;
UNTIL I > 20:
K := P * 3
                     ___ Tuples: _____
(1) P := 0
                 \Leftarrow Leader (Rule 1.a)
(2) I := 1
(3) P := P + I \leftarrow Leader (Rule 1.b)
(4) IF P \le 60 \text{ GOTO} (7)
(5) P := 0 \leftarrow Leader (Rule 1.c)
(6) I := 5
   T1 := I * 2 \leftarrow Leader (Rule(1,b))
(7)
```

```
Block B_1: [(1) P:=0; (2) I:=1]

Block B_2: [(3) P:=P+I;

(4) IF P<=60 GOTO B_4]

Block B_3: [(5) P:=0; (6) I:=5]

Block B_4: [(7) T1:=I*2; (8) I:=T1+1;

(9) IF I<=20 GOTO B_2]

Block B_5: [(10) K:=P*3]
```



# Summary

#### Readings and References

- Read the Tiger book:
   Instruction selection pp. 205–216
   Taming conditional branches pp. 185–188
- Or, read the Dragon book:

```
Introduction 513–521
Basic Blocks 528–530
Flow Graphs 532–534
```

## Summary I

- Instruction selection picks which instruction to use, instruction scheduling picks the ordering of instructions.
- Register allocation picks which variables to keep in registers, register assignment picks the actual register in which a particular variable should be stored.
- We prefer to keep index variables and variables used in inner loops in registers.
- When we run out of registers, we have to pick a register to spill, i.e. to store back into memory. We avoid inserting spill code in inner loops.

### Summary II

- Code generation checklist:
  - Is the code correct?
  - Are values kept in registers for as long as possible?
  - Is the cheapest register always chosen for spilling?
  - 4 Are values in inner loops allocated to registers?
- A basic block is a straight-line piece of code, with no jumps in or out except at the beginning and end.
- Local code generation considers one basic block at a time, global one procedure, and inter-procedural one program.

## Homework

#### Homework I

- Translate the program below into quadruples.
- Identify beginnings and ends of basic blocks.
- Build the control flow graph.

```
PROGRAM P;
VAR X : INTEGER; Y : REAL;
BEGIN
   X := 1; Y := 5.5;
   WHILE X < 10 DO
      Y := Y + FLOAT(X);
      X := X + 1;
      IF Y > 10 THEN
         Y := Y * 2.2;
      ENDIF;
   ENDDO;
FND.
```

#### 07.330 Exam Question — Draw the CFG!

```
while (x>4) { (6) IF x<=4 GOTO (11)
    x = x*2+A[i]; (7) T1 := x*2
                   (8) T2 := A[i]
                    (9) x := T1+T2
                    (10) GOTO (6)
(11) x := x+5
(12) i := i+1
(13) GOTO (2)
 x = x+5;
```