CSc 553

Principles of Compilation

25: Instruction Scheduling II

Department of Computer Science
University of Arizona

## Introduction

### The UltraSparc-IIi I

- The integer unit has 2 ALUs for arithmetic, shift, and logical operations. Each has a 9-stage pipeline.
- The Load/Store unit can issue one load or store per cycle.
- The Floating point unit has five separate functional units.
   Two FP instructions can be issued per cycle. Most FP instructions have a throughput of 1 cycle, and a latency of 3 cycles.
- There's also a graphics unit that can issue 2 instructions per cycle.
- In total, up to 4 instructions can be issued per cycle.

#### The UltraSparc-IIi II



1011/01/12/12/12/12/19/09/0

## Superscalar Dispatch

### The UltraSparc-IIi III

 Instructions are grouped, fetched, and issued as a block of max k instructions.

| Grp | Cycle | Instruction |                             | Unit             |
|-----|-------|-------------|-----------------------------|------------------|
| 1   | 1     | IntAdd      |                             | ALU <sub>0</sub> |
|     | 1     | IntAdd      |                             | ALU <sub>1</sub> |
|     | 1     | LoadOp      |                             | LSU              |
|     | 1     | FPOp        |                             | FPU <sub>0</sub> |
| 2   | 2     | IntAdd      |                             | ALU <sub>0</sub> |
|     | 2     | IntAdd      |                             | $ALU_1$          |
|     | 2     | LoadOp      |                             | LSU              |
| 3   | 3     | LoadOp      | $r_1 \leftarrow \cdots$     | LSU              |
| 4   | 5     | IntAdd      | $r_2 \leftarrow \cdots r_1$ | ALU <sub>0</sub> |
|     | 5     | IntMul      |                             | MUL              |
| 5   | 41    | IntAdd      |                             | ALU <sub>1</sub> |

| Group | In             | Unit        |                  |
|-------|----------------|-------------|------------------|
| Grp1  | add %o1,55,%o2 |             | ECU <sub>0</sub> |
|       | shl            | %g2,2,%g2   | ECU <sub>1</sub> |
|       | ld [%fp+8],%12 |             | LSU              |
|       | fadd           | %f2,%f2,%f4 | FP-Add           |
| Grp2  | sub            | %11,5,%11   | ECU <sub>0</sub> |
|       | add            | %11,20,%o2  | ECU <sub>1</sub> |
|       | ld             | [%fp+8],%12 | LSU              |
|       |                |             |                  |

- The first 3 slots of a group can hold most types of instructions, except there can be only one ECU<sub>0</sub> and one ECU<sub>1</sub> instruction per group.
- The fourth slot can only hold a branch or an FP instruction.

D > 1/8 > 1/2 > 1/2 > 1/2 > 1/4 PO PO

101 (#) (E) (E) E 900

#### The UltraSparc-IIi IV

### The UltraSparc-IIi IEU I



101101011011

- Max 2 integer instructions can be issued per cycle. They're dispatched only if they're in the first 3 instruction slots of the group.
- There are two integer pipelines, IEU<sub>0</sub> and IEU<sub>1</sub>.
- Some instructions can only go to one pipeline. ADD, AND, ANDN, OR, ORN, SUB, XOR, XNOR, SETHI can go to either.
- IEU<sub>0</sub> has special hardware for shift instructions. Two shift instructions can't be grouped together.

### The UltraSparc-IIi IEU III

- IEU<sub>1</sub> has special hardware for instructions that set condition codes: ADDcc, ANDcc, ANDNcc, ORcc, ORNcc, SUBcc, XORcc, XNORcc. CALL, JUMPL, FCMP also use the IEU<sub>1</sub>.
- Two instructions that use the IEU<sub>1</sub> can't be grouped together.
   For example, only one instruction that sets condition codes can be issued per cycle.
- Some instructions execute for several cycles: MULScc inserts 1 bubble after it's dispatched. SDIV inserts 36 bubbles, UDIV insterts 37 bubbles. DIVX insterts 68 bubbles.

- Some instructions must complete before another instruction can be dispatched: Depending on the value of the multiplicand, SMUL inserts max 18 bubbles, UMUL 19 bubbles, MULX 34 bubbles.
- Some instructions are single group, they're always issued by themselves: LDD, STD, ADDC, SUBC, , MOVcc, FMOVcc, MOVr, SAVE, RESTORE, UMUL, SMUL, MULX, UDIV, SDIV, UDIVX, SDIVX

# The UltraSparc-IIi IEU IV

#### The UltraSparc-IIi CTI I

 IEU instructions that write to the same register can't be grouped together:

| Group | Iı  | Unit        |                  |
|-------|-----|-------------|------------------|
| Grp1  | add | %o1,55,%i6  | ECU <sub>0</sub> |
| Grp2  | ldx | [%16+0],%i6 | LSU              |

 If IEU instruction (a) reads a register that instruction (b) writes, they can't be grouped together:

| Group | Iı             | Unit        |                  |
|-------|----------------|-------------|------------------|
| Grp1  | add %o1,55,%i6 |             | ECU <sub>0</sub> |
| Grp2  | ldx            | [%i6+0],%o3 | LSU              |

 In other words, there's a one cycle delay between an instruction that computes a value and an instruction that uses that value.

- At most one control transfer instruction (CTI) can be dispatched per group: CALL, BPcc, Bicc, FB(P)cc, BPr, JMPL.
- BPcc are the branch on integer condition codes with prediction instructions: BPA, BPG, BPGE, · · · .
- If the branch is predicted taken, the branch instruction and the instruction at the branch target can be in the same group:

| Group | Instruction | Unit                |
|-------|-------------|---------------------|
| Grp1  | setcc       | ECU <sub>1</sub>    |
|       | BPcc        | CTI                 |
|       | FADD        | FPU (delay slot)    |
|       | FMUL        | FPU (branch target) |

 If the branch is predicted not taken, the branch instruction and the following instruction can be in the same group:

| Group | Instruction | Unit             |
|-------|-------------|------------------|
| Grp1  | setcc       | ECU <sub>1</sub> |
|       | BPcc        | CTI              |
|       | FADD        | FPU (delay slot) |
|       | FMUL        | FPU (sequential) |

- Load/store instructions can only be dispatched if they're in the first three instruction slots of a group.
- There can be one load/store dispatched per group.
- An instruction that references the result of a load cannot be in the load-group or the next group:

| Gro | up  | Instruction |          | Unit |
|-----|-----|-------------|----------|------|
| Grp | 1 l | DDF         | [r1],f6  | LSU  |
| Grp | 2   |             |          |      |
| Grp | 3 1 | MULD        | f4,f6,f8 | FPU  |

In other words, there's a two cycle load-delay.

### The UltraSparc-IIi FPU I

- FP instructions fall in two classes, A and M. An A and an M instruction can be in the same group, but not two A or two M instructions
- The A class: FxTOy, FABS, FADD, FAND, FCMP, FMOV, FNEG, FSUB.
- The M class: FCMP, FDIST, FDIV, FMUL, FSQRT.
- FPU instructions that write to the same register can't be grouped together:

| Group | Ins   | Unit       |     |
|-------|-------|------------|-----|
| Grp1  | FADDD | f2,f2,f6   | FPU |
| Grp2  | LDF   | [%16+0],f6 | LSU |

### The UltraSparc-IIi FPU I

 A FP store can be in the same group as the instruction that computes the value:

| Group | Ins   | truction   | Unit |
|-------|-------|------------|------|
| Grp1  | FADDD | f2,f2,f6   | FPU  |
|       | STD   | f6 [%16+0] | LSII |

 Most FP instructions have a latency of 3 cycles. I.e., the result generated by instruction (a) cannot be referenced by instruction (b) until 3 cycles later:

| Group | Inst  | Unit     |     |
|-------|-------|----------|-----|
| Grp1  | FADDD | f2,f4,f6 | FPU |
| Grp2  |       |          |     |
| Grp3  |       |          |     |
| Grp4  | FMULD | f4,f6,f8 | FPU |

FDIVD and FSQRTD have 22-cycle latencies.

### Instruction Scheduling

The purpose of instruction scheduling is

- $\ensuremath{\mbox{\textbf{0}}}$  to avoid pipeline stalls due to a datum not being available when needed, and
- to keep all functional units busy, i.e. making sure that every functional unit has at least one instruction ready to execute, and
- 3 to fill branch delay slots.
- We'll consider an algorithm, List Scheduling, that produces a toplogical sort of the dependence graph, while minimizing the execution time of the basic block.

# Dependence Graph

### Building the Dependence Graph I

- ullet There's an edge  $a \to b$  between instructions a and b of the dependence graph if
- 1 a writes to a register or location that b uses:
  - a) r<sub>1</sub> ← · · ·
- a uses a register or location that b writes to:
  - (a)  $\cdots \leftarrow [r_1 + 16]$
  - (b) [r<sub>1</sub> + 16] ← · · ·
- 3 a and b write to the same register or location:
  - (a) r<sub>1</sub> ← · · ·
  - (b) r<sub>1</sub> ← · · ·
- we don't know if a can be moved after b:
  - (a)  $[r_1+16] \leftarrow \cdots$ 
    - $(r_2 + 32]$

## Building the Dependence Graph ${\sf II}$

- The edge a → b is labeled with an integer latency,
   the delay required between the initiation times of a
  - and b, minus the execution time required by a before any other instruction can begin executing.
- If b can begin executing in the cycle after a began executing, the latency is 0:



 If two cycles have to elapse between starting a and starting b, the latency is 1:





### Dependence Graph Example

- Assume that a load has a latency of one cycle and takes two cycles to complete.
  - (1) load r2, [r1+4] (2) load r3, [r1]
  - (3) add r4, r2, r3 (4) sub r5, r2, 1
    - 3 4

#### Note:

- When building the graph we must take implicit resources like condition codes into account:
  - There's an edge a → b if a sets a condition code and b brances on it.

# List Scheduling Algorithm

The state of the s

### List Scheduling Algorithm I

- Consider the dependence graph below. Let ExecTime[6]=2, and ExecTime=1 for all other instructions.
- Start by labeling the nodes with the maximum possible delay from the node to the end of the block.

$$\texttt{delay}[n] = \left\{ \begin{array}{ll} \texttt{ExecTime}[n] \text{if } n & \text{is a leaf} \\ \texttt{max} & \\ \texttt{succs} \ m \ \text{of} \ n \end{array} \right. \\ \left( \texttt{latency}(n,m) + \texttt{delay}[m] \right) \ \text{otherw}$$



### Example I

 Consider the dependence graph below. Let ExecTime[6]=2, and ExecTime=1 for all other instructions.



### List Scheduling Algorithm II

- Next, traverse the graph from the root to the leaves.
- Select nodes to schedule.
- Keep track of the current time, CurTime.
- ETime[n] is the earliest time node n should be scheduled to avoid a stall.
- Candidates is the set of candidates (nodes which can be scheduled).
- MCands is the set of candidates with the maximum delay time to the end of the block.
- $\bullet$  ECands is the set of candidates whose earliest start times are  $\leq \mathtt{CurTime}.$

40×40×42×42× 2 900

#### List Scheduling Algorithm IV

As usual, there are many possible heuristics to choose from:

PROCEDURE Heuristics (M : SET OF Nodes) : Node

- Pick the n from MCands with minimum ETime[n].
- Pick the n with maximum total delay to the leaves.
- Pick the n that adds the most new candidates
- Pick the n that adds the most new candidates.
- Pick the n that originally came first in the basic block.

#### REPEAT

Candidates := nodes in DAG with indegree=0;
MCands := Candidates with max Delay;

ECands := Candidates whose ETime  $\leq$  CurTime; IF there's just one  $m \in$  MCands THEN n := m

ELSIF there's just one  $m \in \text{ECands THEN } n := m$  ELSIF there's more than one  $m \in \text{MCands THEN}$ 

n := Heuristics(MCands)

ELSIF there's more than one  $m \in EC$ ands THEN n := Heuristics(ECands) ENDIF;

Schedule n;

CurTime := CurTime + ExecTime[n];
FOR every successor i of n DO

ETime[i] := max(ETime[n],CurTime+Latency(n,i))





=2+1=3

=4+0=4







CurTime ETime[5]=6

100 C 150 (5) (5) (0)

D=1

40 40 40 42 42 6 2 990



### List Scheduling Algorithm V

- How do we deal with superscalar architectures?
- We can easily modify the heuristic to handle p > 1 pipelines:

PROCEDURE Heuristics (M : SET OF Nodes) : Node

- Pick a node n from M such that
  - lacksquare instruction n can execute on pipeline  $P_i$ , and
  - pipeline P<sub>i</sub> hasn't had an instruction scheduled for it recently.

#### THE THE PERSON AND TH

CurTime

#### Filling Delay Slots II

Typically, a basic block ends with a branch:

Filling Delay Slots I

- (1) Instr<sub>1</sub>
- (n-1) Branch
- (n) Delay Slot
- We pick an instruction (i) from the basic block to fill the delay slot, such that
  - (i) is a leaf of the dependence graph (otherwise, some other instruction b depends on it and would have to be executed after the branch), and
  - the branch must not depend on (i) (e.g. (i) can't set the condition codes that are branched on), and
  - (i) is not a branch itself.

- If that doesn't work, we can try to move an instruction from the target basic block into the delay slot:
  - Instr<sub>1</sub>
    - ...
  - (n-1) Branch L
  - (n) Delay Slot
- L:  $Instr_k \Leftarrow move to delay slot!$  We prefer a single-cycle instruction (like an integer add) over
- a multi-cycle (like a delayed load).
- If we can't find a suitable instruction, insert a nop.

### Filling Delay Slots III

### Readings and References

Calls are similar:

- There's usually an instruction (i) that moves an argument into one of the argument-passing registers. We prefer to put that instruction in the delay slot.
- If not, we can try to move an instruction from the next basic block (the one following the call) into the slot:

Failing that, we insert a nop.

- Steven Muchnick, Advanced Compiler Design and Implementation, Section 9.2, pp. 269–274 and Chapter 17, pp. 531–547.
- Sun Technical manuals for UltraSparc-IIi (available from www.sun.com):
  - Chapter 1: UltraSparc-IIi Basics
  - ② Chapter 2: Processor Pipeline
  - Ochapter 21: Code Generation Guidelines
  - Chapter 22: Grouping Rules and Stalls.

