Designing RISC-V CPU from scratch – Part 3: Dealing with Pipeline Hazards

I hope, everyone of you have gone through the previous part of the RISC-V CPU Development blog series, where we talked about Specifications & Architecture of Pequeno. If not, please go through it before moving ahead.

In this blog, we will be discussing Pipeline Hazards. We successfully designed the pipelined architecture of the CPU last time, but we didn’t consider the evil twin that comes along with the pipelines. What could be the implication of pipeline hazards in the architecture? What architectural modifications may be required to mitigate them? Let us go ahead, and de-mystify them!

Pipeline Hazards – Implications on Performance

Hazards in instruction pipeline of CPU are dependencies which disrupt the normal execution in the pipeline. When a hazard occurs, instruction cannot execute in the designated clock cycle as it may result in incorrect computation results or flow of control. As a result, the pipeline may be forced to stall until the instruction can be successfully executed.

Pipeline Stall due to Data Dependency

In the above example, CPU performs in-order execution of instructions in the compiler-generated order. Let’s assume instruction $i_2$ has some dependency on $i_1$ , say some register has to be read by $i_2$ , but this register is also being modified by the previous instruction $i_1$ . Hence, $i_2$ has to wait until $i_1$ writes back the result to register file, otherwise older data will be decoded and read from register file and used by Execute stage. To avoid this data incoherency, $i_2$ is forced to stall by three clock cycles. Bubbles inserted in the pipeline represent the stall or wait states. Only when $i_1$ is completed, $i_2$ is decoded. Finally, $i_2$ finishes execution in 10th clock cycle instead of 7th. Latency of three clock cycles was introduced due to the stall caused by data dependency. How does this latency affect the CPU performance?

We ideally expect our CPU to work at full throughput i.e., CPI = 1. But when pipeline stalls, it reduces the throughput/performance of the CPU because CPI increases. For non-ideal CPU:

$\text{Clock cycles Per Instruction} = 1 + \text{Stall cycles per instruction}$ .
$\implies \text{CPI}_\text{non-ideal} = \text{CPI}_\text{ideal} + \text{Stall cycles per instruction}$

Pipeline Hazards – Types & Mitigation

There are different ways in which a hazard can occur in the pipeline. Pipeline hazards can be classified into three types:

Structural Hazards
Control Hazards
Data Hazards

Structural Hazards

Structural Hazards occur due to hardware resource conflict. i.e., when two stages of the pipeline want to access the same resource. For eg: two instructions require access to memory in the same clock cycle.

Structural Hazard due to Memory Access Conflict

In the above example, CPU has a single memory for both instruction and data. Fetch stage accesses the memory every clock cycle to fetch next instruction. Hence, instructions at Fetch stage and Memory Access stage may get conflicted if an earlier instruction at Memory Access stage also needs memory access. This will force the CPU to add stall cycles and Fetch stage has to wait until the resource (memory) is relinquished by the instruction in Memory Access stage.

Mitigating Structural Hazards

Some ways to mitigate structural hazards are:

Stall the pipeline until the resource is available.
Duplicate resources so that there will be no conflict at all.
Pipeline resources so that two instructions will be at different pipeline stages of the resource.

Let’s analyze different scenarios which may cause structural hazard in the pipeline of Pequeno and how to resolve it. We have no intention to use stalling as an option to mitigate the structural hazards!

Scenario	Solution
1) Memory access conflict b/w Fetch and Memory Access stages	✅ Separate instruction and data memory access paths. Use instruction and data caches with a single physical memory or physically separate instruction and data memories.
2) Register File access conflict b/w Decode and WriteBack stages	✅ Separate ports for read and write. Register File to be designed with two read ports and one write port. WB stage uses write port. ID stage uses read ports. Both of them may read/write to Register File in the same clock cycle without conflicts.
3) ALU conflict b/w different stages and Execute Stage	✅ Exclusive ALU in Execute stage. If any other stage requires to do operations like address computation (for eg: IF needs to do PC increment every cycle), use their own local ALU.

Table: Mitigating Mechanisms for Structural Hazards

In Pequeno’s architecture, we implement the above three solutions to mitigate all kinds of structural hazards.

Control Hazards

Control Hazards are caused by Jump/Branch instructions. Jump/Branch instructions are the flow control instructions in the ISA of CPU. When the control reaches a Jump/Branch instruction, CPU has to make a decide whether to take the branch or not. One of the following actions should be taken.

Fetch the next instruction at PC+4 (branch not taken) OR
Fetch the instruction at the branch target address (branch taken).

Whether the decision was right or wrong is figured only at the Execute stage where the result of the branch instruction is computed. Depending on whether branch taken or not taken, branch address (address to which CPU should branch out) is resolved. If the decision made earlier was wrong, all the instructions in the pipeline fetched and decoded until that clock cycle should be discarded. Because those instructions should not be executed at all! This is done by flushing the pipeline and fetching the instruction at branch address in the next clock cycle. Flushing invalidate the instructions and convert them into NOPs or bubbles. This incurs significant amount of clock cycles as penalty. This is called branch penalty. Control hazards thus have the worst effect on the performance of CPU.

mvi x1, 0x100      # i1
mvi x2, 0x200      # i2
mvi x3, 0x000      # i3
bne x1, x2, EXIT   # i4: EXIT branch to be taken because x1 != x2
mvi x3, 0x001      # i5: This instruction shouldn't be executed!
...
EXIT:
...                # i10: This instruction should be executed after i4

Control Hazard and Flushing

In the above example, $i_{10}$ completes the execution in 10th clock cycle, but it should have completed the execution in 7th clock cycle. A penalty of three clock cycles was incurred because wrong branch ( $i_5$ ) was taken. Flush had to be done in the pipeline when this was identified by Execute stage in 4th clock cycle. How this will impact the CPU performance?

If a program running in above CPU has $30 \%$ branch instructions, CPI will become:

$\text{CPI} = (\text{CPI}_{ideal} \times 0.70) + (3 \times 0.30) \approx 2$

The CPU performance cuts down by $50 \%$ !

Mitigating Control Hazards

To mitigate control hazards, we can employ some strategies in the architecture…

Simply stall the pipeline if the instruction is identified as branch. This decoding logic can be implemented in Fetch stage itself. Once the branch instruction is executed and branch address is resolved, next instruction can be fetched and the pipeline can resume.
Add a dedicated branch logic in Fetch stage like Branch Prediction.

The essence of branch prediction is: we employ some kind of a prediction logic in Fetch stage that will guess whether the branch should be taken or not. In the next clock cycle, the guessed instruction is fetched. This will be fetched either from PC+4 (predicts branch not taken) or branch target address (predicts branch taken). Now there are two possibilities:

If the prediction is found to be correct at Execute stage, do nothing, pipeline can continue processing.
If the prediction is found to be wrong, flush the pipeline, fetch the correct instruction from the branch address resolved by Execute stage. This incurs branch penalty.

As you can see, branch prediction can still incur branch penalty if it makes incorrect prediction. Design goal should be to reduce the probability of making incorrect prediction. The CPU performance heavily depends on how “good” is the prediction algorithm. Complex techniques like Dynamic Branch Prediction keep the history of instructions to predict correctly at orders of $80-90\%$ probability.

To mitigate control hazards in Pequeno, we will implement a simple branch prediction logic. More details to be revealed in the upcoming blog where we design Fetch Unit.

Scenario	Solution
1) Control Hazards due to Jump/Branch instructions	✅ Add a simple branch prediction logic in Fetch stage.

Table: Mitigating Mechanisms for Control Hazards

Data Hazards

Data Hazards occur when an instruction’s execution has data dependency on the results of some previous instruction that is still being processed in the pipeline. Let’s visit the three types of data hazards with examples to understand the concept better.

WAW (Write-After-Write)

Suppose an instruction $i_1$ writes result to a register $x$ . The next instruction $i_2$ also writes result to the same register. Any subsequent instructions in the program order should be reading the result of $i_2$ at $x$ . Otherwise, data integrity is lost. This type of data dependency is called output dependency which may lead to WAW data hazards.

add x1, x2, x3  # i1; x1 = x2 + x3
add x1, x4, x5  # i2; x1 = x4 + x5
.
.
add x5, x1, x2  # This instruction should be reading the result of i2 @x1

WAR (Write-After-Read)

Suppose an instruction $i_1$ reads a register $x$ . The next instruction $i_2$ writes result to the same register. Here, $i_1$ should read the older value of $x$ , not the result of $i_2$ . If $i_2$ writes result to $x$ before it is read by $i_1$ , it results in data hazard. This type of data dependency is called anti-dependency which may lead to WAR data hazards.

add x1, x2, x3  # i1; x1 = x2 + x3 -- should be reading the older value @x2
add x2, x4, x5  # i2; x2 = x4 + x5 -- new value written @x2

RAW (Read-After-Write)

Suppose an instruction $i_1$ writes result to a register $x$ . The next instruction $i_2$ reads the same register. Here, $i_2$ should read the value written by $i_1$ to the register $x$ , not the older value. This type of data dependency is called true dependency which may lead to RAW data hazards.

add x1, x2, x3  # i1; x1 = x2 + x3
add x5, x1, x4  # i2; x5 = x1 + x4 -- should be reading the value written by i1 @x1

This is the most common and predominant type of data hazard in pipelined CPUs.

Mitigating Data Hazards

To mitigate data hazards in in-order CPUs, we can employ a few techniques:

Stall the pipeline on detecting data dependency (refer to the first figure). Decode stage can wait until execution the previous instruction is completed.
Compile Re-scheduling: Compiler re-arranges the code to avoid data hazards by scheduling it for later. The intention here is to avoid stalling without affecting the integrity of flow of control in the program, but this may not be possible to do always. Compiler can also insert NOP instructions in between two instructions that have data dependency. But this will incur stalls, and hence is a performance hit.

add x1, x2, x3  # i1 being decoded
# Compiler inserts NOP or any other non-dependent instructions
# Compiler inserts 3 NOPs here because 3 stages ahead of Decode in the pipeline...
NOP             # NOP being decoded; i1 in Execute stage
NOP             # NOP being decoded; i1 in Memory Access stage
NOP             # NOP being decoded; i1 in WriteBack stage
add x5, x1, x4  # i2 being decoded; by this time i1 would have completed execution 
                # and hence reads correct value @x1

Data/Operand Forwarding: This is the prominent architectural solution in in-order CPUs to mitigate RAW data hazards. Let’s analyze the CPU pipeline to understand the idea behind this technique.

Data/Operand Forwarding

Assume two adjacent instructions $i_1$ and $i_2$ have RAW data dependency between them as both are accessing a register $x$ . The CPU should stall instruction $i_2$ until $i_1$ writes back the result to the register $x$ . If the CPU has no stalling mechanism, the older value is read by $i_2$ from $x$ at Decode stage in 3rd clock cycle. In the 4th clock cycle, $i_2$ gets executed with wrong value of $x$ !

RAW Data Hazard in action!

If you look closely at the pipeline, we already have the result of $i_1$ available in the 3rd clock cycle. It is not written back to register file of course, but still the result is available at the output of Execute stage. So, if we are somehow able to detect the data dependency and then “forward” this data to Execute stage input, then the next instruction can use the forwarded data instead of the data from Decode stage. Thus data hazard is mitigated! The idea looks like this:

Data or Operand Forwarding in RISC-V CPUs

Data Forwarding

This is called Data/Operand Forwarding or Data/Operand Bypassing . We forward the data in forward direction in time, so that trailing dependent instructions in the pipeline can access this bypassed data for execution at Execute stage.

The idea can be extended across different stages. In a 5-stage pipeline executing instructions in order $i_1, i_2, .... i_n$ , data dependency can be there between:

$i_1$ and $i_2$ – Need to bypass between Execute and Decode stage outputs.
$i_1$ and $i_3$ – Need to bypass between Memory Access and Decode stage outputs.
$i_1$ and $i_4$ – Need to bypass between WriteBack and Decode stage outputs.

Architectural solution to mitigate RAW data hazards originating at any stage of the pipeline would look like:

Data or Operand Forwarding in RISC-V CPUs 5-stage

Data Forwarding to mitigate RAW Hazards in 5-stage Pipeline

Pipeline Interlock

Consider the following scenario:

lw x1, x2, 0xABC  # i1: Load data from data at addr = [x2 + 0xABC] --> x1
add x3, x1, x1    # i2: x3 = x1 + x1 ; x1 should contain value loaded from memory

There is a data dependency between two adjacent instructions $i_1$ and $i_2$ where the first instruction is a Load. This is a special case of data hazard. Here, we cannot execute $i_2$ until data is loaded to $x_1$ . So, the question is can we still mitigate this data hazard with data forwarding? The load data will be available only at Memory Access stage of $i_1$ , and this has to be forwarded to Decode stage of $i_2$ to prevent the hazard. This requirement would look like:

Pipeline Interlock

Say, the load data is available at 4th cycle at Memory Access stage, you need to “forward” this data to 3rd cycle to the Decode stage output of $i_2$ (Why 3rd cycle? Because in 4th cycle $i_2$ would have completed execution in Execute stage already!) . Essentially, you are trying to forward present data to the past, which impossible unless your CPU time travel! This is not data forwarding but “data backwarding” 😀 ….

Data Forwarding can be done only in forward direction in time.

This type of data hazard is called Pipeline Interlock. The only way to get around this by stalling the pipeline by one clock cycle by inserting a bubble when this data dependency is detected.

Pipeline Interlock in RISC-V CPUs solved by stall

NOP aka Bubble is inserted between $i_1$ and $i_2$ . This delays $i_2$ by one cycle, thus data forwarding can now forward load data from Memory Access stage to Decode stage output.

So far, we saw only how to mitigate RAW data hazards. So, what about WAW and WAR hazards? Well, RISC-V architectures are inherently resistant to WAW and WAR hazards for in-order pipeline implementations!

All writeback to registers happens in the order of issuing. Data written back is always overwritten by later instruction which writes to the same register. So, WAW hazard never happens!
Writeback is the last stage of pipeline. By the time writeback happens, read would have already completed execution with older data successfully. So WAR hazard never happens!

To mitigate RAW data hazards in Pequeno, we will implement data forwarding with pipeline interlock detection feature hardware. More details to be revealed in the upcoming blog where we design Data Forwarding logic.

Scenario	Solution
1) WAW/WAR Data Hazards	✅ Inherently mitigated by the in-order pipeline architecture .
2) RAW Data Hazards	✅ Data forwarding + pipeline interlock detection hardware

Table: Mitigating Mechanisms for Data Hazards

Modified Architecture of Pequeno

We have understood and analyzed potential various pipeline hazards that could fail instruction execution by the existing CPU architecture. We also devised solutions and mechanisms to mitigate them. Let’s incorporate the necessary micro-architectures and draft the final architecture of Pequeno RISC-V CPU which is devoid of all kinds of pipeline hazards!

Pequeno – Modified CPU Architecture to mitigate all Pipeline Hazards

What’s next

In upcoming parts, we dive into RTL design of each pipeline stage / functional unit. We will discuss different micro-architecture decisions and challenges in the design phase. These are the building blocks which we eventually integrate to complete our CPU architecture.

Visit the complete blog series

This post is part of RISC-V CPU Development blog series

<< Previous part |~~~~ J U M P ~~ T O ~~~~| Next part >>

Support

Leave a comment or visit support for any queries/feedback regarding the content of this blog.
If you liked Chipmunk , don’t forget to follow!: