Designing RISC-V CPU from scratch – Part 4: Fetch Unit

I hope, everyone of you have gone through the previous part of the RISC-V CPU Development blog series, where we talked about Dealing with Pipeline Hazards of Pequeno. If not, please go through it before moving ahead.

We are ready to dive into micro-architecture and RTL design from this part! In this blog, we will architect and design Fetch Unit (FU) of Pequeno.

Fetch Unit – Defining Interfaces

Fetch Unit is the Stage-1 of the CPU pipeline which interfaces with instruction memory. FU fetches instructions from the instruction memory and sends the fetched instruction to Decode Unit (DU). As discussed in the modified architecture of Pequeno in Part-3, FU accommodates a branch prediction logic and flush support.

Let’s define the interfaces for Fetch Unit.

Instruction Access Interface	To access instruction memory/cache
DU Interface	To send the fetched instruction, control/data to Decode Unit
Flush Interface	To flush FU externally

Table: Fetch Unit – Interfaces

Fetch Unit – Interfaces

Instruction Access Interface

The core functionality of FU in the CPU is instruction access. Instruction Access I/F is used for that purpose. Instructions are stored in instruction memory (RAM) during execution. Modern CPUs fetch instructions from cache memory rather than directly from the instruction memory. Instruction Cache (in computer architecture terms, this is called a Primary Cache or L1 Cache) is located closer to CPU and facilitates faster instruction access by caching/storing frequently accessed instructions and pre-fetching a larger chunk of instructions in the vicinity. Thus, there is no need of continuously accessing the slower main memory (RAM). Hence, most of the instructions are accessed fast, directly from the cache.

Caches are complex designs in computer architecture. Read more about caches here.

CPU doesn’t directly access the interface with an instruction cache/memory. There will be a cache/memory controller in between to control the memory access between them.

Fetch Unit – Instruction Fetch

It would be a good idea to define a standard interface so that any standard instruction memory/cache (IMEM) can be plugged easily to our CPU with minimal or no glue logic. Let’s define two interfaces for instruction access. Request I/F handles requests from FU to instruction memory. Response I/F handles the responses from instruction memory to FU. We will define a simple valid-ready based Request & Response I/Fs for FU, as this is easy to translate to bus protocols like APB, AXI, if required.

Fetch Unit – Request & Response I/F

Instruction access requires the address of instruction in the memory. Address to be requested via Request I/F is simply the PC generated by FU. Rather than ready, we will use stall signal terminology at FU interfaces, which is the inverted version of ready in behavior. Cache controllers usually have a stall signal to stall requests from processor. This signal is represented by cpu_stall. The response from memory is the fetched instruction received via Response I/F. Along with the fetched instruction, the response should also include the corresponding PC. The PC serves as the ID to identify the request to which the response has been received. Or in other words, it indicates the address of the fetched instruction. This is a vital information which will be required by next stages of the CPU pipeline (How? We will see it soon!). Therefore, the fetched instruction and its PC constitute the response packet to FU. CPU may also need to stall responses from instruction memory at times when the internal pipeline is stalled. This signal is represented by mem_stall.

At this point, let’s define instruction packet in our CPU pipeline = {instruction, PC}

PC Generation Logic

At the heart of FU is the PC generation logic which controls Request I/F. Since we are designing a 32-bit CPU, PC should be generated in increments of four. This logic once comes out of reset, generates PC every clock cycle. The on-reset value of PC can be hard-coded. This is the address from which the instructions are fetched and executed by CPU after coming out of reset i.e., the very first instruction’s address in the memory. PC generation is free-running logic stalled only by cpu_stall.

Free-running PC can be bypassed by Flush I/F and internal branch prediction logic. The PC generation algorithm is implemented as:

Fetch Unit – PC Generation Logic

Instruction Buffers

There are two back-to-back instruction buffers inside FU. Buffer-1 buffers the fetched instruction from instruction memory. Buffer-1 has direct access to Response I/F. Buffer-2 buffers the instruction from Buffer-1 and then sent it to DU via DU I/F. These two buffers form the internal instruction pipeline in FU.

Fetch Unit – Instruction Buffers

Branch Prediction Logic

As discussed in the previous blog, we have to add a branch prediction logic in FU to mitigate control hazards. We will implement a simple and static branch prediction algorithm. Major aspects of the algorithm are:

Unconditional jumps are always taken.
If Branch instruction, take the branch if it’s a backward jump. Because the chances are:
- This instruction could be part of the loop exit check of some do-while loop. There is a higher probability to be correct if we take the branch in this case.
If Branch instruction, do not take it if it’s a forward jump. Because the chances are:
- This instruction could be part of the loop entry check of some for loop or while loop. There is a higher probability to be correct if we do not take the branch and continue with the next instruction.
- This instruction could be part of some if-else statement. In this case, we always assume that if condition is true and continue with the next instruction. This bargain theoretically has the probability of $50\%$ to be correct.

Fetch Unit – Branch Prediction Logic

You may want to check pseudo-assembly code for: if-else, for loop, while loop, do-while loop. I used ChatGPT to generate pseudo-assembly code and reach the conclusions for branch prediction!

Buffer-1 instruction packet is monitored and analyzed by Branch Prediction Logic and generates the branch prediction signal: branch_taken. The branch prediction signal is then registered and piped forward in synchronization with the instruction packet sent to DU. Branch prediction signal is sent to DU via DU I/F.

DU Interface

This is the primary interface between Fetch Unit and Decode Unit to send the payload. The payload includes the fetched instruction and branch prediction information.

DU Interface to send payload

Since this is the interface between two pipeline stages of the CPU, valid-ready I/F is implemented. Following signals constitute the DU I/F:

instruction packet	{instruction, PC} to DU
branch_taken	Branch prediction signal to DU
bubble	Inverted version of valid to DU
stall	Inverted version of ready from DU

Table: Decode Unit Instruction I/F

Refer to Part-2 to refresh the discussion about the valid-ready I/F designed between the pipeline stages of Pequeno!

Pipeline Stall and Flush in Pequeno

In previous blogs, we discussed the concept and importance of stall and flush in CPU pipeline. We also discussed various scenarios in Pequeno architecture when it would be required to stall or flush. Therefore, appropriate stall and flush logic have to be incorporated in every pipeline stage of the CPU. It is important to identify the conditions at which stall or flush needs to be generated in a stage. And also what part of logic in the stage needs to be stalled and flushed.

Some initial thoughts before implementing stall and flush logic:

A pipeline stage may be stalled externally or by internally generated conditions.
A pipeline stage may be flushed externally or by internally generated conditions.
There is no centralized stall or flush generation logic in Pequeno. Every stage may have its own stall and flush generation logic.
A stage can be stalled only by the next stage in the pipeline. The stall from any stage trickles up the pipeline eventually and stalls the entire pipeline in the upstream.

Stall in pipeline analogy to ripple effect in traffic

Stall in pipelines is analogous to ripple effect seen in traffic

A stage can be flushed by any of the stages in the downstream pipeline. This is called a pipeline flush, because the whole pipeline in the upstream needs to be flushed simultaneously. In Pequeno, branch miss in Execution Unit (EXU) is the only scenario where a pipeline flush is required.

Stall and Flush Network in Pequeno

Refer to Part-2 to revisit stall and flush behavior in the CPU pipeline.

Stall logic contains the logic to generate local and external stall. Flush logic contains the logic to generate local and pipeline flush.

Local stall is generated internally and used locally to stall the operation of the stage. External stall is generated internally and sent externally to the next stage in the upstream pipeline. Local and external stalls are generated based on internal conditions and external stall from the next stage in the downstream pipeline.

Local flush is the flush which is generated internally and used locally to flush the stage. External flush or Pipeline flush is generated internally and sent externally to the upstream pipeline. This flushes all stages in the upstream simultaneously. Local and external flushes are generated based on internal conditions.

Local and External Stall/Flush in Pequeno Pipeline

Local and External Stall/Flush in Pipeline Stages

Stall Logic in FU

Only DU can externally stall the operation of FU. When DU asserts stall, FU’s internal instruction pipeline (Buffer-1 –>Buffer-2) should be stalled immediately, and it should also assert mem_stall to IMEM as FU cannot accept anymore packets from IMEM. Depending on the pipeline/buffering depth in the IMEM, PC Generation Logic may also gets eventually stalled by cpu_stall from IMEM as no more requests may be accepted by IMEM. There are no internal conditions in FU that generates local stall.

Flush Logic in FU

Only EXU can externally flush FU. EXU initiates branch_flush in the CPU instruction pipeline with the address of the next instruction to be fetched after flushing the pipeline (branch_pc). FU has provided Flush I/F so that external flush can be accepted.

Buffer-1, Buffer-2, PC Generation Logic in FU are flushed by branch_flush. The signal branch_taken from Branch Prediction Logic also acts like a local flush to Buffer-1, PC Generation Logic. If the branch is taken:

Next instruction should be fetched from the PC of branch prediction. Therefore, PC Generation Logic should be flushed and next PC should be = branch_pc.
Next instruction at Buffer-1 should be flushed and invalidated i.e., NOP/bubble is inserted.

Buffer-1 and Buffer-2 functionality in Fetch Unit - Pequeno

Figure: Buffer-1 and Buffer-2 functionality

Wonder why Buffer-2 is not flushed by branch_taken? Because the branch instruction (which is responsible for the flush generation) from Buffer-1 should be buffered at Buffer-2 in the next clock cycle, and allowed to move forward in the pipeline for execution. This instruction shouldn’t be flushed off!

Instruction memory pipeline should also be flushed appropriately. IMEM flush mem_flush is generated from branch_flush and branch_taken.

Fetch Unit – Architecture

Let’s integrate all the micro-architectures we designed so far to complete the architecture of Fetch Unit.

Fetch Unit – Architecture

That’s all folks! We have successfully designed the Fetch Unit of Pequeno 🙂

GitHub Repo of Pequeno

GitHub repo for Pequeno is live now! I will be adding source codes, test suite, scripts, docs to the repo as the blog progresses. Follow me in GitHub and add the repo to favorites!

Find the repo here: pequeno_riscv

What’s next

We have so far completed: Fetch Unit (FU). In the upcoming part, we will be designing Decode Unit (DU) of Pequeno.

Visit the complete blog series

This post is part of RISC-V CPU Development blog series

<< Previous part |~~~~ J U M P ~~ T O ~~~~| Next part >>

Support

Leave a comment or visit support for any queries/feedback regarding the content of this blog.
If you liked Chipmunk , don’t forget to follow!: