Designing RISC-V CPU from scratch – Part 2: Specifications & Architecture

Recap!

I hope, everyone of you have gone through the previous part of the RISC-V CPU Development blog series, where we talked about RV32I ISA. If not, please go through it before moving ahead.

In this blog, we define the full specs and architecture of Pequeno. Last time, it was simply defined to be a 32-bit CPU. Let us put more details into it to get the picture of architecture to be designed.

We would design a simple single-core CPU, which is capable of executing one instruction at a time sequentially in the order of fetching, but still in a pipelined manner. Let’s not support RISC-V privileged specs, as we don’t intend to make our core OS capable, or interrupt capable as of now.

Pequeno RISC-V CPU – Specifications

32-bit CPU, single-issue, single core.
Classic five-stage RISC pipeline. Strictly in-order pipeline.
Compliant to RV32I User-Level ISA v2.2. Supports all 37 base instructions.
Separate bus interface for Instruction & Data memory access. (Why? To be discussed in future…)
Intended for baremetal applications, not OS and interrupt capable. (Limitations rather!)

As said in the previous blog, we would be supporting RV32I ISA. So, the CPU supports only integer arithmetic.

All registers in the CPU are 32-bit. Address and data buses are also 32-bit. The CPU assumes the classic little endian byte-addressable memory space. Each address corresponds to a byte in the address space of the CPU. $\text{0x00 - } \text{byte[7:0], }\text{0x01 - } \text{byte[15:8] ...}$

32-bit word can be accessed at 32-bit aligned addresses i.e., addresses which are multiples of four: $\text{0x00 - } \text{word0, }\text{0x04 - } \text{word1 ...}$

Pequeno – Address Space

Pequeno is a single-issue CPU, i.e., only one instruction is fetched at a time from memory, and issued to be decoded and executed. Pipelined processors with single-issue can have max. IPC = 1 (or least/best CPI = 1) i.e., the ultimate goal is to execute at the rate of 1 instruction per clock cycle. This is theoretically the maximum performance achievable.

Classic five-stage RISC pipeline is the fundamental architecture to understand any other RISC architectures. This would make the ideal and simplest choice for our CPU. The architecture of Pequeno is built around this five-stage pipeline. Let’s take a deeper dive into the underlying concepts.

For simplicity, we will not be supporting timers, interrupts, and exceptions in the CPU pipeline. Hence, CSRs and privilege levels need not be implemented as well. RISC-V Privileged ISA is therefore not part of the current implementation of Pequeno.

Why not a non-pipelined CPU?

Simplest approach to design a CPU is the non-pipelined way. Let’s see couple of design approaches for a non-pipelined RISC CPU and understand its drawbacks.

Let’s assume the classic sequence of steps followed by a CPU for instruction execution: Fetch, Decode, Execute, Memory Access, and Writeback.

First design approach is: designing CPU like an FSM with four or five states which does every operation sequentially. For eg:

CPU like an FSM

But this architecture takes a bad hit on instruction execution rate. As it will take multiple clock cycles to complete the execution of a single instruction. Say, a register write would take 3 cycles. If load/store instruction, memory latency comes into picture as well. This is bad and primitive approach to design a CPU. Let’s dump this for good!

Second approach is: instructions may be fetched from instruction memory, decoded, and executed by a fully combinatorial logic. The result from ALU is then written back to register file. This whole process up to writeback may be done in a single clock cycle. Such a CPU is called single-cycle CPU. If the instruction requires data memory access, read/write latency should be taken into account. If read/write latency is one clock cycle, store instructions may still finish execution in one clock cycle like all other instructions, but load instructions may take one clock cycle extra, as the loaded data has to be written back to register file. PC generation logic has to take care of the implications of this latency. If the data memory read interface is combinatorial (asynchronous read), the CPU becomes truly single-cycle for all instructions.

Single Cycle RISC-V CPU

Main drawback of the architecture is obviously the long critical path through the combinatorial logic from fetch to memory/register file write, which constraints the timing performance. However, this design approach is simple and suitable for CPUs in low-end microcontrollers where low clock speed, power consumption, and area is desirable.

Pipelining CPU

To achieve higher clock speeds and performance, we can segregate the sequential processing of instructions by CPU. Each sub-process is then assigned to independent processing units. These processing units are cascaded sequentially to form a Pipeline. All units work in parallel and operate upon different parts of the instruction execution. Multiple instructions can be processed parallelly in this way. This technique to implement instruction-level parallelism is called Instruction Pipelining. This execution pipeline constitutes the core of a pipelined CPU.

Pipelining breaks the critical path by segregating the combo logic and adding registers in between

Classic five-stage RISC pipeline has five processing units aka Pipeline Stages. The stages are: Fetch (IF), Decode (ID), Execute (EX), Memory Access (MEM), WriteBack (WB). The working of the pipeline can be visualized as:

Each clock cycle, different part of an instruction is processed, and each stage processes different instruction. If you observe closely, only at @5th cycle: instruction-1 finishes execution. This latency is called Pipeline Latency, $\Delta$ . This latency is same as the number of pipeline stages. After this latency, @6th cycle: instruction-2 finishes execution, @7th cycle: instruction-3, and so on…. We can theoretically compute the throughput (Instructions Per Cycle, IPC) as:

$N$ instructions take $(\Delta + N-1)$ cycles to execute.

$\therefore \text{IPC} = \frac{N}{(\Delta + N-1)}$

Theoretical max. IPC achievable is when $N \rightarrow \infty$

$\lim_{N\to \infty} \frac{N}{(\Delta + N-1)} = 1 \text{ instruction per cycle}$

Thus, pipelining CPU guarantees an execution rate of one instruction per clock cycle. This is the max. possible IPC in a single-issue processor.

By demarcating the critical path to multiple pipeline stages, CPU can now run in much higher clock speed as well. Mathematically, this boosts the throughput of pipelined CPU over an equivalent non-pipelined CPU by a factor, $S = \frac{N.\Delta}{(\Delta + N-1)} = \Delta \text{, for } N \rightarrow \infty$ .

This is called Pipeline Speed-up. In simpler words, a pipelined CPU with $S$ stages can work at clock speed of $S$ times compared to the non-pipelined one.

Pipelining normally increases area/power consumption but the performance gain is worth it.

The mathematical computations assume that the pipeline never stalls, i.e., the data keeps moving forward from one stage to another every clock cycle. But in an actual CPU, the pipeline can stall due to multiple reasons, primarily due to Structural / Control / Data Dependency.

An example: register X cannot be read by $N^\text{th}$ instruction because X is not written back yet by $(N-1)^\text{th}$ instruction which modified the value of X. This is an example for Data Hazard in pipelines.

Pipeline Hazards are out of scope of at this point of time. We will discuss them in upcoming parts of the blog series.

Pequeno RISC-V CPU – Architecture

Pequeno incorporates classic five-stage RISC pipeline in the architecture. We will implement strictly in-order pipeline. In In-order Processors, instructions are fetched, decoded, executed, and completed/committed in compiler-generated order. If one instruction stalls, the whole pipeline stalls.

In Out-of-order Processors, instructions are fetched and decoded in compiler-generated order, but execution can happen in different order. If one instruction stalls, it need not stall the subsequent instructions unless there is a dependency. Independent instructions may be allowed to pass forward. The execution may still be completed/committed in-order (that’s what happens in most CPUs today). This opens doors for implementing various architectural techniques to significantly improve the throughput and performance by cutting down clock cycles wasted on stalls and minimizing the insertion of bubbles (What are “bubbles”!? Read on…).

Out-of-order Processors are quite complex due to dynamic scheduling of instructions, but is now the de-facto pipeline architecture in today’s high-performance CPUs.

Pequeno – CPU Architecture

The five pipeline stages are designed as independent units: Fetch Unit (FU), Decode Unit (DU), Execution Unit (EXU), Memory Access Unit (MACCU), and WriteBack Unit (WBU).

Fetch Unit (FU): Stage-1 of pipeline which interfaces with instruction memory. FU fetches instructions from the instruction memory and send to Decode Unit. FU may contain instruction buffers, initial branch logic.

Decode Unit (DU): Stage-2 of pipeline which decodes instructions from FU. Du also initiates read access on Register File. The packets from DU and Register File are retimed to be in sync and sent together to Execution Unit.

Execution Unit (EXU): Stage-3 of pipeline which validates and executes all decoded instructions from DU. Invalid/unsupported instructions are not allowed to move further in the pipeline. They become bubbles. ALU takes care of all integer arithmetic and logical instructions. Branch Unit takes care of jump/branch instructions. Load-Store Unit takes care of load/store instructions which require memory access.

Memory Access Unit (MACCU): Stage-4 of pipeline which interfaces with data memory. MACCU initiates all memory access as directed by EXU. Data memory is the addressing space which may constitute data RAM, memory-mapped IO peripherals, bridges, interconnects etc.

WriteBack Unit (WBU): Stage-5 or the final stage of pipeline. Instructions finish execution here. WBU is responsible for writing back results from EXU/MACCU (load-data) to Register File.

Interface between Pipeline Stages in the CPU

Between pipeline stages, valid-ready handshaking is implemented. This is not so obvious at first look. Each stage registers and sends a packet to the next stage. The packet may be instruction/control/data information to be used by next stage or by subsequent stages. The packet is validated by valid signal. If invalid packet, it is called a Bubble in the pipeline (read about Pipeline Stalls and Bubbles here). Bubble is nothing but “hole” in the pipeline which just moves forward through the pipeline doing nothing in effect. This is analogous to NOP instruction. But don’t think they are of no use! We will see one of their uses when we discuss about Pipeline Hazards in upcoming parts. Following table defines a Bubble in Pequeno’s instruction pipeline.

Instruction in the packet	packet valid	Bubble in the pipeline?
NOP	HIGH/LOW	YES
XXX	LOW	YES

Table: Defining Bubble in the Instruction Pipeline

Each stage can also stall the previous stage by asserting stall signal. Once stalled, the stage will hold their packet until stall goes down. This signal is same as inverted ready signal. In in-order processors, stall generated at any stage acts like a global stall, as it eventually stalls the whole pipeline.

Handshaking between CPU Pipeline Stages - Pequeno

Handshaking between Pipeline Stages

The flush signal is used to flush the pipeline. Flushing will invalidate all packets registered by the previous stages in one go, because they are identified to be not useful anymore.

Pipeline Flush

An example is when the pipeline has fetched and decoded instructions from wrong branch after a jump/branch instruction and it is identified to be wrong only at the execution stage. Now the pipeline should be flushed and instruction has to be fetched from the correct branch!

What next?

While pipeline increases the performance significantly, it adds much complexity to the CPU architecture. Pipelining of CPU always come with his evil twin brother, Pipeline Hazards! At this point of time, let us assume we know NOTHING about Pipeline Hazards. We designed the architecture without considering hazards.

Moving on in the blog series, we will discuss about hazards, find flaws in the existing architecture, and add necessary micro-architectures to mitigate the hazards.

Visit the complete blog series

This post is part of RISC-V CPU Development blog series

<< Previous part |~~~~ J U M P ~~ T O ~~~~| Next part >>

Support

Leave a comment or visit support for any queries/feedback regarding the content of this blog.
If you liked Chipmunk , don’t forget to follow!:

Follow me

Designing RISC-V CPU from scratch – Part 2: Specifications & Architecture

Designing RISC-V CPU from scratch – Part 2: Specifications & Architecture

Recap!