Pipelined CPU Homework Assignment 4 Solution

Starting from:

$30

Home

Pipelined CPU (74%)

In this section, we are going to implement a pipeline cpu.

The provided instruction memory is as follows:

Signal
I/O
Width
Functionality

i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i valid
Input
1
Signal that tells pc-address from cpu is ready
i addr
Input
64
64-bits address from cpu
o valid
Output
1
Valid when instruction is ready
o inst
Output
32
32-bits instruction to cpu

And the provided data memory is as follows:

Signal
I/O
Width
Functionality

i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data
Input
64
64-bits data that will be stored
i w addr
Input
64
Write to target 64-bits address
i r addr
Input
64
Read from target 64-bits address
i MemRead
Input
1
One cycle signal and set current mode to reading
i MemWrite
Input
1
One cycle signal and set current mode to writing
o valid
Output
1
One cycle signal telling data is ready (used when ld happens)
o data
Output
64
64-bits data from data memory (used when ld happens)

The test environment is as follows:

We will only test the instructions highlighted in the red box, as the figures below

And one more instruction to be implemented is

i inst
Function
Description

32’b11111111111111111111111111111111
Stop
Stop and set o finish to 1

All the environment settings are the same as HW3 except the rule of accessing data_memory.v and instruction_memory.v, and the interface of modules are changed this time. See the supplementary.pdf for more information.

You may want to reference the diagram of pipelined cpu from textbook.

To make sure that pipeline is actually implemented in your design, we are going to use an open source synthesis tool Yosys to check the timing of the critical path in your design. We’ll also use the FreePDK 45 nm process standard cell library provided here.

5
You can either build Yosys yourself or use the image provided

docker pull ntuca2020/hw4 # size ~ 1.28G

docker run --name=test -it ntuca2020/hw4

cd /root

ls

Folder structure for this homework:

HW4/

|-- testcases/

• |-- generate.s

• ‘-- generate.cpp |-- codes/

• |-- cpu.v

| |-- data_memory.v // provided data memory

• ‘-- instruction_memory.v // provided instruction memory |-- testbench.v

|-- Makefile

|-- cpu.ys // synthesis command

‘-- stdcells.lib // FreePDK 45 nm standard cell library

Specify all the used modules in the cpu.ys file, then run

make // Compile

make test // Test all test cases

make time // Show the timing and area used in your design

Information about your design is shown when running make time:

ABC: WireLoad = "none" Gates = 13123 ( 14.8 %) Cap = 3.2 ff ( 1.9 %)

Area = 17519.56 ( 87.9 %) Delay = 1091.13 ps ( 5.1 %)

You can optimize the cpu for the 3 workloads (code address range, data address range, etc), but it should not affect other test cases.

Grading:

• Correctness check (10%)

– 10 testcases, each 2% for correctness check

• Required area and frequency (inverse of delay) (32%)

– Area < 25,000 m2, and frequency > 10MHz (5%)
– Area < 25,000 m2, and frequency > 100MHz (5%)
– Area < 25,000 m2, and frequency > 200MHz (5%)
– Area < 25,000 m2, and frequency > 500MHz (5%)
– Area < 25,000 m2, and frequency > 800MHz (4%)
– Area < 25,000 m2, and frequency > 1000MHz (3%)
– Area < 25,000 m2, and frequency > 1200MHz (3%)
– Area < 25,000 m2, and frequency > 1500MHz (2%)

• Required time (clock cycle * operating frequency) to finish workloads from last 3 testcases. (32%)

– Workload1 < 100,000 ns (5%)

– Workload2 < 150,000 ns (5%)

– Workload3 < 200,000 ns (5%)

– Workload1 < 10,000 ns (5%)

– Workload2 < 15,000 ns (4%)

– Workload3 < 20,000 ns (3%)

– Workload1 < 5,000 ns, and Workload2 < 20,000 ns, and Workload3 < 15,000 ns (3%)

– Workload1 < 3,500 ns, and Workload2 < 9,000 ns, and Workload3 < 10,000 ns (2%)

6
Report (12%)

You can describe your pipeline design and how you did it and answer the following questions.

• What is the latency of each module in your design? (e.g. ALU, register file) (2%)

• Which path is the critical path of your cpu? And how can you decrease the latency of it? (2%)

• How to solve data hazard? (2%)

• How to solve control hazard? (2%)

• Describe 3 different workloads attributes, and which one can be improved tremendously by branch predictor? (2%)

• Is it always beneficial to insert multiple stage of pipeline in designs? How does it affect the latency? (2%)

Submission

• Zip and upload your file to ceiba in the following format:

Rxxxxxxxx/ <-- zip this folder

|-- cpu.ys // specify the used *.v file, not including testbench and memory

|-- cpu.f // specify the used *.v file, including testbench and memory

|-- codes/ // put all your *.v file here, including cpu_syn.v

|-- handwritten.pdf // handwritten part

‘-- report.pdf // report on programming part

• Late submission within one-week: (Total score)*0.8

• Late submission within two-week: (Total score)*0.6

• Late submission over two-week: (Total score)*0

• If there’s any question, please send email to 110fall.ca@gmail.com.

• TA hour for this homework: Thur 14:00 17:00

Deadline

• Deadline: 2021/12/14 23:59 (from 2021/11/09 12:00)

Supplementary

• **HW4 introduction-video (2020)

7
Practice (0%)

• You can do this as a practice, no need to hand in this part in your handwritten file.

8

9