Project 2: Solution

Starting from:

~~$35~~

$29

Home

Project 2: Solution

Out of order execution in a superscalar pipelined processor with support for

precise exceptions and interrupts

Rules

The rules for project 2 are the same as project 1:

All students (CS 4290/6290, ECE 4100/6100) must work alone

Sharing of code between students is viewed as cheating and will receive appropriate action in accordance with University policy

It is acceptable for you to compare your results with other students to help debug your program. It is however not acceptable to collaborate on the simulator design or the final experiments

You should do all your work in the C or C++ programming language, and should be written according to the C99 or C++11 standards, using only the standard libraries.

If you choose to use Java, it is your responsibility to port the given framework to Java. It is your responsibility to create a shell script that provides the same command line interface as the original framework. Any bugs introduced in either of these are your responsibility.

The project may be updated if errors are discovered. It is your responsibility to check the website often and download new versions of this project description as and when they become available

7. .A Makefile with the frontend will be given to you; you will only need to fill in the empty functions and any additional subroutines you will be using. You will also need to fill in the statistics structure that will be used to output the results.

8. Discussion on Piazza is highly encouraged but refrain from posting algorithm details

Project Description

In this project, you will complete the following:

Construct a simulator for an out-of-order superscalar processor that dispatches F instructions per cycle and uses the Physical Registers/Register Alias Tables approach.

Use a re-order buffer (ROB) to support the notion of consistent state or precise exceptions and interrupts. To support this use your idea of re-order buffer (ROB)

Use your simulator to determine the appropriate number of functional units, fetch rate and result buses for each benchmark for the default number of ROB entries and PREGs

Use your simulator to determine the appropriate number of ROB entries and PREGs for the default number of functional units.

Directory Description

The procsim_cpp.tar.gz package contains:

Makefile: to compile your code

Procsim_driver.cpp:contains the main() method to run the simulator :Do not edit this file

Procsim.hpp: Used for defining structures and method declarations : you may edit this file to declare or define structures and methods

Procsim.cpp: All your methods are written here

Traces: contains the traces to pass to the simulator (more details in the later section)

Note:procsim_c.tar.gz contains equivalent files

Assumptions:

For simplicity, you do not have to model issue width, retire width, number of result buses and PRF ports. Assume these do not stall your processor.

Understanding the command line parameters

Your project should include a Makefile, which builds binary in your project's root directory named procsim.

The program should run from this root directory as:

./procsim –f F –j J –k K –l L -r ROB -p PREG <trace_file

The command line parameters are as follows:

F – Dispatch rate (instructions per cycle)

J – Number of k0 function units

K –Number of k1 function units

L – Number of k2 function units

ROB - Number of ROB entries

PREG - Number of PREGs

trace_file – Path name to the trace file

Understanding the Input Trace Format

The input traces will be given in the form:

<address<function unit type <dest reg # <src1 reg # <src2 reg# <address <function unit type <dest reg # <src1 reg # <src2 reg# …

where

<address is the address of the instruction (in hex)

<function unit type is either "0", "1" or "2"

<dest reg #

<src1 reg#

<src2 reg # are integers in the range [0..127]

Note:

If any reg # is -1, then there is no register for that part of the instruction (e.g., a branch instruction has -1 for its <dest reg #) For example:

ab120024 0 1 2 3

ab120028 1 4 1 3

ab12002c 2 -1 4 7

means:

"operation type 0" R1, R2, R3

"operation type 1" R4, R1, R3

"operation type 2" -, R4, R7 : Note: no destination register!

Note:

Instructions of type -1 are executed in the type 1 function units.

Pipeline Structure:

For this project assume the pipeline has 5 stages. Each of these stages is described below:

Stage Name
Number of Cycles per instruction

Dispatch
Variable, depending upon resource conflicts

Schedule
Variable, depending upon data dependencies

Execute
1

Status Update
Variable, depends on data dependencies

Understanding each stage:

Dispatch:

The dispatcher attempts to dispatch up to F instructions from the trace into empty slots in the scheduling queue/reservation station, in program (trace) order each cycle. When there are no more slots in the scheduling queue/reservation station, it stalls.

If there are empty slots, the sources and destination register numbers are checked and physical register file (PRF) is accessed along with the Register Alias Table (RAT) and Reorder buffer (ROB) (see lecture notes for the details).

Assume default size of PRF to be 8*Fregisters (Pregs).

The lowest numbered free Preg is the one chosen by Dispatch to assign to an instruction’s destination register. This is recorded in the RAT.

Assume by default the ROB has the same number of entries as the scheduling queue (see below) entries. Each ROB entry must store the Areg number and the previous Preg for that Areg.

When there are no available physical registers in the PRF to remap an architectural register to, the dispatch unit stalls and cannot dispatch any new instructions.

The dispatch unit also needs to stall and cannot fetch any new instructions if the ROB is full or the scheduling queue is full.

Note:

There are 32 registers in the architectural register file (ARF)

The most important job of this stage is to access the three different hardware structures namely scheduling queue/reservation station, ROB, PRF, RAT and update them as required.

The Scheduling Stage

The size of the scheduling queue (or reservation station) is 2*(number of k0 function units + number of k1 function units + number of k2 function units)

If there are multiple independent instructions ready to fire during the same cycle in the scheduling queue, service them in program order, and based on the availability of functional units.

A fired instruction remains in the reservation station until it completes

Function Unit Type
Number of Units
Latency

0
Parameter: k0
1

1
Parameter: k1
1

2
Parameter: k2
1

The number of function units is a parameter of the simulation and should be adjustable along the range of 1 to 3 units of each type.

Reminder: Instructions of type -1 are executed in the type 1 function units.

Execute:

The function units are present in this stage and the outputs from the function units use the result buses to access the PRF, update the ROB.

The State Update Unit

This stage performs in-order retirement from the reorder buffer. It checks if the oldest entry in the ROB is ready, if its ready, retire the instruction and free up previous PReg. It writes the result to the Areg shown in the ROB. This unit can retire as many instructions as possible until the head of the ROB is an instruction that has not yet completed.

Clock Propagation and actual hardware:

Note that the actual hardware has the following structure:

Dispatch

PIPELINE REGISTER

Scheduling

PIPELINE REGISTER

Execute

PIPELINE REGISTER

State update

Instruction movement only happens when the latches are clocked, which occurs at the rising edge of each clock cycle. You must simulate the same behavior of the pipeline latches, even if you do not model the actual latches. For example, if an instruction is ready to move from scheduling to execute, the motion only takes effect at the beginning of the next clock cycle.

Each stage of the pipeline can be divided into “cycle portions”. Assume the following ordering of cycle portions (you do not need to explicitly model this, but please make sure your simulator follows this ordering of events):

Cycle Portion
Action

1
Retire the oldest completed instruction(s) from

the ROB

2
Function Units write to the PRF and ROB for

completing instructions

3
Any ready/independent instruction in the

scheduling queue is marked to fire (depending

upon availability of functional units)

4
The dispatch unit accesses the ARF, ROB,

RAT, PRF and sends out entries to the

reservation station. When the hardware

structures get full, it stalls.

5
Instructions fetched from the trace

Note: Not all events are dependent on each other, and thus it is possible to have a different order of events and still achieve correct output. However, following this order, you should be guaranteed correctness.

Output

For each trace, the output contains 2 files:

An output file, which contains :

The processor settings

A record of when each instruction was in each stage

The processor statistics: IPC of retired instructions per cycle, average number of PRegs busy per cycle, average number of dispatch stall cycles per cycle.

Correctness of your output is required for validation.

Your simulator should output results to the terminal (stdout) and it should match the validated output on Canvas.

A log file, which contains the cycle-by-cycle behavior of the machine. This file is not required for validation. This is simply there to help debug your code.

Experiments

After your simulator is validated, for each trace:

1. Find the minimum value of F, k0, k1 and k2 to achieve a high value of IPC.

Suggested approach: Set F, k0, k1 and k2 to a very high number. Record the IPC. This is your target IPC. Find the smallest values of F, k0, k1 and k2 that achieve at least 98% of that IPC.

Find the minimum value of the number of ROB entries and registers in the PRF to achieve a high value of IPC.

Suggested approach: Set ROB and PREG to a very high number. Record the IPC. This is your target IPC. Find the smallest values of ROB and PREG that achieve at least 98% of that IPC.

Use all statistics to explain the solution you arrived at.

Grading

0% you hand in nothing or hand in something late

+50% you hand in code that shows a reasonable attempt and passes some of our validation tests

+30% your code passes all validation tests

+15% your experiments are completed

+5% your explanation of the results is exemplary and of research quality