$29
Out of order execution in a superscalar pipelined processor with support for
precise exceptions and interrupts
Rules
The rules for project 2 are the same as project 1:
All students (CS 4290/6290, ECE 4100/6100) must work alone
Sharing of code between students is viewed as cheating and will receive appropriate action in accordance with University policy
It is acceptable for you to compare your results with other students to help debug your program. It is however not acceptable to collaborate on the simulator design or the final experiments
You should do all your work in the C or C++ programming language, and should be written according to the C99 or C++11 standards, using only the standard libraries.
If you choose to use Java, it is your responsibility to port the given framework to Java. It is your responsibility to create a shell script that provides the same command line interface as the original framework. Any bugs introduced in either of these are your responsibility.
The project may be updated if errors are discovered. It is your responsibility to check the website often and download new versions of this project description as and when they become available
7. .A Makefile with the frontend will be given to you; you will only need to fill in the empty functions and any additional subroutines you will be using. You will also need to fill in the statistics structure that will be used to output the results.
8. Discussion on Piazza is highly encouraged but refrain from posting algorithm details
Project Description
In this project, you will complete the following:
Construct a simulator for an out-of-order superscalar processor that dispatches F instructions per cycle and uses the Physical Registers/Register Alias Tables approach.
Use a re-order buffer (ROB) to support the notion of consistent state or precise exceptions and interrupts. To support this use your idea of re-order buffer (ROB)
Use your simulator to determine the appropriate number of functional units, fetch rate and result buses for each benchmark for the default number of ROB entries and PREGs
Use your simulator to determine the appropriate number of ROB entries and PREGs for the default number of functional units.
Directory Description
The procsim_cpp.tar.gz package contains:
Makefile: to compile your code
Procsim_driver.cpp:contains the main() method to run the simulator :Do not edit this file
Procsim.hpp: Used for defining structures and method declarations : you may edit this file to declare or define structures and methods
Procsim.cpp: All your methods are written here
Traces: contains the traces to pass to the simulator (more details in the later section)
Note:procsim_c.tar.gz contains equivalent files
Assumptions:
For simplicity, you do not have to model issue width, retire width, number of result buses and PRF ports. Assume these do not stall your processor.
Understanding the command line parameters
Your project should include a Makefile, which builds binary in your project's root directory named procsim.
The program should run from this root directory as:
./procsim –f F –j J –k K –l L -r ROB -p PREG <trace_file
The command line parameters are as follows:
F – Dispatch rate (instructions per cycle)
J – Number of k0 function units
K –Number of k1 function units
L – Number of k2 function units
ROB - Number of ROB entries
PREG - Number of PREGs
trace_file – Path name to the trace file
Understanding the Input Trace Format
The input traces will be given in the form:
<address<function unit type <dest reg # <src1 reg # <src2 reg# <address <function unit type <dest reg # <src1 reg # <src2 reg# …
where
<address is the address of the instruction (in hex)
<function unit type is either "0", "1" or "2"
<dest reg #
<src1 reg#
<src2 reg # are integers in the range [0..127]
Note:
If any reg # is -1, then there is no register for that part of the instruction (e.g., a branch instruction has -1 for its <dest reg #) For example:
ab120024 0 1 2 3
ab120028 1 4 1 3
ab12002c 2 -1 4 7
means:
"operation type 0" R1, R2, R3
"operation type 1" R4, R1, R3
"operation type 2" -, R4, R7 : Note: no destination register!
Note:
Instructions of type -1 are executed in the type 1 function units.
Pipeline Structure:
For this project assume the pipeline has 5 stages. Each of these stages is described below:
Stage Name
Number of Cycles per instruction
Dispatch
Variable, depending upon resource conflicts
Schedule
Variable, depending upon data dependencies
Execute
1
Status Update
Variable, depends on data dependencies
Understanding each stage:
Dispatch:
The dispatcher attempts to dispatch up to F instructions from the trace into empty slots in the scheduling queue/reservation station, in program (trace) order each cycle. When there are no more slots in the scheduling queue/reservation station, it stalls.
If there are empty slots, the sources and destination register numbers are checked and physical register file (PRF) is accessed along with the Register Alias Table (RAT) and Reorder buffer (ROB) (see lecture notes for the details).
Assume default size of PRF to be 8*Fregisters (Pregs).
The lowest numbered free Preg is the one chosen by Dispatch to assign to an instruction’s destination register. This is recorded in the RAT.
Assume by default the ROB has the same number of entries as the scheduling queue (see below) entries. Each ROB entry must store the Areg number and the previous Preg for that Areg.
When there are no available physical registers in the PRF to remap an architectural register to, the dispatch unit stalls and cannot dispatch any new instructions.
The dispatch unit also needs to stall and cannot fetch any new instructions if the ROB is full or the scheduling queue is full.
Note:
There are 32 registers in the architectural register file (ARF)
The most important job of this stage is to access the three different hardware structures namely scheduling queue/reservation station, ROB, PRF, RAT and update them as required.
The Scheduling Stage
The size of the scheduling queue (or reservation station) is 2*(number of k0 function units + number of k1 function units + number of k2 function units)
If there are multiple independent instructions ready to fire during the same cycle in the scheduling queue, service them in program order, and based on the availability of functional units.
A fired instruction remains in the reservation station until it completes
Function Unit Type
Number of Units
Latency
0
Parameter: k0
1
1
Parameter: k1
1
2
Parameter: k2
1
The number of function units is a parameter of the simulation and should be adjustable along the range of 1 to 3 units of each type.
Reminder: Instructions of type -1 are executed in the type 1 function units.
Execute:
The function units are present in this stage and the outputs from the function units use the result buses to access the PRF, update the ROB.
The State Update Unit
This stage performs in-order retirement from the reorder buffer. It checks if the oldest entry in the ROB is ready, if its ready, retire the instruction and free up previous PReg. It writes the result to the Areg shown in the ROB. This unit can retire as many instructions as possible until the head of the ROB is an instruction that has not yet completed.
Clock Propagation and actual hardware:
Note that the actual hardware has the following structure:
Dispatch
PIPELINE REGISTER
Scheduling
PIPELINE REGISTER
Execute
PIPELINE REGISTER
State update
Instruction movement only happens when the latches are clocked, which occurs at the rising edge of each clock cycle. You must simulate the same behavior of the pipeline latches, even if you do not model the actual latches. For example, if an instruction is ready to move from scheduling to execute, the motion only takes effect at the beginning of the next clock cycle.
Each stage of the pipeline can be divided into “cycle portions”. Assume the following ordering of cycle portions (you do not need to explicitly model this, but please make sure your simulator follows this ordering of events):
Cycle Portion
Action
1
Retire the oldest completed instruction(s) from
the ROB
2
Function Units write to the PRF and ROB for
completing instructions
3
Any ready/independent instruction in the
scheduling queue is marked to fire (depending
upon availability of functional units)
4
The dispatch unit accesses the ARF, ROB,
RAT, PRF and sends out entries to the
reservation station. When the hardware
structures get full, it stalls.
5
Instructions fetched from the trace
Note: Not all events are dependent on each other, and thus it is possible to have a different order of events and still achieve correct output. However, following this order, you should be guaranteed correctness.
Output
For each trace, the output contains 2 files:
An output file, which contains :
The processor settings
A record of when each instruction was in each stage
The processor statistics: IPC of retired instructions per cycle, average number of PRegs busy per cycle, average number of dispatch stall cycles per cycle.
Correctness of your output is required for validation.
Your simulator should output results to the terminal (stdout) and it should match the validated output on Canvas.
A log file, which contains the cycle-by-cycle behavior of the machine. This file is not required for validation. This is simply there to help debug your code.
Experiments
After your simulator is validated, for each trace:
1. Find the minimum value of F, k0, k1 and k2 to achieve a high value of IPC.
Suggested approach: Set F, k0, k1 and k2 to a very high number. Record the IPC. This is your target IPC. Find the smallest values of F, k0, k1 and k2 that achieve at least 98% of that IPC.
Find the minimum value of the number of ROB entries and registers in the PRF to achieve a high value of IPC.
Suggested approach: Set ROB and PREG to a very high number. Record the IPC. This is your target IPC. Find the smallest values of ROB and PREG that achieve at least 98% of that IPC.
Use all statistics to explain the solution you arrived at.
Grading
0% you hand in nothing or hand in something late
+50% you hand in code that shows a reasonable attempt and passes some of our validation tests
+30% your code passes all validation tests
+15% your experiments are completed
+5% your explanation of the results is exemplary and of research quality