$29
Overview
RISC-V processors should support acore ISA for integer operations, including RV I, RV E, RV I, or RV I. Additional functionality could be adopted to augment the capability of target RISC-V processors. RISC-V has a series ofstandard extensions to provide additional support beyond the core ISA, such as oating point and bit manipulation [RISC-V ISA List ]. On the other hand, there is also a series ofnon-standard extensions , which might be specialized for certain purposes and might con ict with other extensions. If you are interested in the related contents, please refer to the document [Extending RISC-V ].
The RISC-V Vector extension is a promising extension for the AI computing as it enables the parallel processing of mathematic operations on a RISC-V processor. The V extension involves adding a vector computation engine on the RISC-V processor, compared with the serial computing on a typical processor.
In this assignment, you will be asked to convert the given C code segments into the corresponding assembly versions, based on the skills you learned from the previous programming assignment. You will use the RISC-V V extension to implement your programs. More importantly, you will be asked to collect the performance data for your written code, and you need to use the performance data to derive the execution time of your program on the RISC-V processor based on a basic performance model, which is provided in the following section.
1. Performance Modeling
Based on your knowledge learned from Chapter 1.6 Performance of our course textbook (ISBN: 0128203315; ISBN- : 9780128203316), you would derive the CPU execution time of a given program with clock cycles per instruction (CPI), instruction counts, and clock cycle time.
The performance model used in this assignment uses cycle per instruction (CPI) to summarize the delivered performance of a RISC-V instruction executed on the target RISC-V processor, including the e ect of the CPU pipeline and the memory subsystem.
Given the above modeling concept, you will need to collect the performance data of your program to derive the CPU execution time. Speci cally, you need to record the instruction counts of di erent types of RISC-V instructions.
The instructions are categorized into four types and their instruction counts should be recorded (accumulated) in the
arith_cnt lw_cnt sw_cnt others_cnt
four counters: , , , and , respectively. The table below lists the
instructions and their categories.
arith_CPI lw_CPI sw_CPI
The CPIs for the four types of instructions are de ned in the given header les, , , , and
others_CPI
as constants. You should not alter the constant values. The related information for the CPIs is available on the table, too.
baseline_cycle_count
The derived performance data should be stored in some variables, such as for the total cycle
improved_cpu_time
count calculated in the rst execise, and for the CPU time computed in the second exercise.
cycle_time
NOTE: The represents the clock cycle time for the target RISC-V processor. It is a constant data and its
content should not be altered.
Variables/Constants de ned in the header les used in this assignment.
Var./Cons. Name
De nition
x[ ]
Input array 1
with size of 16 in q
_constant.h
with size of 8 in q
_constant.h
1 / 7
Var./Cons. Name
De nition
h[ ]
Input array 2 with size of 16 (q
_constant.h)
y[ ]
Output array with size of 16 (q
_constant.h)
tmp1[ ]
Temp array 1 to help you
nish algorithm in question 2
tmp2[ ]
Temp array 2 to help you
nish algorithm in question 2
used to count
add{i}, sub{i}, and{i}, or{i}, xor{i}, shift,
arith_cnt
vadd.vv, vadd.vx, vadd.vi vsub.vv, vsub.vx, vand.vv, vand.vx, vand.vi,
vor.vv, vor.vx, vor.vi, vxor.vv, vxor.vx, vxor.vi, mul, vmul.vv, vmul.vx
instruction
lw_cnt
used to count
lw, lh, lb, li, lbu, lhu, vle8.v, vle16.v, vle32.v, vle64.v
instruction
sw_cnt
used to count
sw, sh, sb, vse8.v, vse16.v, vse32.v, vse64.v
instruction
others_cnt
used to count rest of instruction
arith_CPI
arith_cnt
CPI of instructions listed in
lw_CPI
lw_cnt
CPI of instructions listed in
sw_CPI
sw_cnt
CPI of instructions listed in
others_CPI
CPI of rest of instructions
baseline_cycle_count
baseline()
Clock cycle in
asm volatile code you need to calculate
improved_cycle_count
improved()
Clock cycle in
asm volatile code you need to calculate
cycle_time
The given clock cycle time of the target RISC-V processor running at 2.6 GHz
baseline_cpu_time
baseline()
The CPU time in
you need to calculate
improved_cpu_time
improved()
The CPU time in
you need to calculate
target
Target number in question 2
flag
Set to 1 if the sum of two elements equals to target number in question 2
2. What Should You Do in this Assignment?
There are three exercises in this assignment. You will receive the code with the following structure.
CO_StudentID_HW2.zip/
└── CO_StudentID_HW2/
|── answer.h
├── lab2_q1_baseline.c
├── lab2_q1.c
├── lab2_q1.h
├── lab2_q1_improved.c
├── lab2_q1_input.txt
├── lab2_q2_ans.c
├── lab2_q2.c
├── lab2_q2.h
├── lab2_q2_input.txt
├── makefile
└── test
2 / 7
The le you shold modify is below:
lab _q _baseline.c
lab _q _improved.c
answer.h
lab _q _ans.c
1. Baseline Code: Write Assembly Code and Report the Performance Statistics (20%)
lab2_q1.c
You need to write the assembly code for the designated operations (speci ed in
as shown below) with the RV
I
ISA, and to add the four counters into the code to count the number of executed instructions. Please report the performance
statistics, including the values of the four counters, total cycle count, and CPU time, of your revised program. Detailed
descriptions are listed as follows.
baseline()
As shown in the
function, you are responsible for writing the assembly for the array additions
for (...) y[i] = h[i] + x[i];
(
).
lab2_q1_baseline.c
NOTE: You should put your assembly code within the
le, as indicated in
baseline()
asm volatile( #include "lab2_q1_baseline.c" : [h] "+r"(p_h), ...);
within the
lab2_q1.c
function of
.
lab2_q1_constant.h
There is a header le (
) specifying the constants/variables used by this assignment. It de
nes
that the size of the arrays is set to 16. NOTE: Please do not modify the code in the header
le.
In addition, you need to insert the assembly code to count the number of executed instructions, according to the types of the
instructions, and to store the accumulated counts in the respective counters. You will need to provide (i.e., print) the contents
arith_cnt lw_cnt sw_cnt others_cnt
of the four counters ( , , , and ), as de ned in the above table.
lw_cnt
You may, for example, use the following instruction to increment the content in the counter.
addi %[lw_cnt], %[lw_cnt], \n\t
You also need to compute the total cycle count and the CPU execution time for the program.
baseline_cycle_count
You need to calculate based on thecounter values and thegiven CPIs (the constants de ned
lab2_q1_constant.h
in ). The above table de nes the variables for keeping the CPI values. Please add the related
baseline_cycle_count
code (formula) to set up the value for .
baseline_cpu_time
You need to compute using thetotal cycle count and thecycle time (the constant of the cycle
lab2_q1_constant.h
time for a . GHz processor de ned in ). Please add the related code (formula) to compute the
baseline_cpu_time
value for .
Your obtained scores of this exercise is determined by the correctness of your reported performance data.
1. The four counter values, 2.5% for each counter. (10% in total)
arith_cnt
(2.5%)
lw_cnt
(2.5%)
sw_cnt
(2.5%)
others_cnt
(2.5%)
baseline_cycle_count
2. The total cycle count ( ). (5%)
baseline_cpu_time
3. The CPU time ( ). (5%)
"addi t2, zero, 16\n\t" // t2 = arr_size
"addi %[arith_cnt], %[arith_cnt], 1\n\t" // add arith counter
2. Vectorized Code: Write Assembly Code and Report the Performance Statistics (40%)
3 / 7
You need to re-write the assembly code (which you build in the previous exercise) with the RISC-V V extension or any other extension that can be supported by the installed toolchain. Similar to the previous exercise, you need to add the counters to count the executed instructions, and need to report the performance statistics, including the values of the four counters, total cycle count, and CPU time, of your vectorized program. Detailed descriptions are listed below.
As shown in the
improved()
function, you are responsible for writing the assembly for the array additions
for (...) y[i] = h[i] + x[i];
(
) using the RISC-V V extension [RISC-V V Extension ]. The vectorized version would
improve the execution e ciency via doing the computations in parallel.
lab2_q1_improved.c
NOTE: You should put your vectorized code within the
le, as indicated in
asm volatile( #include "lab2_q1_improved.c" :[x] "+r"(p_x), [y] "+r"(p_y), ...);
improved()
lab2_q1.c
within the
function of
.
lab2_q1_constant.h
baseline()
improved()
The header
le (
) is shared by both the
and
functions, and
should not be modi ed in any way.
arith_cnt
lw_cnt
sw_cnt
In addition, similar to the rst exercise, you need to insert the code of the four counters (
,
,
,
others_cnt
and
) to count the number of executed instructions based on their instruction types. You will need to provide
(i.e., print) the contents of the four counters.
You also need to compute the total cycle count and the CPU execution time for the improved program.
improved_cycle_count
You need to calculate based on thecounter values and thegiven CPIs (the constants de ned
lab2_q1_constant.h
in ). The above table de nes the variables for keeping the CPI values. Please add the related
improved_cycle_count
code (formula) to set up the value for .
improved_cpu_time
You need to compute using thetotal cycle count and thecycle time (the constant of the cycle
lab2_q1_constant.h
time for a . GHz processor de ned in ). Please add the related code (formula) to compute the
improved_cpu_time
value for .
The speed up of the performance enhanced by the vectorized version against the baseline code should be presented to show the e ciency delivered by the vectorized code.
printf()
The speed up achieved by the vector code should be calculated and printed out by the function, invoked at
improved()
the end of the function.
improved_cpu_time baseline_cpu_time
The speed up can be calculated with the and the .
Your obtained scores of this exercise is determined by the correctness of your reported performance data.
1. The four counter values, % for each counter. (4% in total)
arith_cnt
(1%)
lw_cnt
(1%)
sw_cnt
(1%)
others_cnt
(1%)
improved_cycle_count
2. The total cycle count ( ). (3%)
improved_cpu_time
3. The CPU time ( ). (3%)
4. The calculated speed up (achieved by the vector version over the baseline code). (30%)
baseline_cpu_time
Note that you can get the score only if your <= 250,000 (us) and your cpu time is right.
If the speedup number <= 1, you get (0 pt).
If 1 < the speedup number <= 2, you get (10 pt).
If 2 < the speedup number <= 4, you get (20 pt).
If 4 < the speedup number, you get (30 pt).
4 / 7
3. Nested For-Loop Code (60%)
target lab2_q2.c
You will be provided with an array of integers and a number. The original C code (in ) shows the logic of
p_x[i] p_x[j] target
the nested loop code, where the summation of the array integers ( and ) is used to compare with .
target flag
The loop execution will break when the summation is equal to the value of thetarget . The value is used to indicate
theanswer of the given test data to see if there is a summation matching the . Besides, you are asked to compute the
CPU execution time of your written code. Detailed descriptions are listed below.
• lab2_q2.c
...
/* Original C code */
for (int i = 0; i < arr_size; i++){ if (flag == 1)
break;
for (int j = 0; j < arr_size; j++){ if ((p_x[i] + p_x[j]) == target){
flag = 1; break;
}
}
}
...
flag
The value indicates the answer (result) of the computations.
flag flag
If == 0, it means there is no matching found in the array elements. If == 1, it means a match is found.
You can use the same array element more than once to obtain the summation ofarray elements .
Example:
arr[8] = {1, 2, 3, 4, 5, 6, 7, 8};
target = 2;
flag Yes
should be one in this case since 1 + 1 = 2. The answer should be .
The array size is xed at eight.
lab2_q2_constant.h
The constants and variables you need is available in the header le .
arith_cnt lw_cnt sw_cnt
You should use these constants/variables in the header le wisely (e.g., , , ,
others_cnt cycle_time
, and ) to compute the CPU time.
Please do not modify the code in the header le.
You are asked to develop ane cient assembly code. You can provide such a code by improving the searching algorithm or by using the RISC-V V extension.
Thee ciency is de ned by the estimated CPU time
tmp1 tmp2
Hint: You can use and arrays wisely.
You can achieve O(n) computation complexity if you make good use of vector add and comparison instructions from RISC-V V extension for the implementation.
Your obtained scores of this exercise is determined by the correctness of your reported performance data.
1. The Yes/No answer. (20%)
2. The CPU time for your program. (40%)
If 720,000 (us) < CPU time, you gets (0%).
If 360,000 (us) < CPU time <= 720,000 (us), you gets (10%).
5 / 7
If 180,000 (us) < CPU time <= 360,000 (us), you gets (20%).
If 90000 (us) < CPU time <= 180,000 (us), you gets (30%).
If CPU time <= 90000 (us), you gets (40%).
3. Test Your assignment
We use make le to judge your program. You can use the judge program to get the testing score by typing judge in your terminal. Get your score
• make score
/opt/riscv/riscv64-unknown-elf/bin/pk
If your path of proxy kernel is not
, change it in PK_PATH in make le.
Test your question 1 in lab
• make lab2_q1
• make run_lab2_q1 TEST_DATA=1
TEST_DATA can change the test data which you want to test (e.g., TEST_DATA= ~ TEST_DATA= ).
Test your question 2 in lab
• make lab2_q2
• make run_lab2_q2 TEST_DATA=1
TEST_DATA can change the test data which you want to test (e.g., TEST_DATA= ~ TEST_DATA= ).
6 / 7
4. Submission of Your Assignment
CO_StudentID_HW2
We assume your developed code is inside the folder: . Please follow the instructions below to submit your
programming assignment.
zip
1. Compress your source code into a le.
2. Submit your homework with NCKU Moodle.
3. The zipped le and its internal directory organization of your developed code should be similar to the example below.
StudentID
NOTE: Replace all with your student ID number).
CO_StudentID_HW2.zip
└── CO_StudentID_HW2/
├── answer.h
├── lab2_q1_baseline.c
├── lab2_q1.c
├── lab2_q1.h
├── lab2_q1_improved.c
├── lab2_q1_input.txt
├── lab2_q2_ans.c
├── lab2_q2.c
├── lab2_q2.h
├── lab2_q2_input.txt
├── makefile
└── test
!!! Incorrect format (either the le structure or le name) will lose 10 points. !!!
5. References
RISC-V ISA List
Extending RISC-V
RISC-V V Extension