Computer Organization HW 2: Performance Modeling for the RISC-V Processor Solution

Starting from:

~~$35~~

$29

Home

Overview

RISC-V processors should support acore ISA for integer operations, including RV I, RV E, RV I, or RV I. Additional functionality could be adopted to augment the capability of target RISC-V processors. RISC-V has a series ofstandard extensions to provide additional support beyond the core ISA, such as oating point and bit manipulation [RISC-V ISA List ]. On the other hand, there is also a series ofnon-standard extensions , which might be specialized for certain purposes and might con ict with other extensions. If you are interested in the related contents, please refer to the document [Extending RISC-V ].

The RISC-V Vector extension is a promising extension for the AI computing as it enables the parallel processing of mathematic operations on a RISC-V processor. The V extension involves adding a vector computation engine on the RISC-V processor, compared with the serial computing on a typical processor.

In this assignment, you will be asked to convert the given C code segments into the corresponding assembly versions, based on the skills you learned from the previous programming assignment. You will use the RISC-V V extension to implement your programs. More importantly, you will be asked to collect the performance data for your written code, and you need to use the performance data to derive the execution time of your program on the RISC-V processor based on a basic performance model, which is provided in the following section.

1. Performance Modeling

Based on your knowledge learned from Chapter 1.6 Performance of our course textbook (ISBN: 0128203315; ISBN- : 9780128203316), you would derive the CPU execution time of a given program with clock cycles per instruction (CPI), instruction counts, and clock cycle time.

The performance model used in this assignment uses cycle per instruction (CPI) to summarize the delivered performance of a RISC-V instruction executed on the target RISC-V processor, including the e ect of the CPU pipeline and the memory subsystem.

Given the above modeling concept, you will need to collect the performance data of your program to derive the CPU execution time. Speci cally, you need to record the instruction counts of di erent types of RISC-V instructions.

The instructions are categorized into four types and their instruction counts should be recorded (accumulated) in the

arith_cnt lw_cnt sw_cnt others_cnt

four counters: , , , and , respectively. The table below lists the

instructions and their categories.

arith_CPI lw_CPI sw_CPI

The CPIs for the four types of instructions are de ned in the given header les, , , , and

others_CPI

as constants. You should not alter the constant values. The related information for the CPIs is available on the table, too.

baseline_cycle_count

The derived performance data should be stored in some variables, such as for the total cycle

improved_cpu_time

count calculated in the rst execise, and for the CPU time computed in the second exercise.

cycle_time

NOTE: The represents the clock cycle time for the target RISC-V processor. It is a constant data and its

content should not be altered.

Variables/Constants de ned in the header les used in this assignment.

Var./Cons. Name

De nition

x[ ]

Input array 1

with size of 16 in q

_constant.h

with size of 8 in q

_constant.h

1 / 7

Var./Cons. Name

De nition

h[ ]

Input array 2 with size of 16 (q

_constant.h)

y[ ]

Output array with size of 16 (q

_constant.h)

tmp1[ ]

Temp array 1 to help you

nish algorithm in question 2

tmp2[ ]

Temp array 2 to help you

nish algorithm in question 2

used to count

add{i}, sub{i}, and{i}, or{i}, xor{i}, shift,

arith_cnt

vadd.vv, vadd.vx, vadd.vi vsub.vv, vsub.vx, vand.vv, vand.vx, vand.vi,

vor.vv, vor.vx, vor.vi, vxor.vv, vxor.vx, vxor.vi, mul, vmul.vv, vmul.vx

instruction

lw_cnt

used to count

lw, lh, lb, li, lbu, lhu, vle8.v, vle16.v, vle32.v, vle64.v

instruction

sw_cnt

used to count

sw, sh, sb, vse8.v, vse16.v, vse32.v, vse64.v

instruction

others_cnt

used to count rest of instruction

arith_CPI

arith_cnt

CPI of instructions listed in

lw_CPI

lw_cnt

CPI of instructions listed in

sw_CPI

sw_cnt

CPI of instructions listed in

others_CPI

CPI of rest of instructions

baseline_cycle_count

baseline()

Clock cycle in

asm volatile code you need to calculate

improved_cycle_count

improved()

Clock cycle in

asm volatile code you need to calculate

cycle_time

The given clock cycle time of the target RISC-V processor running at 2.6 GHz

baseline_cpu_time

baseline()

The CPU time in

you need to calculate

improved_cpu_time

improved()

The CPU time in

you need to calculate

target

Target number in question 2

flag

Set to 1 if the sum of two elements equals to target number in question 2

2. What Should You Do in this Assignment?

There are three exercises in this assignment. You will receive the code with the following structure.

CO_StudentID_HW2.zip/

└── CO_StudentID_HW2/

|── answer.h

├── lab2_q1_baseline.c

├── lab2_q1.c

├── lab2_q1.h

├── lab2_q1_improved.c

├── lab2_q1_input.txt

├── lab2_q2_ans.c

├── lab2_q2.c

├── lab2_q2.h

├── lab2_q2_input.txt

├── makefile

└── test

2 / 7

The le you shold modify is below:

lab _q _baseline.c

lab _q _improved.c

answer.h

lab _q _ans.c

1. Baseline Code: Write Assembly Code and Report the Performance Statistics (20%)

lab2_q1.c

You need to write the assembly code for the designated operations (speci ed in

as shown below) with the RV

ISA, and to add the four counters into the code to count the number of executed instructions. Please report the performance

statistics, including the values of the four counters, total cycle count, and CPU time, of your revised program. Detailed

descriptions are listed as follows.

baseline()

As shown in the

function, you are responsible for writing the assembly for the array additions

for (...) y[i] = h[i] + x[i];

(

lab2_q1_baseline.c

NOTE: You should put your assembly code within the

le, as indicated in

baseline()

asm volatile( #include "lab2_q1_baseline.c" : [h] "+r"(p_h), ...);

within the

lab2_q1.c

function of

lab2_q1_constant.h

There is a header le (

) specifying the constants/variables used by this assignment. It de

nes

that the size of the arrays is set to 16. NOTE: Please do not modify the code in the header

le.

In addition, you need to insert the assembly code to count the number of executed instructions, according to the types of the

instructions, and to store the accumulated counts in the respective counters. You will need to provide (i.e., print) the contents

arith_cnt lw_cnt sw_cnt others_cnt

of the four counters ( , , , and ), as de ned in the above table.

lw_cnt

You may, for example, use the following instruction to increment the content in the counter.

addi %[lw_cnt], %[lw_cnt], \n\t

You also need to compute the total cycle count and the CPU execution time for the program.

baseline_cycle_count

You need to calculate based on thecounter values and thegiven CPIs (the constants de ned

lab2_q1_constant.h

in ). The above table de nes the variables for keeping the CPI values. Please add the related

baseline_cycle_count

code (formula) to set up the value for .

baseline_cpu_time

You need to compute using thetotal cycle count and thecycle time (the constant of the cycle

lab2_q1_constant.h

time for a . GHz processor de ned in ). Please add the related code (formula) to compute the

baseline_cpu_time

value for .

Your obtained scores of this exercise is determined by the correctness of your reported performance data.

1. The four counter values, 2.5% for each counter. (10% in total)

arith_cnt

(2.5%)

lw_cnt

(2.5%)

sw_cnt

(2.5%)

others_cnt

(2.5%)

baseline_cycle_count

2. The total cycle count ( ). (5%)

baseline_cpu_time

3. The CPU time ( ). (5%)

"addi t2, zero, 16\n\t" // t2 = arr_size

"addi %[arith_cnt], %[arith_cnt], 1\n\t" // add arith counter

2. Vectorized Code: Write Assembly Code and Report the Performance Statistics (40%)

3 / 7

You need to re-write the assembly code (which you build in the previous exercise) with the RISC-V V extension or any other extension that can be supported by the installed toolchain. Similar to the previous exercise, you need to add the counters to count the executed instructions, and need to report the performance statistics, including the values of the four counters, total cycle count, and CPU time, of your vectorized program. Detailed descriptions are listed below.

As shown in the

improved()

function, you are responsible for writing the assembly for the array additions

for (...) y[i] = h[i] + x[i];

(

) using the RISC-V V extension [RISC-V V Extension ]. The vectorized version would

improve the execution e ciency via doing the computations in parallel.

lab2_q1_improved.c

NOTE: You should put your vectorized code within the

le, as indicated in

asm volatile( #include "lab2_q1_improved.c" :[x] "+r"(p_x), [y] "+r"(p_y), ...);

improved()

lab2_q1.c

within the

function of

lab2_q1_constant.h

baseline()

improved()

The header

le (

) is shared by both the

and

functions, and

should not be modi ed in any way.

arith_cnt

lw_cnt

sw_cnt

In addition, similar to the rst exercise, you need to insert the code of the four counters (

others_cnt

and

) to count the number of executed instructions based on their instruction types. You will need to provide

(i.e., print) the contents of the four counters.

You also need to compute the total cycle count and the CPU execution time for the improved program.

improved_cycle_count

You need to calculate based on thecounter values and thegiven CPIs (the constants de ned

lab2_q1_constant.h

in ). The above table de nes the variables for keeping the CPI values. Please add the related

improved_cycle_count

code (formula) to set up the value for .

improved_cpu_time

You need to compute using thetotal cycle count and thecycle time (the constant of the cycle

lab2_q1_constant.h

time for a . GHz processor de ned in ). Please add the related code (formula) to compute the

improved_cpu_time

value for .

The speed up of the performance enhanced by the vectorized version against the baseline code should be presented to show the e ciency delivered by the vectorized code.

printf()

The speed up achieved by the vector code should be calculated and printed out by the function, invoked at

improved()

the end of the function.

improved_cpu_time baseline_cpu_time

The speed up can be calculated with the and the .

Your obtained scores of this exercise is determined by the correctness of your reported performance data.

1. The four counter values, % for each counter. (4% in total)

arith_cnt

(1%)

lw_cnt

(1%)

sw_cnt

(1%)

others_cnt

(1%)

improved_cycle_count

2. The total cycle count ( ). (3%)

improved_cpu_time

3. The CPU time ( ). (3%)

4. The calculated speed up (achieved by the vector version over the baseline code). (30%)

baseline_cpu_time

Note that you can get the score only if your <= 250,000 (us) and your cpu time is right.

If the speedup number <= 1, you get (0 pt).

If 1 < the speedup number <= 2, you get (10 pt).

If 2 < the speedup number <= 4, you get (20 pt).

If 4 < the speedup number, you get (30 pt).

4 / 7

3. Nested For-Loop Code (60%)

target lab2_q2.c

You will be provided with an array of integers and a number. The original C code (in ) shows the logic of

p_x[i] p_x[j] target

the nested loop code, where the summation of the array integers ( and ) is used to compare with .

target flag

The loop execution will break when the summation is equal to the value of thetarget . The value is used to indicate

theanswer of the given test data to see if there is a summation matching the . Besides, you are asked to compute the

CPU execution time of your written code. Detailed descriptions are listed below.

• lab2_q2.c

...

/* Original C code */

for (int i = 0; i < arr_size; i++){ if (flag == 1)

break;

for (int j = 0; j < arr_size; j++){ if ((p_x[i] + p_x[j]) == target){

flag = 1; break;

}

...

flag

The value indicates the answer (result) of the computations.

flag flag

If == 0, it means there is no matching found in the array elements. If == 1, it means a match is found.

You can use the same array element more than once to obtain the summation ofarray elements .

Example:

arr[8] = {1, 2, 3, 4, 5, 6, 7, 8};

target = 2;

flag Yes

should be one in this case since 1 + 1 = 2. The answer should be .

The array size is xed at eight.

lab2_q2_constant.h

The constants and variables you need is available in the header le .

arith_cnt lw_cnt sw_cnt

You should use these constants/variables in the header le wisely (e.g., , , ,

others_cnt cycle_time

, and ) to compute the CPU time.

Please do not modify the code in the header le.

You are asked to develop ane cient assembly code. You can provide such a code by improving the searching algorithm or by using the RISC-V V extension.

Thee ciency is de ned by the estimated CPU time

tmp1 tmp2

Hint: You can use and arrays wisely.

You can achieve O(n) computation complexity if you make good use of vector add and comparison instructions from RISC-V V extension for the implementation.

Your obtained scores of this exercise is determined by the correctness of your reported performance data.

1. The Yes/No answer. (20%)

2. The CPU time for your program. (40%)

If 720,000 (us) < CPU time, you gets (0%).

If 360,000 (us) < CPU time <= 720,000 (us), you gets (10%).

5 / 7

If 180,000 (us) < CPU time <= 360,000 (us), you gets (20%).

If 90000 (us) < CPU time <= 180,000 (us), you gets (30%).

If CPU time <= 90000 (us), you gets (40%).

3. Test Your assignment

We use make le to judge your program. You can use the judge program to get the testing score by typing judge in your terminal. Get your score

• make score

/opt/riscv/riscv64-unknown-elf/bin/pk

If your path of proxy kernel is not

, change it in PK_PATH in make le.

Test your question 1 in lab

• make lab2_q1

• make run_lab2_q1 TEST_DATA=1

TEST_DATA can change the test data which you want to test (e.g., TEST_DATA= ~ TEST_DATA= ).

Test your question 2 in lab

• make lab2_q2

• make run_lab2_q2 TEST_DATA=1

TEST_DATA can change the test data which you want to test (e.g., TEST_DATA= ~ TEST_DATA= ).

6 / 7

4. Submission of Your Assignment

CO_StudentID_HW2

We assume your developed code is inside the folder: . Please follow the instructions below to submit your

programming assignment.

zip

1. Compress your source code into a le.

2. Submit your homework with NCKU Moodle.

3. The zipped le and its internal directory organization of your developed code should be similar to the example below.

StudentID

NOTE: Replace all with your student ID number).

CO_StudentID_HW2.zip

└── CO_StudentID_HW2/

├── answer.h

├── lab2_q1_baseline.c

├── lab2_q1.c

├── lab2_q1.h

├── lab2_q1_improved.c

├── lab2_q1_input.txt

├── lab2_q2_ans.c

├── lab2_q2.c

├── lab2_q2.h

├── lab2_q2_input.txt

├── makefile

└── test

!!! Incorrect format (either the le structure or le name) will lose 10 points. !!!

5. References

RISC-V ISA List

Extending RISC-V

RISC-V V Extension