CA-HW-2
HW_2
The assignment of HW_2.
I will adjust the format and combine problem and solution to be more suitable for markdown, but you can get and edit it in Overleaf by the source file.
And you can get my source code solution and commit file in github repo
Problem 1: Implement a simple RISC Core (2+5+8=15 Points)
Using the following ISA and hardware architecture to compute $\mathbf{A} \cdot \mathbf{B} + \mathbf{C}$, where
$\mathbf{A}$, $\mathbf{B}$ and $\mathbf{C}$ are $8\times 8$ matrices.
Each element in them are signed integers with 8b length.
Problem(a)
Write the entire assembly code for computation. (hints: 8 indexed vector register file is not sufficient for 8x8 matrix.)
Problem(b)
Propose a superscalar strategy (maximum 2 instruction per fetch), and calculate how many cycles needed. Compare the utilization ratio with and without the superscalar strategy.
Solution
Problem(a)
This problem is essentially a matrix block problem, which is very common in the data mapping process of neural network accelerators——The memory on chip is not sufficient, so we have to divide the activation into many tile and calculate each tile orderly.Depending on the split ways, I provide two methods to deal with the problem a.
Solution(1)
At first, we split three $8\times8$ matrices A B C into 48 vectors which has 4 implements. Because it can match the DCM vector and Vector reg file bitwidth (32b).Matrix A is divided horizontally, and matrix B is divided vertically, just as the order of calculation.And we distinguish them by H and L,which is also easily for us to annotate assembly code.Each input vectors are stored in DCM vector. So the $8\times8$ matrices calculation can be display as follow(The order of matrices C and S depends on the split methods of matrix B):
$$
\left[
\begin{matrix}
A1H & A1L \\
A2H & A2L \\
A3H & A3L \\
A4H & A4L \\
A5H & A5L \\
A6H & A6L \\
A7H & A7L \\
A8H & A8L
\end{matrix}
\right] *
\left[
\begin{matrix}
B1H & B2H & B3H & B4H & B5H & B6H & B7H & B8H\\
B1L & B2L & B3L & B4L & B5L & B6L & B7L & B8L
\end{matrix}
\right] \\ +
\left[
\begin{matrix}
C1H & C1L \\
C2H & C2L \\
C3H & C3L \\
C4H & C4L \\
C5H & C5L \\
C6H & C6L \\
C7H & C7L \\
C8H & C8L
\end{matrix}
\right] =
\left[
\begin{matrix}
S1H & S1L \\
S2H & S2L \\
S3H & S3L \\
S4H & S4L \\
S5H & S5L \\
S6H & S6L \\
S7H & S7L \\
S8H & S8L
\end{matrix}
\right]
$$
In instructions, vrd and vrs only have 3bit which is not sufficient for 48 vectors indexing in vector reg file. But we know that, the vectors of same type are stored in contiguous address memory. So we store the base address of input vectors -activation A, weight B and bias C in Scalar reg file, and mark all of them as RF1, RF2, RF3 easily. The same, we need restore the sum vector in DCM vector, and the base address is stored in Scalar reg file called RF4.
In this way, we can use the Vload and Vstore instructions and index vectors by stored base address RF1,RF2 and Immediate.However, we use weight stationary to get over memory Wall, 8 indexed vector reg file is not sufficient. We cannot store all the weight in the DCM(8),so I split the weight matrix B into 4 part furthermore like this:
$$
\left[
\begin{matrix}
B1H & B2H & B3H & B4H
\end{matrix}
\right]\
$$
$$
\left[
\begin{matrix}
B5H & B6H & B7H & B8H
\end{matrix}
\right]
$$
$$
\left[
\begin{matrix}
B1L & B2L & B3L & B4L
\end{matrix}
\right]\
$$
$$
\left[
\begin{matrix}
B5L & B6L & B7L & B8L
\end{matrix}
\right]
$$
Besides, we calculate matrix S twice depended on the partitioned matrix A.
In a word,my design assembly code looks like this :
1 | // Step1:load the weight |
It’s worthy adding that the assembly code include 4 big cycles and 8 small cycles , in actual situation, we can replace them by callq , jz and loop to simply the code.
Besides, we can calculate that,number of DCM fetches is 112:
Matrix A: 8 * 4 load
Matrix B: 4 * 4 load
Matrix C: 8 * 2 load
Matrix S: 8 * 2 * 2 store 8 * 2 load
And this block matrix can be expand with the increasing of matrices’ size.
Solution(2)
As described in the above method, we have to reload each elements of matrix A from DCM vector twice. Besides, the matrix sum also need to reload once and store twice to complete twice MAC. So I change the split matrices methods, and in this method, the matrix sum only need store once.
The 48 input vectors are divided as follow:
$$
\left[
\begin{matrix}
A1H & A1L \\
A2H & A2L \\
A3H & A3L \\
A4H & A4L \\
A5H & A5L \\
A6H & A6L \\
A7H & A7L \\
A8H & A8L
\end{matrix}
\right] *
\left[
\begin{matrix}
B1H & B2H & B3H & B4H & B5H & B6H & B7H & B8H\\
B1L & B2L & B3L & B4L & B5L & B6L & B7L & B8L
\end{matrix}
\right]
+
$$
$$
\left[
\begin{matrix}
C1H & C2H & C3H & C4H\\
C5H & C6H & C7H & C8H\\
C1L & C2L & C3L & C4L\\
C5L & C6L & C7L & C8L
\end{matrix}
\right] =
\left[
\begin{matrix}
S1H & S2H & S3H & S4H\\
S5H & S6H & S7H & S8H\\
S1L & S2L & S3L & S4L\\
S5L & S6L & S7L & S8L
\end{matrix}
\right]
$$
I still hold on the weight stationary,so the matrix B is divided in four parts.
But this way I split the weight matrix B into 4 parts spanning multiple lines like this:
$$
\left[
\begin{matrix}
B1H & B2H\\
B1L & B2L
\end{matrix}
\right]
\left[
\begin{matrix}
B3H & B4H \\
B3L & B4L
\end{matrix}
\right]
\left[
\begin{matrix}
B5H & B6H \\
B5L & B6L
\end{matrix}
\right]
\left[
\begin{matrix}
B7H & B8H \\
B7L & B8L
\end{matrix}
\right]
$$
So the cycle calculation is a little different from the method a.
In a word,my design assembly code looks like this :
1 | Vload VRF6, RF1, $1 //load A1L |
Although the matrix sum only store once, in this method, we have to load matrix A fourth to complete MAC. So we can calculate that,number of DCM fetches is 112,too:
Matrix A: 16 * 4 load
Matrix B: 4 * 4 load
Matrix C: 4 * 4 load
Matrix S: 4 * 4 store
Of coarse, there are many other functions such as based on actvation stationary or output stationary. Whatever,choose the reasonable splited methods to balance memory wall and latency depends your own situation.
Problem(b)
Based on function 1,we proposed a superscalar strategy. As I said above, this method include 4 big cycles and 8 small cycles, so I just analyze the non-repeated instruction parallelism.
Assumed that ,all steps cost the same time.
As Figure 1,in customed pipeline,we have 4 big cycles and each big cycle has 8 small cycles.
And we can figure that,the small cycles’ throughout is 11 cycles.Combined above, we can calculate the throughout of whole assembly code:
$$
((8 * (11 - 2) + 2 + 4) - 2) * 4 + 2 = 306 clks
$$
Before superscalar strategy, we assumed that, the DCM vector is TrueDualPortRam. Or we can design two DCM to store for weight and activation data seperatly. In other words, we can load or store two vectors based on different address at the same time.
And Based on Lecture 4, we use the Load+Mac instruction. In other words, we proposed a LoMac instruction to complete two function in 5 cycles.But it must execute with a load instruction at the same time.
1 | Vload Vrd(3b), rs1(5b), imm(5b) |
Therefore, most load and LoMac instruction can be run in parallel. But the step2/4(initial bias/reload psum) and MAC in small cycle might cause data hazard. So we have to execute two instructions separately.
And the Step1(load weight) can be combined with Step3(Computing).But cause the VloMac only have imm(1b), so We add more base scalar RF to store base address. Specifically, RF1 and RF3 still works for matrix A and C.RF2 is for Psum. Matrix B need RF4 to RF11 to index. The depth of scalar RF is 32, so it’s enough.
The improving assembly code show follow:
1 | // Step2: Initial Bias |
As Figure 2,we have 4 big cycles and each big cycle only has 8 small cycles with superscalar strategy.
And we can figure that,the small cycles’ throughout is 9 cycles without VStore. And we have to stall the pipeline 3 cycles for VStore. And this instruction, we cannot realize superscalar.However, just like the yellow rectangle, we can run next cycles’ instruction. And I confirm that if we change the VRF for Psum, we can avoid the data hazard. Combined above, we can calculate the throughout of whole assembly code:
$$
(8 * (9 - 4) + 8) * 4 - 1 + 6\ =\ 197 clks
$$
$$
Before\ Superscalar: 5 * 8 * 4 / 306\ =\ 52.29%
$$
$$
After\ Superscalar: 5 * 8 * 4 / 197\ =\ 81.22%
$$
In this work, we assumed that all step cost the same time. But in reality, ALU-MAC costs much more longer than others, so superscalar strategy could improve the pipeline latency significantly.
In this system,we realize a MAC module based on the circuit showed in Problem1.The source code was written by SystemVerilog called mac.sv. And the interface of MAC module was showed follow:
1 | module mac( |
We also submit the complete source code in the attachment.And the testbench called tb_mac.sv was attached, too. In the testbench, we realize the maxtrix MAC.
$$
\left[
\begin{matrix}
-1 & 2 & -3 & 4 \\
\end{matrix}
\right] *
\left[
\begin{matrix}
-1 & -1 & 0 & 2 \\
2 & 2 & 3 & 1 \\
3 & -3 & 1 & 4 \\
4 & 4 & 2 & 5
\end{matrix}
\right] +
\left[
\begin{matrix}
1 & 2 & 3 & 4 \\
\end{matrix}
\right]
=
\left[
\begin{matrix}
13 & 32 & 14 & 12 \\
\end{matrix}
\right]
$$
In Figure3, we realize 5 VMAC instructions to complete the matrix calculation. The final out is a splicing of 4 quantized results in scratchpad of MAC.$0c\ 0e\ 20\ 0d$ is the hexadecimal of result of $12\ 14\ 32\ 13$.We have reached the expected functional goal.The specific source code just like the mac.sv and tb_mac.sv in appendix.I also attached a screenshot of the simulation waveform.
******Believe me, Support me, Pay me!******
⬇︎