chomp arc final
You compile some C source code containing a function called loop_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline -fno-if-conversion loop_quiz.c -o loop_quiz.s In the resulting assembly code, loop_quiz() is implemented as: loop_quiz: movl $0, %edx movl $0, %eax .LOOP: addl %eax, %edx addl $1, %eax cmpl %edi, %eax jne .LOOP movl %edx, %eax ret The value of the first argument passed to loop_quiz(), passed via the %edi register, is 19. What value does loop_quiz() return in %eax?
171
Here is the sumArrayRows function we have been using in class as an example. This time, both the dimensions and the data type of the array have changed: #define M 2 #define N 8 short sumArrayRows (short a[M][N]) { short sum = 0; for ( int i=0; i<M; i++ ) for ( int j=0; j<N; j++ ) sum += a[i][j]; return sum; } int main(int argc, char* argv[]){ // prepare random number generatortime_t t; srand((unsigned) time(&t)); short a[M][N]; // INITIALIZE ARRAY WITH RANDOM NUMBERS for ( int i=0; i<M; i++ ) for ( int j=0; j<N; j++ ) a[i][j] = rand() % 256; int sumArrayRowsResult = sumArrayRows(a); printf("sumArrayRows took %lf seconds.\n", cpu_time_used); exit(EXIT_SUCCESS);} What is the total size of "a" array stored on the program stack? a. 64 bytes b. 16 bytes c. 128 bytes d. 32 bytes
32 bytes
You now simulate the memory access trace using the csim-ref cache simulator, with these parameters given to the simulator: ./csim-ref -s 1 -E 4 -b 3 -v -t trace_matTrans_4x4.txt The parameters mean there is one set index bit, and 3 block offset bits. Each set has 4 associative ways. What kind of cache is this? a. 16-byte direct mapped cache with 2 sets b. 4-way fully associative 32-byte cache c. 4-way set-associative 64-byte cache d. 4-way set-associative 8-byte cache
4-way set-associative 64-byte cache
According to this source, https://en.wikichip.org/wiki/amd/microarchitectures/zen%2BLinks to an external site., the L2 cache of the AMD Zen+ microarchitecture (CPU) from 2018 has the following cache design parameters: Links to an external site. L2 Cache: 512 KiB 8-way set associative 1,024-sets 64 B line size Write-back policy 12 cycles latency Knowing that the Zen+ CPU is a 64-bit architecture, meaning that the number of bits in each memory address m=64, how many bits are in the tag of each L2 cache line?
48
Here is a simple C program to perform matrix transpose on a 4x4 square matrix: size_t n = 4; int A [ n*n ]; int B [ n*n ]; for ( size_t i=0; i<n; i++ ) { for ( size_t j=0; j<n; j++ ) { B[ j*n + i ] = A[ i*n + j ]; } } You can assume as the CPU executes this code, the variables n, i, and j are all in registers. The only memory accesses are accesses to matrix A and matrix B. How much memory access will there be? a. 16 writes to matrix A, 16 reads from matrix B b. 64 bytes loaded from matrix A, 64 bytes stored to matrix B c. 32 reads from matrix A, 32 writes to matrix B d. 32 bytes loaded from matrix A, 32 bytes stored to matrix B
64 bytes loaded from matrix A, 64 bytes stored to matrix B
Now, let's also give names for the int elements in matrix B: matrix B: q r s t u v w x y z α β γ δ ε ζ After the seventh memory access (L 403249c,4), the content of the caches is as follows: Set index 0: Way 0 --> a, b Way 1 --. q, r Way 2 --> u, v Way 3 --> y, z Set index 1: Way 0 --> c, d Way 1 --> Way 2 --> Way 3 --> The next and eighth memory access (S 40304b0,4) is caused by storing to the γ element of matrix B. γ is not in the cache, and it will map to set index 0 in the cache. But set index 0 is full, so one line in set index 0 has to be evicted. Consider two options for the cache replacement policy: First-In First-Out (FIFO), and Least Recently Used (LRU). Under each policy, which line gets evicted to accommodate γ, δ? a. FIFO: (u, v) LRU: (u, v) b. FIFO: (a, b) LRU: (q, r) c. FIFO: (q, r) LRU: (q, r) d. FIFO: (a, b) LRU: (a, b)
FIFO: (a, b) LRU: (q, r)
Now, here is the sumArrayCols function we have been using as a demonstration in class. It has also similarly been modified to handle different array dimensions and data type: #define M 2 #define N 8 short sumArrayCols (short a[M][N]) { short sum = 0; for ( int j=0; j<N; j++ ) for ( int i=0; i<M; i++ ) sum += a[i][j]; return sum; } Similar to the sumArrayRows function, the relevant lines of the memory trace are 16 load records. The FIRST record is the same as sumArrayRows: L 1fff000110,2 What is the SECOND record for sumArrayCols? a. L 1fff000120,4 b. L 1fff000112,2 c. L 1fff000118,2 d. L 1fff000120,2
L 1fff000120,2
You compile your program with gcc, making sure to provide the -O0 compiler flag to disable any compiler optimizations: gcc -Wall -Werror -m64 -O0 -o matTrans matTrans.c You run your matrix transpose program with the following command: valgrind --tool=lackey --trace-mem=yes ./matTrans tests/matrix_a_4x4.txt 2> trace_matTrans_4x4.txt The relevant records in the memory trace are 32 lines in total, here are just the first five lines: L 04032490,4 S 04030480,4 L 04032494,4 S 04030490,4 L 04032498,4 What is the next line of the record? a. M 040304a0,4 b. S 04030500,4 c. S 040304a0,4 d. L 0403249c,4
S 040304a0,4
Based on the exploration so far, it is clear that our first version of the matrix transpose program has poor spatial locality in accessing matrix B. An improved version of matrix transpose uses cache blocking to improve locality: size_t n = 4; int A [ n*n ]; int B [ n*n ]; size_t block = 2; for ( size_t i=0; i<n; i+=block ) { for ( size_t j=0; j<n; j+=block ) { for ( size_t i1=i; i1<i+block; i1++ ) { for ( size_t j1=j; j1<j+block; j1++ ) { b[ j1*n + i1 ] = a[ i1*n + j1 ]; } } } } You can assume as the CPU executes this code, the variables n, block, i, j, i1, and j1 are all in registers. The only memory accesses are accesses to matrix A and matrix B. You compile your new program with gcc, making sure to provide the -O0 compiler flag to disable any compiler optimizations: gcc -Wall -Werror -m64 -O0 -o blockedMatTrans blockedMatTrans.c You run your new matrix transpose program with the following command: valgrind --tool=lackey --trace-mem=yes ./blockedMatTrans tests/matrix_a_4x4.txt 2> trace_blockedMatTrans_4x4.txt The relevant records in the memory trace are again 32 lines in total, here are just the first seven lines: L 4032490,4 S 4030480,4 L 4032494,4 S 4030490,4 L 40324a0,4 S 4030484,4 L 40324a4,4 What is the next line of the record? a. L 4030494,4 b. S 40304a0,4 c. S 4030488,4 d. S 4030494,4
S 4030494,4
You compile some C source code containing a function called setnae_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline setnae_quiz.c -o setnae_quiz.s In the resulting assembly code, setnae_quiz() is implemented as: setnae_quiz: cmpq %rsi, %rdi setb %al ret Which of the following is the C source code for setnae_quiz()? a.bool setnae_quiz ( unsigned long x, unsigned long y ) { return x!<y; } b. bool setnae_quiz ( unsigned long x, unsigned long y ) { return x>=y; } c. bool setnae_quiz ( unsigned long x, unsigned long y ) { return x<y; } d. bool setnae_quiz ( signed long x, signed long y ) { return x<y; }
bool setnae_quiz ( unsigned long x, unsigned long y ) { return x<y; }
You compile some C source code containing a function called setnz_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline setnz_quiz.c -o setnz_quiz.s In the resulting assembly code, setnz_quiz() is implemented as: setnz_quiz: cmpb %sil, %dil setnz %al ret Which of the following is the C source code for setnz_quiz()? a. bool setnz_quiz ( char x, char y ) { return x==y; } b. bool setnz_quiz ( char x, char y ) { return x!=0 || y!=0; } c. bool setnz_quiz ( int x, int y ) { return x!=y; } d. bool setnz_quiz ( char x, char y ) { return x!=y; }
bool setnz_quiz ( char x, char y ) { return x!=y; }
For better visualization, we now refer to the int elements of matrix A as follows: matrix a: a b c d e f g h i j k l m n o p Based on the cache design parameters, to which cache set will each element be placed into? a. cache index 0: a, b, c, d, i, j, k l cache index 1: e, f, g, h, m, n, o, p b. cache index 0: a, b, e, f, i, j, m, n cache index 1: c, d, g, h, k, l, o, p c. cache index 0: a, b, c, d cache index 1: e, f, g, h cache index 2: i, j, k, l cache index 3: m, n, o, p d. cache index 0: a, b, i, j cache index 1: c, d, k, l cache index 2: e, f, m, n cache index 3: g, h, o, p
cache index 0: a, b, e, f, i, j, m, n cache index 1: c, d, g, h, k, l, o, p
We simulate the memory trace for sumArrayRows using the 16-byte, four-way fully associative cache using the following command: ./csim-ref -v -s 0 -E 4 -b 2 -l 0 -t ./sumArray/trace_sumArrayRows csim-ref gives the following step-by-step simulation result: L 1fff000110,2 miss L 1fff000112,2 hit L 1fff000114,2 miss L 1fff000116,2 hit L 1fff000118,2 miss L 1fff00011a,2 hit L 1fff00011c,2 miss L 1fff00011e,2 hit L 1fff000120,2 miss eviction L 1fff000122,2 hit L 1fff000124,2 miss eviction L 1fff000126,2 hit L 1fff000128,2 miss eviction L 1fff00012a,2 hit L 1fff00012c,2 miss eviction L 1fff00012e,2 hit The fact that the very first load is a cache miss is because the cache is initially "cold," or in other words empty. This type of cache miss is called a ______________ miss. The summary statistics that csim-ref reports are: hits: ______, misses: ______, evictions: _____. Since each cache block is 4 bytes, each cache block can hold _____ adjacent short ints. This provides hardware support for ________ locality in memory accesses.
compulsory, 8, 8, 4, 2, spatial
Now, we consider a 16-byte, four-way, fully-associative cache. Since the capacity of the cache is 16 bytes, the array "a" in our example _________________ fit inside the cache. We can deduce that the block size for this cache is ______ bytes per block. So the block index size b=2 bits. For a memory trace record such as: L 1fff000116,2 the 2-bit block offset is __________. The tag bits are all the rest of the bits not part of the block offset.
does not, 4, 0b10
Using the improved code with that takes advantage of cache blocking, you simulate the cache accesses on the same cache as before. Here are the first eight memory access records, as reported by csim-ref with the verbose printout: L 4032490,4 miss S 4030480,4 miss L 4032494,4 hit S 4030490,4 miss L 40324a0,4 miss S 4030484,4 ???? L 40324a4,4 ???? S 4030494,4 ???? Whether the sixth, seventh, and eighth accesses hit or miss in the cache has been blanked out. What is the correct outcome? a. miss, miss, miss b. hit, hit, hit c. hit, miss, hit d. miss, hit, miss
hit, hit, hit
You compile some C source code containing a function called matrix_access(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline matrix_access.c -o matrix_access.s In the resulting assembly code, matrix_access() is implemented as: matrix_access: movq (%rdi,%rsi,8), %rax movl (%rax,%rdx,4), %eax ret Which of the following is the C source code for matrix_access()? a. int matrix_access (int array[][], long row, long col) { return array[row][col]; } b. int matrix_access (int** array, long row, long col) { return array[col][row]; } c. long matrix_access (long** array, long row, long col) { return array[row][col]; } d. int matrix_access (int** array, long row, long col) { return array[row][col]; }
int matrix_access (int** array, long row, long col) { return array[row][col]; }
You compile some C source code containing a function called movzwl_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline movzwl_quiz.c -o movzwl_quiz.s In the resulting assembly code, movzwl_quiz() is implemented as: movzwl_quiz: movzwl %di, %eax ret Which of the following is the C source code for movzwl_quiz()? a. int movzwl_quiz ( unsigned char input ) { return input; } b. int movzwl_quiz ( signed short input ) { return input; } c. long movzwl_quiz ( signed char input ) { return input; } d. int movzwl_quiz ( unsigned short input ) { return input; }
int movzwl_quiz ( unsigned short input ) { return input; }
You compile some C source code containing a familiar function called quizSwap(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline quizSwap.c -o quizSwap.s In the resulting assembly code, quizSwap() is implemented as: quizSwap: movl (%rsi), %eax movl %edi, (%rsi) ret Which of the following is the C source code for quizSwap()? a. int* quizSwap ( int a, int* b ) { int temp = *b; *b = a; a = temp; return b; } b. int quizSwap ( int a, int* b ) { int temp = *b; *b = a; a = temp; return a; } c. int quizSwap ( int a, int* b ) { int temp = *b; *b = a; a = temp; return *b; } d. int quizSwap ( int* a, int b ) { int temp = b; b = *a; *a = temp; return *a; }
int quizSwap ( int a, int* b ) { int temp = *b; *b = a; a = temp; return a; }
You compile some C source code containing a function called leaq_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline leaq_quiz.c -o leaq_quiz.s In the resulting assembly code, leaq_quiz() is implemented as: leaq_quiz: leaq 24(%rdi,%rsi,8), %rax ret Which of the following is the C source code for leaq_quiz()? a. long * leaq_quiz ( long * ptr, long index ) { return &ptr[index+3]; } b. long leaq_quiz ( long * ptr, long index ) { return ptr[index+3]; } c. long * leaq_quiz ( long * ptr, long index ) { return &ptr[8*index+3]; } d. long * leaq_quiz ( long * ptr, long index ) { return &ptr[index+24]; }
long * leaq_quiz ( long * ptr, long index ) { return &ptr[index+3]; }
You compile some C source code containing a function called movsbq_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline movsbq_quiz.c -o movsbq_quiz.s In the resulting assembly code, movsbq_quiz() is implemented as: movsbq_quiz: movsbq %dil, %rax ret Which of the following is the C source code for movsbq_quiz()? a. unsigned long int movsbq_quiz ( unsigned char input ) { return input; } b.long int movsbq_quiz ( signed char input ) { return input; } c.long int movsbq_quiz ( unsigned char input ) { return input; } d. signed int movsbq_quiz ( signed char input ) { return input; }
long int movsbq_quiz ( signed char input ) { return input; }
You compile some C source code containing a function called movsbq_quiz(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline movsbq_quiz.c -o movsbq_quiz.s In the resulting assembly code, movsbq_quiz() is implemented as: movsbq_quiz: movsbq %dil, %rax ret Which of the following is the C source code for movsbq_quiz()? a. unsigned long int movsbq_quiz ( unsigned char input ) { return input; } b. long int movsbq_quiz ( signed char input ) { return input; } c. long int movsbq_quiz ( unsigned char input ) { return input; } d. signed int movsbq_quiz ( signed char input ) { return input; }
long int movsbq_quiz ( signed char input ) { return input; }
Here are the first eight memory access records, this time as reported by csim-ref with the verbose printout: L 4032490,4 miss S 4030480,4 miss L 4032494,4 hit S 4030490,4 miss L 4032498,4 miss S 40304a0,4 ???? L 403249c,4 ???? S 40304b0,4 ???? Whether the sixth, seventh, and eighth accesses hit or miss in the cache has been blanked out. What is the correct outcome? a. hit, hit, hit b. miss, miss, miss c. miss, hit, hit d. miss, hit, miss
miss, hit, miss
Now, you use csim-ref to simulate the memory trace from sumArrayCols running with the 16-byte, four-way, fully-associative cache using the following command: ./csim-ref -v -s 0 -E 4 -b 2 -l 0 -t ./sumArray/trace_sumArrayCols csim-ref will report the following summary statistics: hits:8 misses:8 evictions:4. The questions below are about how it reached that conclusion. The first four lines of the simulation trace are the following: L 1fff000110,2 ___________; L 1fff000120,2 ___________; L 1fff000112,2 ___________; L 1fff000122,2 ___________.
miss, miss, hit, hit
pushq %rbx popq %rbx Which of the following sets of assembly instructions has the same behavior as the above? (You can actually find a segment of assembly code, substitute away all the push and pop instructions, recompile the binary, and get the same working program.) a.movq %rbx, %rsp addq $-8, %rsp movq %rsp, %rbx addq $8, %rsp b. movq %rbx, -8(%rsp) addq $-8, %rsp movq (%rsp), %rbx addq $8, %rsp c. movq %rbx, -4(%rsp) addq $-4, %rsp movq (%rsp), %rbx addq $4, %rsp d. movq %rbx, 8(%rsp) addq $8, %rsp movq (%rsp), %rbx addq $-8, %rsp
movq %rbx, -8(%rsp) addq $-8, %rsp movq (%rsp), %rbx addq $8, %rsp
You test sumArrayRows using valgrind to extract a trace (recording) of the memory loads and stores. The command to do the trace is: valgrind --tool=lackey --basic-counts=no --trace-mem=yes --log-fd=1 ./sumArray > trace The relevant records for sumArrayRows are the following 16 loads: L 1fff000110,2 L 1fff000112,2 L 1fff000114,2 L 1fff000116,2 L 1fff000118,2 L 1fff00011a,2 L 1fff00011c,2 L 1fff00011e,2 L 1fff000120,2 L 1fff000122,2 L 1fff000124,2 L 1fff000126,2 L 1fff000128,2 L 1fff00012a,2 L 1fff00012c,2 L 1fff00012e,2 What is a possible assembly instruction corresponding to the first load record? a.movw %cx, (%rax,%rsi,2) b. movw (%rax,%rsi,2), %cx c. movq (%rax,%rsi,2), 0x1fff000110 d. movq 0x1fff000110, %rcx
movw (%rax,%rsi,2), %cx
Assuming the same L2 cache design parameters as the previous question, suppose the CPU core performed this memory access: L 040334a0,4 indicating a load from memory to CPU of the contents at the memory address 0x040334a0. If this memory block is found in the L2 cache, what is the tag value? a. tag = 0x40334 b. tag = 0x40 c. tag = 0x4033 d. tag = 0x403
tag = 0x403
You compile some C source code containing a function called displacement(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline addressing_modes_quiz.c -o addressing_modes_quiz.s In the resulting assembly code, displacement() is implemented as: displacement: movw $-32768, 6(%rdi) ret Which of the following is the C source code for displacement()? a. void displacement ( int * ptr ) { ptr[6] = 0x8000; } b. void displacement ( signed char * ptr ) { ptr[6] = 0x80; } c. void displacement ( short * ptr ) { ptr[3] = 0x8000; } d. void displacement ( signed int * ptr ) { ptr[3] = -32768; }
void displacement ( short * ptr ) { ptr[3] = 0x8000; }
You compile some C source code containing a function called index_and_displacement(). The gcc command you use is: gcc -Og -Wall -Werror -std=c99 -S -fno-inline addressing_modes_quiz.c -o addressing_modes_quiz.s In the resulting assembly code, index_and_displacement() is implemented as: index_and_displacement: movl $-1, 8(%rdi,%rsi,4) ret Which of the following is the C source code for displacement()? a. void index_and_displacement ( long * ptr, long index ) { ptr[index+1] = -1;} b. void index_and_displacement ( int * ptr, int index ) { ptr[4*index+8] = 0xFFFFFFFF; } c. void index_and_displacement ( int * ptr, long index ) { ptr[index+2] = 0xFFFFFFFF; } d. void index_and_displacement ( long * ptr, long index ) { ptr[index+8] = 0xFFFFFFFFFFFFFFFF; }
void index_and_displacement ( int * ptr, long index ) { ptr[index+2] = 0xFFFFFFFF; }
