High Performance Computing Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

[OpenMP] What is the syntax for an atomic directive in OpenMP? What form must the line following an atomic direction follow?

#pragma omp atomic x <op> = <expression>; where <op> is one of the following operators: + * - / & ^ | << >>

[OpenMP] What is the syntax for a critical section in OpenMP?

#pragma omp critical

[OpenMP] What is the syntax for a parallel for loop in OpenMP?

#pragma omp parallel for num_threads(thread_count)

[CUDA] What are the rules of Host pointers?

- May be passed to and from device code - May not be dereferenced from device code

[CUDA] What are the rules of Device pointers?

- May be passed to and from host code - May not be dereferenced from host code

[CUDA] What are the 3 basic device memory management routines?

- cudaMalloc(void ** dev_ptr, int size); -cudaMemcpy(void * recipient, void * sender, int size, direction [cudaMemcpyDeviceToHost or cudaMemcpyHostToDevice]) -cudaFree(void * block)

[MPI] Why use nonblocking send and recv?

-Blocking sends and receives may limit the performance - Permitting other operations, or I/O operations, and then completing the message may save time

[CUDA] Describe the steps of tiling.

1) Identify a tile of global memory contents that are accessed by multiple threads 2) Load the tile from global memory into on-chip memory 3) Use barrier synchronization to make sure that all threads are ready to start the phase 4) Have the multiple threads to access their data from the on-chip memory 5) Use barrier synchronization to make sure that all threads have completed the current phase 6) Move on to the next tile

What are the steps of Foster's Methodology?

1. Partitioning - divide the computation to be performed and the data operated on by the computation into small tasks. 2. Communication - determine what communication needs to be carried out among the tasks identified in the previous step. 3. Aggregation - Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. 4. Mapping - assign the composite tasks identified in the previous step to processes/threads. This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work.

Each stream multiprocessor has how many stream processors?

8

[MPI] What is a communicator?

A collection of processes that can send messages to each other.

[MPI] What is MPI_COMM_WORLD?

A communicator containing the set of all processes created when the program was started

What is the general definition of scalable?

A program is scalable if it can handle ever increasing problem sizes.

What is the definition of strongly scalable?

A program is strongly scalable if we can keep the efficiency fixed while increasing p, without increasing the problem size.

What is the definition of weakly scalable?

A program is weakly scalable if we can keep the efficiency fixed while increasing p, and also increasing the problem size.

[CUDA] What is the host?

CPU and its memory

What does the control unit do?

Determines which instructions to execute

[CUDA] Which pointer should you pass into a device function?

Device pointer

[CUDA] Where is shared memory stored (in hardware)? What is the scope and lifetime? What is the relative speed of shared memory?

Device shared memory is stored in a stream multiprocessor. The scope of a shared memory is within a thread block. The lifetime is a block. It's much faster than global memory.

What is the formula for the efficiency of a parallel program?

E = S / p

Explain pipelining

Function units are arranged in stages such that the input of a functional unit is the output of the previous functional unit

[CUDA] What is the device?

GPU and its memory

What does MPI_Allgather do? What is the header?

It concatenates the contents of each process' send_buf_p into each process' recv_buf_p, ordered by the rank of the process from which they were received int MPI_Allgather( void * send_buf_p, int send_count, MPI_Datatype send_type, void* recv_buf_p, int recv_count, MPI_Datatype recv_type, MPI_Comm comm)

What does MPI_Barrier do? What is the header?

It ensures that no process will return from calling it until every process in the communicator has started calling it. int MPI_Barrier(MPI_Comm comm);

[CUDA] Compare and contrast latency-oriented design and throughput oriented design.

Latency: Powerful ALU, large caches, sophisticated control Throughput: Energy efficient ALUs, small cache, simple control

What does MPI_Allreduce do?

Like MPI_Reduce, it performs a reduction operation on all processes in the communicator. However, it also distributes this result to all processes. So, its header is the same as MPI_Reduce except without the dest_process.

What do multiple issue processors do?

MI processors replicate functional units and try to simultaneously execute different instructions in a program

What does MPI_Bcast do? What is the header of MPI_Bcast?

MPI_Bcast sends data from one process to all other processes in the communicator. int MPI_Bcast(void * data, int count, MPI_Datatype datatype, int src, MPI_Comm communicator)

[MPI] How do you find the rank of a process?

MPI_Comm_rank(MPI_comm comm, int * rank_p)

[MPI] How do you find the number of processes?

MPI_Comm_sz(MPI_comm comm, int * comm_sz_p)

What are the headers for MPI_Isend and MPI_Rsend? What does MPI_Wait do and take?

MPI_Isend( void * msg_buf, int buf_size, MPI_Datatype buf_type, int dest, int tag, MPI_Comm comm, MPI_Request *handle_p); MPI_Wait blocks until the operations associate with the request handle completes. -In the case of a send operation, the buffer may then be assigned new values -In the case of a receive operation, the buffer may now be references MPI_Wait(MPI_Request *handle_p, MPI_Status *status_p);

What does MPI_Reduce do? What is the header of MPI_Reduce? Which are the output variables?

MPI_Reduce performs a tree-structured reduction using an operator of choice, such as MPI_SUM. The result is received only by the destination process. int MPI_Reduce(void * input_data_pointer, void * output_data_pointer /*out */, int count, MPI_Datatype datatype, MPI_Operator operator, int dest_process, MPI_Comm comm)

What does MPI_Scan do? What is the header?

MPI_Scan computes the partial reduction of data on a collection of processes int MPI_Scan(void * send_buf, void * recv_buf, int count, MPI_Datatype type, MPI_Op operator, MPI_Comm comm);

Draw diagrams of shared-memory and distributed-memory parallel systems.

Okay

What is the formula for the speedup of a parallel program?

S = Tserial / Tparallel

What is Flynn's Taxonomy?

SISD - classic Von Neumann SIMD - MIMD - MISD -

SPMD

Single Program, Multiple Data

[OpenMP] List the scheduling types for OpenMP, and explain their effects. What effect the chunksize have on each type?

Static - the iterations are assigned to the threads before the loop is executed. Each thread gets chunksize iterations in a round-robin fashion. Dynamic - the iterations are assigned to the threads while the loop is executing. Each thread is assigned a chunk of size chunksize, and requests a new chunk when it is finished. Guided - the iterations are assigned to the threads while the loop is executing. The first thread is assigned about (number of iterations) / (number of threads) iterations. Each thread thereafter, when requesting a chunk, will be assigned about (number of unassigned iterations) / (number of threads) until that value = chunksize. Only the last assignment may be below chunksize. Runtime - the system assigns threads based on the value of env variable OMP_SCHEDULE

What is the program counter?

Stores address of the next instruction to be executed

What does a page table do?

Stores translations of virtual address into physical address

What is bandwidth?

The rate at which the destination receives data once it has started to receive the first byte

What is latency?

The time that elapses between the source's beginning to transmit the data and the destination starting to receive the first bye.

[CUDA] What is a warp? How many threads are in a warp?

They are scheduling units within a stream multiprocessor. Threads in a warp execute in SIMD. 32 threads / warp. Future GPUs may have different number of threads in each warp.

[CUDA ] What is __global__ ?

This C keyword indicated that a function: - is executed on the device - is called from the host

[CUDA] Where is constant memory stored (in hardware)? What is the scope and lifetime? What is the relative speed of constant memory?

This memory is stored in global memory. It acts like a cache of global memory. It is faster than global memory.

[CUDA] Why bother using threads when we can use blocks in CUDA?

Threads within a block can communicate and synchronize. However, blocks cannot communicate with each other.

What is Amdahl's Law?

Unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited — regardless of the number of cores available.

[CUDA] What are a, b, and c in the following constructor: dim3 DimGrid(a, b, c)?

a is the number of blocks in the x dimension, b is the number of blocks in the y dimension, and c is the number of blocks in the z dimension

[CUDA] What are a, b, and c in the following constructor: dim3 DimBlock(a, b, c)?

a is the number of threads in the x dimension, b is the number of threads in the y dimension, and c is the number of threads in the z dimension

Each stream processor can do _______ in a single clock cycle.

a multiplication and an addition

The 8 stream processors (SP) execute __________________ on ____________

a single instruction sequence, different data

Simultaneous hardware multithreading (SMT)

a variation of fine-grained multithreading that allows multiple threads to use multiple functional units

[CUDA] How to access block index in device code?

blockIdx.x

What is the translation lookaside buffer?

cache for the page table

[OpenMP] How to declare that you must explicitly state the scope of variables as shared or private? And how do you then declare the variables?

default(none) shared(<list variables>) private(<list variables>)

◼Grid maps to _________ ◼ Blocks map to the _______ ◼ Threads map to ________

device, stream multiprocessors (SMs), stream processors (SPs)

What does the ALU do?

executes instructions

What is the point of virtual memory?

functions as a cache for secondary storage

[CUDA] How would we index an array that is spread among n/THREADS_PER_BLOCK blocks with THREAD_PER_BLOCK threads each?

index = threadIdx.x + blockIdx.x * blockDim.x

[CUDA] Assume you have called a device code / performed a kernel launch with a 2 dimensional grid and 2 dimension blocks. Each thread is supposed to operate on a single element in a two dimensional array Arr with known width and height. How do you get a single element?

int Col = blockIdx.x * blockDim.x + threadIdx.x int Row = blockIdx.y * blockDim.y + threadIdx.y if ((Row < height) && (Col < width)) { element = Arr[row * width + Col]; }

What is the header of MPI_Recv? Which are out variables?

int MPI_Recv(void *msg_buf_p /*out */, int buf_size, MPI_Datatype buf_type, int source, int tag, MPI_Comm communicator, MPI_Status* status_p /*out */);

What is the header of MPI_Send?

int MPI_Send(void * msg_buf_p, int msg_size, MPI_Datatype msg_type, int dest, int tag, MPI_Comm communicator);

What does MPI_Scatter do? What is the header?

it can be used to distribute components of an array across multiple processes int MPI_Scatter( void * send_buf, int send_count, MPI_Datatype send_type, void * recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, int src_proc, MPI_Comm comm);

What does MPI_Gather do? What is the header?

it collects all of the components of an array onto the root process, ordered by the rank of the process from which they were received int MPI_Gather( void * send_buf, int send_count, MPI_Datatype send_type, void * recv_buf /* out */, int recv_count, MPI_Datatype recv_type, int dest_proc, MPI_Comm comm)

Coarse-grained hardware multithreading

only switches threads that are stalled waiting for a time-consuming operation to complete

[OpenMP] What is the format for the reduction clause in openMP?

reduction(<operator>: <variable list>)

[OpenMP] What is the format of the schedule clause in openMP?

schedule(type, chunksize)

A dynamic multiple issue processor...

schedules functional units at runtime. It is also called 'superscalar'

Fine-grained hardware multithreading

the processor switches between threads after each instruction, skipping threads that are stalled

[CUDA] How to access thread index in device code?

threadIdx.x

[CUDA] What is the general idea of tiling? How is it advantageous?

• Divide the global memory content into tiles • Focus the computation of threads on one or a small number of tiles at each point in time • It is advantageous because it minimized global memory accesses in lieu of shared memory (local) accesses.


Conjuntos de estudio relacionados

A&P II- Exam 3: Connect (Lymphatic and Immunity)

View Set

WWU SOC 363 Gillham Law and Stratification Final Exam

View Set

ATI Nutrition Final Practice Test

View Set

AP Computer Science Chapter 12 Vocabulary

View Set

pharmacology: ch 17 drugs for treatment of respiratory disorders and allergic rhinitis

View Set

Lesson 3: Overcurrent Protective Device Ratings (2023)

View Set