Design of Parallel Algorithms - Test 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is a FIFO replacement policy?

(First In, First Out) policy replaces the oldest cache line based on the order it was brought into the cache.

What is pipelining? Why do modern processors use this feature?

- A technique where multiple instructions are processed concurrently in different stages - Modern processors use this feature to increase throughput, allowing for faster execution of instructions and improved performance

What components are in a von Neumann architecture?

- CPU - Memory - Input/output devices

What is out-of-order instruction handling? Why do modern processors use this feature?

- Execution of instructions in a non-sequential order to improve efficiency - Modern processors use this feature to optimize resource utilization and minimize idle time

What steps can we divide an addition instructions into to implement pipelining?

- Instruction fetch - Operand fetch - Execution and result storage

Which of the following features of a modern architecture are not features of a von Neumann architecture?

- Out of order instruction handling - Instruction level paralellism

What is a bank conflict and what does it do to memory systems?

A bank conflict occurs when multiple memory accesses target the same memory bank simultaneously, causing a delay in one or more accesses and reducing memory system performance.

What is a communicator?

A communicator is an MPI object that groups processes for communication and provides a context for message passing.

What is von Neumann architecture?

A computer architecture model, which is based on a program concept, where both data and instructions are stored in the same memory.

Define Core, Socket, and Node

A core is an individual processing unit within a CPU, a socket refers to the physical connector on the motherboard that houses a CPU, and a node is a single computing entity within a distributed or parallel computing system.

What is a critical section? When is a critical section used in software? Is it used in shared address space or distributed address space architectures?

A critical section is a portion of code that accesses shared resources and must be executed by only one thread at a time; it is used in shared address space architectures to protect shared data from concurrent modifications.

What is a direct-to-memory architecture?

A design where data is transferred directly between I/O devices and memory, bypassing the CPU, reducing the workload and increasing overall system performance?

What is a k-way set associative cache? What does k represent?

A k-way set associative cache is a compromise between direct mapped and fully associative caches, with k representing the number of cache lines in a set where a memory block can be placed. mapped and fully associative caches, with k representing the number of cache lines in a set that a memory block can be placed in.

What effect does stride have on cache line efficiency?

A large stride (accessing memory locations far apart) can reduce cache line efficiency, as it may not utilize the entire cache line before eviction.

What is a prefetch data stream?

A prefetch data stream is a technique where data is fetched from memory before it is needed, reducing latency by having the data available when the CPU requires it.

What is a processor rank? How does one find the rank of a process within a communicator?

A processor rank is a unique identifier for a process within a communicator. The rank can be found using the MPI Comm rank function.

What property would a program have that would cause performance degradation due to TLB misses?

A program with a large working set or non-contiguous memory accesses could cause performance degradation due to increased TLB misses, as it increases the likelihood of cache eviction for needed address translations.

What is a stored program?

A sequence of instructions and data stored in the computer's memory, enabling the computer to perform specific tasks allows a computer to execute programs directly from memory without the need for external control

What is a stall compared to a cache miss?

A stall occurs when the CPU waits for data to be fetched from memory, while a cache miss is an unsuccessful attempt to find data in the cache, leading to access of slower memory.

What is instruction level parallelism (ILP)?

A technique that allows multiple instructions to be executed simultaneously, exploiting parallelism within a single instruction stream to improve performance

What is branch prediction?

A technique used to guess the outcome of a branching instruction, allowing a processor to speculatively execute subsequent instructions and improve overall performance

What is speculative execution?

A technique where a processor predicts the outcome of branches or operations and executes instructions ahead of time If the prediction is correct, performance is improved; otherwise, the speculatively executed instructions are discarded

What is prefetching?

A technique where the processor fetches data or instructions form memory before they are needed, reducing the time spent waiting for memory, reducing latency, and improving performance

What is a thread context? What does it consist of?

A thread context consists of the state information needed to execute a thread, including registers, program counter, and stack pointer.

How big is a typical TLB?

A typical TLB can store around 64 to 4096 entries, depending on the specific architecture and implementation.

What is affinity in thread systems?

Affinity in thread systems refers to the preference or tendency for a thread to be executed on a specific processor or core, improving cache locality and performance.

What is Amdahl's Law? How is it used to estimate parallel execution time?

Amdahl's Law estimates the parallel execution time by considering the fraction of a program that can be parallelized and the number of processors; it highlights the impact of sequential portions on overall parallel performance.

What does it mean that when an MPI program is loosely synchronous?

An MPI program is loosely synchronous if processes execute independently but periodically synchronize during communication or other shared operations.

What is arithmetic intensity and how do we compute it?

Arithmetic intensity is the ratio of computational operations to memory accesses in an algorithm. It is computed by dividing the total number of operations by the total amount of data transferred.

How does one distribute an array across processors in MPI? What is the local numbering? What is the global numbering?

Arrays can be distributed across processors using data decomposition techniques like block, cyclic, or block- cyclic distributions. Local numbering refers to the index of an element within a process, while global numbering refers to the index of the element in the entire distributed array.

What does it mean if message passing operations are blocking?

Blocking message passing operations cause the calling process to wait until the operation is complete, potentially causing synchronization between processes.

What does it mean if message passing operations are buffered?

Buffered message passing operations temporarily store messages in a buffer before they are transmitted or received, allowing for asynchronous communication.

What is cache coherence?

Cache coherence is the consistency of shared resource data in a multi-core system, ensuring that all cores see the most recent version of the data to avoid data inconsistency and race conditions.

What is the difference between Cache memory and Dynamic Memory?

Cache memory is high- speed, small-capacity memory directly accessible by the CPU, while dynamic memory (DRAM) is larger- capacity memory with slower access times used for main system memory.

What are cache aware and cache oblivious algorithms?

Cache-aware algorithms are designed with explicit knowledge of cache sizes and properties to optimize performance, while cache-oblivious algorithms do not rely on specific cache information but still achieve good cache performance through general principles

What additional type of cache miss occurs in multi-core systems?

Coherence misses occur in multi-core systems when one core modifies data that another core has cached, causing inconsistencies.

What is a DAG and how is it used in describing explicit parallelism?

Directed acyclic graph (DAG) is a graph with directed edges and no cycles, used to describe explicit parallelism by representing dependencies among tasks, enabling parallel execution of independent tasks.

What is false sharing?

False sharing occurs when two or more cores in a multi-core system access different data elements within the same cache line, causing unnecessary cache invalidations and performance degradation.

What is a hardware prefetch and how is it different from prefetch intrinsics?

Hardware prefetch is an automatic, CPU-managed prefetching mechanism, while prefetch intrinsics are explicit software instructions inserted by the programmer to initiate prefetching.

What is the difference between hyperthreading and multi-threading?

Hyperthreading is a hardware-based technique that enables a single processor to execute multiple threads simultaneously, while multithreading is a more general term for running multiple threads concurrently, either through hardware or software.

How many calls does it take to receive a message using a non-blocking protocol? What MPI communication functions implement non-blocking point-to-point communication?

It takes two calls to receive a message using a non-blocking protocol: MPI Irecv to initiate the receive and MPI Wait or MPI Test to check or wait for completion. MPI Irecv and MPI Isend implement non-blocking point-to-point communication.

What is Little's Law? What does this law tell us about the effect of latency on computing systems?

Little's Law states that the average number of items in a queuing system is equal to the arrival rate multiplied by the average waiting time. It shows that higher latency in a computing system increases the number of items waiting in queues, reducing overall system throughput.

What is logically and physically distributed memory?

Logically distributed memory refers to memory that is logically separated but physically located within the same system, while physically distributed memory is physically separated across multiple systems.

What is loop tiling and what is it used for?

Loop tiling is a code optimization technique that reorganizes nested loops into smaller, fixed-size blocks to exploit cache locality and improve cache utilization, resulting in better overall performance.

What is the difference between MPI Reduce and MPI Allreduce?

MPI Allreduce performs the same reduction operation as MPI Reduce but distributes the result to all processes in the communicator instead of just the root process.

What does MPI Alltoall do?

MPI Alltoall performs an all-to-all data exchange, where each process sends distinct data to and receives data from every other process in the communicator.

What function does MPI Barrier perform?

MPI Barrier provides synchronization among processes in a communicator, ensuring all processes reach the barrier before any continue execution.

What function does MPI Bcast perform?

MPI Bcast broadcasts a message from one process (the root) to all other processes in a communicator.

What is MPI COMM WORLD?

MPI COMM WORLD is a predefined communicator that includes all processes running in an MPI program.

What does MPI Comm Split do?

MPI Comm Split creates new communicators by dividing an existing communicator into disjoint groups based on specified criteria.

What is the purpose of MPI Finalize?

MPI Finalize terminates the MPI environment, cleaning up resources and ensuring proper program exit.

What is the purpose of MPI Init?

MPI Init initializes the MPI environment, allowing an MPI program to run across multiple processes.

What function does MPI Reduce perform?

MPI Reduce combines data from all processes in a communicator using a specified operation (e.g., summation) and stores the result at the root process.

What does an MPI Scan do?

MPI Scan performs a parallel prefix operation, computing a partial reduction (e.g., cumulative sum) of data across processes in the communicator.

What does MPI Scatter do? How about MPI Gather?

MPI Scatter distributes equal-sized chunks of an array from the root process to all processes in the communicator. MPI Gather collects data from all processes in the communicator and assembles it into an array at the root process.

What do MPI functions return?

MPI functions return an error code, indicating success or the type of error encountered.

Why are memory banks used in memory systems?

Memory banks are used to increase parallelism in memory systems, allowing multiple memory accesses to occur simultaneously, thus improving memory bandwidth and performance.

What is NUMA?

Non-Uniform Memory Access (NUMA) is a memory architecture in multi-processor systems where memory access times vary depending on the memory location relative to the processor, as opposed to uniform access times in UMA systems.

What is Parallel Speedup and Efficiency? How are they defined in terms of serial and parallel running time?

Parallel Speedup is the ratio of serial running time to parallel running time, while Efficiency is the speedup divided by the number of processors; both metrics evaluate the performance improvement of parallel execution.

How are ranks numbered within a communicator?

Ranks are numbered from 0 to N-1, where N is the total number of processes in the communicator.

Be able to describe the special case of SPMD and how it relates to Flynn's taxonomy.

SPMD (Single Program, Multiple Data) is a special case where multiple processors execute the same program on different data, fitting into the MIMD category of Flynn's taxonomy.

What is spatial locality?

Spatial locality refers to the tendency of programs to access memory locations that are close to recently accessed locations, allowing for more efficient caching.

What formula do we use to estimate the execution time for pipelined instructions?

T = (n + k - 1) x t n = number of instructions k = number of pipeline stages t = time per stage

What is temporal locality?

Temporal locality is the tendency of programs to access the same memory locations repeatedly over a short period, allowing for more efficient caching.

What is the TLB? What function does it have in the memory hierarchy?

The Translation Lookaside Buffer (TLB) is a small, fast cache that stores mappings between virtual and physical memory addresses, reducing the time taken for address translation during memory accesses.

What is a fetch, execute, store cycle

The basic operational process of a computer, where the CPU fetches an instruction from memory, decodes and executes it, then stores the result back in memory or a register

What role does the color argument to MPI Comm Split perform?

The color argument determines the group to which a process belongs in the newly created communicators.

What is the diameter and bisection width of common networks such as rings, 2-D and 3-D arrays, hypercubes, and fat-trees?

The diameter and bisection width of common networks such as rings, 2-D and 3-D arrays, hypercubes, and fat-trees vary, with rings having larger diameters, and fat-trees having smaller diameters and higher bisection widths for better scalability.

What is the difference between static and dynamic scheduling?

The difference between static and dynamic scheduling is that static scheduling assigns tasks to processors at compile time, while dynamic scheduling assigns tasks during runtime, allowing for greater adaptability to changing workloads and system conditions.

What is the first-touch phenomena

The first-touch phenomenon refers to the memory allocation policy in NUMA systems where memory is allocated on the same node where a thread first accesses the data, potentially affecting data locality and performance.

What is n1/2

The half-way point in the instruction sequence

What role does the key argument to MPI Comm Split perform?

The key argument determines the relative ordering of processes within each new communicator.

What is the motivation for multi-core architectures?

The motivation for multi-core architectures is to increase performance and energy efficiency by allowing parallel execution of multiple tasks on separate processor cores within a single chip.

What is control flow?

The order in which instructions are executed within a program May involve branching, looping, or conditional statements

What is the roofline model? What defines the roofline?

The roofline model is a performance analysis tool that visually represents the relationship between arithmetic intensity and achievable performance. The roofline is defined by the system's peak performance and memory bandwidth limitations.

What three pieces of information are used to pass sending data to a collective communication operation?

The three pieces of information are the communicator, the source/destination rank, and the message data (count and datatype).

What is Thread Parallelism? What is the difference between a thread and a process?

Thread Parallelism is the parallel execution of multiple threads within a single process; a thread is a lightweight unit of execution, while a process is a heavier-weight, independent execution environment with its own memory space.

How can we use the roofline model to optimize code performance?

We can use the roofline model to identify performance bottlenecks, understand the impact of architectural constraints, and guide optimization efforts by targeting areas where improvements in arithmetic intensity or memory access can increase performance.

What is a cache line? What property of program memory accesses do cache lines exploit?

a block of memory that is stored in cache. Cache lines exploit spatial locality, as they contain contiguous memory locations

What is a symmetric multiprocessor?

a computer system with multiple identical processors sharing the same memory and I/O resources, offering uniform access and control over resources.

What is the fork-join mechanism?

a parallel programming pattern where a task is split into subtasks, which are executed concurrently and then synchronized at a join point before continuing.

What is a fully associative cache?

allows any memory block to be stored in any cache location, providing flexibility in data placement.

What is a direct mapped cache?

assigns each memory block to exactly one cache location, determined by a specific mapping function

What is Flynn's taxonomy?

classifies computer architectures as SISD (Single Instruction, Single Data), SIMD (Single Instruction, Multiple Data), MISD (Multiple Instruction, Single Data), or MIMD (Multiple Instruction, Multiple Data), with examples like traditional sequential processors (SISD), GPUs (SIMD), fault-tolerant systems (MISD), and multi-core CPUs (MIMD)

What are the three types of cache misses in single processor systems?

compulsory (first access), capacity (cache is full), and conflict (multiple memory addresses map to the same cache location) misses.

What two properties of typical program memory access patterns are exploited by cache systems?

exploit spatial (accessing nearby memory locations) and temporal (accessing the same memory locations repeatedly) locality

Task Parallelism

focuses on distributing separate, independent tasks across multiple processing units to execute concurrently, improving overall performance.

What role do registers play in the memory hierarchy?

high-speed storage elements within the CPU that store and provide fast access to data and instructions during processing

What are atomic operations?

indivisible and uninterruptible operations that ensure consistency and synchronization in concurrent systems.

Data Parallelism

involves applying the same operation to multiple data elements simultaneously, taking advantage of the independence of data to execute operations in parallel.

What are cache tags?

metadata used to uniquely identify the memory address of data stored in a cache line.

How does the memory hierarchy address the von Neumann bottleneck?

mitigates the von Neumann bottleneck by organizing different memory types based on access speed and capacity, providing faster access to frequently used data and reducing the performance impact of slower memory.

Functional Parallelism

refers to the parallel execution of different functions or tasks, exploiting the independence of tasks to improve overall performance.

What is an LRU replacement policy?

replaces the least recently accessed cache line when making room for new data.

What role do cache systems play in the memory hierarchy?

store frequently accessed data and instructions, reducing latency by providing faster access to this information than main memory

What is bisection width?

the minimum number of edges that must be removed to divide a network into two equal parts, while diameter is the longest shortest path between any pair of nodes in the network, both used to describe network properties.

What is the von Neumann bottleneck?

the performance limitation in computer systems due to the separation of memory and processing units, causing a delay in data transfer between them.

What is cache mapping?

the process of determining the relationship between memory addresses and cache locations.

What is granularity in parallelism?

the size or duration of tasks executed in parallel, with fine-grained tasks being small and short-lived, and coarse-grained tasks being large and long-lived.

What is latency and bandwidth in memory systems?

the time it takes to access data from memory, while bandwidth refers to the amount of data that can be transferred per unit of time


Set pelajaran terkait

Chapter 11: Fundamentals of the Nervous System

View Set

ACC 617 exam 3 CH 11 Auditing & Corporate Governance: An International Perspective

View Set

Bio204 Final Mastering Bio Questions

View Set

Ch. 4 Information Security Controls

View Set

accounting information systems final (9-16)

View Set

Area concepts of polygons & Area of rectangles

View Set