computer architecture ch 7 parallel processors

¡Supera tus tareas y exámenes ahora con Quizwiz!

Multiprocessor:

A computer system with at least two processors. This computer is in contrast to a uniprocessor, which has one, and is increasingly hard to find today.

Reduction: .

A function that processes a data structure and returns a single value

Direct memory access (DMA):

A mechanism that provides a device controller with the ability to transfer data directly to or from the memory without involving the processor.

Multicore microprocessor:

A microprocessor containing multiple processors ("cores") in a single integrated circuit. Virtually all microprocessors today in desktops and servers are multicore.

Uniform memory access (UMA):

A multiprocessor in which latency to any word in main memory is about the same no matter which processor requests the access.

MIMD or multiple instruction streams, multiple data streams:

A multiprocessor.

Crossbar network:

A network that allows any node to communicate with any other node in one pass through the network.

Fully connected network:

A network that connects processor-memory nodes by supplying a dedicated communication link between every node.

Multistage network:

A network that supplies a small switch at each node.

Shared memory multiprocessor (SMP):

A parallel processor with a single physical address space.

Process:

A process includes one or more threads, the address space, and the operating system state. Hence, a process switch usually invokes the operating system, but not a thread switch.

Device driver:

A program that controls an I/O device that is attached to the computer.

7.4 hardware multithreading

A related concept to MIMD, especially from the programmer's perspective, is hardware multithreading. While MIMD relies on multiple processes or threads to try to keep many processors busy, hardware multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion to try to utilize the hardware resources efficiently.

Receive message routine:

A routine used by a processor in machines with private memories to accept a message from another processor.

Send message routine:

A routine used by a processor in machines with private memories to pass a message to another processor.

Cluster:

A set of computers connected over a local area network that function as a single large multiprocessor.

Parallel processing program:

A single program that runs on multiple processors simultaneously

Lock:

A synchronization device that allows access to data to only one processor at a time

Thread:

A thread includes the program counter, the register state, and the stack. It is a lightweight process; whereas threads commonly share a single address space, processes don't.

Nonuniform memory access (NUMA):

A type of single address space multiprocessor in which some memory accesses are much faster than others depending on which processor asks for which word.

SISD or single instruction stream, single data stream:

A uniprocessor.

Fine-grained multithreading:

A version of hardware multithreading that implies switching between threads after every instruction. The primary disadvantage of fine-grained multithreading is that it slows down the execution of the individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.

Coarse-grained multithreading:

A version of hardware multithreading that implies switching between threads only after significant events, such as a last-level cache miss. Coarse-grained multithreading suffers, however, from a major drawback: it is limited in its ability to overcome throughput losses, especially from shorter stalls. start up overhead,much more useful for reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time.

Simultaneous multithreading (SMT):

A version of multithreading that lowers the cost of multithreading by utilizing the resources needed for multiple issue, dynamically scheduled microarchitecture.

While they share some common goals with servers, WSCs (warehouse scale computers) have three major distinctions:

Ample, easy parallelism: Operational Costs Count: Scale and the Opportunities/Problems Associated with Scale:

OpenMP:

An API for shared memory multiprocessing in C, C++, or Fortran that runs on UNIX and Microsoft platforms. It includes compiler directives, a library, and runtime directives.

Memory-mapped I/O:

An I/O scheme in which portions of the address space are assigned to I/O devices, and reads and writes to those addresses are interpreted as commands to the I/O device.

Interrupt-driven I/O:

An I/O scheme that employs interrupts to indicate to the processor that an I/O device needs attention.

NVIDIA systems are the representivie of GPU architecture for this book

CUDA (compute unified device architecture) which enables the programmer to write C programs that execute on GPU's

Clusters:

Collections of computers connected via I/O over standard network switches to form a message-passing multiprocessor. Given the separate memories, each node of a cluster runs a distinct copy of the operating system. higher dependability

Message passing:

Communicating between multiple processors by explicitly sending and receiving information.

7.6 Introduction to graphics processing units

GPU

Hardware multithreading:

Increasing utilization of a processor by switching to another thread when one thread is stalled.

Network bandwidth:

Informally, the peak transfer rate of a network; can refer to the speed of a single link or the collective transfer rate of all links in the network.

Software as a service (SaaS) :

Rather than selling software that is installed and run on customers' own computers, software is run at a remote site and made available over the Internet typically via a Web interface to customers. SaaS customers are charged based on use versus on ownership.

Weak scaling:

Speed-up achieved on a multiprocessor while increasing the size of the problem proportionally to the increase in the number of processors. Bigger problems often need more data, which is an argument for weak scaling.

Strong scaling:

Speed-up achieved on a multiprocessor without increasing the size of the problem.

As mentioned above, a shared memory multiprocessor (SMP) is one that offers the programmer a single physical address space across all processors—which is nearly always the case for multicore chips—although a more accurate term would have been shared-address multiprocessor.

The alternative is to have a separate address space per processor that requires that sharing must be explicit; we'll describe this option in the COD Section 6.7 (Clusters, warehouse scale computers, and other message-passing multiprocessors).

Bisection bandwidth:

The bandwidth between two equal parts of a multiprocessor. This measure is for a worst case split of the multiprocessor.

SPMD or single program, multiple data streams:

The conventional MIMD programming model, where a single program runs across all processors. programmers normally write a single program that runs on all processors of a MIMD computer, relying on conditional statements when different processors should execute distinct sections of code. normal way to program on a mimd computer

Synchronization:

The process of coordinating the behavior of two or more processes, which may be running on different processors.

Polling:

The process of periodically checking the status of an I/O device to determine the need to service the device.

Arithmetic intensity:

The ratio of floating-point operations in a program to the number of data bytes accessed by a program from main memory.

SIMD or single instruction stream, multiple data streams:

The same instruction is applied to many data streams, as in a vector processor. SIMD computers operate on vectors of data. For example, a single SIMD instruction might add 64 numbers by sending 64 data streams to 64 ALUs to form 64 sums within a single clock cycle

Task-level parallelism or process-level parallelism:

Utilizing multiple processors by running independent programs simultaneously.

7.10 Multiprocessor benchmarks and performance models

Linpack is a collection of linear algebra routines, and the routines for performing Gaussian elimination constitute what is known as the Linpack benchmark. The DGEMM routine in the example in COD Section 3.5 (Floating Point) represents a small fraction of the source code of the Linpack benchmark, but it accounts for most of the execution time for the benchmark. It allows weak scaling, letting the user pick any size problem. Moreover, it allows the user to rewrite Linpack in almost any form and in any language, as long as it computes the proper result and performs the same number of floating point operations for a given problem size. Twice a year, the 500 computers with the fastest Linpack performance are published at www.top500.org. The first on this list is considered by the press to be the world's fastest computer. SPECrate is a throughput metric based on the SPEC CPU benchmarks, such as SPEC CPU 2006 (see COD Chapter 1 (Computer Abstractions and Technology)). Rather than report performance of the individual programs, SPECrate runs many copies of the program simultaneously. Thus, it measures task-level parallelism, as there is no communication between the tasks. You can run as many copies of the programs as you want, so this is again a form of weak scaling. SPLASH and SPLASH 2 (Stanford Parallel Applications for Shared Memory) were efforts by researchers at Stanford University in the 1990s to put together a parallel benchmark suite similar in goals to the SPEC CPU benchmark suite. It includes both kernels and applications, including many from the high-performance computing community. This benchmark requires strong scaling, although it comes with two data sets. The NAS (NASA Advanced Supercomputing) parallel benchmarks were another attempt from the 1990s to benchmark multiprocessors. Taken from computational fluid dynamics, they consist of five kernels. They allow weak scaling by defining a few data sets. Like Linpack, these benchmarks can be rewritten, but the rules require that the programming language can only be C or Fortran. The recent PARSEC (Princeton Application Repository for Shared Memory Computers) benchmark suite consists of multithreaded programs that use Pthreads (POSIX threads) and OpenMP (Open MultiProcessing; see COD Section 6.5 (Multicore and other shared memory multiprocessors)). Yahoo! Cloud Serving Benchmark (YCSB) is to compare performance of cloud data services. It offers a framework that makes it easy for a client to benchmark new data services, using Cassandra and HBase as representative examples. [Cooper, 2010]

Vector lane:

One or more vector functional units and a portion of the vector register file. Inspired by lanes on highways that increase traffic speed, multiple lanes execute vector operations simultaneously.

Data-level parallelism:

Parallelism achieved by performing the same operation on independent data.

The virtues of SIMD are that all the parallel execution units are synchronized, and they all respond to a single instruction that emanates from a single program counter (PC). From a programmer's perspective, this is close to the already familiar SISD.

The virtues of SIMD are that all the parallel execution units are synchronized, and they all respond to a single instruction that emanates from a single program counter (PC). From a programmer's perspective, this is close to the already familiar SISD. Although every unit will be executing the same instruction, each execution unit has its own address registers, and so each unit can have different data addresses. Thus, in terms of COD Figure 6.1 (Hardware/software categorization ...), a sequential application might be compiled to run on serial hardware organized as a SISD or in parallel hardware that was organized as a SIMD. The original motivation behind SIMD was to amortize the cost of the control unit over dozens of execution units. Another advantage is the reduced instruction bandwidth and space—SIMD needs only one copy of the code that is being simultaneously executed, while message-passing MIMDs may need a copy in every processor and shared memory MIMD will need multiple instruction caches.

Here are some of the key characteristics as to how GPUs vary from CPUs: GPUs are accelerators that supplement a CPU, so they do not need to be able to perform all the tasks of a CPU. This role allows them to dedicate all their resources to graphics. It's fine for GPUs to perform some tasks poorly or not at all, given that in a system with both a CPU and a GPU, the CPU can do them if needed. The GPU problem sizes are typically hundreds of megabytes to gigabytes, but not hundreds of gigabytes to terabytes.

These differences led to different styles of architecture: Perhaps the biggest difference is that GPUs do not rely on multilevel caches to overcome the long latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading (COD Section 6.4 (Hardware multithreading)) to hide the latency to memory. That is, between the time of a memory request and the time that data arrive, the GPU executes hundreds or thousands of threads that are independent of that request. The GPU memory is thus oriented toward bandwidth rather than latency. There are even special graphics DRAM chips for GPUs that are wider and have higher bandwidth than DRAM chips for CPUs. In addition, GPU memories have traditionally had smaller main memories than conventional microprocessors.Finally, keep in mind that for general-purpose computation, you must include the time to transfer the data between CPU memory and GPU memory, since the GPU is a coprocessor. Given the reliance on many threads to deliver good memory bandwidth, GPUs can accommodate many parallel processors (MIMD) as well as many threads. Hence, each GPU processor is more highly multithreaded than a typical CPU, plus they have more processors.

catagorization of parallel hardware

Thus, a conventional uniprocessor has a single instruction stream and single data stream, and a conventional multiprocessor has multiple instruction streams and multiple data streams.


Conjuntos de estudio relacionados

HESI Exam Fundamental of Nursing

View Set

WEEK 9 >>>Chapter 43 Liver, Biliary Tract, and Pancreas Problems Pancreatic Cancer

View Set

Mental Health: Physiological & Psychological Responses to Stress

View Set

Chapter 7: Consumer Buying Behavior

View Set

Psychology Today- The bystander effect

View Set

2nd year Lesson 10 Reviewing of DC Theory

View Set