PP Test 2

Ace your homework & exams now with Quizwiz!

MPI Gather

MPI_Gather( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, dst, MPI_COMM_WORLD ); -Gather is both a sender and reciever, but root node is receiver instead of sender

Programheader

#include <stdio.h> #include <math.h> #include <string.h> #include <stdlib.h> //used for timing NOT parallelism #include <omp.h> //this one #include "cl.h"

Combining SIMD with Parallel Threading in OpenMP 4.0

#pragma omp parallel for simd for( int i = 0; i < ArraySize; i++ ) { c[ i ] = a[ i ] * b[ i ]; } use threads and have the threads use its SIMD units to speed it up.

kernel void ArrayMult

( global const float *dA(the local GPU memory address of the array dA), global const float *dB, global float *dC ) device version of A dC[gid] = dA[gid]

OpenGL object

-An OpenGL Object is pretty much the same as a C++ object: it encapsulates a group of data items and allows you to treat them as a unified whole. -Then, you could create any number of Buffer Object instances, each with its own characteristics encapsulated within it. When you want to make that combination current, you just need to point the ArrayBuffer element of the Context to that entire struct ("bind"). When you bind an object, all of its information comes with it.

OpenCL Events

-An event is an object that communicates the status of OpenCL commands -When it reaches the front of the conveyor belt it does it. -If the next object can be done at the same time, OpenCL does it too. write A and B buffers over to the GPU device and the kernel starts executing before the buffer writing is complete, the buffers will be empty.

OpenCL Reduction: mask

-Basically saying which threads are active(being added). Mask is anded in with the thread number. If it comes out as non 0, the thread is idle.

OpenCL Reduction

-Does not have built in reduction A * B → prods Σ prods → C -After the array multiplication, we want each work-group to sum the products within that work-group, then return them to the host in an array for final summing. -To do this, we will not put the products into a large global device array, but into a prods[ ] array that is local to each work-group. -Workgroup returns its C to the host to add the final numbers. Need to know where we are in the work item and need to know where we are in the work group prods[thread number (work-item in work group)] = dA[gid] +dB[gid] A work-group consisting of numItems work-items can be reduced to a sum in Log2(numItems) steps. In this example, numItems=8.

Why have GPUs Been Outpacing CPUs in Performance?

-Due to the nature of graphics computations, GPU chips are customized to handle streaming data. Streaming numbers = GPUs are great. -Another reason is that GPU chips do not need the significant amount of cache space that occupies much of the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power. CPU chips let you jump around a lot. While GPU does not. -Another reason is that general CPU chips contain on-chip logic to do branch prediction. This, too, takes up chip die space. -Another reason is that general CPU chips contain on-chip logic to process instructions out- of-order if the CPU is blocked and is waiting on something (e.g., a memory fetch). This, too, takes up chip die space.

Vector registers (XEON PHI)

-Fused Mutliply-Add are 512 bits wide = 16 floats. They can perform Fused Multiply-Add (FMA). Theoretical performance = almost 1 TFLOPS

The mechanical equivalent of CUDA cores

-Grapefruit loading machine -yellow balls @ bottom are grapefruit. Robot picks up grapefruit off of conveyor belt in 2 rows of 3 and puts them in the crate. - The robot makes decisions and the 6 grippers carry out the decisions, but don't have their own intelligence. Executing the same things. -GPU cores ARE NOT like CPU cores -Program is the yellow robot, grippers are the different pieces of data. - Yellow robot compute unit and the grippers are like processing elements. Processing elements have no independent intelligenceAll do same things.

OpenCL

-Open Computing Language -first.cpp -> OpenCL API, run like normal and links with OpenCL library first.cl ->GPU kernel ->Cish program that will express the data parallelism

OpenCL(.cl) supports built in SIMD constructs

-OpenCL code can be vector-oriented, meaning that it can perform a single instruction on multiple data values at the same time (SIMD). -Vector data types are: charn, intn, floatn, where n = 2, 4, 8, or 16. f = (float4)( 1.f, 2.f, 3.f, 4.f ); -Just because a language supports it doesnt mean the hardware does. ->does not automatically produce SIMD code.

OpenGL Compute Shaders: How they compare with using OpenCL.

-OpenCL requires installing seperate driver -Compute shaders use GLSL language, when doing shaders for graphics pipeline-> OpenGL programmer. Dont need to learn OpenCL, just use OpenGL -Shhaders use same context. No need to acquire and release the context. - More lightweight than calls to OpenCL kernels (better performance). -OpenCl has hella set up -Use for light-weight data parallelism (not heavy, if heavy use OpenCL- big bad computational science).

OpenCL / OpenGL Interoperability

-Simulations and games C++program writes initial values into the buffer on the GPU-> (x,y,z) vertex data in an openGL buffer(8->)(9)(10->) (10 points back)OpenCL acquires the buffer-> Each openCL kernel reads an (x,y,z) value from the buffer (8)->Each OpenCL kernel updates its (x,y,z) value->Each OpenCL kernel writes its (x,y,z) value back to the buffer(<-9)->OpenCL releases the buffer->OpenGL draws using the (x,y,z) values in the buffer on the GPU(10)

Write the Data from the Host Buffers to the Device Buffers

-Writing over to the device -take array in ha write it to the da array on the device do this number of bytes. -enqueue? kept into a queue -> the command buffer. Put in command queue/buffer and program on CPU can generate the requests as fast as it wants and OpenCL will get to it. status = clEnqueueWriteBuffer( cmdQueue, dA (device array we are writing to it), CL_FALSE (block until done? Or check on it later?), 0, dataSize (how many bytes are we taking), hA (host array where we are taking data from), 0 (#events), NULL(event wait list), NULL(event object) );

OpenGL Compute Shaders: What are they

-look like OpenCL and were designed to. OpenGL was out and then OpenCL came out and use for making openCL and openGL interoperability was great. -With OpenCL had to load in new driver & extensions. Why is it not part of OpenGL? It is a subpiece of OpenGL which looks like a simple OpenGL. - Has access to everything OpenGL created. -Same work-group and work-item idea as OpenCL.

GPU architecture

-sections of chip that were task specific: vertex shader (only vertices), pixel shader (only pixels) -All processors are the same now.Vertex processor one minute and a pixel processor the next. Why do GPUs have so many cores? The definition has changed.

8. Compile and Link the Kernel Code

// create the kernel program on the device: //passing in source code string and telling it this is the program. //will be compiled & linked char * strings [ 1 ]; // an array of strings strings[0] = clProgramText; cl_program program = clCreateProgramWithSource( context, 1, (const char **)strings (array of arrays), NULL, &status ); delete [ ] clProgramText; // build the kernel program on the device: //build program/ link it (function calls)& the device we are doing it for & options we dont have char *options = { "" }; status = clBuildProgram( program, 1, &device, options, NULL, NULL ); if( status != CL_SUCCESS ) { // retrieve and print the error messages: size_t size; clGetProgramBuildInfo( program, devices[0], CL_PROGRAM_BUILD_LOG, 0, NULL, &size ); cl_char *log = new cl_char[ size ]; clGetProgramBuildInfo( program, devices[0], CL_PROGRAM_BUILD_LOG, size, log, NULL ); fprintf( stderr, "clBuildProgram failed:\n%s\n", log ); delete [ ] log; {

10. Setup the Arguments to the Kernel Object

//done through API to set up the kernel //kernel created and the argument to be created. //how big the data is sizeof(cl_mem) the pointer) //address(&dA the argument is) status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA ); kernel void ArrayMult( global const float *dA, global const float *dB, global float *dC )

11. Enqueue the Kernel Object for Execution

//eventurally munches it. Want the write buffers to happen before execute kernel. Don't have to concern yourself when the device becomes available. Will just grab as fast as it can //how much to use compute units and the processing elements size_t globalWorkSize[ 3 ] = { NUM_ELEMENT, 1, 1 }; size_t localWorkSize[ 3 ] = { LOCAL_SIZE, 1, 1 } ; //waiting for everything to go quiet //done before timer to get there status = clEnqueueBarrier( cmdQueue ); double time0 = omp_get_wtime( ); //ND could be 1D/2D/3D status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1 (dimensions), NULL (global work offset), globalWorkSize, localWorkSize, 0(events), NULL, NULL ); //want to make sure everything has happened before clicking timer status = clEnqueueBarrier( cmdQueue ); double time1 = omp_get_wtime( );

OpenCl events: waiting for one or more events

//everything on OpenCL device side is done. first argument was amount of events to wait for. 2, dependencies, NULL ); cl_event waitKernelA, waitKernel B. <- 2 enqueues are going to throw ... cl_event dependencies[ 2 ]; dependencies[ 0 ] = waitKernelA; dependencies[ 1 ] = waitKernelB; Waiting on 2 things to throw events. It's a list of what dependencies you are waiting for. 1 event = address of 1, &waitKernelA //This blocks until the specified events are thrown, so use it carefully! status = clWaitForEvents( 2, dependencies ); //This is just like a barrier, but it can throw an event to be waited for. cl_event waitMarker; status = clEnqueueMarker( cmdQueue, &waitMarker );

13. Clean Everything Up

//good practice //happens automatically when program exits clReleaseKernel( kernel ); clReleaseProgram( program ); clReleaseCommandQueue( cmdQueue ); clReleaseMemObject( dA ) clReleaseMemObject( dB ) clReleaseMemObject(dC ); delete [ ] hA; delete [ ] hB; delete [ ] hC;

9. Create the Kernel Object

//kernel object is the OpenCL program i.e. ArrayMult already compiled, but now just needs to be set up. Embedded programming because it is living on a seperate computer from the one we are making the function call from. //essentially setting up a command line cl_kernel kernel = clCreateKernel( program, "ArrayMult", &status ); program = program we compiled arraymult = the function

12. Read the Results Buffer Back from the Device to the Host

//need barrier before enqueue otherwise it will execute and wont stop before going to nect on conveyor belt. Will read buffer back before kernel finishes. status = clEnqueueReadBuffer( cmdQueue, dC (device buffer), CL_TRUE (block until done doesnt block enqueue kernel, blocks the enqueueReadBuffer), 0, dataSize (bytes bringing back), hC (array on host side), 0(events), NULL, NULL );

steps in creating and running an OpenCL program

1. Programheader 2. Allocate the host memory buffers 3. CreateanOpenCLcontext 4. CreateanOpenCLcommandqueue 5. Allocatethedevicememorybuffers 6. Writethedatafromthehostbufferstothedevicebuffers 7. Readthekernelcodefromafile 8. Compileandlinkthekernelcode 9. Createthekernelobject 10.Setup the arguments to the kernel object 11.Enqueue the kernel object for execution 12.Read the results buffer back from the device to the host 13.Clean everything up

How Can You Gain Access to GPU Power?

1. Write a graphics display program (≥ 1985) -very parallel 2. Write an application that looks like a graphics display program, but uses the fragment shader to do some computation (≥ 2002) -Put own code on graphics card->fragment shaders done in order to do graphics. I.e. pixel effects, conways game of life (pixel comparison arithmetic). 3. Write in OpenCL(orCUDA), which looks like C++ (≥ 2006) -Compilers that know how to compile down to pixel code. NVIDIA Compute unified device architecture. Write in Cish language & compiled into GPU code.

Why use an array of strings to hold the OpenCL program, instead of just a single string?

1. You can use the same OpenCL source and insert the appropriate "#defines" at the beginning 2. You can insert a common header file (≈ a .h file) 3. You can simulate a "#include" to re-use common pieces of code

In terms of 32-bit floating point numbers, a cache line's size is:

16 floats

Using "default(none)" in an OpenMP #pragma is:

A good idea, but not required

OpenCL:get_local_id( )

Among things in local workgroup, which one am i?

Vector processing

Another word for single instruction multiple data can be done on modern CPUS

OpenGL compute shader basic idea

Application Invokes the Compute Shader to Modify the OpenGL Buffer Data (run graphics) ->Application Invokes OpenGL Rendering which Reads the Buffer Data

OpenCL code needs to be compiled in the Driver

Application program -> GPU - > openCL driver (GLSL shader driver )does the compile and link<-OpenCL code in a separate file.

OPENCL

CUDA-NVIDIA version->everything for OpenCL is true for CUDA as well. OpenCL runs on everything • There is a JavaScript implementation of OpenCL, called WebCL- not true with CUDA • There is a JavaScript implementation of OpenGL, called WebGL - not true with CUDA • WebCL can share data, and interoperate with, WebGL • OpenCL consists of two parts: a C/C++-callable API and a C-ish programming language. • The programming language can run on NVIDIA GPUs, AMD GPUs, Intel CPUs, Intel GPUs, mobile devices, and (supposedly) FPGAs (Field-Programmable Gate Arrays). But, OpenCL is at its best on compute devices with large amounts of data parallelism, which usually implies GPUs. • You break your computational problem up into lots and lots of small pieces. Each piece gets farmed out to threads on the GPU. • Each thread wakes up and is able to ask questions about where it lives in the entire collection of (thousands of) threads. From that, it can tell what it is supposed to be working on (The data). • OpenCL can share data, and interoperate with, OpenGL

Prefetching

Cache fetch time can become large amount of timing. Prefetching is used to place a cache line in memory before it is to be used, thus hiding the latency of fetching from off-chip memory. There are two key issues here: 1. Issuing the prefetch at the right time 2. Issuing the prefetch at the right distance

GPU platform (graphic units)

Can have one or more Devices.

Xeon Phi peak performance

Clockfreq x #cores x #vectorlanes x 2FMA/2cyclestodecode= 1.091 GHz x 56 x 16 x 2 / 2 = 0.98 TFLOPS

OpenCL Assembly Language: distance(), length(),normalize()

Code is amazingly involved. Couple pages of assembly before computing sqrt. The nrmal fucntions are for real accuracy. Optimizing and scaling pass it in, bring it back out, rescaling.

Create an OpenCL Command Queue

Context found in step 3 is now used here. command queue is where you're going to send things each device in the context can have its own command queue. cl_command_queue cmdQueue = clCreateCommandQueue( context, device, 0 (properties), &status(needs to be checked) );

Create an OpenCL context

Context is the internal data structure framework in which your code is running. cl_context context = clCreateContext( (A) NULL, (B) 1,(C) &device, (D) NULL, (E) NULL, (F) &status ); A - properties B - device (amount) C- the device (the device to run on) D- callback (say when its done) E - pass in the user data (not doing now. DO later) F- returned status (can check).

AMD set a record for clock speed by:

Cooling the CPU with liquid nitrogen

How does MPI let the Sender perform an MPI_Send( ) even if the Receivers are not ready to MPI_Recv( )?

Data is put into MPI transmission buffer MPI_Send( ) blocks until the transfer is far enough along that array can be destroyed or re-used.

GPUs

Data parallel programming Little user control Regular data structures Regular Flow Control DATA PARALLLEL thousands of things to execute! Small data set size is a few thousand Each thread executes the same program, but operates on a different piece of data. Built in functions to determine the thread # and which part of memory to get data from. -When threads block there needs to be more to work on. Each compute unit is capable of swapping out an entire bank of threads and bringing in another 128 to continue to processing. 128 threads bringing in then kicked out. -Particle systems are a good example. One thread for particle. Data is left in GPU memory & openCL figures out where it is & OpenGL writes it.

A chunksize of 2:

Deals two for-loop passes to each thread and then goes around to each thread again, etc

When using OpenMP Tasks to apply parallelism to traversing a binary tree, how uniform (i.e., evenly spread out) among the threads the distribution of tasks is :

Depends on the compiler

Coarse-gained parallelism is:

Dividing the problem into a small number of large pieces

The difference between static and dynamic scheduling of a for-loop is:

Dynamic scheduling divides only some of the for-loop passes among the threads at first

A Private variable differs from a Shared variable in that, when using a Private variable:

Each thread has its own copy of it

Message passing interface follows SPMD

Every node is running the exact same program -Single program multiple data - all 64 computers are running the same program -Communicator is a collection of CPUs that are capable of sending messages to each other -Need to find out which thread they are MPI_Comm_Rank - 0 is the master

Moore's Law (as Gordon Moore actually phrased it) is no longer happening today

False

Our class's "Inverse Amdahl's Law" that you used in Projects #0 and #1, computes:

Fp, given S and n

CPUs

General purpose programming Multi-core under user control Irregular data structures: traversing a tree, following a linked list. Jumping around Irregular flow control

GPU

Graphical processing unit GPU increased dramaticall in comparison to CPU

How does openCL work?

Have the .cl file that is the openCL code every vendor that supports openCL has drivers. NVIDIA takes code to run on their own device. OpenCL code .cl file, cish language is able to produce compiler to compile it to run on the device.

WORKGROUPS/WORK ITEMSSoftware Compute units

Have work-groups to hold data each work group is divided into work items.

In multithreading, the threads all share:

Heap, Execution instructions, and Global variables

Allocate the host memory buffers

Host = CPU, GPU is where were going to accelerate the floating point operations. //Could be done one heap or global. NOT LOCAL. float * hA = new float [ NUM_ELEMENTS ]; OR //Dont do this as local otherwise will over flow the stack. float hA[ NUM_ELEMENTS ];

OpenCL Reduction: offset

How much do you add to 0 or 2 to get to the next thing (the next thing to add in). -tells what the idle thread should be doing. offset = 1,2,4 mask = 1,3,7;

OpenCL/OpenGL

If the data needs to be shared, OPENGL will write it. -if its something that OpenCL needs, will be done with device memory -Once data is in GPU memory (vertex buffer) it stays there. Data doesnt travel bus again. OpenGL rendering context: the state/context.

Bottom line of CPU and GPU

Impossible to directly compare CPU and GPU

No stack

In GPUs

In a chip

Instruction Decode -> Scalar Unit -> Scalar registers -> 32k l1 data cache -> 512k l2- cache/core - > ring bus .. also instruction cache X registers

Intel SSE

Intel architecture supports vectorization. The most well-known form is called Streaming SIMD Extension, or SSE. It allows four floating point operations to happen simultaneously. - How normal scalar floating point multiplication works: 1 num in register 0 while the other is in register 1. Take contents of each, multiply them together. Put them back into r0. ->mulss r1, r0 SSE: does this 3 more times. (x registers, extended). x is 4 floating point numbers., but operation is the same. mulps xmm1 (src), xmm0(dest) -> all four numbers in parallel

Fused Multiply-Add

It allows the operation: d = a*b + c; while next part of multipy is taking place, the adding is also taking place (same instruction now). It is now done in half the time. instead of normal multiply add: tmp = B*C D=A+tmp to be performed in the same amount of time as: d = a*b; Normal A+(B*C) /= FMA A+(B*C)

A good way to make a piece of code not Thread Safe (such as rand or strtok) is to:

Keep internal state

Hyperthreading is:

Keeping one or more extra thread states within a core

OpenCL memory

Kernel There is global memory on graphics card Constant memory sits somewhere Within each work group there are work items -work items have private memory. If you declare a variable in your openCL code its a private variable - work item has access to local memory.

The cache that is closest to the Arithmetic Logic Unit (ALU) is named:

L1

The cache that is smallest and fastest is:

L1

OpenCL is a

Large pacman. Enqueueing things onto a conveyor belt -read buffer dC -execute kernel -write buffer dB -write buffer dA This is what it means to enqueue.

Allocate the Device Memory Buffers

Let's assume device is the GPU. RUnning data parallel on GPU has incredible performances. ->device where OpenCL is going to do its data parallel dA = d is on the device!!!!!! (The GPU in this case). cl_mem dA = clCreateBuffer( context, (read or write) CL_MEM_READ_ONLY, dataSize (how bit the array *size of float), NULL (Buffer data already allocated), &status (check later)); - read and write is done by GPU(the OpenCL) device. CL_MEM_READ_WRITE(to read and write)

The stack is used for storing:

Local variables and function return addresses

FMA

Lumping add and multiply compiler took advantage of it with assembly.

MPI Barriers

MPI_Barrier( MPI_COMM_WORLD ); - Barriers are based on count, not location. -All CPUs must execute a call to MPI_Barrier( ) before any of them can move past it. Reminder: barriers are based on count, not location.

MPI Receiving Data in a Destination CPU from a Source CPU

MPI_Recv( array, maxCanReceive, type, src, tag, MPI_COMM_WORLD, &status ); -status is a structure -tag should be what we are looking for Rules: • The receiver blocks waiting for data that matches what it declares to be looking for • One message from a specific src to a specific dst cannot overtake a previous message from the same src to the same dst • There are no guarantees on the order from different src's • The order from different src's could be implied in the tag • status is type MPI_Status - the "&status" can be replaced with MPI_STATUS_IGNORE

MPI reduction

MPI_Reduce( partialResult, globalResult, count, type, operator, dst, MPI_COMM_WORLD ); dst = ROOT -Does it for you - Everybody exeutes the same call. Just needs to know where to put it. Know where to put it by some operation to find where nodes on network has partial sum.

MPI: Sending Data from a Source CPU to Several Destination CPUs

MPI_Send( array, numToSend, type, dst, tag, MPI_COMM_WORLD ) -Specify the destination for this one. Doesn't have SRC so the send function is different from the recieve function. -tag: an integer to differentiate this transmission from any other transmission Rules: • One message from a specific src to a specific dst cannot overtake a previous message from the same src to the same dst. -order sent in is the order gotten out. • MPI_Send( ) blocks until the transfer is far enough along that array can be destroyed or re-used. ->once MPI_send returns • There are no guarantees on order from different src's .

MPI Derived Types

MPI_datatype point_t; MPI_Type_create_struct( count, blocklengths, displacements, types, datatype ); MPI_Type_create_struct( 4, blocklengths, displacements, types, &point_t ); You can now use point_t everywhere you could have used MPI_INT, MPI_FLOAT,etc. Structures and scatter/gather. Can create your own type. Idea: In addition to types MPI_INT, MPI_FLOAT, etc., allow the creation of new MPI types so that you can transmit an "array of structures". Reason: There is significant overhead with each transmission. Better to send one entire array of structures instead of sending several arrays separately.

The theoretical maximum speedup that you can ever achieve, no matter how many cores you add, is:

MaxS = 1/Fs

Gustafson's Observation on Amdahl's Law says:

More cores often results in more data, which results in a larger parallel fraction

A way to prevent harm from race conditions is:

Mutal Exclusion Locks

Core scoreboard

NVIDIA - 1024 Intel - 8 How can this be?

OpenCL Reduction: Barriers

No control when openCL makes these things run. Need a barrier to wait for completion otherwise dont know if there is int mask = 2*offset - 1; barrier( CLK_LOCAL_MEM_FENCE ); // wait for completion prods[ tnum ] += prods[ tnum + offset ]; Or if product is ready barrier( CLK_LOCAL_MEM_FENCE )

1D 2D 3D OpenCL special hardware?

No. Done for convenience.

OpenGL Compute Shaders: how they fit into a graphics application.

OpenGL subpart. Particle system. OpenCL+OpenGL baby graphics. Game companies

Another way to do SIMD muliplication

OpenMP void SimdMul( float *a, float *b, float *c, int len ) { #pragma omp simd for( int i= 0; i < len; i++ ) c[i] = a[i] * b[i]; } Presumably will generate SIMD code. Will give right answer, but may not be done in the fastest way. Should be turned into SIMD code.

Message passing interface

Parallelism on multiple CPUs Basic Idea: 2 different computers connected by a network. Allowed to pass messages.Messages can be commands or data between each other. -Theres an MPI server(the software is installed on each of the systems. Each computer is given an ID number (the rank).

Work Items

Part of Work-Groups (dad) - Has its own private memory(child) exactly as OpenMP private memory. Each one has its own value.

What is Vectorization/SIMD and Why do We Care?

Performance!-> issuing single instruction that is causing multiple pairs of numbers to be operated on. Outgrowth of Moores law. Put in more cores, cache, and vector units with smaller transistors. Many hardware architectures today, both CPU and GPU, allow you to perform arithmetic operations on multiple array elements simultaneously. (Thus the label, "Single Instruction Multiple Data".) We care about this because many problems, especially scientific and engineering, can be cast this way. Examples include convolution, Fourier transform, power spectrum, autocorrelation, etc.

OpenGL binding

Plugging into state the buffer. bind vertex buffer information. Draw. glBindBuffer( GL_ARRAY_BUFFER, buf ); It's very fast to re-bind a different vertex buffer. It amounts to just changing a pointer. -The OpenGL term "binding" refers to "attaching" or "docking" (a metaphor which I find to be more visually pleasing) an OpenGL object to the Context. You can then assign characteristics, and they will "flow" through the Context into the object. -Docks to array buffer and write it into the currently docked array buffer. glBufferData( GL_ARRAY_BUFFER, numBytes, data, usage ); USE THE CURRENT ONE. -Use a different vertex buffer, just rebind it.

Speedup is defined as (P=performance, T=execution time, n = number of cores):

Pn/P1

SIMD

Prefetch is super important. Could have a decrease, because it is being gone through so quickly it is violating temporal coherence (not using things multiple times). Using once, bang, reload cache line.

Declaring a variable inside an OpenMP for-loop automatically makes it:

Private

Each Compute Unit is organized as a grid of

Processing Elements (floating point and integer units arithmetic).

Intel Xeon Phi

Rabbit has the Xeon System - 8 hardware cores - 57 cores (1 for the OS, 56 for you) - 56 cores we can get at with 4 hyperthreads/core = 224 hyperthreads ready to use -emphasized vector unit and vector registers (z registers) LEFT SIDE OF INTERNALS - The vector registers are 512 bits wide which can hold 16 floats. 16 floats* 4 bytes/ float = 64 bytes = ITS A CACHE LINE no out-of-order instruction processing (ILP). When it blocks. it blocks -THE COMPILER IS STILL YOUNG!!!! (VECTOR/SIMD).

Amdahl's law says:

S = 1/( (Fp/n) + Fs )

General term in OpenCL

SM = compute unit CUDA CORE = Processing Element.

As you saw in the graphs for Project #0 (Simple OpenMP) and Project #1 (volume integration):

Small dataset sizes can give you unreliable performance measurements

Speedup Efficiency is defined as:

Sn/n

The two coherences that the cache really likes you to use in order to deliver maximum performance are:

Spatial and Temporal

A thread's state consists of:

Stack pointer, Program counter, Registers

MPI_COMM_WORLD

Stands for all of the CPUs asked for

MPI scatter

Take a data array, break it into ~equal portions, and send it to each CPU MPI_Scatter( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, src, MPI_COMM_WORLD ); -example of each node executes the same call. - Both the sender and receivers need to execute MPI_Scatter - there is no separate receive function -root node can be both sender and reciever

The difference between using OpenMP Tasks vs. using OpenMP Sections is that:

Tasks are dynamically allocated, sections are static

MESI is:

The 4 states of a cache line on a core

CON of OpenCL

The GPU does not have a stack, and so the OpenCL C-ish programming language cannot do recursion and cannot make function calls. It also can't use pointers. GPUS do not have stack & cant handle pointers.

This all sounds great! What is the catch?

The catch is that compilers haven't caught up to producing efficient SIMD code. So, while there are great ways to express the desire for SIMD in code, you won't get the full potential speedup ... yet. So, for the CPU SIMD project, we are going to investigate the potential speedup using assembly language. Don't worry - you don't need to write it. You will be given two assembly functions: 1. SimdMul: C[ 0:len ] = A[ 0:len ] * B[ 0:len ] 2. SimdMulSum:return(ΣA[0:len]*B[0:len]) Haven't found full results for SIMD since it is still relatively new. Given in assembly language, because it works best there. Getting at the full SIMD power until compilers catch up. It loads/ multiplies/stores 4 at a time. Were advancing 16 bytes (changing pointer by 16). Tight code. Compilers not doing this.

Prefetch right time and right distance

The right time: Prefetching 12 Prefetching is used to place a cache line in memory before it is to be used, thus hiding the latency of fetching from off-chip memory. If the prefetch is issued too late, then the memory values won't be back when the program wants to use them, and the processor has to wait anyway. If the prefetch is issued too early, then there is a chance that the prefetched values could be evicted from cache by another need before they can be used. The right distance: The "prefetch distance" is how far ahead the prefetch memory is than the memory we are using right now. Too far, and the values sit in cache for too long, and possibly get evicted. Too near, and the program is ready for the values before they have arrived.

The word "deterministic" means:

The same inputs will always produce the same outputs

GPU vs CPU

They do different things

OpenMP Reductions are faster than Atomic or Critical sections because:

They sum into a separate variable per thread and then perform power-of-two addition

Barrier in the command queue

This does not complete until all commands enqueued before it have completed. status = clEnqueueBarrier( cmdQueue );

The Bubble Sort example needed to use the Even-Odd (or Red-Black) pattern:

To prevent two threads from trying to write the same variable at the same time

Moore's Law (as Gordon Moore actually phrased it) says:

Transistor density doubles every 2 years

OpenCL

Version of CUDA. Same thing. Supported by many vendors.

MPI Broadcasting

Want each CPU to work on part of the billion sized array. -Count is the number of elements -type: what the data is MPI_CHAR -src rank of the CPU. Evreyone who gets the broadcast know who sent it. -Everyone executes MPI broadcast -only root node reads the value in. Root then sends it. If not the root, it recieves -MPI uses same bcast function -if you are the source MPI knows to make you sender - if you are not the source MPI knows to make you the reciever

The observation that clock speed doubles every 2 years:

Was the case for a while, but does not hold anymore

OpenGL vertex buffers

Where the data memory is held. OpenGL is moving to vertex buffer objects. Vertices will only be listed once, but are indexed. Only has to be done once per vertex. OpenGL, objects are pointed to by an unsigned integer handle. You can assign a value for this handle yourself (not recommended), or have OpenGL generate one for you that is guaranteed to be unique. For example: GLuint buf; glGenBuffers( 1, &buf );

SIMD in Intel Chips: SSE

Width bits: 128 Width (FP words): 4 pairs of floating point multiplications at the time YR: 1998 -Pretty much still on every CPU

SIMD in Intel Chips: AVX

Width bits: 256 Width (FP words): 8 pairs multiplied at a time YR: 2011 -Not as common on chips

SIMD in Intel Chips: MMX

Width bits: 64 Width (FP words): 2 - Can multiply 2 pairs of floating point numbers at the same time YR: 1996 MM = Multimedia -Used to speed up the processing of music

SIMD in Intel Chips: AVX-512

Xeon Phi (rabbit) Width bits: 512 Width (FP words): 16 pairs of floating point numbers at a time. An entire cache line! YR: 2013 -Can go through 16 pairs at a time. Load cache line. Bang its multiplied. This is where prefetch becomes important. Need to reload another cache line faster than we had before.

particles.cpp, particles.cl

Yikes

A "race condition" is one where:

You get a different result depending on which thread gets to a piece of code first

OpenCL Reduction: local memory array,

cSize = numWorkGroups*sizeof(float) float *hc = new float [ numWorkGroups ]; status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC ); to do local memory: done in host: status = clSetKernelArg( kernel, 2, sizeof(float), NULL ); NULL is saying local. 1 element per work item. Put in work group memory. local float *prods

kernel

can be fired off from your c program as kernel

OpenCL has float4

can be seen as SIMD.

.cl file

pass in pointer to get to global memory int gid = get_global_is(0) -> gid is global id 0 = .x .y .z GPU memory is 1D 2D and 3D, just getting .x global id is where am i in the entire big array. dA[gid]-> get its value from its big array with that index. put product back in dC ->dC[gid] = dA[gid]

Work-Group

certain set of data to be worked on. Eack work group has its own local memory which it cn use/ not use. -Each work group has certain number of work items to work on. - Has global memory that all can work on. Usually the big GPU memory. Where you put the 50mill size of array. Can also have constant memory. Depends on the system you are on

OpenCL acquiring and releasing a buffer

clCreateBuffer( Context, CL_MEM_READ_WRITE, 4*sizeof(float)*NUM_PARTICLES, NULL, &status ); Creates an OpenCL device memory pointer from an OpenGL vertex buffer

7. Read the Kernel Code from a File into a Character Array

const char *CL_FILE_NAME = { "arraymult.cl" }; <-name of the program FILE *fp = fopen( CL_FILE_NAME, "r" ); if( fp == NULL ) { fprintf( stderr, "Cannot open OpenCL source file '%s'\n", CL_FILE_NAME ); return 1; } // read the characters from the opencl kernel program: fseek( fp, 0, SEEK_END ); //ftell howmany bytes in are you (in the file). size_t fileSize = ftell( fp ); //go back to beginning fseek( fp, 0, SEEK_SET ); //handing compiler large text char *clProgramText = new char[ fileSize+1 ]; size_t n = fread( clProgramText, 1, fileSize, fp ); //add null so it is a string clProgramText[fileSize] = '\0'; fclose( fp );

Offloading

could run on the CPU/XEON and send stuff over to run on Xeon phi

clCreateFromGLBuffer

dPobj = clCreateFromGLBuffer( Context, CL_MEM_READ_WRITE, hPobj, &status ); PrintCLError( status, "clCreateFromGLBuffer (1)" );

sphere

float4: x,y,z,radius

workgroup #

globalIndexSpaceSize(global Space)/WorkGroupSize (local space) 2d: Global space x = 20 y = 12 local space x = 4 y =3 20*12/4*3 work groups y=4 x=5 5*4

GPUs can come in multi-devices

i.e. a card/ graphics card

OpenCL:get_global_id( )

int gid = get_global_id( 0 (dimension we are fetching)) 2D and want Y = 1 3D Z=2 got a big long array gid tells you where you are in the array. What number u at. can be called up to 3x for 3D to find out where you are to get data that you need.

GPU device

is organized as a grid of SM = Compute Units (Independet processors on it).

NVIDIA and CUDA cores

no control logic - they are pure compute units. (The surrounding SM has the control logic.) -FP UNIT and INT UNIT. NO PROCESSOR/ CONTROL LOGIC/NOTHING TO EXECUTE. Cant load program counters/ stack. - has a Streaming multiprocessor (SM) can execute an isntruction and 192 data values can be computing what that instruction asks to have happen at a time. -GPUs made to stream data.

cl_uint

openCL unsigned int

Scatter Gather MPI

pattern of breaking a big problem up into pieces, sending them to different CPUs, computing on the pieces, and getting the results back is so common that it has its term: Scatter/Gather, and has its own MPI function calls.

clEnqueueNDRangeKernel

put kernel on the conveyor belt

OpenCL Assembly Language: fast_distance(), fast_normalize(), fast_length()

something to do with sqrt normalize is divide by the length of a vector length is to get the length of a vector. Really quick. Call function and done. 16 bit instead of 32 bit. Only need to be accurate to the closest pixel.

OpenCl events: throwing events

status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL ); 2 event calls - ># of events to wait for and the list of them 0,NULL not waiting for any events Last NULL -> throw event when you're done. All it needs to be done is declared cl_event waitKernelC; Take address and whenever the kernel is done executing throw the event.

SIMD multiplication

void SimdMul( float *a, float *b, { //letting you express SIMD, :notation. start at 0 go to len (<len) float *c, int len ) c[0:len] = a[0:len] * b[0:len]; } //multiply scalar c[0:len] = a[0:len] * b; a[ 0 : ArraySize ] ->"The set of elements in the array a starting at index 0 and going for ArraySize elements".

OpenCL:get_local_size( )

within each workgroup how many pieces of data do you have. 0,1,2

Requirements for a For-Loop to be Vectorized

• If there are nested loops, the one to vectorize must be the inner one. (The one you are going to be in most often) • There can be no jumps or branches. "Masked assignments" (an if-statement- controlled assignment) are OK, e.g., if( A[ i ] > 0. ) B[ i ] = 1.; if to assign is okay. If this boolean then do the b of i. (its bang bang bang bang bang)-> if statements are gotos • The total number of iterations must be known at runtime when the loop starts • There cannot be any backward loop dependencies, like this: A[ i ] = A[ i-1 ] + 1.; (going to spread for loop passes on multiple threads, no gaurantee the statements is going A[i] is actually going to happen after A{i-1] is produced. • It helps if the elements have contiguous memory addresses.

OpenCL rules

• Threads can share memory with the other Threads in the same Work-Group (work items can access local memory without reloading) • Threads can synchronize with other Threads in the same Work-Group (WorkGroups are independent) • Global and Constant memory is accessible by all Threads in all Work-Groups • Global and Constant memory is often cached inside a Work-Group (if cache is available) • Each Thread has registers and private memory • Each Work-Group has a maximum number of registers it can use. These are divided equally among all its Threads. Write code that uses to many registers in one work-item, the compiler will complain. work-item * work-group *registers= number of registers needed


Related study sets

AP Psychology--Human Development Quiz

View Set

Chapter 36- Management of Patients With Immune Deficiency Disorders PrepU (CC3 Immunity 2)

View Set

Chapter 5 Quantitive Primary research

View Set