Chapter 3 Pacheco

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

If we fix n and increase comm_sz, the run-times usually decrease

In fact, for large values of n, doubling the number of processes roughly halves the overall run-time.

What language is the mpicc a script for?

Typically, mpicc is a script that's a wrapper for the C compiler.

Many systems also support program startup with mpiexec:

$ mpiexec −n <number of processes> ./mpi hello 1. So to run the program with one process, we'd type $ mpiexec −n 1 ./mpi helloand one process the program's output would be: Greetings from process 0 of 1! 2. to run the program with four processes, we'd type $ mpiexec −n 4 ./mpi helloWith Greetings from process 0 of 4! Greetings from process 1 of 4! Greetings from process 2 of 4! Greetings from process 3 of 4!

MPI program that prints greetings from the processes

(IMG In Folder)

Issues with send and receive

- MPI_Send may behave differently withregard to buffer size, cutoffs and blocking - MPI_Recv always blocks until a matching message is received

MPI Identifiers have to start with:

- MPI__

Define Local variables

- Local variables are variables whose contents are significant only on the process that's using them. - Some examples from the trapezoidal rule program are local a, local b, and local n.

Explain the MPI_COMM_WORLD communicator

- A collection of processes that can send messages to each other - MPI_Init defines a communicator that consists of all the processes created when the program is started: - Called MPI_COMM_WORLD

What does a distributed-memory system consist of?

- A distributed-memory system consists of a collection of core memory pairs connected by a network, and the memory associated with a core is directly accessible only to that core - Connects to the Bus (Gates & Transistors & RAM) Interconnect

What does a a shared-memory system consist of?

- A shared-memory system consists of a collection of cores connected to a globally accessible memory, in which each core can have access to any memory location - Connects to the Bus (Gates & Transistors & RAM) Interconnect to connect to the memory. Bus is necessary to access memory which will be slower and less user friendly.

Does all MPI implementations allow all the processes in MPI COMM WORLD full access to "stdout" and "stderr"?

- Although the MPI standard doesn't specify which processes have access to which I/O devices, virtually all MPI implementations allow all the processes in MPI COMM WORLD full access to stdout and stderr, so most MPI implementations allow all processes to execute printf and fprintf(stderr, ...).

Suppose that each process calls MPI_Reduce with operator MPI_SUM, and destination process 0.

- At first glance, it might seem that after the two calls to MPI_Reduce, the value of 'B' will be three, and the value of 'D' will be six. - However, the names of the memory locations are irrelevant to the matching, of the calls to MPI_Reduce. - The order of the calls will determine the matching, so the value stored in 'B' will be 1 + 2 + 1 = 4, and the value stored in 'D' will be 2+1+2 = 5.

Communication functions that involve all the processes in a communicator are called:

- Collective Communications - All the processes in the communicator must call the same collective function. - For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv on another process is erroneous, and, in alllikelihood, the program will hang or crash.

Explain our Trap Program's issues:

- Each process with rank greater than 0 is "telling process 0 what to do" and then quitting. - That is, each process with rank greater than 0 is, in effect, saying "add this number into the total." - Process 0 is doing nearly all the work in computing the global sum, while the other processes are doing almost nothing

There are still a few remaining issues.

- First, as we've described it, our parallel program will report comm_sz times, one for each process. We would like to have it report a single time - Ideally, all of the processes would start execution of the matrix- vector multiplication at the same time, and then, we would report the time that elapsed when the last process finish. In other words, the parallel execution time would be the time it took the "slowest" process to finish.

In serial programs, an in/out argument is one whose value is both used and changed by the function.

- For MPI_Bcast, however, the data_p argument is an input argument on the process with rank source_proc and an output argument on the other processes.

We might try something similar with the vectors: process 0 could read them in and broadcast them to the other processes.

- However, this could be very wasteful. If there are 10 processes and the vectors have 10,000 components, then each process will need to allocate storage for vectors with 10,000 components, when it is only operating on sub-vectors with 1000 components. If, for example, we use a block distribution, it would be better if process 0 sent only components 1000 to 1999 to process 1, components 2000 to 2999 to process 2, and so on. Using this approach, processes 1 to 9 would only need to allocate storage for the components they're actually using.

The times for comm_sz = 1 are the run-times of the serial program running on a single core of the distributed- memory system

- If we fix comm_sz, and increase n, the order of the matrix, the run-times increase. - For relatively small numbers of processes, doubling n results in roughly a four-fold increase in the run-time. However, for large numbers of processes, this formula breaks down.

If we can improve the performance of the global sum in our trapezoidal rule program by replacing a loop of receives on process 0 with a tree-structured communication, we ought to be able to do something similar with the distribution of the input data:

- If we simply "reverse" the communications in the tree-structured global sum, we obtain the tree-structured communication, and we can use this structure to distribute the input data. - A collective communication in which data belonging to a single process is sent to all of the processes in the communicator is called a broadcast

Describe Elapsed serial time

- In this case, you don't need to link in the MPI libraries. - Returns time in microseconds elapsed from some point in the past

Our parallel matrix-vector multiplication program doesn't come close to obtaining linear speedup for small n and large p. Does this mean that it's not a good program?

- Many computer scientists answer this question by looking at the "scalability" of the program. - Recall that very roughly speaking, a program is scalable if the problem size can be increased at a rate so that the efficiency doesn't decrease as the number of processes increase.

Important purposes of Communicators?

- One of the most important purposes of communicators is to specify communication universes; recall that a communicator is a collection of processes that can send messages to each other. - Conversely, a message sent by a process using one communicator cannot be received by a process that's using a different communicator. - Since MPI provides functions for creating new communicators, this feature can be used in complex programs to insure that messages aren't "accidentally received" in the wrong place.

What is scalable?

- Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable - Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable For example, if n = 1000 and p = 2, the efficiency of B is 0.80. If we then double p to 4 and we leave the problem size at n = 1000, the efficiency will drop to 0.40, but if we also double the problem size to n = 2000, the efficiency will remain constant at 0.80. Program A is thus more scalable than B, but both satisfy our definition of scalability.

Both MPI_Wtime and GET_TIME return wall clock time

- Recall that a timer like the C clock function returns CPU time—the time spent in user code, library functions, and operating system code - It doesn't include idle time, which can be a significant part of parallel run time. For example, a call to MPI Recv may spend a significant amount of time waiting for the arrival of a message. - Wall clock time, on the other hand, gives total elapsed time, so it includes idle time.

How to aggregate the computation of the areas of the trapezoids into groups?

- Split the interval [a, b] up into comm_sz subintervals. - If comm_sz evenly divides n, the number of trapezoids, we can simply apply the trapezoidal rule with n/comm_sz trapezoids to each of the comm_sz subintervals. - To finish, we can have one of the processes, say process 0, add the estimates.

Explain how the tag argument works?

- Suppose process 1 is sending floats to process 0. Some of the floats should be printed, while others should be used in a computation. - Then the first four arguments to MPI_Send provide no information regarding which floats should be printed and which should be used in a computation. - So process 1 can use, say, a tag of 0 for the messages that should be printed and a tag of 1 for the messages that should be used in a computation.

Explain the MPI_Finalize method

- Tells the MPI system that we're done using MPI, and that any resources allocated for MPI can be freed. - The syntax is quite simple: int MPI Finalize(void); - In general, no MPI functions should be called after the call to MPI_Finalize.

If we pause for a moment and think about our trapezoidal rule program, we can find several things that we might be able to improve on:

- The "global sum" after each process has computed its part of the integral: If we hire eight workers to, say, build a house, we might feel that we weren't getting our money's worth if seven of the workers told the first what to do, and then the seven collected their pay and went home. But this is very similar to what we're doing in our global sum

Collective Communications vs Point-to-Point communications?

- The arguments passed by each process to an MPI collective communication must be"compatible." - For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash - The output_data_p argument is only used on dest_process. - However, all of the processes still need topass in an actual argument corresponding to output_data_p, even if it's just NULL - Point-to-point communications are matched on the basis of tags and communicators. - Collective communications don't use tags. They're matched solely on the basis of the communicator and the order in which they're called.

Explain the pointer variable: msg_buf_p

- The first argument, msg_buf_p, is a pointer to the block of memory containing the contents of the message. - In our program, this is just the string containing the message, greeting. (Remember that in C an array, such as a string, is a pointer.)

The method: MPI_Send -

- The first three arguments, msg_buf p, msg_size, and msg_type, determine the contents of the message - The remaining arguments, dest, tag, and communicator, determine the destination of the message.

Explain the MPI_Reduce function?

- The key to the generalization is the fifth argument, operator. - It has type MPI_Op, which is a predefined MPI type like MPI Datatype and MPI_Comm. There are a number of predefined values in this type

How can we aggregate the tasks and map them to the cores?

- The more trapezoids we use, the more accurate our estimate will be. That is, we should use many trapezoids, and we will use many more trapezoids than cores. - Thus, we need to aggregate the computation of the areas of the trapezoids into groups.

How do we get from invoking mpiexec to one or more lines of greetings?

- The mpiexec command tells the system to start <number of processes> instances of our <mpi hello> program - It may also tell the system which core should run each instance of the program. - After the processes are running, the MPI implementation takes care of making sure that the processes can communicate with each other.

The operator we want is..

- The operator we want is MPI_SUM. - Using this value for the operator argument, we can replace the code in Lines 18 through 28 of Program 3.2 with the single function call.

What happens if multiple processes are attempting to write to 'stdout'

- The order in which the processes' output appears will be unpredictable. - Indeed, it can even happen that the output of one process will be interrupted by the output of another process.

As we increase the problem size, the run-times increase, and this is true regardless of the number of processes.

- The rate of increase can be fairly constant (e.g., the one-process times) or it can vary wildly (e.g., the 16-process times). As we increase the number of processes, the run-times typically decrease for a while - However, at some point, the run-times can actually start to get worse. - The closest we came to this behavior was going from 8 to 16 processes when the matrix had order 1024.

When we run it with six processes, the order of the output lines is unpredictable:

- The reason this happens is that the MPI processes are "competing" for access to the shared output device, stdout, and it's impossible to predict the order in which the processes' output will be queued up. - Such a competition results in Nondeterminism. That is, the actual output will vary from one run to the next.

How could we implement Vectors this using MPI?

- The work consists of adding the individual components of the vectors, so we might specify that the tasks are just the additions of corresponding components

Why do MPI implementations include implementations of global sums.

- This places the burden of optimization on the developer of the MPI implementation, rather than the application developer. - The assumption here is that the developer of the MPI implementation should know enough about both the hardware and the system software so that she can make better decisions about implementation details.

Is this binary tree structure communication ideal?

- This solution may not seem ideal, since half the processes (1, 3, 5, and 7) are doing the same amount of work that they did in the original scheme. - However, if you think about it, the original scheme required comm_sz − 1 = seven receives and seven adds by process 0, while the new scheme only requires three, and all the other processes do no more than two receives and adds - We've thus reduced the overall time by more than 50%.

Explain the MPI_Recv method

- Thus, the first three arguments specify the memory available for receiving the message: msg buf p points to the block of memory, buf size determines the number of objects that can be stored in the block, and buf type indicates the type of the objects - The next three arguments identify the message. The source argument specifies the process from which the message should be received. - The tag argument should match the tag argument of the message being sent, and the communicator argument must match the communicator used by the sending process.E

In MPI, a derived datatype can be used...

- To represent any collection of data items in memory by storing both the types of the items and their relative locations in memory. - The idea is that if a function that sends data knows this information about a collection of data items, it can collect the items from memory before they are sent - Similarly, a function that receives data can distribute the items into their correct destinations in memory when they're received. - As an example, in our trapezoidal rule program we needed to call MPI Bcast three times: once for the left endpoint a, once for the right endpoint b, and once for the number of trapezoids n. - As an alternative, we could build a single derived datatype that consists of two doubles and one int. If we do this, we'll only need one call to MPI_Bcast.

What is message-passing?

- Used by the distributed-memory systems.

Define Global Variables

- Variables whose contents are significant to all the processes are sometimes called global variables. - Some examples from the trapezoidal rule are a, b, and n.

What is the Trapezoid Rule used for

- We can use the trapezoidal rule to approximate the area between the graph of a function, y = f (x), two vertical lines, and the x-axis - The basic idea is to divide the interval on the x-axis into n equal subintervals. - Then we approximate the area lying between the graph and each subinterval by a trapezoid whose base is the subinterval, whose vertical sides are the vertical lines through the endpoints of the subinterval, and whose fourth side is the secant line joining the points where the vertical lines cross the graph

If we don't want output from different processes to appear in a random order, we have to:

- We must modify our program accordingly. - For example, we can have each process other than 0 send its output to process 0, and process 0 can print the output in process rank order. This is exactly what we did in the "greetings" program.

Explain MPI (Message-Passing Interface)

- We need MPI to find Process ID - It defines a library of functions that can be called from C, C++, and Fortran programs. We'll learn about some of MPI's different send and receive functions. - We'll also learn about some "global" communication functions that can involve more than two processes. These functions are called collective communications

In order to write MPI programs that can use scanf:

- We need to branch on process rank, with process 0 reading in the data and then sending it to the other processes. - For example, we might write the Get input function for our parallel trapezoidal rule program. - In this function, process 0 simply reads in the values for a, b, and n and sends all three values to each process. - This function uses the same basic communication structure as the "greetings" program, except that now process 0 is sending to each process, while the other processes are receiving.

Exaplin the role of variables comm_sz and my_rank

- We'll often use the variable comm_sz for the number of processes in MPI_COMM_WORLD, and the variable my rank for the process rank.

In the matrix-vector multiplication, we're not interested in the time it takes to type in the matrix or print out the product.

- We're only interested in the time it takes to do the actual multiplication, so we need to modify our source code by adding in calls to a function that will tell us the amount of time that elapses from the beginning to the end of the actual matrix-vector multiplication. - MPI provides a function, MPI_Wtime, that returns the number of seconds that have elapsed since some time in the past

How many characters stored in greetings?

- We've allocated storage for 100 characters in greetings. - Of course, the size of the message sent should be less than or equal to the amount of storage in the buffer in our case the string "greeting".

What is Single-Program Multiple-Data (SPMD)

- compiled a single program - Process 0 does something different: It's receiving a series of messages and printing them, while each of the other processes is creating and sending a message. - The if-else construct Lines 16 through 28 makes our program SPMD

Explain the role of msg_size and msg_type:

- msg_size and msg_type, determine the amount of data to be sent. - In our program, the msg_size argument is the number of characters in the message plus one character for the '\0' character that terminates C strings - The msg_type argument is MPI CHAR. - These two arguments together tell the system that the message contains strlen(greeting)+1 chars

Key values of sorting

- n keys and p = comm sz processes - n/p keys assigned to each process - No restrictions on which keys are assigned to which processes

Explain what a "wrapper script" is?

-A wrapper script is a script whose main purpose is to run some program. - In this case, the program is the C compiler. - However, the wrapper simplifies the running of the compiler by telling it where to find the necessary header files and which libraries to link with the object file.

You can receive messages without:

1. Amount of data 2. Sender 3. Tag (Broadcasting)

Pseudo-code if comm_sz evenly divides n

1. Get a, b, n; (a &b are comm_sz subintervals) (n = the number of trapezoids) (we can simply apply the trapezoidal rule with n/comm_sz trapezoids to each of the comm sz subintervals) 2. h = (b−a)/n; 3. local_n = n/comm_sz; 4. local_a = a + my_rank ∗ local_n ∗ h; 5. local_b = local_a + local_n ∗ h; 6. local_integral = Trap(local_a, local_b, local_n, h); 7. if (my_rank != 0) { 8. Send local_integral to process 0; 9. } else { 10. total integral = local_integral; 11. for (proc = 1; proc < comm_sz; proc++) { 12. Receive local_integral from proc; 13. total_integral += local_integral; 14. } } 15. if (my_rank == 0) { 16. print (result)

For the trapezoidal rule, we might identify two types of tasks:

1. One type is finding the area of a single trapezoid 2. Computing the sum of these areas - Then the communication channels will join each of the tasks of the first type to the single task of the second type

Parallelizing the Trapezoidal Rule:

1. Partition the problem solution into tasks. 2. Identify the communication channels between the tasks 3. Aggregate the tasks into composite tasks. 4. Map the composite tasks to cores.

Explain two processes can communicate by calling functions -

1. one process calls a send function 2. the other calls a receive function

What is the area of one Trapezoid?

= h/2 [ f (xi) + f (xi+1)].

Explain Line (1-3)

All Libraries downloaded

Which process will handle receiving? (Trap Method)

All processes expect 0.

Explain the MPI_Type_commit Method

Allows the MPI implementation to optimize its internal representation of the datatype for use in communication functions

A third alternative is a ...

Block-cyclic partition - Instead of using a cyclic distribution of individual components, we use a cyclic distribution of blocks of components, so a block-cyclic distribution isn't fully specified until we decide how large the blocks are.

Perhaps the best known serial sorting algorithm is

Bubble Sort

Define, MPI_Type_create_struct

Builds a derived datatype that consists of individual elements that have different basic types.

An alternative to a block partition is a ...

Cyclic partition - In a cyclic partition, we assign the components in a round robin fashion For example, when n = 12 and comm sz = 3. Process 0 gets component 0, process 1 gets component 1, Process 2 gets component 2, process 0 gets component 3, and so on.

What are the fundamental issues involved in writing message-passing programs?

Data partitioning and I/O in distributed-memory systems

Hello World Program:

Each Core will print "Hello World" (Single Instruction).

In Lines 17 and 18:

Each process, other than process 0, creates a message it will send to process 0. (The function sprintf is very similar to printf, except that instead of writing to stdout, it writes to a string.)

Why is the butterfly method of communication more preferred than the tree?

Faster, only 3 levels. More efficient and all the process get involved without having to "reverse" out the global sum to all the processes.

It might be tempting to call MPI_Reduce using the same buffer for both input and output

For example, if we wanted to form the global sum of x on each process and store the result in x on process 0, we might try calling: MPI Reduce(&x, &x, 1, MPI_DOUBLE, MPI_SUM, 0, comm); However, this call is illegal in MPI, so its result will be unpredictable: it might produce an incorrect result, it might cause the program to crash, it might even produce a correct result.

In virtually all distributed-memory systems, communication can be much more expensive than local computation

For example, sending a double from one node to another will take far longer than adding two doubles stored in the local memory of a node

Unlike the MPI_Send and MPI_Recv pair, the global-sum function may involve more than two processes.

However, in our trapezoidal rule program it will involve all the processes in MPI_COMM_WORLD

The ideal value for S(n,p) is p.

If S(n,p)=p, then our parallel program with comm_sz = p processes is running p times faster than the serial program In practice, this speedup, sometimes called linear speedup, is rarely achieved.

What is the role of the global sum?

In fact, global sum is just a special case of an entire class of collective communications. For example, it might happen that instead of finding the sum of a collection of comm_sz numbers distributed among the processes, we want to find the maximum or the minimum, or the product, or any one of many other possibilities

How many processes (cores) does our trapezoidal rule program require?

In our trapezoidal rule program it will involve all the processes in MPI_COMM_WORLD

What is the partitioning phase?

In the partitioning phase, we usually try to identify as many tasks as possible

Tree-structured communication: "binary tree structure"

In this diagram, initially students or processes 1, 3, 5, and 7 send their values to processes 0, 2, 4, and 6, respectively.: - Then processes 0, 2, 4, and 6 add the received values to their original values, and the process is repeated twice: - Processes 2 and 6 send their new values to processes 0 and 4, respectively. - Processes 0and 4add the received values into their new values. - Process 4 sends its newest value to process 0.b. Process0addsthereceivedvaluetoitsnewestvalue.

What are the problems if all of the processes NEED the result of a global sum in order to complete some larger computation?

In this situation, we encounter some of the same problems we encountered with our original global sum. For example, if we use a tree to compute a global sum, we might "reverse" the branches to distribute the global sum. (broadcast). We might have the processes exchange partial results instead of using one-way communications. Such a communication pat- tern is sometimes called a "butterfly"

The MPI collective communication function MPI Barrier has what purpose?

Insures that no process will return from calling it until every process in the communicator has started calling it.

Another purpose of MPI Init:

Is to define a communicator that consists of all of the processes started by the user when she started the program. This communicator is called MPI COMM_WORLD: The function calls in Lines 13 and 14 are getting information about MPI COMM WORLD. Their syntax is: int MPI_Comm size int MPI_Comm rank

Explain the fifth argument tag (nonnegative int)

It can be used to distinguish messages that are otherwise identical.

Now suppose we want to test our vector addition function, what should we do?

It would be convenient to be able to read the dimension of the vectors and then read in the vectors x and y.

Once we've decided how to partition the vectors...

It's easy to write a parallel vector addition function: each process simply adds its assigned components. Furthermore, regardless of the partition, each process will have local n components of the vector, and, in order to save on storage, we can just store these on each process as an array of local n elements

Why is it illegal to call MPI_Reduce using the same buffer for I/O?

It's illegal because it involves aliasing of an output argument. - Two arguments are aliased if they refer to the same block of memory, and MPI prohibits aliasing of arguments if one of them is an output or input/output argument. - This is because the MPI Forum wanted to make the Fortran and C versions of MPI as sim- ilar as possible, and Fortran prohibits aliasin

Explain the role of the MPI_COMM_WORLD communicator on Line: 13, 14

Line 13: int MPI_Comm_size - Line 14: int MPI_Comm_rank - The function calls in Lines 13 and 14 are getting information about MPI COMM WORLD. - For both functions, the first argument is a communicator and has the special type defined by MPI for communicators, MPI Comm. - MPI_Comm_size returns in its second argument the number of processes in the communicator MPI_Comm_rank returns in its second argument the calling process' rank in the communicator

The implementation of message-passing that we'll be using is:

MPI (Message-Passing Interface)

Explain Line 12-14

MPI Commands: 1. MPI_Init tells the MPI system to do all of the necessary setup. For example, it might allocate storage for message buffers, and it might decide which process gets which rank. As a rule of thumb, no other MPI functions should be called before the program calls MPI_Init 2. MPI_Comm_size, returns in its second argument the number of processes in the communicator. 3. MPI_Comm_rank, returns in its second argument the calling process' rank in the communicator.

MPI_REDUCE VS MPI_ALLREDUCE

MPI_Reduce = Combines values from all processors to a single value. Returns results to a single process. MPI_AllReduce = Combines values from all processors and distributes the results back to all processors. Returns results to all processes in the group

Thus, we might try writing a function that reads in an entire vector that is on process 0 but only sends the needed components to each of the other processes

MPI_Scatter, can be used in a function that reads in an entire vector on process 0 but only sends the needed components to each of the other processes.

What command do most systems use for compilation?

Many systems use a command called "mpicc" for compilation $mpicc −g −Wall −o mpi hello mpi hello.c *Recall that the dollar sign ($) is the shell prompt, so it shouldn't be typed in. Also recall that, for the sake of explicitness, we assume that we're using the Gnu C compiler, gcc. - g = Debug Info - Wall = Warnings - o = Output (Executes the file) - mpi_hello = File name - mpi_hello.c = Source File

Explain Line 5

Maximum Length of strings in the Char Variable

MPI_Reduce can operate..

On arrays instead of scalars

Which process handles I/O (Input and Output) (Trap Method)

Only Process 0 handles I/O

Most MPI implementations only allow process ...?

Only Process 0 in MPI_COMM_WORLD has access to stdin - If multiple processes have access to stdin, which process should get which parts of the input data? - Should process 0 get the first line? Process 1 the second? Or should process 0 get the first character?

If we have a total of n keys and p = comm_sz processes

Our algorithm will start and finish with n/p keys assigned to each process. (As usual, we'll assume n is evenly divisible by p.) At the start, there are no restrictions on which keys are assigned to which processes

Of course, our test program will be useless unless we can see the result of our vector addition, so we need to write a function for printing out a distributed vector

Our function can collect all of the components of the vector onto process 0, and then process 0 can print all of the components. The communication in this function can be carried out by MPI_Gather.

How do we distinguish between Collective communications, and functions such as MPI_Send and MPI_Recv, MPI_Send and MPI_Recv

Point-to-Point communications

What is a program running on one core-memory pair is usually called?

Process

# of Cores =

Process = p process = p - 1

In Lines 24-25

Receive the message sent by process q, for q=1,2,...,comm sz−1

Define the MPI_Get_address

Returns the address of the memory location referenced by location_p. The special type MPI_Aint is an integer type that isbig enough to store an address on the system

In Lines 19-20

Send the message to process 0. Process 0, on the other hand, simply prints its message using printf, and then uses a for loop to receive and print the messages sent by processes 1,2,...,comm sz−1

So if x has a block distribution, how can we arrange that each process has access to all the components of x before we execute the following loop?

So if x has a block distribution, how can we arrange that each process has access to all the components of x before we execute the following loop? Using the collective communications we're already familiar with, we could execute a call to MPI_Gather followed by a call to MPI_Bcast This would, in all likelihood, involve two tree-structured communications, and we may be able to do better by using a butterfly. So, once again, MPI provides a single function: Concatenates the contents of each process's end_buf_p and stores this in each process 'recv_buf_p.

Explain the fourth argument dest (nonnegative int)

Specifies the rank of the process that should receive the message

The most widely used measure of the relation between the serial and the parallel run-times is the: .....

Speedup: It's just the ratio of the serial run-time to the parallel run-time

In a A distributed-memory system:

The CPU and Memory of a core interconnect.

In a A shared-memory system:

The CPU of a core interconnect to one shared memory

Explain the MPI_Status:

The MPI type MPI Status is a struct with at least the three members MPI SOURCE, MPI TAG, and MPI ERROR

What is the the Trap function?

The Trap function is just an implementation of the serial trapezoidal rule

What do we mean by a parallel sorting algorithm in a distributed-memory environment? What would its "input" be and what would its "output" be?

The answers depend on where the keys are stored. We can start or finish with the keys distributed among the processes or assigned to a single process

We can use MPI_Type_create_struct to build a derived datatype that consists of individual elements that have different basic types:

The argument count is the number of elements in the datatype, so for our example, it should be three. Each of the array arguments should have count elements. The first array, array of block lengths, allows for the possibility that the individual data items might be arrays or subarrays The third argument to MPI Type create struct, array of displacements, specifies the displacements, in bytes, from the start of the message To find these values, we can use the function MPI Get address:

Fortunately, MPI provides a variant of MPI_Reduce that will store the result on all the processes in the communicator:

The argument list is identical to that for MPI Reduce except that there is no dest process since all the processes should get the result.

MPI provides three basic approaches to consolidating data that might other- wise require multiple messages:

The count argument to the various communication functions, derived datatypes, and MPI Pack/Unpack. We've already seen the count argument—it can be used to group contiguous array elements into a single message

When the algorithm terminates:

The keys assigned to each process should be sorted in (say) increasing order If 0 ≤ q < r < p, then each key assigned to process q should be less than or equal to every key assigned to process r.

Explain the MPI_Broadcast Method

The process with rank 'source_proc' sends the contents of the memory referenced by 'data_p' to all the processes in the communicator 'comm'.

If the number of components is n and we have comm_sz cores or processes, let's assume that n evenly divides comm_sz and define local n = n/comm sz.

Then we can simply assign blocks of local n consecutive components to each process. For example, when n = 12 and comm_sz = 3 This is often called a block partition of the vector.

What are the arguments, argc p and argv p?

They are pointers to the arguments to main, argc, and argv. However, when our program doesn't use these arguments, we can just pass NULL for both

What does the line 3 (mpi.h header file) mean?

This contains prototypes of MPI functions, macro definitions, type definitions, and so on; it contains all the definitions and declarations needed for compiling an MPI program.

It's not necessary to call the methods MPI_Init and MPI_Finalize be in main.

True

Explain Line 8-10

Variable that will be set.

Vector vs Scalar Processor

Vector Processor: Refers to a processor that implements instructions operating on an array of data, which is processes an array of data at the same time. Scalar Processor: A processor whose instructions operate on a single data. A single data is processed once at a time.

In our trapezoidal rule program,

We just print the result, so it's perfectly natural for only one process to get the result of the global sum

Explain the variable "comm sz"

We'll often use the variable "comm sz" for the number of processes in MPI_COMM WORLD

Explain MPI_Type_free

When we're finished with our new type,this frees any additional storage used.

On process 0, a, b, and n will be sent with the one call

While on the other processes, the values will be received with the call.

What is the world of parallel multiple instruction, multiple data, or MIMD, computers is, for the most part, divided into?

distributed-memory and shared-memory

Are of height in a Trapezoid:

h = b - a/n

Explain the MPI_Init method

int MPI Init(int∗ argc p /∗ in/out ∗/, char∗∗∗ argv p /∗ in/out ∗/); - The arguments, argc p and argv p, are pointers to the arguments to main, argc, and argv. - However, when our program doesn't use these arguments, we can just pass NULL for both - Like most MPI functions, MPI Init returns an int error code, and in most cases we'll ignore these error codes

Explain the status_p argument

it won't be used by the calling function, and, as in our "greetings" program, the special MPI constant MPI STATUS IGNORE can be passed.

We already know how to read in the dimension of the vectors:

process 0 can prompt the user, read in the value, and broadcast the value to the other processes

Explain the variable "my_rank"

the variable "my_rank" is used for the process rank.

Explain why the size of the string greeting is not the same as the size of the message specified by the arguments msg_size and msg_type

when we run the program with four processes, the length of each of the messages is 31 characters. 4*8 (p -1) = 31


Kaugnay na mga set ng pag-aaral

Acid Base Balance Review Questions

View Set

3. Physics Practice Questions - Chapter 7- Momentum

View Set

Marketing Final Exam: Chapter 17

View Set

CHAPTER 5: Equal Rights: Struggling toward Fairness

View Set

JFK inauguration speech english test

View Set

Chapter 10 - Innovative Strategies that Change the Nature of Competition

View Set