High Performance Computing and Distributed Systems

Ace your homework & exams now with Quizwiz!

What are the 2 parts of a protocol

- A spec of the sequence of messages to be exchange - A spec of the format of the data in the messages

Factors affecting parallel performance

- Fraction of the work which is serial - Load Balancing: processes waiting at synchronisation points whilst others complete work - Hardware resource contention: saturating memory bandwidth - Communication overheads

How do you design an architectural model

- Simplify and abstract the function of individual components of a DS - Consider the placment of the components across a network of computers - define useful patterns for the distribution of data and workloads - Identify the interrelationship between the components - their functional roles and patterns of communication between them.

Give examples of 3 distributed systems

- The internet - An intranet - Mobile and Ubiquitous Computing

What is thread affinity

It binds specific OpenMP threads to specific hardware resources (CPU cores)

What is a routing overlay

It is a distributed algorithm for a middlelayer responsible for locating nodes and objects and routing requests from any client to a host that holds the required object.

What is an MPI comunicator

It is a group of processes in MPI

What is a large shared memory machine

It is a machine where all cores share memory, using a proprietary interconnect to provide cache-coherent shared memory. This architecture is highly NUMA, so the performace for accessing memory is very varied. This is more expensive than distributed memory clusters. This also requires MPI to be used rather than OpenMP as cores don't share the same address space.

What is Moore's Law, and how is this law holding up today

The number of transistors in a chip will double approximately every 24 month. From 2005 this started to level off and was no longer met

What is strong scaling

We keep the problem size fixed and increase the number of processors

How can you experience data races in parallel for loops

Where the value your updating depends on another value in the array, which may or may not have been updated yet

What are the consequences of Amdahl's law

The maximum speedup is 1/s = 1/(1-p) So the parallel fraction of our program limits the maximum possible speedup

What is arithmetic intensity

The number of floating point operations per byte of data memory from memory

What is the serial computation time for a weak scaling program

Where number of times we need to carry out the parallel fraction for our calculation

How do you improve the accuracy of an approximation made with time stepping

decrease the size of the time step

What file do you need to include to use OpenMP in C

#include <omp.h>

In C how do you start an OpenMP directive

#pragma

What is the DS Challenge of Scalability

A DS is scalable if the cost of adding a user is a constant amount in terms of the resources that must be added. The system must also work efficiently with an increasing number of suers at many different scales. The algorithms used to access the data should be decentralised and avoid performance bottlenecks Data should be structured heirarchically (in a tree-like structure rather than linearly) to get the best access time

Whats the difference between named and unnamed critical regions

A thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. If the region has no name then no other critical regions without a name can be entered at the same time, so use names where possible to allow multiple critical regions to be used

What is the impact of accelerators on the Top 500 list

A total of 138 systems are using accelerators, up from 110 6 months ago

What is a protcol

A well-known set of rules and formats to be used for communication between processes in order to perform a task

What is an architectural model

Address the placement of system components and relationships between them Defines the ways in which the system components interact and are mapped onto the underlying network of computers.

What is BLAS

Basic linear algebra subprograms Defines standardised interface for basic vector and matrix operations

Why would you rarely use a synchronous DS

Because it is hard to set time limits for processes execution, message delivery or clock drift

Compare the performance impact of critical vs. atomic

Both have a negative performance impact. Atomic is usually significantly more efficient than critical, so use it where possible.

How can P2P systems ensure availability

By storing multiple replicas of objects

How do we represent sparse matrics

Compressed sparse row format

Why do we use GPUs as accelerators for HPC

Good floating point performance and high memory bandwidth. Some operations need to be handled by the CPU by computationally expensive work can be offloaded to the GPU

What is a GPU

Graphics Processing Unit Specialised hardware for carrying out operations releated to image generation.

How does CUDA implement Data parallelism

Data set is a stream of elements, then the kernel is applied to each element, kernels execute on the CPU

What is the DS Challenge of Openness

Distributed Systems should be extensible. It should be easy for new resource-sharing services to be added and made available by a variety of clients. Open distributed systems are based on the provision of a univeral communication mechanism and published interfaces for access of shared resources

What is domain decomposition

Distributing work between MPI processes by breaking down the domain of the problem.

How does process placement effect performance

Efficiency use of memory is important One MPI process per socket with multiple OpenMP threads tend to work effectively. OpenMP threads access memory on the same memory controller

Explain GPU memory architecture

High bandwidth memory and memory controllers. (Nvidia pascal has 16GB of memory running at 720GB/s) Small L2 Cache (which is the highest cache level)

How can you improve the load balance of MPI programs

Interleaving loop iterations should give better load balance as work in the computationally expensive regions is more evenly distributed

What is cache memory, and why do we need it

It is a small amount of fast memory which holds data fetched from an written to main memory. Cache helps hide memory latency provided we make efficient use of data we fetch from main memory. We need to make efficient use of cache to get good performance on modern processors

What is MPI_Bcast

It sends data from 1 process to all others in a specified communicator

What does high performance mean for interconnects

Low latency: short delay sending small messages High bandwidth: high data transfer rate for large messages.

What is the default MPI communicator

MPI_COMM_WORLD

What is the speed-up of a weak scaling program

Speed-up scales linearly with N. Making it easy to make use of a large number of processors

What is a memory bound application

an application limited by the speed at which data can be accessed from main memory

How do you compile an MPI program

mpicc command to compile for C use the command mpirun to run the program with the -np command used to specify the number of processes

What are handles in java serialised form

they are references to an object within the serialised form

How do you debug MPI programs

when you compile use the flags -check_mpi and -g so that you can see issues in the output

How do you parallelise a for loop using OpenMP

#pragma omp for this command needs to be within a parallel region Or alternatively you can combine the two directives into #pragma omp parallel for which starts a parallel region and a parallel for loop

In C how do you start an OpenMP parallel section

#pragma omp parallel {}

Can you easily program for a GPU

GPU programming is hard and may require an application to be restructure to use a GPU efficienctly.

How is diversity increasing in HPC Systems

GPUs are in 5 of the current top 10 systems. ARM Architecture is starting to have an impact.

Why do we have to carry out parallel programming

Because parallelisation is typically too complex for the compiler to handle automatically and needs to be programmed and specified

Why is Fortran still used for HPC

Because there are many large legacy HPC programs and libraries still in use which would take enormous effort to rewrite

What is the parallel computation time with an ideal parallel computer for a weak scaling program

Because we only need to carry out the parallel region once, if we have the right number of processes for each point in the domain of the problem

What are ports

A local port is a message destination within a computer. Th combination of ip address and port number uniquely identifies the specific process to which the data should be delivered the port number of a positive integer, and some ports are reserved for common/well known services

What is the peer to peer architectural model

All of the processes play similar roles, interacting cooperatively as peers without distinction between clients and servers. All participating processes run the same program and offer the same interfaces to each other Provides better scalability than the client-server architecture

What is the DS challenge of heterogeneity. How do we solve this

All parts of the systems can be different. Different networks, hardware, OS, programming languages. We use the internet protocol IP to mask different networks. Middleware applies to a software layer than can deal with other differences. For example CORBA (Common object request broker) and Java RMI (Remote Method Invocation)

How do you write programs to use cache effectively

Always access arrays with stride 1, so your accessing adjacent data. For example you have a 2-d array and iterate through all of the first items in the arrays this is far less efficient than iterating through each array fully then moving to the next one. In the example the second program is about 7 seconds faster

How should you use finite differences to maximise memory performance

Always calculate them in one direction, as calculating them in the other direction may require access to multiple cache lines

What is exascale computing. What are the current constraints in realising this

An HPC System capable of 10^18 flops. We can't just scale up systems as thre are just power and cost constraints. We also need applications that could make efficient use of the capability. So software innovation also needs to happen as well as hardware

What is an external data representation

An agreed standard for representation of data structure and primitive values

What is the Golden Bell, what projects are increasingly winning it

An award for outstanding achievement in HPC, AI and deep learning projects

What is the DS Challenge of failure handling

Any process, computer, or network may fail independently of the others Some components fail while others continue to function Each component needs to be aware of the possible ways in which the components it depends on may fail and be designed to deal with each of those failures appropriately

What are some key BLAS implementations, why would you use them

Application performance is significantly improved with optimised BLAS. Optimised versions are OpenBLAS, Intel MKL

Where is BLAS used

As low level building blocks for linear algebra and other libraries like pythons numpy

When should you use critical sections vs. atomic sections

Atomic locks a specific memory location not a section of code. So it is more efficient But critical can lock large section of code so is more versatile

What are accelerators and what are their relationships with CPU

CPUs are designed to delivery acceptable performance for a very wide range of applications. They need to trade-off functionality, performance, energy efficiency and cost. Accelerators provide increased performance for specific workloads - different design trade offs

What is the most popular GPU programming model

CUDA

What is false cache sharing

Caches on different processes need to agree on data in memory When a threads writes to memory it invalidates the corresponding cache line, so another thread needs to reload this line. This reduces performance due to unnecessary loading of data

What is a Client-Server basic model

Client processes interact with individual server processses in separate host computers in order to access shared resources that they manage Server may n turn be clients and use services of other servers

What are all the types of architectural models

Client-service basic model Services provided by multiple cooperative servers Proxy Servers and Caches Peer-to-peer

What is CDR

Common data representation Can represent all of the data tyes that can be used as arguments and return values in remote invocations in CORBA Allows clients and servers written in different languages to communicate

What is MPI collective communication

Communicating within a specified communicator. All processes in the communicator must reach the collective communication otherwise there will be a deadlock

What does CUDA stand for

Compute Unified Device Architecture

Explain the basic structure of a traditional HPC system

Compute nodes are made up of a number of cores, which all share RAM and have their own memory address space (so share files) A number of compute nodes fit into a single "rack unit" and connect to a single network switch. These compute nodes can communicate faster with each other than they can with other compute nodes not on this memory switch. Then mutiple rack units are connected together

How do we make efficient use of a cache

Data is transferred into the cache in fixed blocks called cache lines When we access adjancent memory locations the next piece of data will be likely in the cache line, so will already be in the cache even if we didn't explicitely access it. So we don't need to request this from main memory which is very slow. Reading from memory loactions sequentially is usually efficient and will result in more cache hits (where the data is found in the cache) and less cache misses (where data is not found and has to be requested from main memory)

What is a remote object interface

Every remote object as a remote interface that specifies which of its methods can be invoked remotely. Objects in other processes can invoke only the methods in the remote interface, whilst local objects can invoke all methods.

What are the main HPC languages

Fortran and C/C++

What are the ARM processors being used in HPC and why are they being used

Fujitsu's A64FX Arm-based processor. 32GB high bandwidth memory and very energy efficient

What is branch divergence in GPUs, why is it bad

GPUs are suited to data parelle workloads: each thread carried out the same operations on different data items. Branch divergence within a warp is when 2 threads take a different branch in the execution. This can degrade performance as now the threads may not finish computing at the same time and threads need to execute in lock step. GPUs handle this by masking. If threads need to take a different branch. They turn off threads that aren't taking the specific branch and calculate the others, then the GPU returns to the branch point and calculates the other threads by disabling the already calculated threads. This way the thread carries out the correct calculation on the threads for their branches (masking the calculatings on specific threads) whilst still only carrying out a single instruction within the warp at a time.

What is MPI_Wtime

Gives the time in seconds since an arbitrary point in the past

What is High performance Conjugate Gradients

HPCG is representative of sparse linear algebra workloads - lower arithmetic intensity and a much lower performance than HPL

When can we not calculate parallel speed-up

If it isn't feasible to run a serial version of the code

What are loop scheduling and load balancing

If loop iterations do the same amount of work then threads should finish at the same time. This is known as load balancing We achieve this through loop scheduling to distribute work between threads

When might you consider not purchasing the most efficient interconnect

If the workload is not dominated by communication, as the interconnect is a significant part of the cost of an HPC system

What are MPI wildcards

If we're using a manager-working approach we don't know which processes will next request work. So the manager needs to accept requests from any worker process. We can use wildcards such as MPI_ANY_SOURCE to receive from an unspecified source

How does the resolution of a calculation affect the stability

If you increase the resolution you may need to increase the time steps required to reach the required simulated time. To ensure stability is maintained

How does the communication overhead change as the domain size decreases. What is the communication overhead proportional to

In 2-D and 3-D it is porportional to 1/N where N is the size of 1 axis. In strong scaling N decreases as we break down the domain into small pieces. If N is smaller then communication is making up a larger portion of the execution time. So communication overheads inversely impact strong scaling

What is the single construct

In parallel region work which is not parallelised is repeated by all threads. Single directive ensures only one thread executes a block of code inside a parallel region but it can be any thread

What is the architectural model: proxy servers and caches

Increases availability and performance of a service by reducing the load on the network and servers A caches stores recently used data objects that is closer to the client. Caches may be allocated with each client or in a proxy server shared between clients When an object is needed by a client process, the caching service first checks that cache and supplies the object if an up-to-date copy is available

What is the ExCalibur project

It aims to redesign high priority simulation codes and algorithms to fully harness the power of future super computers, keeping the UK research and development at the forefront of high-performance simulation science.

What is MPI_Scatter

It distributes data from one process to all others in a communicator. However each process receives a subset of the data

What is MPI_Allgather

It gathers data from all processes in a communicator to all processes, each process contributes a subset of the data received and then all processes receive the result

What is MPI_Gather

It gathers data from all processes in a communicator to one process, each process contributes a subset of the data received

What is the most common OS for HPC

It has changed from being majority Unix to nearly 100% Linux

What is a distributed System

It is a system in which components located at networked computers communicate and coordinate their actions only by passing messages

What is the Courant-Friedriches-Lewy Condition

It is an example of a stabilty condition In hydrodynamics schemes this condition ensures that the scheme stays stable. By limiting the timestep we ensure stability

What is the concept of masking failures and how is it done

It is possible to construct reliable services from components that exhibit failures A knowledge of failure characteristics of a component can enable services to be designed to mask the failure of the components on which it depends .g. checksums are used to mask corrupted messages and reject them. Rather than relying on the server to never fail we simply rejects its input (masking it)

What is unmarshalling

It is the process of disassembling an external data representation to produce an equivalent collection of data items at the destination - a data structure

What is marshalling

It is the process of taking a collection of data items and assembling them into a form suitable for transmission. Translating structured data items and primitive values into an external data representation

What is a data race

It is where multiple threads try to update a variable at the same time, and some updates to the variable can be lost

What is a multicast operation

It sends a single message from 1 process to each of the members of a group of processes. The sender cannot see the membership of the group, it is transparent

What arethe 3 levels of cache

L1, L2, L3 With increases sizes but increasing latency

How do you do error handling in MPI for C

MPI functions return an error status which can be checked against the value of MPI_SUCCESS However errors are normally fatal unless you choose to handle them within the function.

How do you determine the rank of the MPI process

MPI_Comm_rank

How do you determine the number of MPI processes

MPI_Comm_size

What are the non blocking MPI send/recv commands

MPI_Isend and MPI_Irecv You need to use MPI_Wait or MPI_Waitall at the point when the program needs to have finished communicating. To prevent data from being changed in the send buffer once it has been received.

What are the 2 basic MPI reduction operators

MPI_Reduce: carry out reduciton and return the result of the specified process MPI_Allreduce: carry out reduction and return result to all processes

How do you do MPI point-to-point communication

MPI_send() and MPI_Recv()

Compare memory speed to CPU speed, how has this changed over time

Memory is very slow. Memory speed also hasn't been improving as fast as it historically has been, back in 1980 memory was just as fast as the CPU

What is the primary focus of modern distributed systems and why

More focus on sharing other resoruces such as data, functionality and collaborating. This is because hardware is less extensive so sharing hardware is less important.

Explain the idea of layers in protocols

Network software is made up of a hierarcy of layers. Each layer presents an interface to the layer above, each layer is represented by a module in a network computer Each module appears to communicate directly with a module at the same level in a computer elsewhere on the network. In reality data is passed up and down the protocol stack of layers

Can you use MPI/OpenMP together, given an example

On a cluster you can use OpenMP within a node and MPI between nodes/sockets. you can use MPI to decompose a domain then parallise loop iterations with OpenMP

What are the two main technologies for parallel programming

OpenMP: directives based parallelism for shared memory. Used wihtin a compute node to make use of all processor cores MPI: message passing interface for shared or distributed memory systems using function calls. This can communicate between compute nodes

Give an example of reduction in OpenMP

Otherwise the sum variable would need to be within an atomic region

Why do we have a queue for HPC systems

Parallel programs proceed at the speed of the slowest invidual thread/process We want exclusive access to resources for our program so that processes in our parallel application are not competing for resources. This includes memory resources The queuing system allocates exclusive use by your job

Give some traditional applications of HPC

Physical sciences, engineering and applied mathematics, solving equations that cannot be solved analytically Finding approximate numerical solutions, e.g. for PDEs

What is weak scaling

Problem size per processor remains fixed. Increase the problem size for each processor we have access to

Why are most DS's in practice asynchronous, what is the advantage of this

Processes must share processors and communicating channels making them asynchronous Many design problems can be solved in an asynchronous system e.g. events can be ordered even without a global clock

What tasks do GPUs carry out and what properties do they have because of this

Project 3D objects from images -> many matrix/vector operations Highly parallel task: GPUs contain a large number of floating point units and support a large number of processing threads They have a higher memory bandwidth than a CPU (different memory)

What is the difference between Rpeak and RMa in linpack

RMax is the max performance achieved. Rpeak is the theoretical performance

What is the baseline physical model

Representation of the underlying hardware elements Abstract away details of computer and network technologies This is the baseline physical model of a DS< essentially describing what it is

What is the DS challenge of security

Resources are accessible only to authorised users and used in ways that are intended

What are the categories of Flynn's Taxonomy and give an example for each

SISD - Single instruction, single data. e.g. a serial non-parallel computer SIMD - SIngle instruction, multiple data - GPUs MISD - Multiple Instruction, single data. few practical examples MIMD - Multiple Instruction, multiple data e.g. most HPC systems

What is the architectural model - Services provided by multiple cooperative servers. When might these be used and what are the benefits

Services may be implemented as several cooperative server processes in separate host computers interacting to provide a service to client processes Objects required to provide service may be replicated or partitioned between servers Replication is used to increase performance and improve fault tolerance Cluster based web servers are used for highly scalable web services such as search engines or web stores

What is the master construct

Similar to the single construct but ensures the master thread executes the block of code. With a single region all threads need to wait for the command to be finished. Master doesn't have this issue so other theads don't need to wait for the master construct to finish to continue the rest of their work within the parallel region

What are the lvels of parallelism within an HPC cluster

Single core can use vector instructions (SIMD) Within a compute node there is shared memory between cores that can hold multiple data streams (MIMD) Between nodes there's distributed memory that can hold multiple data streams (MIMD)

What do you need to specify when calculated spatial derivatives and time derivatives

Spatial derivates - boundary conditions. Spatial resolution determines the accuracy. Time step can affect stability Time derivatives - initial conditions. time-step size to determine accuracy and stability

WHat is one way to improve the memory performance of an application

Spread the threads over more sockets using thread affinity variables. Providing the bandwidth from more memory controllers

How does the performance overhead of the loop scheduling options compare

Static is the lowest, then dynamic and finally guided.

What is the default loop scheduler

Static, if the schedule is not defined

Why are sparse matrices bad

Storing many blank elements in inefficient, they have a low arthmetic intensity

Explain the processor architecture of a GPU

The GPU is made up of streaming multiprocessors (SMs) (up to 60 in Nvidia Pascal), up to 4 SMs may be disabled due to manufacturing defects. These are grouped into blocks (of 10 in Nvidia Pascal). Each graphics processing cluster includes all elements of the rending pipeline - effectively an independent GPU

How is work distributed on GPUs

The GigaThread Engine schedules threads and handles context switching between them. The thread engine schedules blocks of up to 1024 threads These blocks are then subdivided into warps of up to 32 threads which execute on a single SM

What is the Oepn Systems Interconnected

The OSI is designed to enourage the development of protocol standards for open systems. It sets out the structure of network layers

WHat protocol is the internet based on and what does this allow

The TCP/IP protocols which are open protocols which allow messages to be sent over a heterogeneous network of networks

What is the WLCG

The Worldwide LHC (large hadron collider) Computing Grid. The mission is to provide global computing resources to store and analyse the data from CERN. It is made up of 170 computing centres in 42 countries, linking up national and international grid structures

What are sockets

The combination of an internet address and a port number is known as a socket, which uniquely identifies a process within the entire internet. Socket is a software abstraction which provides an endpoint of a 2-way communication link between 2 processes. Messages sent to a particlular address and port number can only be received by a processes whose socket is associated with that address and port number A socket can send and receive messages Any processs may use multiple ports to receive messages, but a process cannot share ports with other processes on the same computer (outside of multicast ports) Any number of processes may send messages to the same port

What are fundamental models

The formal description of properties that are common in all architectural models Address time synchronisation, message delays, failures, security issues The overall goal is to ensure that the DS is correct, reliable, secure, adaptable, and cost effective

Give a basic definition of HPC

The practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typically desktop computer or workstation in order to solve large problems in science, engineering, or business

Explain the architecture and components of RMI

The proxy (stub) makes the RMI transparent to the client by behaving like a local object. Whe Object A wants to invoke a method on object B it simply invokes the method on the stub object of B locally. The remote reference model transpates between local and remote object references and creates remote object references for new objects. The communication modules implement the request reply protocol The server has a skeleton and dispatcher for each remote object

What is the java rmi registry

The registry runs on the server and contains a list of all the remote objects which the server can export. Clients connect to the registry and provide the name of the required remote object. The registry returns a reference which the client can use to invoke methods in the registry

What is stability when solving PDE's computationally

The scheme is stable if the solution is guaranteed to remain finite. If the errors grow without bound then the scheme is not stable and the program will generate floating point exceptions

What are the skeleton and dispatcher

The skeleton implmenets the methods in the remote interface The dispatcher receives a request message from the communication module and uses the methodID to select the appropriate method in the skeleton

What are thread private variables

They are private to a single thread but can be accessed in any parallel region. So their essentially a global variable that is private to a thread

What are reduction variables in OpenMP

They bring together results from multiple threads

Other than floating point performance and memory performance why might you use GPUs for HPC

They offer very good energy efficiency

How do MPI programs communicate

They parse messages by making calls to functions/subroutines

What are critical sections

They provide a way to proect an arbitrary section of code from access by multiple threads. Only one thread may be in a critical section at a time

What are request-reply protocols

They represent a pattern on top of messaging passing Support two-way exchange of messages encountered in client-server architecture. Also provide support for Remote Method Invocation (RMI)

What is the truncation error

This is an error that occurs when we're calculating the finite difference

How do we calculate the rate of change of x and y for a spatial derivative using a stencil

This is known as the finite difference Many different formula for dvdx and dvdy can be used, examples can be seen in the pictures

What is Dennard Scaling, how is it holding up today

This says that power per unit area (power density) remains constant as transistors shrink Meaning that as transistors become smaller they also become faster and more energy efficient. As the energy required to power them was proportional to their size. However with very small features we encounter leakage current, and there is also a threshold voltage required for the transistor. So Dennard scaling has broken down and we can no longer make smaller transistors

What are the security threats to a DS

Threats to a server/client process Threats to communication channels Denial of service - the enemy interferes with the activities of authorised users by making excessive and pointless invocations on services or message transmissions, resulting in overloading of physical resources

What is the DS challenge of transparency

To hide from the user and the application programming, the separation/distribuion of components, so that the system is perceived as a whole rather than a collection or independent components Access transparency: access to local or remote resources is identical Location trnasparency: access without knowlege of location Failure transparency: tasks can be computed despite failures

What is the DS challenge of concurrency

To process multi-client requests concurrently and correctly To provide and manage concurrent access to shared resources in a safe way (fair scheduling and deadlock avoidance)

Why do we need distributed systems

To share resources: - hardware resources e.g. prints - data resources e.g. files, databases, - resources with specific functionality e.g. search engines - to develop a collaborative environment e.g. sharepoint, github - to aggregate computing power

What are the 2 historic applications of networking

Transferring files and data between hosts Allowing one host to run programs on another host

How should you accurately test the performance of a program

Use the same problem size, but run for less iterations. If you reduce the problem size you may change the performance in a non-uniform matter as the program may suddenly be able to use the cache more effectively

What is the section construct

Used when there are two or more section of code which can be executed concurrently

How can we measures the performance of an MPI program and understand the load imbalance

Using a profiling tool such as Intel Trace Analyser and Collector (ITAC)

How can you carry out GPU programming using OpenMP

Using offloading via the target construct. The teams distribute directive distributes loop iterations over threads in a team

Give an example of using an MPI reduction operator

Using the MPI_Min operator in the MPI_Allreduce function you can calculate the minimum value across all processes. This can be used to agree on the timestep when solving PDEs as all processes need to use the same timestep, which should be the smallest timestep required by a process to maintain stability

How should you manage the number of OpenMP threads

Using the OMP_NUM_THREADS environment variable. However calls in the program using omp_set_num_threads overrides this. It is not good practice to do this. As your program should be flexible to any number of threads being available to it. If old programs hard coded the number of threads we wouldn't be able to easily run them now with the far larger number of threads available to us.

How can you set the loop schedule without recompiling

Using the OMP_SCHEDULE environment variable, if the schedule is set to runtime, this will dictate the loop schedule

What security is the server responsible for

Verifying the identity of the principle (user or process) behind each operatoin (authentication) Checking if they have sufficient access rights to perform the requested operation, rejecting those who do not (authorisation)

How do GPUs connected

Via PCI Express this is an industry standard connection which connects to any machine with enough power, cooling and physical space. Nvidias NV link provides better connectivity than PCI-e

What are the advantages of MPI

We can use distributed memory systems

How can we use the C pre-processor to compile a program without OpenMP directives

We can use the #ifdef directive to carry out conditional compilation

How do we avoid data races

We ensure that updates to a variable are only carried out by one thread at a time. Known as atomic updates. And they are implemented like this

How can we use finite difference stencils to calculate second order derivatives

We need 3 points for each finite difference stencil rather than just 2.

How do we calculate spatial derivates numerically

We stores the values of the variables at discrete points in space e.g. density, velocity, temperature etc. We assume a regular grid of points in space which can be implemented as an array We calculate values for spatial derivatives using the differences between grid points

What is Gustfson's Law

When given a more powerful processor the problem generally expands to make use of the increased facilities

When should you use guided scheduling

When iterations are poorly balanced between each other. So one iteration may involve far more work than another. Very effective if there is poor load balancing towards the end of the computation

When would you not want to start a parallel region and a parallel for loop at the same time

When you are going to carry out multiple for loops that all need to use the same threads. Instead you can put multiple #pragma omp for loops within a region to reduce overheads

What is ambdahl's Law

Where the serial fraction and parallel fraction are the portion of the program that has serial/parallel functionality

Do MPI processes have their own memory address space

Yes, unlike OpenMP where the address space is always shared.

What is high performance Lnpack

You are permitted to change the size of the matrices so that floating pointer performance is close to the peak value. HPL performance (Rmax) is usually a significant fraction of the theoretical maximum performance (RPeak)

With domain decomposition how would you program finite difference stencils

You can communication information about other regions on demand or store a halo (copy of row/column held by other MPI processes). At the end of each iteration each process can exchange the new values (halos) it calculated with any other processes that need them.

What are some benefits of openMP

You can compile without it to recover serial functionality You can incrementally parallelise existing code

WHat can you see from the output of ITAC

You can see what functions are taking up the most time and where the program is waiting the longest

What is the idea of time-stepping for approximating the solution to a differential equation with a time derivative

You need to specify the initial condition

What are boundary conditions

You need to specify what happens to a stencil at the edge of the domain where there is no value. This choice can have a large impact on the outcome, and should be determined based on the different physical situation that is being simulated

What is the OpenMP Fork-Join Model

You start with a single master thread, which forks work threads in a parallel region. Threads join at the end of a parallel region and execution becomes serial again. This only works when processors share memory and there is an overhead to starting/ending parallel regions

What is a compute bound application

application limited by the rate at which the processors can carry out arithmetic operations

How do we store data in running programs

as data structures

How is information in messages stored

as sequences of bytes

What is a big use for linear algebra workloads

benchmarking

How do we reduce false cache sharing

by using a large chunk size in OpenMP loop scheduling so there is less overlap in the data being accessed by processors

What do you need to run CUDA programs

compiler support, such as the ncc compiler from Nvidida for C

How do you use copyin for thread private variables

copyin causes these variables to be initialised from their value in the master thread, rather than being undefined on entry to the first parallel region

How do you carry out variable scoping within an OpenMP parallel region

default(none) ensures that default scoping rules aren't being applied to any variables. Multiple variables can be specified for each type e.g. private(i,j)

What environmets can you use MPI

distributed memory and shared memory environments

How has super performance been increasing over time

exponentially

How does the registry map remote objects

from textual URL-style names to references to remote objects on the server

What is an rmi security manager

it checks all operations and ensures they are permitted

What is a kernel in GPU programming

it is a computational function applied to each element

What is a remote object reference

it is an identifier for a remote object that is valid throughout a distributed system. Must be unique in space and time.

What is the spatial resolution when calculate the finite difference

it is the size of the cells on the grid, the smaller the cells the greater the accuracy

What are the disadvantages of MPI

more design work is required as we have to distribute the work ourselves. It is less suited to incremental parallelism - it should be designed from the start. You may need to rethink the algorithm. The best serial algorithm may not be the best parallel algorithm.

What is NUMA

non uniform memory access. Where accessing some memory locations is slower than others, for example accessing the memory on another processor will have slower performance.

What are remote objects

objects that can receive remote invocations

What is a spatial gradient

rate of change of some property in space

How can we achieve security in a DS

securing the processes and channels used in their interactions Protecting the objects that they encapsulate against unauthorised access

How are MPI_Send and MPI_Recv blocking communications

send blocks until the message is received by the destination process or until it is safe to change the data in the send buffer. Recv will block until it receives a message. If the send/receive operations are not correctly matched there will be a deadlock

How have we continued to improve performance after the breakdown of moores law and dennard scaling

the size of many components has continued to decrease, so we fit more components on a chip, we now get more cores on a processor not better performance per core. Improvements in single core have been relatively modest. Improving performance falls a lot to the programmer to use all processor cores effectively

What it the top 500 and how is it calculated

the top 500 super commuters, according to the Linpack benchmark this measures the floating point operations per second with a dense linear algebra workload

Give some properties of traditional HPC systems

tightly coupled (all parts need to function together and talk to each other), low latency interconnect, parallel file systems, designed to run high performance, highly scalable custom applications.

How do you find an IP address in java

use Inet, need to import java.net.* and handle UnknownHostException

How can you time an OpenMP program

use the time command before running the executable e.g. time ./cprogram This outputs multiple time measurements, but the real time is how long the program took to run. The sys and user are related to the actual CPU time You can also use the omp_get_wtime() command to create timestamps at different points in the program, then call it again to compare the value to the initial time. This can be used to time specific sections

High Performance Computing and Distributed Systems

Related study sets

La Division Celular - Notre 1er examen-7mo

Practice Quiz

Sun Salutation

MGMT 104 Midterm

MRU24.3: Video Activity: Moral Hazard

Los 21 Países, Las Capitales y Nacionalidades de Habla Español

Chapter 6: Managing Quality

CNA 210 | Ch. 4, Advanced Cryptography and PKI

Cell and Molecular Test 1

Animal Tissue and Animal Tissue Slides (BY124 Exam 1)

test 2 criminal law

Accounting Chapter 12

review chapter 6,7,8

Exam 3- chapter 12

Fin 320 Ch.11

State and Local

Macro Final

CH. 5 Membrane Transport and Cell Signaling

Math: Skip Counting - Missing Sequence Number (odds/evens)

Week 10: Conditionals and List Comprehension