High Performance Computing and Distributed Systems
What are the 2 parts of a protocol
- A spec of the sequence of messages to be exchange - A spec of the format of the data in the messages
Factors affecting parallel performance
- Fraction of the work which is serial - Load Balancing: processes waiting at synchronisation points whilst others complete work - Hardware resource contention: saturating memory bandwidth - Communication overheads
How do you design an architectural model
- Simplify and abstract the function of individual components of a DS - Consider the placment of the components across a network of computers - define useful patterns for the distribution of data and workloads - Identify the interrelationship between the components - their functional roles and patterns of communication between them.
Give examples of 3 distributed systems
- The internet - An intranet - Mobile and Ubiquitous Computing
What is thread affinity
It binds specific OpenMP threads to specific hardware resources (CPU cores)
What is a routing overlay
It is a distributed algorithm for a middlelayer responsible for locating nodes and objects and routing requests from any client to a host that holds the required object.
What is an MPI comunicator
It is a group of processes in MPI
What is a large shared memory machine
It is a machine where all cores share memory, using a proprietary interconnect to provide cache-coherent shared memory. This architecture is highly NUMA, so the performace for accessing memory is very varied. This is more expensive than distributed memory clusters. This also requires MPI to be used rather than OpenMP as cores don't share the same address space.
What is Moore's Law, and how is this law holding up today
The number of transistors in a chip will double approximately every 24 month. From 2005 this started to level off and was no longer met
What is strong scaling
We keep the problem size fixed and increase the number of processors
How can you experience data races in parallel for loops
Where the value your updating depends on another value in the array, which may or may not have been updated yet
What are the consequences of Amdahl's law
The maximum speedup is 1/s = 1/(1-p) So the parallel fraction of our program limits the maximum possible speedup
What is arithmetic intensity
The number of floating point operations per byte of data memory from memory
What is the serial computation time for a weak scaling program
Where number of times we need to carry out the parallel fraction for our calculation
How do you improve the accuracy of an approximation made with time stepping
decrease the size of the time step
What file do you need to include to use OpenMP in C
#include <omp.h>
In C how do you start an OpenMP directive
#pragma
What is the DS Challenge of Scalability
A DS is scalable if the cost of adding a user is a constant amount in terms of the resources that must be added. The system must also work efficiently with an increasing number of suers at many different scales. The algorithms used to access the data should be decentralised and avoid performance bottlenecks Data should be structured heirarchically (in a tree-like structure rather than linearly) to get the best access time
Whats the difference between named and unnamed critical regions
A thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. If the region has no name then no other critical regions without a name can be entered at the same time, so use names where possible to allow multiple critical regions to be used
What is the impact of accelerators on the Top 500 list
A total of 138 systems are using accelerators, up from 110 6 months ago
What is a protcol
A well-known set of rules and formats to be used for communication between processes in order to perform a task
What is an architectural model
Address the placement of system components and relationships between them Defines the ways in which the system components interact and are mapped onto the underlying network of computers.
What is BLAS
Basic linear algebra subprograms Defines standardised interface for basic vector and matrix operations
Why would you rarely use a synchronous DS
Because it is hard to set time limits for processes execution, message delivery or clock drift
Compare the performance impact of critical vs. atomic
Both have a negative performance impact. Atomic is usually significantly more efficient than critical, so use it where possible.
How can P2P systems ensure availability
By storing multiple replicas of objects
How do we represent sparse matrics
Compressed sparse row format
Why do we use GPUs as accelerators for HPC
Good floating point performance and high memory bandwidth. Some operations need to be handled by the CPU by computationally expensive work can be offloaded to the GPU
What is a GPU
Graphics Processing Unit Specialised hardware for carrying out operations releated to image generation.
How does CUDA implement Data parallelism
Data set is a stream of elements, then the kernel is applied to each element, kernels execute on the CPU
What is the DS Challenge of Openness
Distributed Systems should be extensible. It should be easy for new resource-sharing services to be added and made available by a variety of clients. Open distributed systems are based on the provision of a univeral communication mechanism and published interfaces for access of shared resources
What is domain decomposition
Distributing work between MPI processes by breaking down the domain of the problem.
How does process placement effect performance
Efficiency use of memory is important One MPI process per socket with multiple OpenMP threads tend to work effectively. OpenMP threads access memory on the same memory controller
Explain GPU memory architecture
High bandwidth memory and memory controllers. (Nvidia pascal has 16GB of memory running at 720GB/s) Small L2 Cache (which is the highest cache level)
How can you improve the load balance of MPI programs
Interleaving loop iterations should give better load balance as work in the computationally expensive regions is more evenly distributed
What is cache memory, and why do we need it
It is a small amount of fast memory which holds data fetched from an written to main memory. Cache helps hide memory latency provided we make efficient use of data we fetch from main memory. We need to make efficient use of cache to get good performance on modern processors
What is MPI_Bcast
It sends data from 1 process to all others in a specified communicator
What does high performance mean for interconnects
Low latency: short delay sending small messages High bandwidth: high data transfer rate for large messages.
What is the default MPI communicator
MPI_COMM_WORLD
What is the speed-up of a weak scaling program
Speed-up scales linearly with N. Making it easy to make use of a large number of processors
What is a memory bound application
an application limited by the speed at which data can be accessed from main memory
How do you compile an MPI program
mpicc command to compile for C use the command mpirun to run the program with the -np command used to specify the number of processes
What are handles in java serialised form
they are references to an object within the serialised form
How do you debug MPI programs
when you compile use the flags -check_mpi and -g so that you can see issues in the output
How do you parallelise a for loop using OpenMP
#pragma omp for this command needs to be within a parallel region Or alternatively you can combine the two directives into #pragma omp parallel for which starts a parallel region and a parallel for loop
In C how do you start an OpenMP parallel section
#pragma omp parallel {}
Can you easily program for a GPU
GPU programming is hard and may require an application to be restructure to use a GPU efficienctly.
How is diversity increasing in HPC Systems
GPUs are in 5 of the current top 10 systems. ARM Architecture is starting to have an impact.
Why do we have to carry out parallel programming
Because parallelisation is typically too complex for the compiler to handle automatically and needs to be programmed and specified
Why is Fortran still used for HPC
Because there are many large legacy HPC programs and libraries still in use which would take enormous effort to rewrite
What is the parallel computation time with an ideal parallel computer for a weak scaling program
Because we only need to carry out the parallel region once, if we have the right number of processes for each point in the domain of the problem
What are ports
A local port is a message destination within a computer. Th combination of ip address and port number uniquely identifies the specific process to which the data should be delivered the port number of a positive integer, and some ports are reserved for common/well known services
What is the peer to peer architectural model
All of the processes play similar roles, interacting cooperatively as peers without distinction between clients and servers. All participating processes run the same program and offer the same interfaces to each other Provides better scalability than the client-server architecture
What is the DS challenge of heterogeneity. How do we solve this
All parts of the systems can be different. Different networks, hardware, OS, programming languages. We use the internet protocol IP to mask different networks. Middleware applies to a software layer than can deal with other differences. For example CORBA (Common object request broker) and Java RMI (Remote Method Invocation)
How do you write programs to use cache effectively
Always access arrays with stride 1, so your accessing adjacent data. For example you have a 2-d array and iterate through all of the first items in the arrays this is far less efficient than iterating through each array fully then moving to the next one. In the example the second program is about 7 seconds faster
How should you use finite differences to maximise memory performance
Always calculate them in one direction, as calculating them in the other direction may require access to multiple cache lines
What is exascale computing. What are the current constraints in realising this
An HPC System capable of 10^18 flops. We can't just scale up systems as thre are just power and cost constraints. We also need applications that could make efficient use of the capability. So software innovation also needs to happen as well as hardware
What is an external data representation
An agreed standard for representation of data structure and primitive values
What is the Golden Bell, what projects are increasingly winning it
An award for outstanding achievement in HPC, AI and deep learning projects
What is the DS Challenge of failure handling
Any process, computer, or network may fail independently of the others Some components fail while others continue to function Each component needs to be aware of the possible ways in which the components it depends on may fail and be designed to deal with each of those failures appropriately
What are some key BLAS implementations, why would you use them
Application performance is significantly improved with optimised BLAS. Optimised versions are OpenBLAS, Intel MKL
Where is BLAS used
As low level building blocks for linear algebra and other libraries like pythons numpy
When should you use critical sections vs. atomic sections
Atomic locks a specific memory location not a section of code. So it is more efficient But critical can lock large section of code so is more versatile
What are accelerators and what are their relationships with CPU
CPUs are designed to delivery acceptable performance for a very wide range of applications. They need to trade-off functionality, performance, energy efficiency and cost. Accelerators provide increased performance for specific workloads - different design trade offs
What is the most popular GPU programming model
CUDA
What is false cache sharing
Caches on different processes need to agree on data in memory When a threads writes to memory it invalidates the corresponding cache line, so another thread needs to reload this line. This reduces performance due to unnecessary loading of data
What is a Client-Server basic model
Client processes interact with individual server processses in separate host computers in order to access shared resources that they manage Server may n turn be clients and use services of other servers
What are all the types of architectural models
Client-service basic model Services provided by multiple cooperative servers Proxy Servers and Caches Peer-to-peer
What is CDR
Common data representation Can represent all of the data tyes that can be used as arguments and return values in remote invocations in CORBA Allows clients and servers written in different languages to communicate
What is MPI collective communication
Communicating within a specified communicator. All processes in the communicator must reach the collective communication otherwise there will be a deadlock
What does CUDA stand for
Compute Unified Device Architecture
Explain the basic structure of a traditional HPC system
Compute nodes are made up of a number of cores, which all share RAM and have their own memory address space (so share files) A number of compute nodes fit into a single "rack unit" and connect to a single network switch. These compute nodes can communicate faster with each other than they can with other compute nodes not on this memory switch. Then mutiple rack units are connected together
How do we make efficient use of a cache
Data is transferred into the cache in fixed blocks called cache lines When we access adjancent memory locations the next piece of data will be likely in the cache line, so will already be in the cache even if we didn't explicitely access it. So we don't need to request this from main memory which is very slow. Reading from memory loactions sequentially is usually efficient and will result in more cache hits (where the data is found in the cache) and less cache misses (where data is not found and has to be requested from main memory)
What is a remote object interface
Every remote object as a remote interface that specifies which of its methods can be invoked remotely. Objects in other processes can invoke only the methods in the remote interface, whilst local objects can invoke all methods.
What are the main HPC languages
Fortran and C/C++
What are the ARM processors being used in HPC and why are they being used
Fujitsu's A64FX Arm-based processor. 32GB high bandwidth memory and very energy efficient
What is branch divergence in GPUs, why is it bad
GPUs are suited to data parelle workloads: each thread carried out the same operations on different data items. Branch divergence within a warp is when 2 threads take a different branch in the execution. This can degrade performance as now the threads may not finish computing at the same time and threads need to execute in lock step. GPUs handle this by masking. If threads need to take a different branch. They turn off threads that aren't taking the specific branch and calculate the others, then the GPU returns to the branch point and calculates the other threads by disabling the already calculated threads. This way the thread carries out the correct calculation on the threads for their branches (masking the calculatings on specific threads) whilst still only carrying out a single instruction within the warp at a time.
What is MPI_Wtime
Gives the time in seconds since an arbitrary point in the past
What is High performance Conjugate Gradients
HPCG is representative of sparse linear algebra workloads - lower arithmetic intensity and a much lower performance than HPL
When can we not calculate parallel speed-up
If it isn't feasible to run a serial version of the code
What are loop scheduling and load balancing
If loop iterations do the same amount of work then threads should finish at the same time. This is known as load balancing We achieve this through loop scheduling to distribute work between threads
When might you consider not purchasing the most efficient interconnect
If the workload is not dominated by communication, as the interconnect is a significant part of the cost of an HPC system
What are MPI wildcards
If we're using a manager-working approach we don't know which processes will next request work. So the manager needs to accept requests from any worker process. We can use wildcards such as MPI_ANY_SOURCE to receive from an unspecified source
How does the resolution of a calculation affect the stability
If you increase the resolution you may need to increase the time steps required to reach the required simulated time. To ensure stability is maintained
How does the communication overhead change as the domain size decreases. What is the communication overhead proportional to
In 2-D and 3-D it is porportional to 1/N where N is the size of 1 axis. In strong scaling N decreases as we break down the domain into small pieces. If N is smaller then communication is making up a larger portion of the execution time. So communication overheads inversely impact strong scaling
What is the single construct
In parallel region work which is not parallelised is repeated by all threads. Single directive ensures only one thread executes a block of code inside a parallel region but it can be any thread
What is the architectural model: proxy servers and caches
Increases availability and performance of a service by reducing the load on the network and servers A caches stores recently used data objects that is closer to the client. Caches may be allocated with each client or in a proxy server shared between clients When an object is needed by a client process, the caching service first checks that cache and supplies the object if an up-to-date copy is available
What is the ExCalibur project
It aims to redesign high priority simulation codes and algorithms to fully harness the power of future super computers, keeping the UK research and development at the forefront of high-performance simulation science.
What is MPI_Scatter
It distributes data from one process to all others in a communicator. However each process receives a subset of the data
What is MPI_Allgather
It gathers data from all processes in a communicator to all processes, each process contributes a subset of the data received and then all processes receive the result
What is MPI_Gather
It gathers data from all processes in a communicator to one process, each process contributes a subset of the data received
What is the most common OS for HPC
It has changed from being majority Unix to nearly 100% Linux
What is a distributed System
It is a system in which components located at networked computers communicate and coordinate their actions only by passing messages
What is the Courant-Friedriches-Lewy Condition
It is an example of a stabilty condition In hydrodynamics schemes this condition ensures that the scheme stays stable. By limiting the timestep we ensure stability
What is the concept of masking failures and how is it done
It is possible to construct reliable services from components that exhibit failures A knowledge of failure characteristics of a component can enable services to be designed to mask the failure of the components on which it depends .g. checksums are used to mask corrupted messages and reject them. Rather than relying on the server to never fail we simply rejects its input (masking it)
What is unmarshalling
It is the process of disassembling an external data representation to produce an equivalent collection of data items at the destination - a data structure
What is marshalling
It is the process of taking a collection of data items and assembling them into a form suitable for transmission. Translating structured data items and primitive values into an external data representation
What is a data race
It is where multiple threads try to update a variable at the same time, and some updates to the variable can be lost
What is a multicast operation
It sends a single message from 1 process to each of the members of a group of processes. The sender cannot see the membership of the group, it is transparent
What arethe 3 levels of cache
L1, L2, L3 With increases sizes but increasing latency
How do you do error handling in MPI for C
MPI functions return an error status which can be checked against the value of MPI_SUCCESS However errors are normally fatal unless you choose to handle them within the function.
How do you determine the rank of the MPI process
MPI_Comm_rank
How do you determine the number of MPI processes
MPI_Comm_size
What are the non blocking MPI send/recv commands
MPI_Isend and MPI_Irecv You need to use MPI_Wait or MPI_Waitall at the point when the program needs to have finished communicating. To prevent data from being changed in the send buffer once it has been received.
What are the 2 basic MPI reduction operators
MPI_Reduce: carry out reduciton and return the result of the specified process MPI_Allreduce: carry out reduction and return result to all processes
How do you do MPI point-to-point communication
MPI_send() and MPI_Recv()
Compare memory speed to CPU speed, how has this changed over time
Memory is very slow. Memory speed also hasn't been improving as fast as it historically has been, back in 1980 memory was just as fast as the CPU
What is the primary focus of modern distributed systems and why
More focus on sharing other resoruces such as data, functionality and collaborating. This is because hardware is less extensive so sharing hardware is less important.
Explain the idea of layers in protocols
Network software is made up of a hierarcy of layers. Each layer presents an interface to the layer above, each layer is represented by a module in a network computer Each module appears to communicate directly with a module at the same level in a computer elsewhere on the network. In reality data is passed up and down the protocol stack of layers
Can you use MPI/OpenMP together, given an example
On a cluster you can use OpenMP within a node and MPI between nodes/sockets. you can use MPI to decompose a domain then parallise loop iterations with OpenMP
What are the two main technologies for parallel programming
OpenMP: directives based parallelism for shared memory. Used wihtin a compute node to make use of all processor cores MPI: message passing interface for shared or distributed memory systems using function calls. This can communicate between compute nodes
Give an example of reduction in OpenMP
Otherwise the sum variable would need to be within an atomic region
Why do we have a queue for HPC systems
Parallel programs proceed at the speed of the slowest invidual thread/process We want exclusive access to resources for our program so that processes in our parallel application are not competing for resources. This includes memory resources The queuing system allocates exclusive use by your job
Give some traditional applications of HPC
Physical sciences, engineering and applied mathematics, solving equations that cannot be solved analytically Finding approximate numerical solutions, e.g. for PDEs
What is weak scaling
Problem size per processor remains fixed. Increase the problem size for each processor we have access to
Why are most DS's in practice asynchronous, what is the advantage of this
Processes must share processors and communicating channels making them asynchronous Many design problems can be solved in an asynchronous system e.g. events can be ordered even without a global clock
What tasks do GPUs carry out and what properties do they have because of this
Project 3D objects from images -> many matrix/vector operations Highly parallel task: GPUs contain a large number of floating point units and support a large number of processing threads They have a higher memory bandwidth than a CPU (different memory)
What is the difference between Rpeak and RMa in linpack
RMax is the max performance achieved. Rpeak is the theoretical performance
What is the baseline physical model
Representation of the underlying hardware elements Abstract away details of computer and network technologies This is the baseline physical model of a DS< essentially describing what it is
What is the DS challenge of security
Resources are accessible only to authorised users and used in ways that are intended
What are the categories of Flynn's Taxonomy and give an example for each
SISD - Single instruction, single data. e.g. a serial non-parallel computer SIMD - SIngle instruction, multiple data - GPUs MISD - Multiple Instruction, single data. few practical examples MIMD - Multiple Instruction, multiple data e.g. most HPC systems
What is the architectural model - Services provided by multiple cooperative servers. When might these be used and what are the benefits
Services may be implemented as several cooperative server processes in separate host computers interacting to provide a service to client processes Objects required to provide service may be replicated or partitioned between servers Replication is used to increase performance and improve fault tolerance Cluster based web servers are used for highly scalable web services such as search engines or web stores
What is the master construct
Similar to the single construct but ensures the master thread executes the block of code. With a single region all threads need to wait for the command to be finished. Master doesn't have this issue so other theads don't need to wait for the master construct to finish to continue the rest of their work within the parallel region
What are the lvels of parallelism within an HPC cluster
Single core can use vector instructions (SIMD) Within a compute node there is shared memory between cores that can hold multiple data streams (MIMD) Between nodes there's distributed memory that can hold multiple data streams (MIMD)
What do you need to specify when calculated spatial derivatives and time derivatives
Spatial derivates - boundary conditions. Spatial resolution determines the accuracy. Time step can affect stability Time derivatives - initial conditions. time-step size to determine accuracy and stability
WHat is one way to improve the memory performance of an application
Spread the threads over more sockets using thread affinity variables. Providing the bandwidth from more memory controllers
How does the performance overhead of the loop scheduling options compare
Static is the lowest, then dynamic and finally guided.
What is the default loop scheduler
Static, if the schedule is not defined
Why are sparse matrices bad
Storing many blank elements in inefficient, they have a low arthmetic intensity
Explain the processor architecture of a GPU
The GPU is made up of streaming multiprocessors (SMs) (up to 60 in Nvidia Pascal), up to 4 SMs may be disabled due to manufacturing defects. These are grouped into blocks (of 10 in Nvidia Pascal). Each graphics processing cluster includes all elements of the rending pipeline - effectively an independent GPU
How is work distributed on GPUs
The GigaThread Engine schedules threads and handles context switching between them. The thread engine schedules blocks of up to 1024 threads These blocks are then subdivided into warps of up to 32 threads which execute on a single SM
What is the Oepn Systems Interconnected
The OSI is designed to enourage the development of protocol standards for open systems. It sets out the structure of network layers
WHat protocol is the internet based on and what does this allow
The TCP/IP protocols which are open protocols which allow messages to be sent over a heterogeneous network of networks
What is the WLCG
The Worldwide LHC (large hadron collider) Computing Grid. The mission is to provide global computing resources to store and analyse the data from CERN. It is made up of 170 computing centres in 42 countries, linking up national and international grid structures
What are sockets
The combination of an internet address and a port number is known as a socket, which uniquely identifies a process within the entire internet. Socket is a software abstraction which provides an endpoint of a 2-way communication link between 2 processes. Messages sent to a particlular address and port number can only be received by a processes whose socket is associated with that address and port number A socket can send and receive messages Any processs may use multiple ports to receive messages, but a process cannot share ports with other processes on the same computer (outside of multicast ports) Any number of processes may send messages to the same port
What are fundamental models
The formal description of properties that are common in all architectural models Address time synchronisation, message delays, failures, security issues The overall goal is to ensure that the DS is correct, reliable, secure, adaptable, and cost effective
Give a basic definition of HPC
The practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typically desktop computer or workstation in order to solve large problems in science, engineering, or business
Explain the architecture and components of RMI
The proxy (stub) makes the RMI transparent to the client by behaving like a local object. Whe Object A wants to invoke a method on object B it simply invokes the method on the stub object of B locally. The remote reference model transpates between local and remote object references and creates remote object references for new objects. The communication modules implement the request reply protocol The server has a skeleton and dispatcher for each remote object
What is the java rmi registry
The registry runs on the server and contains a list of all the remote objects which the server can export. Clients connect to the registry and provide the name of the required remote object. The registry returns a reference which the client can use to invoke methods in the registry
What is stability when solving PDE's computationally
The scheme is stable if the solution is guaranteed to remain finite. If the errors grow without bound then the scheme is not stable and the program will generate floating point exceptions
What are the skeleton and dispatcher
The skeleton implmenets the methods in the remote interface The dispatcher receives a request message from the communication module and uses the methodID to select the appropriate method in the skeleton
What are thread private variables
They are private to a single thread but can be accessed in any parallel region. So their essentially a global variable that is private to a thread
What are reduction variables in OpenMP
They bring together results from multiple threads
Other than floating point performance and memory performance why might you use GPUs for HPC
They offer very good energy efficiency
How do MPI programs communicate
They parse messages by making calls to functions/subroutines
What are critical sections
They provide a way to proect an arbitrary section of code from access by multiple threads. Only one thread may be in a critical section at a time
What are request-reply protocols
They represent a pattern on top of messaging passing Support two-way exchange of messages encountered in client-server architecture. Also provide support for Remote Method Invocation (RMI)
What is the truncation error
This is an error that occurs when we're calculating the finite difference
How do we calculate the rate of change of x and y for a spatial derivative using a stencil
This is known as the finite difference Many different formula for dvdx and dvdy can be used, examples can be seen in the pictures
What is Dennard Scaling, how is it holding up today
This says that power per unit area (power density) remains constant as transistors shrink Meaning that as transistors become smaller they also become faster and more energy efficient. As the energy required to power them was proportional to their size. However with very small features we encounter leakage current, and there is also a threshold voltage required for the transistor. So Dennard scaling has broken down and we can no longer make smaller transistors
What are the security threats to a DS
Threats to a server/client process Threats to communication channels Denial of service - the enemy interferes with the activities of authorised users by making excessive and pointless invocations on services or message transmissions, resulting in overloading of physical resources
What is the DS challenge of transparency
To hide from the user and the application programming, the separation/distribuion of components, so that the system is perceived as a whole rather than a collection or independent components Access transparency: access to local or remote resources is identical Location trnasparency: access without knowlege of location Failure transparency: tasks can be computed despite failures
What is the DS challenge of concurrency
To process multi-client requests concurrently and correctly To provide and manage concurrent access to shared resources in a safe way (fair scheduling and deadlock avoidance)
Why do we need distributed systems
To share resources: - hardware resources e.g. prints - data resources e.g. files, databases, - resources with specific functionality e.g. search engines - to develop a collaborative environment e.g. sharepoint, github - to aggregate computing power
What are the 2 historic applications of networking
Transferring files and data between hosts Allowing one host to run programs on another host
How should you accurately test the performance of a program
Use the same problem size, but run for less iterations. If you reduce the problem size you may change the performance in a non-uniform matter as the program may suddenly be able to use the cache more effectively
What is the section construct
Used when there are two or more section of code which can be executed concurrently
How can we measures the performance of an MPI program and understand the load imbalance
Using a profiling tool such as Intel Trace Analyser and Collector (ITAC)
How can you carry out GPU programming using OpenMP
Using offloading via the target construct. The teams distribute directive distributes loop iterations over threads in a team
Give an example of using an MPI reduction operator
Using the MPI_Min operator in the MPI_Allreduce function you can calculate the minimum value across all processes. This can be used to agree on the timestep when solving PDEs as all processes need to use the same timestep, which should be the smallest timestep required by a process to maintain stability
How should you manage the number of OpenMP threads
Using the OMP_NUM_THREADS environment variable. However calls in the program using omp_set_num_threads overrides this. It is not good practice to do this. As your program should be flexible to any number of threads being available to it. If old programs hard coded the number of threads we wouldn't be able to easily run them now with the far larger number of threads available to us.
How can you set the loop schedule without recompiling
Using the OMP_SCHEDULE environment variable, if the schedule is set to runtime, this will dictate the loop schedule
What security is the server responsible for
Verifying the identity of the principle (user or process) behind each operatoin (authentication) Checking if they have sufficient access rights to perform the requested operation, rejecting those who do not (authorisation)
How do GPUs connected
Via PCI Express this is an industry standard connection which connects to any machine with enough power, cooling and physical space. Nvidias NV link provides better connectivity than PCI-e
What are the advantages of MPI
We can use distributed memory systems
How can we use the C pre-processor to compile a program without OpenMP directives
We can use the #ifdef directive to carry out conditional compilation
How do we avoid data races
We ensure that updates to a variable are only carried out by one thread at a time. Known as atomic updates. And they are implemented like this
How can we use finite difference stencils to calculate second order derivatives
We need 3 points for each finite difference stencil rather than just 2.
How do we calculate spatial derivates numerically
We stores the values of the variables at discrete points in space e.g. density, velocity, temperature etc. We assume a regular grid of points in space which can be implemented as an array We calculate values for spatial derivatives using the differences between grid points
What is Gustfson's Law
When given a more powerful processor the problem generally expands to make use of the increased facilities
When should you use guided scheduling
When iterations are poorly balanced between each other. So one iteration may involve far more work than another. Very effective if there is poor load balancing towards the end of the computation
When would you not want to start a parallel region and a parallel for loop at the same time
When you are going to carry out multiple for loops that all need to use the same threads. Instead you can put multiple #pragma omp for loops within a region to reduce overheads
What is ambdahl's Law
Where the serial fraction and parallel fraction are the portion of the program that has serial/parallel functionality
Do MPI processes have their own memory address space
Yes, unlike OpenMP where the address space is always shared.
What is high performance Lnpack
You are permitted to change the size of the matrices so that floating pointer performance is close to the peak value. HPL performance (Rmax) is usually a significant fraction of the theoretical maximum performance (RPeak)
With domain decomposition how would you program finite difference stencils
You can communication information about other regions on demand or store a halo (copy of row/column held by other MPI processes). At the end of each iteration each process can exchange the new values (halos) it calculated with any other processes that need them.
What are some benefits of openMP
You can compile without it to recover serial functionality You can incrementally parallelise existing code
WHat can you see from the output of ITAC
You can see what functions are taking up the most time and where the program is waiting the longest
What is the idea of time-stepping for approximating the solution to a differential equation with a time derivative
You need to specify the initial condition
What are boundary conditions
You need to specify what happens to a stencil at the edge of the domain where there is no value. This choice can have a large impact on the outcome, and should be determined based on the different physical situation that is being simulated
What is the OpenMP Fork-Join Model
You start with a single master thread, which forks work threads in a parallel region. Threads join at the end of a parallel region and execution becomes serial again. This only works when processors share memory and there is an overhead to starting/ending parallel regions
What is a compute bound application
application limited by the rate at which the processors can carry out arithmetic operations
How do we store data in running programs
as data structures
How is information in messages stored
as sequences of bytes
What is a big use for linear algebra workloads
benchmarking
How do we reduce false cache sharing
by using a large chunk size in OpenMP loop scheduling so there is less overlap in the data being accessed by processors
What do you need to run CUDA programs
compiler support, such as the ncc compiler from Nvidida for C
How do you use copyin for thread private variables
copyin causes these variables to be initialised from their value in the master thread, rather than being undefined on entry to the first parallel region
How do you carry out variable scoping within an OpenMP parallel region
default(none) ensures that default scoping rules aren't being applied to any variables. Multiple variables can be specified for each type e.g. private(i,j)
What environmets can you use MPI
distributed memory and shared memory environments
How has super performance been increasing over time
exponentially
How does the registry map remote objects
from textual URL-style names to references to remote objects on the server
What is an rmi security manager
it checks all operations and ensures they are permitted
What is a kernel in GPU programming
it is a computational function applied to each element
What is a remote object reference
it is an identifier for a remote object that is valid throughout a distributed system. Must be unique in space and time.
What is the spatial resolution when calculate the finite difference
it is the size of the cells on the grid, the smaller the cells the greater the accuracy
What are the disadvantages of MPI
more design work is required as we have to distribute the work ourselves. It is less suited to incremental parallelism - it should be designed from the start. You may need to rethink the algorithm. The best serial algorithm may not be the best parallel algorithm.
What is NUMA
non uniform memory access. Where accessing some memory locations is slower than others, for example accessing the memory on another processor will have slower performance.
What are remote objects
objects that can receive remote invocations
What is a spatial gradient
rate of change of some property in space
How can we achieve security in a DS
securing the processes and channels used in their interactions Protecting the objects that they encapsulate against unauthorised access
How are MPI_Send and MPI_Recv blocking communications
send blocks until the message is received by the destination process or until it is safe to change the data in the send buffer. Recv will block until it receives a message. If the send/receive operations are not correctly matched there will be a deadlock
How have we continued to improve performance after the breakdown of moores law and dennard scaling
the size of many components has continued to decrease, so we fit more components on a chip, we now get more cores on a processor not better performance per core. Improvements in single core have been relatively modest. Improving performance falls a lot to the programmer to use all processor cores effectively
What it the top 500 and how is it calculated
the top 500 super commuters, according to the Linpack benchmark this measures the floating point operations per second with a dense linear algebra workload
Give some properties of traditional HPC systems
tightly coupled (all parts need to function together and talk to each other), low latency interconnect, parallel file systems, designed to run high performance, highly scalable custom applications.
How do you find an IP address in java
use Inet, need to import java.net.* and handle UnknownHostException
How can you time an OpenMP program
use the time command before running the executable e.g. time ./cprogram This outputs multiple time measurements, but the real time is how long the program took to run. The sys and user are related to the actual CPU time You can also use the omp_get_wtime() command to create timestamps at different points in the program, then call it again to compare the value to the initial time. This can be used to time specific sections
How do you access the registry within java
using the Naming class