OpenMP

Links:

Tools

OpenMP

With "#pragma omp parallel" you declare the beginning of a part that should be parallelized.

To compile such a source code use special options for your compiler:

 -openmp
intel compiler
 gcc -fopenmp -o openmp omp.c
gcc compiler (since version 4.2)
 #pragma omp parallel
 {
 }
shows openmp that the following part should be parallelized
 #pragma omp parallel
 {
 #pragma omp for
for(...)
 }

or in short form:

 #pragma omp parallel for
   for(...)
the for loop should be devided to parallel threads. For example if the for loop has 100.000 steps and we have 5 threads the first one takes the steps from 1 to 20.000 the second 20.001 to 40.000 and so on.

Variables

are defined in two groups:

  1. Private (each thread gets its own copy of that variable and noone else can write to it)
  2. Shared (all threads have simultaneously access to these variables)
 #pragma omp parallel for private(i,j,workingprogress) shared(sum,overallprogress)
 for(...)

critical sections

to prevent race conditions.

 #pragma omp atomic
   critVar +=1

reduction

to bring the data the threads have processed together.

 #pragma omp parallel for reduction(operator:variable)
at the end of the for loop omp brings the variable together through the given operator. Initialization value is the neutral value (+:0, *:1).

chunks

to devide distinct parts of the loop: "schedule(dynamic,CHUNK)

single, master

 #pragma omp single
only one thread runs through the part
 #pragma omp master
only thread 0 runs through

barriers

to set a point in the code that should be reached first by all the parallel threads.

 #pragma omp barrier

MPI (Message Passing Interface)

Intel offers products for mpi development and library

Debugger for parallel apps are Totalview, DDT or IDB.

To analyse the mpi events and communication use Intel Trace Analyzer and Collector.

Start of an executable file with tcp/ip communication over ethernet:

 mpiexec -n 2 -env I_MPI_DEVICE sock a.out

start of the same app on infiniband (high performance network):

 mpiexec -n 2 -env I_MPI_DEVICE rdma a.out

The device shm is for shared memory. Systems with multiple cores are using ssm device

Compiling MPI Programs:

 mpicc -o prg prg.c

Running MPI Programs

 mpirun -np 1 prg
with -np flag you indicate the number of processes to create.

You can run this command on the same machine but after I tried 200 processes the MP_Init method crashed.

To run on a cluster use parameter machinefile:

 mpirun -n $(Anzahl Prozessoren) -machinefile $(Datei mit Rechnernamen) Programm 

 int MPI_Bcast (
    void *buffer,              address of first broadcast element
    int count,                 number of elements
    MPI_Datatype datatype,     type of elements to broadcast
    int root,                  ID of process doing broadcast
    MPI_Comm comm              communicator which means the group of receiver
    )

 int MPI_Send(
    void         *message,     starting address of the data to be transmitted
    int           count,       number of data items
    MPI_Datatype  datatype,    type of the data
    int           dest,        rank or id of the receiver process
    int           tag,         label of the message for different purposes
    MPI_Comm      comm         communicator in which this message is being sent
  )

 int MPI_Recv(
    void         *message,     starting address of the buffer where the received data is to be stored
    int           count,       number of data items the process can receive at a time which is 
                               bounded through the size of the buffer
    MPI_Datatype  datatype,    type of the data
    int           source,      rank or id of the sender process
    int           tag,         desired label of the message
    MPI_Comm      comm         communicator in which this message is being passed
    MPI_Status   *status       pointer to a MPI_Status data structure
  )

Function MPI_Recv blocks until the message has been received or until an error condition causes the function to return. After it returns, the status record contains information about the just-completed function call:

To receive any message from anyone use MPI_ANY_SOURCE in MPI_Recv. Any message tag with MPI_ANY_TAG.

Function MPI_Scatterv

enables a single root process to distribute a contiguous group of elements to all of the processes in a communicator including itself. It is a collective communication function and all of the processes in a communicator participate in its execution. The function requires that each process has previously initialized two arrays: one that indicates the number of elements the root process should send to each of the other processes, and one that indicates the displacement of this block of elements in the array being scattered. In this case we want to scatter the blocks in process order: process 0 gets the first block, process 1 gets the second block and so on.

Function MPI_Gatherv

allows a single MPI process to gather together data elements stored on all processes in a communicator. If every process is contributing the same number of data elements, the simpler function MPI_Gather is appropriate.

Function MPI_Alltoallv

allows every MPI process to gather data items fromm all the processes in the communicator. The simpler function MPI_Alltoall should be used in the case where all of the groups of data items being transferred from one process to another have the same number of elements.

Example Matrix-vector Multiplication

Phases of the parallel matrix-vector multiplication algorithm based on a checkerboard block decomposition of the matrix elements. First, vector b is distributed among the tasks. Second each task performs matrix-vector multiplication on its block of matrix A and portion of vector b. Third, each row of tasks performs a sum-reduction of the result vectors, creating vector c.

Redistributing vector b

Algorithm is simpler when process grid is square. Processes in the first column send their blocks of b to processes in the first row. Then each process in the first row broadcasts its blcok of b to the other processes in its column.

When the process grid is not square, first the processes in the first column gather vector b onto process at grid position (0,0). Next the process (0,0) scatters b to the processes in the first row. Finally, each process in the first row broadcasts its block of b to the other processes in its column.

Creating a Communicator

The default communicator is MPI_COMM_WORLD which is the set of all processes executing the MPI program.

function MPI_Dims_create()

to create a virtual mesh of processes that is as close to square as possible, which results in an algorithm having maximum scalability. For example you passed the total number of nodes desired for a cartesian grid and the number of grid dimensions, the function returns an array of integers specifying the number of nodes in each dimension of the grid, so that the sizes of the dimensions are as balanced as possible.

function MPI_Cart_create

creates a communicator with its topology. The output parameter cart_comm returns the address of the newly created Cartesian communicator.

function MPI_Cart_rank

In order to send a matrix row to the first process in the appropriate row of the process grid, process 0 needs to know its rank. This function when passed the coordinates of a process in the grid, returns its rank.

function MPI_Cart_coords

returns the process coords in the grid which are obviously different to the process ids.

function MPI_Com_split

In order to scatter an input row among only the processes in a single row of the process grid, we must divide the cartesian communicator into separate communicators for every row in the process grid. Collective function MPI_Comm_split partitions the processes in an existing communicator into one or more subgroups and constructs a communicator for each of these new subgroups.

document classification