Links:
With "#pragma omp parallel" you declare the beginning of a part that should be parallelized.
To compile such a source code use special options for your compiler:
-openmp
gcc -fopenmp -o openmp omp.c
#pragma omp parallel { }
#pragma omp parallel { #pragma omp for
}
or in short form:
#pragma omp parallel for for(...)
are defined in two groups:
#pragma omp parallel for private(i,j,workingprogress) shared(sum,overallprogress) for(...)
to prevent race conditions.
#pragma omp atomic critVar +=1
to bring the data the threads have processed together.
#pragma omp parallel for reduction(operator:variable)
to devide distinct parts of the loop: "schedule(dynamic,CHUNK)
#pragma omp single
#pragma omp master
to set a point in the code that should be reached first by all the parallel threads.
#pragma omp barrier
Intel offers products for mpi development and library
Debugger for parallel apps are Totalview, DDT or IDB.
To analyse the mpi events and communication use Intel Trace Analyzer and Collector.
Start of an executable file with tcp/ip communication over ethernet:
mpiexec -n 2 -env I_MPI_DEVICE sock a.out
start of the same app on infiniband (high performance network):
mpiexec -n 2 -env I_MPI_DEVICE rdma a.out
The device shm is for shared memory. Systems with multiple cores are using ssm device
Compiling MPI Programs:
mpicc -o prg prg.c
Running MPI Programs
mpirun -np 1 prg
You can run this command on the same machine but after I tried 200 processes the MP_Init method crashed.
To run on a cluster use parameter machinefile:
mpirun -n $(Anzahl Prozessoren) -machinefile $(Datei mit Rechnernamen) Programm int MPI_Bcast ( void *buffer, address of first broadcast element int count, number of elements MPI_Datatype datatype, type of elements to broadcast int root, ID of process doing broadcast MPI_Comm comm communicator which means the group of receiver ) int MPI_Send( void *message, starting address of the data to be transmitted int count, number of data items MPI_Datatype datatype, type of the data int dest, rank or id of the receiver process int tag, label of the message for different purposes MPI_Comm comm communicator in which this message is being sent ) int MPI_Recv( void *message, starting address of the buffer where the received data is to be stored int count, number of data items the process can receive at a time which is bounded through the size of the buffer MPI_Datatype datatype, type of the data int source, rank or id of the sender process int tag, desired label of the message MPI_Comm comm communicator in which this message is being passed MPI_Status *status pointer to a MPI_Status data structure )
Function MPI_Recv blocks until the message has been received or until an error condition causes the function to return. After it returns, the status record contains information about the just-completed function call:
To receive any message from anyone use MPI_ANY_SOURCE in MPI_Recv. Any message tag with MPI_ANY_TAG.
enables a single root process to distribute a contiguous group of elements to all of the processes in a communicator including itself. It is a collective communication function and all of the processes in a communicator participate in its execution. The function requires that each process has previously initialized two arrays: one that indicates the number of elements the root process should send to each of the other processes, and one that indicates the displacement of this block of elements in the array being scattered. In this case we want to scatter the blocks in process order: process 0 gets the first block, process 1 gets the second block and so on.
allows a single MPI process to gather together data elements stored on all processes in a communicator. If every process is contributing the same number of data elements, the simpler function MPI_Gather is appropriate.
allows every MPI process to gather data items fromm all the processes in the communicator. The simpler function MPI_Alltoall should be used in the case where all of the groups of data items being transferred from one process to another have the same number of elements.
Phases of the parallel matrix-vector multiplication algorithm based on a checkerboard block decomposition of the matrix elements. First, vector b is distributed among the tasks. Second each task performs matrix-vector multiplication on its block of matrix A and portion of vector b. Third, each row of tasks performs a sum-reduction of the result vectors, creating vector c.
Algorithm is simpler when process grid is square. Processes in the first column send their blocks of b to processes in the first row. Then each process in the first row broadcasts its blcok of b to the other processes in its column.
When the process grid is not square, first the processes in the first column gather vector b onto process at grid position (0,0). Next the process (0,0) scatters b to the processes in the first row. Finally, each process in the first row broadcasts its block of b to the other processes in its column.
The default communicator is MPI_COMM_WORLD which is the set of all processes executing the MPI program.
to create a virtual mesh of processes that is as close to square as possible, which results in an algorithm having maximum scalability. For example you passed the total number of nodes desired for a cartesian grid and the number of grid dimensions, the function returns an array of integers specifying the number of nodes in each dimension of the grid, so that the sizes of the dimensions are as balanced as possible.
creates a communicator with its topology. The output parameter cart_comm returns the address of the newly created Cartesian communicator.
In order to send a matrix row to the first process in the appropriate row of the process grid, process 0 needs to know its rank. This function when passed the coordinates of a process in the grid, returns its rank.
returns the process coords in the grid which are obviously different to the process ids.
In order to scatter an input row among only the processes in a single row of the process grid, we must divide the cartesian communicator into separate communicators for every row in the process grid. Collective function MPI_Comm_split partitions the processes in an existing communicator into one or more subgroups and constructs a communicator for each of these new subgroups.