Examples of parallel computing using open-mp and MPI

 

Dilip Angom

Scientist

Physical Research Laboratory,

Navarangpura, Ahmedabad.

 

 

Plan of the lecture

  1. Parallelized laser pulse propagation code.
  2. Coupled partial differential equations.

    Data dependent loops.

    Large sized arrays.

    (Tarak Nath Dey)

  3. Parallel execution of atmospheric radiative transfer code.
  4. A big code however short execution time.

    Several executions required.

    Numerous subroutines.

    (Harish Gadhavi)

  5. Parallelizing relativistic atomic structure code.

Several modules, each having unique computational requirements.

Large number of coupled itegro-differential equations.

Large sized matrices.




Pulse propagation in an atomic medium

Propagation of a laser pulse through a homogenous atomic medium can be modeled as a one-dimensional problem, where atoms are placed periodically. The ingredients of the problem are:

Model atomic system ( system)

 

 

Geometry of the system

The study is to investigate the possibility of storing the pulse in the medium and reproducing without distortion.

 

Equations of the system

Density matrix equations describe the evolution of the atomic states.

 

The diagonal density matrices are population densities and the offdiagonal are the coherence between atomic states.

Maxwell equations describe the evolution of the laser field within the atomic medium

Together these two sets of coupled differential equations define the matter and field evolution.

 

Numerical solution

Initial conditions of the problem are:

All the atoms are in ground state .

Input pulse at Z = 0 is known for all time steps.

Using these conditions the equations are integrated. The scheme of the integration could be along time-axis for each space points or vice-versa. The schematic representation of the integration sequence is given below.

 

 

Typical number of time steps and spatial grid points are

~ 200,000. The integration involve data dependent loops, which are difficult to parallelize. An approach of parallelization is to block the calculation and synchronize the processors based on data availability: data driven blocking.

Method of integration is chosen according to the required accuracy.

Schematic of the CPU and memory management

Each of the CPUs enters the calculation when the required data is available ( data driven synchronization). Consider a four CPU system.

Parallelized algorithm

The parallelized algorithm has

  1. Parameters, variables and constants of the calculation.

  2. Engage the processors in steps ( similar to filling up a pipeline).

  3. Assign blocks of calculations to each processor.

Parallelize the inner loop of the last part of the calculation.

Fragment of the code

The parallelized fragment of the code is given below.

Radiation budget of Earth's atmosphere

Balance between the incoming and outgoing long-wave terrestrial radiation is very important to understand climate change. This require radiative transfer calculation of the Earth’s atomsphere.

The quantities of interest are:


Present study is to investigate the effect of aerosol in Antarctic like condition. The strategy is to simulate radiative flux at the top of the atmosphere and at surface level for specified amount of aerosol. Santa Barbara Discrete Ordinate Atmospheric Radiative Transfer (SBDART) software is used to calculate Radiative fluxes.

Once the parameters of the calculation are decided, the code is used as a black box.

Why parallelize SBDART

The execution time of one SBDART run

0.28 – 2.8 9 sec with 5 nm resolution.

2.80 – 100 90 sec with 20 nm resolution.

Calculation of the radiative forcing due to aerosol particles on a single day at a resolution of 15 minutes would require ~ 5 hours of computation.

Understanding the role of aerosols require several runs, and there are several parameters associated with the atmosphere and aerosols to vary. Two observations are to be made

Execution time of each run is short.

A large number of runs are required.

It is therefore desirable to parallelize the code and generate data for a wide range of parameters to gain a better understanding in terms of complementing the observed data.

Important information concerning SBDART are:

Complexity of the code and short execution time implies that the shortest route to parallelization is parallelizing the repeated execution. This is achieved using a control routine, which calls the SBDART as function or subroutine.

Parallelized algorithm

Representative algorithm of parallelizing the computational sequence is

Treating SBDART as subroutine/function of the program require

Careful initialization of the static arrays.

A detailed analysis of the constants/variables passed between the subroutines/functions.

Input and output through files require detailed book keeping to avoid simultaneous access by multiple processors.

It is achievable for a short and easily understandable code. But not when it is a code designed, developed and written by someone else.

A possible solution

Treat executable file of SBDART as a system call within the unix working environment. Then each SBDART call is independent of each other completely.

This solution worked very well.

 

Fragment of the code

The parallelized control function to execute SBDART is given below.

Relativistic atomic structure calculation

Dirac-Coulomb Hamiltonian is suitable for many atomic structure and properties calculations. In atomic units ()

Atomic state functions (ASF) or atomic wave-functions are calculated in the following steps:

1. Approximate the electron-electron coulomb interaction as a central field potential

and calculate the single electron wave-functions .

2. Calculate a set of Slater determinants or configuration state functions ( CSF) as representations of many-electron wave functions.

3. Calculate the ASFs or atomic wave functions treating the residual electron-electron coulomb interaction

Atomic wave functions satisfy the Schrodinger equation

where , and are the parity, total angular momentum and magnetic quantum numbers respectively, and is a quantum number to define each atomic wave function uniquely.

Computational requirements

Each step of the calculation has unique computational requirements and multi-configuration approximation of the central field potential has even higher requirements.

Single electron wave function calculation

Self-consistent calculation and require a large number of angular factors to generate the potentials. It is therefore operation intensive. Angular factors are calculated once and stored, which require large physical memory.

Slater determinant calculation

Large sized arrays to manage occupancy of shells.

Atomic wave function calculation

Configuration interaction (CI) is the simplest calculation to include electron-electron correlation effects, which is diagonalization of the residual Coulomb interaction matrix. It is operation and memory intesive.

Diagonalization of large matrices is a numerical calculation common to many problems in physics. Parallelization of matrix diagonalization is also equivalent to parallelizing CI calculation. Flow chart of CI calculation is

Grouping data for communication

Data can be grouped based on the context. For example, the variables that define a determinant or configuration state function are

Parity.

Total angular momentum.

Energy.

Sequence of angular momentum coupling.

Shell occupation.

A data type struct can be defined having these as elements. However MPI does not support data type struct.

MPI support several derived data type to group data.

MPI_Type_struct

MPI_Type_contiguous

MPI_Type_vector

MPI_Type_indexed

Among these MPI_Type_struct is suitable to group chunks of data which has same context and are frequently used.

Information required to build derived data type

Number of elements.

Data type of the member/elements.

Relative storage locations: displacement from the beginning of message.

Beginning of the message.

A MPI derived data type is then a sequence of pairs

,

where each is a basic MPI data type and each is a displacement in bytes. The sequence of the data type

is the type signature. Basic rule to check consistency of data type transmitted is to match the type signature.

Hamiltonian matrix storage

Calculation of Hamiltonian matrix elements is distributed across the processors in blocks of rows/columns. Only the upper/lower triangular matrix elements are required as

.

Diagrammatic representation of the block distribution


In this form of block distribution of matrix element generation, the CPU calculates


matrix elements. Since , . That is the first CPU calculates maximum number of matrix elements and the last CPU calculates the least. Load balancing is achieved by an appropriate choice of

Sparse matrix storage

The Hamiltonian matrix encountered in atomic structure calculations are in general sparse, which is a consequence of the selection rules. A schematic representation is

Information required to store each row/column of the Hamiltonian matrix

Quantum numbers of the ket/bra.

Number of non-zero elements.

Location of non-zero elements.

Value of the non-zero elements.

Only the non-zero Hamiltonian matrix elements are stored using three arrays, schematically



The sparse nature of the matrix implies that any form of block distribution provides near equal loads to all the processors.

Building MPI_Type_struct

 

Communicating MPI_Type_struct