Examples of parallel computing using open-mp and MPI
Dilip Angom
Scientist
Physical Research Laboratory,
Navarangpura, Ahmedabad.
Plan of the lecture
Coupled partial differential equations.
Data dependent loops.
Large sized arrays.
(
Tarak Nath Dey)A big code however short execution time.
Several executions required.
Numerous subroutines.
(
Harish Gadhavi)Several modules, each having unique computational requirements.
Large number of coupled itegro-differential equations.
Large sized matrices.
Pulse propagation in an atomic medium
Propagation of a laser pulse through a homogenous atomic medium can be modeled as a one-dimensional problem, where atoms are placed periodically. The ingredients of the problem are:
Model atomic system (
system)
Geometry of the system
The study is to investigate the possibility of
storing the pulse in the medium and reproducing without distortion.
Equations of the system
Density matrix equations describe the evolution of the atomic states.
The diagonal density matrices are
population densities and the offdiagonal are the coherence between atomic states.Maxwell equations describe the evolution of the laser field within the atomic medium
Together these two sets of coupled differential equations define the
matter and field evolution.
Numerical solution
Initial conditions of the problem are:
All the atoms are in ground state
.
Input pulse at Z = 0 is known for all time steps.
Using these conditions the equations are integrated. The scheme of the integration could be along time-axis for each space points or vice-versa. The schematic representation of the integration sequence is given below.
Typical number of time steps and spatial grid points are
~ 200,000
. The integration involve data dependent loops, which are difficult to parallelize. An approach of parallelization is to block the calculation and synchronize the processors based on data availability: data driven blocking.Method of integration is chosen according to the required accuracy.
Schematic of the CPU and memory management
Each of the CPUs enters the calculation when the required data is available ( data driven
synchronization). Consider a four CPU system.Parallelized algorithm
The parallelized algorithm has
Parallelize the
inner loop of the last part of the calculation.Fragment of the code
The parallelized fragment of the code is given below.
Radiation budget of Earth's atmosphere
Balance between the incoming and outgoing long-wave terrestrial radiation is very important to understand
climate change. This require radiative transfer calculation of the Earth’s atomsphere.The quantities of interest are:
Once the parameters of the calculation are
decided, the code is used as a black box.Why parallelize SBDART
The execution time of one
SBDART run0.28 – 2.8
2.80 – 100
90 sec with 20 nm resolution.
Calculation of the radiative forcing due to aerosol particles on a single day at a resolution of
15 minutes would require ~ 5 hours of computation.Understanding the role of aerosols require several runs, and there are several parameters associated with the atmosphere and aerosols to vary. Two observations are to be made
Execution time of each run is short.
A large number of runs are required.
It is therefore desirable to parallelize the code and generate data for a wide range of parameters to gain a better understanding in terms of complementing the observed data.
Important information concerning SBDART are:
Complexity
of the code and short execution time implies that the shortest route to parallelization is parallelizing the repeated execution. This is achieved using a control routine, which calls the SBDART as function or subroutine.Parallelized algorithm
Representative algorithm of parallelizing the computational sequence is
Treating SBDART as subroutine/function of the program require
Careful initialization of the static arrays.
A detailed analysis of the constants/variables passed between the subroutines/functions.
Input and output through files require detailed book keeping to avoid simultaneous access by multiple processors.
It is achievable for a short and easily understandable code. But not when it is a code designed, developed and written by
someone else.A possible solution
Treat executable file of
SBDART as a system call within the unix working environment. Then each SBDART call is independent of each other completely.This solution worked very well.
Fragment of the code
The parallelized control function to execute
SBDART is given below.Relativistic atomic structure calculation
Dirac-Coulomb Hamiltonian
is suitable for many atomic structure and properties calculations. In atomic units
)
Atomic state functions (ASF) or atomic wave-functions are calculated in the following steps:
1.
Approximate the electron-electron coulomb interaction as a central field potentialand calculate the single electron wave-functions 
2.
Calculate a set of Slater determinants or configuration state functions ( CSF) as representations of many-electron wave functions.3.
Calculate the ASFs or atomic wave functions
treating the residual electron-electron coulomb interactionAtomic wave functions
satisfy the Schrodinger equation
where
,
and
are the parity, total angular momentum and magnetic quantum numbers respectively, and
is a quantum number to define each atomic wave function uniquely.
Computational requirements
Each step of the calculation has unique computational requirements and
multi-configuration approximation of the central field potential has even higher requirements.Single electron wave function calculation
Self-consistent
calculation and require a large number of angular factors to generate the potentials. It is therefore operation intensive. Angular factors are calculated once and stored, which require large physical memory.Slater determinant calculation
Large sized arrays to manage
occupancy of shells.Atomic wave function calculation
Configuration interaction
(CI) is the simplest calculation to include electron-electron correlation effects, which is diagonalization of the residual Coulomb interaction matrix. It is operation and memory intesive.Diagonalization of large matrices is a numerical calculation
common to many problems in physics. Parallelization of matrix diagonalization is also equivalent to parallelizing CI calculation. Flow chart of CI calculation isGrouping data for communication
Data can be
grouped based on the context. For example, the variables that define a determinant or configuration state function areParity.
Total angular momentum.
Energy.
Sequence of angular momentum coupling.
Shell occupation.
A data type
struct can be defined having these as elements. However MPI does not support data type struct.MPI support several derived data type to group data.
MPI_Type_struct
MPI_Type_contiguous
MPI_Type_vector
MPI_Type_indexed
Among these
MPI_Type_struct is suitable to group chunks of data which has same context and are frequently used.Information required to build derived data type
Number of elements.
Data type of the member/elements.
Relative storage locations:
Beginning of the message.
A MPI derived data type is then a sequence of pairs
,
where each
is a basic MPI data type and each
is a displacement in bytes. The sequence of the data type

is the
type signature. Basic rule to check consistency of data type transmitted is to match the type signature.Hamiltonian matrix storage
Calculation of Hamiltonian matrix elements is distributed across the processors in blocks of rows/columns. Only the
upper/lower triangular matrix elements are required as
.
In this form of block distribution of matrix element generation, the
CPU calculates

,
. That is the first CPU calculates maximum number of matrix elements and the last CPU calculates the least. Load balancing is achieved by an appropriate choice of
Sparse matrix storage
The Hamiltonian matrix encountered in atomic structure calculations are in general
sparse, which is a consequence of the selection rules. A schematic representation isInformation required to store each row/column of the Hamiltonian matrix
Quantum numbers of the ket/bra.
Number of non-zero elements.
Location of non-zero elements.
Value of the non-zero elements.
Only the non-zero Hamiltonian matrix elements are stored using three arrays, schematicallyThe sparse nature of the matrix implies that any form of block distribution provides
near equal loads to all the processors. Building MPI_Type_struct
Communicating MPI_Type_struct