================================
Parallel computing 
================================

What is parallel computing?
===========================

**Parallel computing** is the simultaneous use of multiple compute
resources to solve a computational problem. Provided the computational
problem can be broken apart into discrete pieces of work that can be
solved simultaneously, the goal is to solve the problem in less time
with multiple compute resources than with a single compute
resource. Luckily, the real-word (and hence the mathematical models
desscribing real-world systems) is massively parallel. Many complex,
interrelated events are happening at the same time, yet within a
temporal sequence.


Today's compute resources are typically: (1) a single computer
with multiple processors/cores, e.g. an Apple MacBook with a duo-core or
quad-core processor; or (2) a supercomputer consisting of a large number of such computers (or
nodes) connected by a communication network, forming a cluster. Each
node consists of multiple cores that sit together on a socket. For
instance, `Stampede2`__ supercomputer is a cluster with 4,200 nodes, each containing 68
cores, making a total of 285,600 cores.

Generally speaking, each core is an independent unit that executes a
stream of instructions. The separate cores can work on unrelated tasks
or (by data parallelism) collaborate on a common task.


Limits of parallel computing
==================================

**Amdahl's Law** gives the potential program speedup (:math:`S`) in
terms of the fraction :math:`P \in [0,1]` of the code that can be
parallelized: 

.. math::
   :nowrap:

      $$S = \frac{T_s}{T_p} \le \frac{1}{1-P},$$

where :math:`T_s` and :math:`T_p` denote the times required on a
sequential and a parallel machine, respectively.

If no fraction of the code can be parallelized (:math:`P=0`), then
there is no speedup (:math:`S = 1`). If all of the code can be parallelized (:math:`P = 1`), the speedup is in theory
infinite (assuming we have many processors/cores). Typically only part of a computation can be
parallelized. For instance, if :math:`50\%` of the code can be
parallelized (:math:`P=0.5`), maximum speedup is 2, meaning the code
will run twice as fast on many processors/cores. In this case we can
gain at most a factor of 2 (2 times faster), because the other
:math:`50\%` sequential part is taking half of the time, and that time is still required even if the parallel part is reduced to zero time.
      
Now let us assume that our parallel machine has :math:`N_p`
processors. Then

.. math::
   :nowrap:

      $$T_p \ge (1-P) \, T_s + P \, \frac{T_s}{N_p},$$ 

with equality corresponding to the best case. In the limit (with many
processors) the second term in the right hand side vanishes, and we obtain

.. math::
   :nowrap:

      $$\lim_{N_p \rightarrow \infty} T_p = \frac{T_s}{S},$$ 

gining a factor of :math:`S` with many processors.

In practice the speedup is less than (sometimes much less than) the
number of processors :math:`N_p`, due to overhead costs of starting
processors/threads, communications, memory limitations, algorithm's limitations to scalability, etc.


Scalability
===============

**Scalability** refers to the ability of a parallel system (software
and/or hardware) to demonstrate a proportional increase in parallel
speedup with the addition of more resources. 


There are two types of scaling:

* **Strong scaling**: How does the algorithm perform as :math:`N_p`
  increases for a fixed problem size :math:`N`? The goal is to run the
  same problem size faster. Perfect scaling means problem is solved in
  :math:`1/N_p` time, compared to serial computation.
	

* **Weak scaling**: How does the algorithm perform when :math:`N`
  increases with  :math:`N_p`? The goal is to run larger problem in
  the same amount of time. For example if we double :math:`N_p`, can
  we solve a problem twice as large in the same time?
  

What does it mean by doubling the size of a problem?

**Example.** When solving an :math:`n \times n` linear system with
Gaussian elimination, we require :math:`{\mathcal O}(n^3)`
FLOPS. Doubling :math:`n` would not double the size of the problem, as
it would requires 8 times as many operations. This
problem is indeed twice as large if we increase :math:`n` by a factor
of :math:`2^{1/3} \approx 1.26`. In fact with :math:`n_{new} = 2^{1/3}
n`, we would require :math:`{\mathcal O}(n_{new}^3) = 2 \, {\mathcal O}(n^3)` FLOPS.

**Exercise.** What if we solve the system in the previous example by
an iterative method, such as the multigrid method, that requires
:math:`{\mathcal O}(n \log n)` FLOPS?

**Remark.** Developing better algorithms is as important as better
hardware.


Parallel computer memory architectures
=======================================================
      

**Shared Memory** describes a computer architecture where all
processors have direct access to common physical memory as global
address space. Multiple processors can operate independently but share
the same memory resources. Shared memory machines have been classified as UMA and NUMA.

* Uniform Memory Access (UMA): Identical processors have equal access
  and access times to memory. All processors/cores have a consistent view of the data: If one processor updates a location in
  shared memory, all the other processors know about the update (this
  is referred to as cache coherency). For instance, if data value x
  changes on a given core, there must be no risk of other cores using
  outdated values of x.

* Non-Uniform Memory Access (NUMA): It is often made by linking a few
  UMA machines. Not all processors have equal access time to all
  memories. Under NUMA, a processor can access its own local memory faster than
  non-local memory (memory access across link is slower). With NUMA,
  maintaining cache coherency across shared memory has a significant
  overhead.
  

In general, the global address space in shared memory machines provides a
user-friendly programming perspective to memory. However, shared
memory machines suffer from the lack of scalability between memory and
processors: adding more processors can geometrically increases traffic
on the shared memory-processor path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory
management.

  

**Distributed Memory** describes a computer architecture where
processors have their own local memory. Each processor operates
independently. Memory addresses in one processor do not map to another processor, so there is no concept of
global address space across all processors. Distributed memory systems require a communication network to connect
inter-processor memory. Since the changes that a processor makes to its local memory have no effect on
the memory of other processors, the concept of cache coherency does
not apply. When a processor needs access to data in another processor, it is
usually the task of the programmer to explicitly define how and when
data is communicated (or exchanged) between
processors. Synchronization between tasks (coordination of parallel tasks) is likewise the programmer's
responsibility. Synchronization is often implemented by establishing a synchronization point within an application where a
task may not proceed further until another task(s) reaches the same point. Synchronization usually involves waiting
by at least one task, and can therefore cause a parallel application's
wall clock execution time to increase.

A main advantage of distributed memory systems is that memory is scalable with the number of processors. Increase the number
of processors and the size of memory increases proportionately. Moreover, each processor can rapidly access its own memory without interference
and without the overhead incurred with trying to maintain global cache
coherency. However, one should beware of communication costs and
non-uniform memory access times: data residing on a remote node takes
longer to access than node local data.









**Hybrid Memory**: The largest and fastest computers in the world
today (such as Stampede2) employ both shared and distributed memory
architectures. In such a paradigm, increased scalability is an
important advantage, while increased programmer complexity is an
important disadvantage.


Parallel programming
=================================

We will study two parallel programming models:

**1. Threads model on shared memory machines**:

* The main program performs some serial work, and then creates a
  number of tasks (threads) that can be scheduled and run by the
  operating system concurrently.

*  Each thread has local (or private) data, but also, shares the entire resources
   of the program (shared data). This saves the overhead associated with replicating a
   program's resources for each thread. Each thread
   also benefits from a global memory view because it shares the
   memory space of the program.
   
* Threads communicate with each other through global memory (updating
  address locations). This requires synchronization constructs to
  ensure that more than one thread is not updating the same global
  address at any time.
  

* We will use `OpenMP`__ to implement this model.




**2. Message passing model on distributed memory machines**: 


* We create a set of tasks to be run on processors that use their own local memory during
  computation. 
  
* Tasks exchange data through communications by sending and receiving messages.

* Data transfer usually requires cooperative operations to be
  performed by each process. For example, a send operation must have a
  matching receive operation.
  
* We will use `MPI`__ to implement this model.


  
Designing parallel programs
=================================================

Partitioning
~~~~~~~~~~~~~~~~~~~~~

One of the first steps in designing a parallel program is to break the
problem into discrete "chunks" of work that can be distributed to
multiple tasks. This is known as decomposition or partitioning. There are two basic ways to partition computational work among parallel tasks: 

* Domain Decomposition: In this type of partitioning, the data
  associated with a problem is decomposed. Each parallel task then
  works on a portion of the data. This is a common scenario in
  scientific computing: a data set (such as a vector or a matrix) is
  spread over many processors, each working on its part of the
  data. This may or may not need communication. 
  
* Functional Decomposition: In this approach, the focus is on the
  computation that is to be performed rather than on the data
  manipulated by the computation. The problem is decomposed according
  to the work that must be done. Each task then performs a portion of
  the overall work. Functional decomposition fits well to
  problems that can be split into different tasks. Functional
  parallelism is often obtained by considering independent
  subprograms, e.g. when we need to solve the same set of ODEs with various
  choices of parameters, or in convergence
  studies where we need to solve the same ODE problem with different step
  sizes. 
  





Communications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The need for communications between tasks depends on the problem in hand:

* We do not need communications if the problem can be decomposed and
  executed in parallel with almost no need for tasks to share data. These types of
  problems are often called embarrassingly parallel: little or no
  communications are required. For instance, :math:`a(i) = 2*b(i)`
  would not need communication between different processors
  (corresponding to different :math:`i` values). As another example, in Monte Carlo
  simulations we solve many determinstic problems (corresponding to
  many realizations of the random parameters) independently. Each
  deterministic problem can be solved on a processor (or multiple
  processors) indepdently of other problems.
  

* We do however need communications in most parallel applications that
  require tasks to share data with each other. For example, if we
  solve a 2D heat diffusion problem with a central finite difference
  scheme, we require a task to know
  the temperatures calculated by the tasks that have neighboring data.

  
  




__ http://math.unm.edu/~motamed/Teaching/Fall20/HPSC/stampede2.html
__ http://math.unm.edu/~motamed/Teaching/Fall20/HPSC/openmp.html
__ http://math.unm.edu/~motamed/Teaching/Fall20/HPSC/mpi.html