Wednesday 4 June 2014

Parallel Computation: Best Practices While Testing The Speedup

This article is intended for programmers (especially CFD developers or Engineers) who intend to test scale up or performance of their applications in modern processor using either MPI(Message Passing Interface ) or OpenMP(Open Multiprocessing).
           Modern computer hardware are loaded with technologies like Turbo Boost Technology or Turbo Core Technolgy, vector registers and others to accelerate code performance or with such technologies which enable more threads to operate than physical available units, thus increasing parallel execution of task. However effect of few of the technologies in performance may vary depending on load, memory requirement and other parameter, hence either not uniform for all processor or affect performance which may not be desired for developer as it may give wrong picture. 
   Parallelization of the code aims at faster execution of program using multiple processes or multiple threads on desired number of processing units i,e processor. This article aim to teach basic concepts as well equip one with other information which every programmers or testers should know, before going to test their application for speedup and efficiency as well the best practices to measure them. It do not intend to teach or guide ways of doing parallelization. Before going further, it will be good to brush a few terminologies which are used in context of parallel computation.

Basic Terminology

Speedup

Speedup is ratio of time taken in single processor to time taken in multiple processors




Higher the value, better is the speedup (desired).  Usually application’s speedup curve flattens with increase in processor in CFD Computational Fluid Dynamics) solvers using MPI



Linear Speedup or Ideal Speedup: 
Linear speedup is achieved when doubling the number of processors decrease the execution time by half. It is an ideal case and not always achieved. 




Super Speedup:

 A special case when speedup achieved is more than the number of processors or processing units in which case is run. One of the reasons  is the cache effect in modern processors. 

Efficiency

 Efficiency tells about the processor/resource utilization.   Efficiency is the ratio of the time   taken in single processor to total time taken in multiprocessor (i.e time taken for parallel run x number of processor). Higher the efficiency better is the program. Its value vary from 0 to 1 . Higher the efficiency better the utilization of resources



Program running on single processor as well program having linear  speedup  has efficiency of  one. Lower efficiency usually points to complex program which is difficult to parallelize as well communication and synchronization overheads 

Amdahl’s law
Amdahl’s law enables to determine maximum speed up that one can expects by parallelizing program or algorithm. Every parallel program consists of two section i.e parallel section and sequential section. Parallel section is that section which can be parallelized or work in that section can be distributed between processor. While sequential section is common to every process and need to be executed by all processor. So a process speedup is limited by ratio of time taken to execute parallel section to sequential section. 


Where f is fraction of code that can be parallelized and n is number of processor to be used.  So if code takes 80% of execution time for section of code that can be be parallelized in single processor,then maximum speed up that one can expect is  5 .

What should  programmer know

Once one has  parallelized the program, the next task is to benchmark the performance of parallelized code i.e to check speedup or scale up and efficiency of code. Using Amdahl's law, one can estimate the maximum speedup that one can get after parallelizing the code. However in all analysis, we assume similar hardware performance which is not so in reality with modern processor . With advent of new technology to efficiently utilize resources and maximize resource usage, the operating frequency as well cache memory effect, depend on process to be running as well its memory usage.The point is on using n number of  processor,it is natural to expect same  performance from each of the processor  but in actual, it is not the case. Below we briefly discuss Hyper-Threading as well Turbo Boost technology/Turbo Core technolgy. before going on guidelines.
  

Hyper-Threading
It is Intel patented technology that enables one core to look like two cores. So physically there will be one core but to user it will look like two cores. This is based on efficient use of processor resources that enable multiple threads to execute concurrently. Hyper-Threading enable better usage of resource allowing concurrent executions of two threads and delivers more output. However for memory intensive calculations, Hyper-Threading can affect performance in negative way too, as one physical  processor is being used by two processes which can put load on resources i.e cache memory and other and slower the execution as compared to what if two physical cores is being used. So while calculating speed up, we should switch off the Hyper-Threading, otherwise you could see lower speed up or in some cases and also one can find higher execution time with increase in processor.

Suggestion : Switch off  Hyper-Threading, It can be disabled from BIOS. Google how to disable Hyper-Threading .




Turbo Boost Technology
Turbo Boost technology(Intel's ) or Turbo Core Technology(AMD) enables cores to operate at frequency higher than operating frequency hence increasing speed of processor. Frequency is like speed of processor, higher the frequency, faster the processor and higher the heat generation.However  higher heat generation has limited the frequency on which processors can operate

Back ground : Multicore are designed to operate at base operating frequency which enable safer operations i.e operate below TDP(Thermal design power : maximum amount of heat generated which can be dissipated by cooling system of generator). However there is always possibility that few core are sitting idle because of less workload and hence less heat generated than TDP. In such scenario it will be useful that cores which are not idle operate at  frequency higher than operating frequency which will increase speed of   execution

Effect : It can cause  execution of program in less number of processor to be faster as compared to larger number of processor, hence giving confusing data for speed up

Suggestion: Switch of the Turbo Boost Technology or Turbo Core Technology(AMD). It can be disabled from BIOS. Please google to find how to do it or whether your hardware has this technology

One last thing
What happens when user launch parallelized code in multi-core processor. Does thread or process  always run on some specified processor ?

The answer is no until  you have specified the affinity i.e one can ask the OS(Operating System) to utilize particular processor or processing unit for the process. In general, scheduler depending on load switches the process among the different processors. Even when parallelized code is run on multiprocessor machine, by default, scheduler decides usage of processor for processes. In simple language, a processor say with id 1 may be running process with rank 1 at particular  time and process with rank 2 at another time.
  

Benchmarking the performance of parallelized code


Hardware Suggestion
1.  Disable the Hyper-Threading.
2.  Disable the Turbo Boost technology
3. Machine to be used for testing performance of application should be of same configuration. If testing is being done in distributed systems ensure high speed network connectivity as infiniband. 

.
Software Suggestion
1.  Make sure the executable to be tested is not in debug mode.
2.  Use profiler to note the timing or use wall clock time. Using stop watch to check timing will not always give right results. Why?
3.  Make sure no other process is running except normal OS process.
4. Also check whether computer is not running any scheduled task at time of testing .In Linux, Cron is used to run scheduled task at specified time or regular interval.  It will be good if we do not run such task or spare one core if possible which will allow system process or other scheduled job .

Case Selection and Running Suggestion (Written considering CFD Solvers)
1. Run case for sufficient  number of iterations so that there can be conspicuous difference  in timing when it is to be run for large number of processors especially in case of distributed memory systems with MPI .
2. Also as in CFD-Solver there is one time computational cost for reading case and other pre-processing , this time can be removed from actual analysis.
3. Do not write or save any data from application when being tested for speedup or performance analysis.
4. Memory requirement for case to be tested  should be such to minimize effect of cache (L1,L2 and L3 ) on performance.  A detail analysis of effect on cache on speed  can be found on this link.


Conclusion
The above set of guidelines will enable programmers to better judge the  performances    of their applications. Please note that above guidelines is for testing performance and helping programmer   in better judging their parallelization of code. The above guidelines are based on my experience with parallel computing especially for CFD software.
Few of the guidelines like switching OFF Hyper-Threading is applicable while running CFD Solver in production run for memory intensive calculation whereas other like turbo boost frequency is always suggested to be ON 


(If you have any suggestion to add more thing or improve the article . Please comment on the article or mail me at  pawan24ghildiyal@gmail.com )


Google

Sunday 1 June 2014

MPI vs OpenMP: A Short Introduction Plus Comparison (Parallel Computing)

 MPI Vs OpenMP : A Short Introduction Plus Comparison

Here i will talk briefly about  OpenMP and MPI (OpenMPI ,MPICH, HP-MPI) for parallel programming or parallel computing . (Many a times one can easily confuse OpenMP with OpenMPI or vice versa. OpenMPI is a particular API of MPI whereas OpenMP is shared memory  standard available with compiler ). This is intended for user who are new to parallel programming or parallel computation and is thinking of using OpenMP or MPI for their applications or learning. This will introduce them to with  differences as well advantages of both. I will not go in development history of these two and changes that they have gone but will focus on differences at present form  i.e with OpenMP 4 and MPI 3.

A Brief about MPI  & OpenMP
1. MPI stands for Message Passing Interface . These are available as API(Application programming interface) or in library form  for C,C++ and FORTRAN.    
·     Different  MPI's API are available in market i.e OpenMPI,MPICH,HP-MPI.Intel MPI, etc. Whereas many are freely available like OpenMPI, MPICH etc , other like Intel  MPI comes with license i.e you need to pay for it  .
·      One can use any one  of above to parallelize  programs . MPI standards maintain that all of these APIs provided by different vendors or groups follow similar standards, so all   functions or subroutines in all different MPI API follow similar functionality as well arguments.
·      The difference lies in implementation that can make some MPIs API to be more efficient than other.  Many commercial CFD-Packages gives user option to select between different MPI API. However HP-MPI as well Intel MPIs are considered to be more efficient in performance.
·       When MPI was developed, it was aimed at distributed memory system but now focus is both on distributed as well shared memory system.    However it does not mean that with MPI , one cannot run program on shared memory system,  it just  that earlier, we could    not take advantage of shared memory but now we can with latest MPI 3


2.  OpenMP stand for Open Multiprocessing . OpenMP is basically an add on in compiler. It is available in gcc (gnu compiler) , Intel compiler and with other compilers.
·      OpenMP target shared memory systems i.e where processor shared the main memory.
·      OpenMP is based on thread approach . It launches a single process which in turn can create n number of thread as desired.  It is based on what is called "fork and join method" i.e depending on particular task it can launch desired number of thread as directed by user.  

   Fork and Join Model of OpenMP : 
Different stage of program show different number of thread

·      Programming in OpenMP is relatively easy and involve adding pragma directive . User need to tell number of thread it need to use.  (Note that launching more thread than number of processing unit available can actually slow down the whole program ) 


What is Shared Memory and Distributed Memory
1) Shared memory is one where all processors can see whole of  the memory that is    available . Simple example is  your desktop computer or laptop , where all processing  units can see all the memory of system
Shared Memory : Processor 1 ,2 3 4 can see whole memory



2) Distributed memory system is one where processor can see limited memory i.e two desktop computer connected in network . They can only see memory available to them only not of other



Distributed Memory System: CPU can see only limited memory of their own


What is  Process and Thread

1) Process:  An executing instance of program .It has distinct address space . It different from other executing instance of program in way that it has separate resources .


2) Thread is subset of process. A process can have number of threads as desired . Every thread of process share  its all resources i.e data as well address space of process that created it . Thread has to be part of some process.   It cannot be independent .




              MPI
OpenMP
1 . Available from different vendor and can be compiled in desired platform with desired compiler. One can use any of MPI API i.e MPICH, OpenMPI or other
1 .OpenMP are hooked with compiler so with gnu compiler and with Intel compiler one have specific implementation. User is at liberty with changing compiler but not with openmp implementation.
2. MPI support C,C++ and FORTRAN
2.OpenMP support C,C++ and FORTRAN
3.OpenMPI one of  API for MPI is providing provisional support for Java
3.Few projects try to replicate openmp for Java.
4. MPI target both distributed as well shared memory system
4.OpenMP target only shared memory system
5.Based on both process and thread based approach .(Earlier it was mainly process based parallelism but now with MPI 2 and 3 thread based parallelism is there too. Usually a process can contain more than 1 thread and call MPI subroutine as desired

5.Only thread based parallelism.
6. Overhead for creating process is one time
6. Depending on implementation threads can be created and joined for particular task which add overhead
7.There are overheads associated with transferring message from one process to another

7.No such overheads, as thread can share variables
8. Process in MPI  has private variable only, no shared variable
8.  In OpenMP , threads have both private as well shared variable
9.Data racing is not there if not using
any thread in process .
9. Data racing is inherent in OpenMP model
10.Compilation of MPI program require
    1. Adding header file : #include "mpi.h"
    2. compiler as:(in linux )
     mpic++  mpi.cxx -o mpiExe
 
(User need to set environment variable PATH and LD_LIBRARY_PATH to MPI as OpenMPI installed folder or binaries) (For Linux)
10. Need to add  omp.h and then can directly compile code with -fopenmp in Linux environment
   g++ -fopenmp openmp.cxx -o openmpExe
11 . Running MPI program .
a ) User need to make sure that bin and library folder from MPI installation are included in environmental variable PATH and LD_LIBRARY_PATH. 
b) For running executable from command line ,user need to supply following command and specify number of processor as in example below it is four .
 
11. User can launch executable openmpExe in normal way
mpirun -np 4 mpiExe
./openmpExe




 Sample MPI program
  
 #include <iostream>
#include <mpi.h>

/**************************************************************************
This is a simple hello world program. Each processor print its id
************************************************************/
using namespace std;
int main(int argc,char** argv)
{
    int myid, numprocs;

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);


        /* output  my rank */
    cout<<"Hello from "<<myid<<endl;
    MPI_Finalize();

}

 Command to run executable with name a.out in Linux =  mpirun -np 4   a.out 

Output

   Hello from 1
   Hello from 0
   Hello from 2
   Hello from 3


Sample OpenMP Program 


#include<iostream>
#include<omp.h>
using namespace std;
/********************************************************************
Sample OpenMP program which at stage 1 has 4 threads and at stage 2 has 2 threads
**********************************************************/
int main()
{
#pragma omp parallel  num_threads(4) //*create 4 threads and region inside it will be executed by all   threads . */
{
  #pragma omp critical//allow one thread at a time to access below statement
  cout<<" Thread Id  in OpenMP stage 1=  "<<omp_get_thread_num()<< endl;
}  //here all thread get merged into one thread id

cout<<"I am alone"<<endl;

#pragma omp parallel num_threads(2)//create two threads
{
  cout<<" Thread Id  in OpenMP stage 2=  "<<omp_get_thread_num()<<  endl;;
}

}
 Command to run executable  with name a.out  on Linux :  /a.out 
Output
      
       Thread Id  in OpenMP stage 1= 2      
        Thread Id  in OpenMP stage 1=0
        Thread Id  in OpenMP stage 1=3
        Thread Id  in OpenMP stage 1= 1
        I am alone
        Thread Id  in OpenMP stage 2= 1
        Thread Id  in OpenMP stage 2=0
      

Summary
MPI and OpenMP have  its own advantages and limitations . OpenMP is relatively easy to implement and involves few pragma directives to achieve desired tasks. OpenMP can be used in recursive function as well i.e as traversing in binary tree.  However it suffers from  problem of memory limitations for memory intensive calculations.
MPI usually serve those problem well which involve large memory. With MPI 3 , shared memory advantage can be utilized within MPI too. Also one can use OpenMP with MPI i.e for shared memory in targeted platform OpenMP can be used whereas for distributed one, MPI can be used.


Author
Pawan Ghildiyal

Google