Working on Stampede

Intro

Lab session 26/9-2014

In this lab session we will:

  1. Learn how to log into the Stampede supercomputer.
  2. Learning how to compile OpenMP programs.
  3. Specifying the number of threads for OpenMP programs.
  4. Setting up batch scripts for parallel jobs.
  5. Running and analyzing the timimg output from a simple OpenMP program.

Start by pointing youre browser to Xsede page about Stampede, for now pay close attention to the sections: System Access, Application Development (compiling), SLURM Batch Environment and LAUNCHING MPI APPLICATIONS WITH IBRUN.

For later the Computing Enviroment, Code Tuning and File Systems sections are good to read as well.

Logging in and cloning our repository

From your terminal window do ssh username@login.xsede.org, then when logged in, gsissh -p 2222  stampede.tacc.xsede.org ehere username is your Xsede username. Once logged in the shell will display some useful information about Stampede. For example:

--> Stampede has three parallel file systems: $HOME (permanent,
    quota'd, backed-up) $WORK (permanent, quota'd, not backed-up) and
    $SCRATCH (high-speed purged storage). The "cdw" and "cds" aliases
    are provided as a convenience to change to your $WORK and $SCRATCH
    directories, respectively.

tells you that there are three different file systems associated with your account. Your $HOME directory is small and should mainly be used to store source codes and for compiling. The $WORK filesystem allocation is larger and should be used for most of your computations (for this lab session we don’t really write anything to file so we can stay in $HOME) going forward. The $SCRATCH filesystem is very large but is periodically purged so it should only be used for very big computations.

Once you are logged in clone our repo:

git clone https://username@bitbucket.org/appelo/math471.git

and cd into math471/codes/Stampde. In that directory there should be a Makefile containing something like:

FC = ifort
LD = ifort
LDFLAGS = -openmp
F90FLAGS = -openmp
EX = ./matrixmul.x
OBJECTS = matrixmul.o

As you can see we will use the Intel fortran compiler ifort. The flag -openmp will give you access to the OpenMP module and will also instruct the compiler to take the OpenMP directives (the lines starting with !$OMP) into account.

The program matrixmul.f90 is a very simple example of how OpenMP can be used for parallel computing.

program matrixmul
  use omp_lib
  implicit none
  integer, parameter :: nmax = 800
  integer :: i,j,k,l,myid
  real(kind = 8), dimension(nmax,nmax) :: A,B,C
  real(kind = 8) :: d,time1,time2,time1a,time2a
  A = 1.d0
  B = 2.d0

The use omp_lib gives you access to all the subroutines and functions in OpenMP. We will go through many of them in class but if you are want to read up beforehand you can look at the tutorials / documentation here and here.

This code computes and times the matrix product \(C = AB\) using do loops. The program is using 1 to 8 threads, and the number of threads are set by the call to omp_set_num_threads().

do l = 1,8
   !$ call omp_set_num_threads(l)
   call cpu_time(time1)
   time1a = omp_get_wtime()

We use two timers, cpu_time() and omp_get_wtime() the first one times the cpu-time and the second the wall-clock time. Our approach to computing \(C = AB\) is straightforward we simply use the OpenMP directive !$OMP PARALLEL DO which tells the compiler to execute the next do loop in parallel. By default, the variables inside the do loop are assumed to be shared but we can make them prvate to each thread by the PRIVATE() statement. The loop counter j is in-fact private by default but to make that explicit we put it in the PRIVATE() statement as wel

What could happen if i,k,d were shared?

     !$OMP PARALLEL DO PRIVATE(i,j,k,d)
     do j = 1,nmax
        do i = 1,nmax
           d = 0.d0
           do k = 1,nmax
              d =  d + A(i,k)*B(k,j)
           end do
           C(i,j) = d
        end do
     end do
     call cpu_time(time2)
     time2a = omp_get_wtime()
     write(*,*) "With ", l, " threads this takes: ",time2-time1 ,&
     " of cpu_time but only ",time2a-time1a, " wall clock time."
  end do
end program matrixmul

Start by compiling the program using make and make sure that the executable matrixmul.x has been produced. Next, read the instructions for Job Submissions in the Stampede userguide and find out what the lines in the SLURM-script ompbatch_8.job means:

#!/bin/bash
#SBATCH -A TG-DMS140044     # account name
#SBATCH -J mmul_omp_t8      # job name
#SBATCH -o mmul8_out.%j     # output file
#SBATCH -e mmul8_err.%j     # error file
#SBATCH -N 1                # total nodes requested
#SBATCH -n 1                # total MPI tasks requested
#SBATCH -p serial           # queue name
#SBATCH -t 00:02:00         # total time requested <hh:mm:ss>

The %j in the output and error filenames gets replaced by your job-id so that the files don’t get overwritten.

Before submitting your job, find out from the documentation how many cores each node on Stampede has. If it is different than 8 change the l loop in matrixmul.f90 and recompile.

Now submit your job. If you want to check the status you can use squeue -u username. If you realize that you did something wrong, scancel jobid is the command to use (jobid can be found from squeue -u username.)

Inspect the timing data (in the mmul8_out.jobid file ) and compute the speedup. Change the program so that the parallel do loop is either the i or the k loop and again compute the speedup.