Articles

News

Vendors

Whitepapers

Write for Us

About Us

Look for Bottlenecks with Open|SpeedShop

Open|SpeedShop is an open source multiplatform Linux performance tool targeted at performance analysis of applications running on both a single-node and on large-scale platforms.

Look for Bottlenecks with Open|SpeedShop

By Jim Galarowicz

Open|SpeedShop is an open source multiplatform Linux performance tool targeted at performance analysis of applications running on both a single node and on large-scale IA64, IA32, EM64T, AMD64, IBM Power PC, Cray, and IBM Blue Gene platforms. Open|SpeedShop operates on existing application binaries, so there is no need to recompile the application being analyzed. Open|SpeedShop gathers several types of performance information, relates that information back to the application source code, and then displays the associated performance information to the user.

Open|SpeedShop is in use at a number of laboratories, universities, and corporations worldwide, helping software developers and users speed up applications and reduce time to solution. Open|SpeedShop supports performance analysis of sequential, MPI, openMP, and threaded applications and has been tested on several of the most common Linux distributions with the most commonly used MPI implementations, including SGI MPT, mpich2 variants, mvapich, mvapich2, and openmpi.

In this article, I will describe how to use Open|SpeedShop through step-by-step examples illustrating how to find a number of different performance bottlenecks. Additionally, I will describe the tool’s most common usage model (workflow) and provide several performance data viewing options.

Open|SpeedShop uses both statistical sampling and traditional tracing techniques to record performance information. The central concept in its workflow is an experiment. An experiment defines what type of performance information is being measured and the program being analyzed. Users select their experiment at the beginning of any performance analysis run depending on what kind of performance bottleneck they would like to investigate.

The main types of data gathered via sampling techniques – by periodically interrupting execution, recording location, then reporting statistical distribution across all reported locations – are program counter information (pcsamp; experiment names are in parentheses; for example, “pcsamp” is the name of the Program Counter Sampling experiment), call path information (usertime), and hardware counter information (hwc, hwctime, hwcsamp). Tracing techniques are used to gather Input/Output information (io, iot), MPI function-specific information (mpi, mpit, mpiotf), and Floating Point Exception information (fpe). (Tracing techniques involve gathering and storing individual application events – e.g., function invocations, MPI messages, I/O calls. Events are typically time stamped and provide detailed per event information.) Table 1 describes specific performance issues that each Open|SpeedShop experiment is designed to reveal.

Table 1: Summary of Experiments

Experiment Clues Data Collected and Derived
pcsamp High user CPU time. Gives good low-overhead overview of performance. Actual CPU time at the source line, machine instruction, and function levels by sampling the program counter at 100 samples per second. 
usertime Slow program, nothing else known. May not be CPU-bound. Inclusive and exclusive CPU time for each function by sampling the callstack at 35-sample-per-second intervals. Identifies paths through the program that are taking the most time.
hwc High user CPU time Counts at the source line, machine instruction, and function levels of various hardware events, including: clock cycles, graduated instructions, primary instruction cache misses, secondary instruction cache misses, primary data cache misses, secondary data cache misses, translation lookaside buffer (TLB) misses, and graduated floating-point instructions.  A single hardware counter is read when a predefined count threshold is reached (overflows).  
hwcsamp High user CPU time Similar to hwc experiment, except that periodic sampling is used instead of the overflow mechanism. Up to six (6) hardware counter events are read when a sample is taken.
hwctime High user CPU time Similar to hwc experiment, except that callstack sampling is used and call paths are available along with the event counts.  
io I/O-bound Traces and times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. Call path information also is available.
iot I/O-bound Traces and times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. Output is a line of trace per I/O function call. Call path information also is available.
mpi MPI performance is poor. Times calls to various MPI routines. The time reported is wall clock time.
Call path information also is available.
mpit MPI performance is poor. Traces and times MPI function calls. Output is optionally a line of trace per MPI function call. All calls are accounted for by wrapping, (i.e., no sampling). The time reported is wall clock time.
Call path information also is available.
mpiotf MPI performance is poor, and OTF files are preferred. Traces and times MPI function calls and generates Open Trace Format (OTF) files using VampirTrace as the underlying gathering tool.
fpe High system time. Presence of floating point operations All floating-point exceptions, with the exception type and the callstack at the time of the exception.

After collecting performance information, Open|SpeedShop displays it in detailed reports that allow the user to relate the performance information back to its application source code easily. This information is accessible through a comprehensive graphical user interface (GUI), from a command line interface (CLI), as well as from within Python scripts. Additionally, the toolset includes a series of analysis techniques, including outlier detection, load balance analysis, and cross-experiment comparisons. Open|SpeedShop’s functionality provides a comprehensive set of techniques that greatly aids analysis and understanding of parallel application performance.

Open|SpeedShop Program Counter Sampling Example

An Open|SpeedShop user must first set up a run-time environment. This is usually done by loading a module, Dotkit, or SoftEnv file that will set environment variables, including PATH and LD_LIBRARY_PATH, so that Open|SpeedShop tools and libraries can be accessed. A typical run-time environment initialization would include these items:

export OPENSS_PREFIX=/opt/OSS-201
export OPENSS_MPI_IMPLEMENTATION=openmpi
export OPENSS_PLUGIN_PATH=$OPENSS_PREFIX/lib64/openspeedshop
export OPENSS_RAWDATA_DIR=/opt/shared
export LD_LIBRARY_PATH=$OPENSS_PREFIX/lib64:$LD_LIBRARY_PATH
export PATH=$OPENSS_PREFIX/bin:$PATH

The Open|SpeedShop website describes the usage and meaning of these environment variables in detail (BuildAndInstallGuide).

The workflow model for running Open|SpeedShop on a desktop or cluster system entails a command to gather the data and create an Open|SpeedShop database file containing the performance information and application symbol information. The Open|SpeedShop GUI or the interactive CLI tool enable viewing the data contained in the database file. Each of the above-mentioned experiments has a corresponding convenience command – for example, osspcsamp for the pcsamp experiment, ossusertime for the usertime experiment, and so on. I use the application smg2000 – a Semicoarsening Multigrid Solver based on the hypre library and taken from the ASCI Purple benchmark suite – for examples in this article.

To run a program counter sampling experiment on the smg2000 application on 256 processors using openmpi or SLURM, you would use

module load openspeedshop-2.0.1
module load mvapich-1.1

 (or other MPI implementation). If you run your application like this normally,

mpirun –np 256 smg2000 –n 65 65 65

 or this,

srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90

then to run with Open|SpeedShop, one adds the convenience command and quotes around the command normally used to execute the application outside of Open|SpeedShop:

osspcsamp "mpirun –np 256 smg2000 –n 65 65 65"
osspcsamp "srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90"

When executing the above commands, one sees output from Open|SpeedShop and from the application and then the default performance analysis report showing the functions in the application that took the most time. Additionally, an Open|SpeedShop database file is created. This SQLite database file contains the performance information for smg2000 and the debug symbol table information, including source line number information. That enables the file to be moved to any other platform/laptop that has Open|SpeedShop installed for viewing, if desired.

Here is the example output from a pcsamp experiment run from hyperion at Lawrence Livermore National Laboratory (LLNL) using SLURM:

osspcsamp "srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90"

[openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100".
[openss]: Using OPENSS_PREFIX installed in /home/jeg/chaos_4_x86_64_ib/opt/OSS-mrnet
[openss]: Setting up offline raw data directory in /home/jeg/chaos_4_x86_64_ib/shared/offline-oss
[openss]: Running offline pcsamp experiment using the command:
"srun -ppbatch -N 32 -n 256 /home/jeg/chaos_4_x86_64_ib/opt/OSS-mrnet/bin/ossrun -c pcsamp ./smg2000 -n 90 90 90"

Running with these driver parameters:
 (nx, ny, nz)    = (90, 90, 90)
 (Px, Py, Pz)    = (256, 1, 1)
 (bx, by, bz)    = (1, 1, 1)
 (cx, cy, cz)    = (1.000000, 1.000000, 1.000000)
 (n_pre, n_post) = (1, 1)
 dim             = 3
 solver ID       = 0
=============================================
Struct Interface:
=============================================
Struct Interface:
 wall clock time = 0.431376 seconds
 cpu clock time  = 0.440000 seconds
=============================================
Setup phase times:
=============================================
SMG Setup:
 wall clock time = 5.291889 seconds
 cpu clock time  = 5.300000 seconds
=============================================
Solve phase times:
=============================================
SMG Solve:
 wall clock time = 46.156027 seconds
 cpu clock time  = 46.160000 seconds

Iterations = 7
Final Relative Residual Norm = 3.535135e-07

[openss]: Converting raw data from /home/jeg/chaos_4_x86_64_ib/shared/offline-oss into temp file X.0.openss

Processing raw data for smg2000
Processing processes and threads ...
Processing performance data ...
Processing functions and statements ...

[openss]: Restoring and displaying default view for:
 /home/jeg/chaos_4_x86_64_ib/smg2000/test/smg2000-pcsamp.openss
[openss]: The restored experiment identifier is:  -x 1

 Exclusive    % of CPU  Function (defining location)
CPU time in       Time
 seconds.
5735.470000  47.243309  hypre_SMGResidual (smg2000: smg_residual.c,152)
2874.310000  23.675813  hypre_CyclicReduction (smg2000: cyclic_reduction.c,757)
1293.670000  10.656015  smpi_net_lookup (libmpich.so.1.0: mpid_smpi.c,1370)
 329.590000   2.714847  hypre_SemiInterp (smg2000: semi_interp.c,126)
 276.170000   2.274824  hypre_SemiRestrict (smg2000: semi_restrict.c,125)
 125.830000   1.036467  pthread_spin_lock (libpthread-2.5.so)
 124.440000   1.025018  hypre_SMGAxpy (smg2000: smg_axpy.c,27)
 87.770000    0.722965  __GI_memcpy (libc-2.5.so)
 79.820000    0.657481  hypre_StructAxpy (smg2000: struct_axpy.c,25)
 79.370000    0.653774  hypre_SMGSetStructVectorConstantValues (smg2000: smg.c,379)
 63.160000    0.520252  __munmap (libc-2.5.so)
 58.160000    0.479066  MPIR_UnPack_Hvector (libmpich.so.1.0: dmpipk.c,95)
 58.100000    0.478572  hypre_StructVectorSetConstantValues (smg2000: struct_vector.c,537)

The default report, created when the osspcsamp command was executed, displays the functions in the smg2000 application that took the most time. A user can further examine the performance information with the CLI or GUI by opening the Open|SpeedShop database file created during the experiment. With the command

openss –f smg2000-pcsamp.openss

the GUI is raised and displays the program counter sampling experiment default view as shown in Figure 1. Note that the naming convention for Open|SpeedShop database files uses the .openss suffix.

OSS-F01

Figure 1: Default GUI program counter sampling view.

By choosing the Statements as the View/Display Choice on the right side of the GUI Stats Panel window and then clicking on the D icon, which represents the default view selection, one can view which statements in smg2000 took the most time (Figure 2). By double-clicking on a line of performance information in the Stats Panel, the Source Panel is raised and focuses on the line in the application source code that corresponds to the performance information. With this feature, one can quickly see where in the application source the performance issue shows up.

OSS-F02

Figure 2: Statement view with Source Panel.

In the load balance view (Figure 3), one can look across all ranks included in the application execution at statement-level granularity. To generate this view, select Statements as the View/Display Choice and then click on the LB load balance icon. The information displayed is the minimum, maximum, and average exclusive time recorded for each statement in the program across all ranks, threads, or processes. In this case, a user sees rank information because this is an MPI application run. Additionally, this view displays rank number of the minimum and maximum values to help focus in on any possible outliers (a rank, thread, or process that is performing outside of the majority of the other ranks, threads, or processes). Use this view to determine whether there is imbalance or not. If the minimum and maximum values for key functions, statements, or libraries vary by a significant amount, then the application run is likely not well balanced.

OSS-F03

Figure 3: Program counter sampling load balance view.

If imbalance is suspected, the comparative analysis CA icon can be selected to run a cluster analysis algorithm on the performance information. In general, the cluster analysis algorithm will group processes, threads, or ranks together into groups of like-performing entities, thus exposing the ranks, threads, or processes that are not performing the way the other groups are. Each group is displayed as a column in the comparative analysis view (Figure 4). This view depicts which ranks are in the outlier group(s) and examines their performance information individually or as a group with the other Open|SpeedShop views.

OSS-F04

Figure 4: Program counter sampling comparative analysis view.