Open|SpeedShop is an open source multiplatform Linux performance tool targeted at performance analysis of applications running on both a single-node and on large-scale platforms. Look for Bottlenecks with Open|SpeedShopOpen|SpeedShop is an open source multiplatform Linux performance tool targeted at performance analysis of applications running on both a single node and on large-scale IA64, IA32, EM64T, AMD64, IBM Power PC, Cray, and IBM Blue Gene platforms. Open|SpeedShop operates on existing application binaries, so there is no need to recompile the application being analyzed. Open|SpeedShop gathers several types of performance information, relates that information back to the application source code, and then displays the associated performance information to the user. Open|SpeedShop is in use at a number of laboratories, universities, and corporations worldwide, helping software developers and users speed up applications and reduce time to solution. Open|SpeedShop supports performance analysis of sequential, MPI, openMP, and threaded applications and has been tested on several of the most common Linux distributions with the most commonly used MPI implementations, including SGI MPT, mpich2 variants, mvapich, mvapich2, and openmpi. In this article, I will describe how to use Open|SpeedShop through step-by-step examples illustrating how to find a number of different performance bottlenecks. Additionally, I will describe the tool’s most common usage model (workflow) and provide several performance data viewing options. Open|SpeedShop uses both statistical sampling and traditional tracing techniques to record performance information. The central concept in its workflow is an experiment. An experiment defines what type of performance information is being measured and the program being analyzed. Users select their experiment at the beginning of any performance analysis run depending on what kind of performance bottleneck they would like to investigate. The main types of data gathered via sampling techniques – by periodically interrupting execution, recording location, then reporting statistical distribution across all reported locations – are program counter information (pcsamp; experiment names are in parentheses; for example, “pcsamp” is the name of the Program Counter Sampling experiment), call path information (usertime), and hardware counter information (hwc, hwctime, hwcsamp). Tracing techniques are used to gather Input/Output information (io, iot), MPI function-specific information (mpi, mpit, mpiotf), and Floating Point Exception information (fpe). (Tracing techniques involve gathering and storing individual application events – e.g., function invocations, MPI messages, I/O calls. Events are typically time stamped and provide detailed per event information.) Table 1 describes specific performance issues that each Open|SpeedShop experiment is designed to reveal. Table 1: Summary of Experiments
After collecting performance information, Open|SpeedShop displays it in detailed reports that allow the user to relate the performance information back to its application source code easily. This information is accessible through a comprehensive graphical user interface (GUI), from a command line interface (CLI), as well as from within Python scripts. Additionally, the toolset includes a series of analysis techniques, including outlier detection, load balance analysis, and cross-experiment comparisons. Open|SpeedShop’s functionality provides a comprehensive set of techniques that greatly aids analysis and understanding of parallel application performance. Open|SpeedShop Program Counter Sampling ExampleAn Open|SpeedShop user must first set up a run-time environment. This is usually done by loading a module, Dotkit, or SoftEnv file that will set environment variables, including PATH and LD_LIBRARY_PATH, so that Open|SpeedShop tools and libraries can be accessed. A typical run-time environment initialization would include these items: export OPENSS_PREFIX=/opt/OSS-201 export OPENSS_MPI_IMPLEMENTATION=openmpi export OPENSS_PLUGIN_PATH=$OPENSS_PREFIX/lib64/openspeedshop export OPENSS_RAWDATA_DIR=/opt/shared export LD_LIBRARY_PATH=$OPENSS_PREFIX/lib64:$LD_LIBRARY_PATH export PATH=$OPENSS_PREFIX/bin:$PATH The Open|SpeedShop website describes the usage and meaning of these environment variables in detail (BuildAndInstallGuide). The workflow model for running Open|SpeedShop on a desktop or cluster system entails a command to gather the data and create an Open|SpeedShop database file containing the performance information and application symbol information. The Open|SpeedShop GUI or the interactive CLI tool enable viewing the data contained in the database file. Each of the above-mentioned experiments has a corresponding convenience command – for example, osspcsamp for the pcsamp experiment, ossusertime for the usertime experiment, and so on. I use the application smg2000 – a Semicoarsening Multigrid Solver based on the hypre library and taken from the ASCI Purple benchmark suite – for examples in this article. To run a program counter sampling experiment on the smg2000 application on 256 processors using openmpi or SLURM, you would use module load openspeedshop-2.0.1 module load mvapich-1.1 (or other MPI implementation). If you run your application like this normally, mpirun –np 256 smg2000 –n 65 65 65 or this, srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90 then to run with Open|SpeedShop, one adds the convenience command and quotes around the command normally used to execute the application outside of Open|SpeedShop: osspcsamp "mpirun –np 256 smg2000 –n 65 65 65" osspcsamp "srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90" When executing the above commands, one sees output from Open|SpeedShop and from the application and then the default performance analysis report showing the functions in the application that took the most time. Additionally, an Open|SpeedShop database file is created. This SQLite database file contains the performance information for smg2000 and the debug symbol table information, including source line number information. That enables the file to be moved to any other platform/laptop that has Open|SpeedShop installed for viewing, if desired. Here is the example output from a pcsamp experiment run from hyperion at Lawrence Livermore National Laboratory (LLNL) using SLURM: osspcsamp "srun -ppbatch -N 32 -n 256 ./smg2000 -n 90 90 90" [openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100". [openss]: Using OPENSS_PREFIX installed in /home/jeg/chaos_4_x86_64_ib/opt/OSS-mrnet [openss]: Setting up offline raw data directory in /home/jeg/chaos_4_x86_64_ib/shared/offline-oss [openss]: Running offline pcsamp experiment using the command: "srun -ppbatch -N 32 -n 256 /home/jeg/chaos_4_x86_64_ib/opt/OSS-mrnet/bin/ossrun -c pcsamp ./smg2000 -n 90 90 90" Running with these driver parameters: (nx, ny, nz) = (90, 90, 90) (Px, Py, Pz) = (256, 1, 1) (bx, by, bz) = (1, 1, 1) (cx, cy, cz) = (1.000000, 1.000000, 1.000000) (n_pre, n_post) = (1, 1) dim = 3 solver ID = 0 ============================================= Struct Interface: ============================================= Struct Interface: wall clock time = 0.431376 seconds cpu clock time = 0.440000 seconds ============================================= Setup phase times: ============================================= SMG Setup: wall clock time = 5.291889 seconds cpu clock time = 5.300000 seconds ============================================= Solve phase times: ============================================= SMG Solve: wall clock time = 46.156027 seconds cpu clock time = 46.160000 seconds Iterations = 7 Final Relative Residual Norm = 3.535135e-07 [openss]: Converting raw data from /home/jeg/chaos_4_x86_64_ib/shared/offline-oss into temp file X.0.openss Processing raw data for smg2000 Processing processes and threads ... Processing performance data ... Processing functions and statements ... [openss]: Restoring and displaying default view for: /home/jeg/chaos_4_x86_64_ib/smg2000/test/smg2000-pcsamp.openss [openss]: The restored experiment identifier is: -x 1 Exclusive % of CPU Function (defining location) CPU time in Time seconds. 5735.470000 47.243309 hypre_SMGResidual (smg2000: smg_residual.c,152) 2874.310000 23.675813 hypre_CyclicReduction (smg2000: cyclic_reduction.c,757) 1293.670000 10.656015 smpi_net_lookup (libmpich.so.1.0: mpid_smpi.c,1370) 329.590000 2.714847 hypre_SemiInterp (smg2000: semi_interp.c,126) 276.170000 2.274824 hypre_SemiRestrict (smg2000: semi_restrict.c,125) 125.830000 1.036467 pthread_spin_lock (libpthread-2.5.so) 124.440000 1.025018 hypre_SMGAxpy (smg2000: smg_axpy.c,27) 87.770000 0.722965 __GI_memcpy (libc-2.5.so) 79.820000 0.657481 hypre_StructAxpy (smg2000: struct_axpy.c,25) 79.370000 0.653774 hypre_SMGSetStructVectorConstantValues (smg2000: smg.c,379) 63.160000 0.520252 __munmap (libc-2.5.so) 58.160000 0.479066 MPIR_UnPack_Hvector (libmpich.so.1.0: dmpipk.c,95) 58.100000 0.478572 hypre_StructVectorSetConstantValues (smg2000: struct_vector.c,537) The default report, created when the osspcsamp command was executed, displays the functions in the smg2000 application that took the most time. A user can further examine the performance information with the CLI or GUI by opening the Open|SpeedShop database file created during the experiment. With the command openss –f smg2000-pcsamp.openss the GUI is raised and displays the program counter sampling experiment default view as shown in Figure 1. Note that the naming convention for Open|SpeedShop database files uses the .openss suffix. By choosing the Statements as the View/Display Choice on the right side of the GUI Stats Panel window and then clicking on the D icon, which represents the default view selection, one can view which statements in smg2000 took the most time (Figure 2). By double-clicking on a line of performance information in the Stats Panel, the Source Panel is raised and focuses on the line in the application source code that corresponds to the performance information. With this feature, one can quickly see where in the application source the performance issue shows up. In the load balance view (Figure 3), one can look across all ranks included in the application execution at statement-level granularity. To generate this view, select Statements as the View/Display Choice and then click on the LB load balance icon. The information displayed is the minimum, maximum, and average exclusive time recorded for each statement in the program across all ranks, threads, or processes. In this case, a user sees rank information because this is an MPI application run. Additionally, this view displays rank number of the minimum and maximum values to help focus in on any possible outliers (a rank, thread, or process that is performing outside of the majority of the other ranks, threads, or processes). Use this view to determine whether there is imbalance or not. If the minimum and maximum values for key functions, statements, or libraries vary by a significant amount, then the application run is likely not well balanced. If imbalance is suspected, the comparative analysis CA icon can be selected to run a cluster analysis algorithm on the performance information. In general, the cluster analysis algorithm will group processes, threads, or ranks together into groups of like-performing entities, thus exposing the ranks, threads, or processes that are not performing the way the other groups are. Each group is displayed as a column in the comparative analysis view (Figure 4). This view depicts which ranks are in the outlier group(s) and examines their performance information individually or as a group with the other Open|SpeedShop views. |
