Articles

News

Vendors

Whitepapers

Write for Us

About Us

HPC Storage – I/O Profiling

HPC Storage is arguably one of the most pressing issues in HPC. Selecting various HPC Storage solutions is a problem that requires some research, study, and planning to be effective – particularly cost-effective. Getting this process started usually means understanding how your applications perform I/O. This article presents some techniques for examining I/O patterns in HPC applications.

HPC Storage – Getting Started with I/O Profiling

By Jeff Layton

It's fairly obvious to say that storage is one of the largest issues, if not the largest issue, in HPC. All aspects of HPC storage are becoming critical to the overall success or productivity of systems: high performance, high reliability, access protocols, scalability, ease of management, price, power, and so on. These aspects, and perhaps more importantly, combinations of these aspects, are key drivers in HPC systems and HPC performance.

With so many options and so many key aspects to HPC storage, a logical question one can ask is: Where should one start? Will a NAS (Network Attached Storage) solution work for my system? Do I need a high-performance parallel file system? Do I use a SAN (Storage Area Network) as a back end for my storage, or can I use less expensive DAS (Direct Attached Storage)? Should I use InfiniBand for the compute node storage traffic or will GigE or 10GigE be sufficient? Do I use 15,000 rpm drives or should I use 7,200 rpm drives? Do I need SSDs (Solid State Disks)? Which file system should I use? Which I/O scheduler should I use within Linux? Should I tune my network interfaces? How can I take snapshots of the storage and do I really need snapshots? How can I tell if my storage is performing well enough? How do I manage my storage? How do I monitor my storage? Do I need a backup or do I really only need a copy of the data? How can I monitor the state of my storage? Do I need quotas and how do I enforce them? How can I scale my storage in terms of performance and capacity? Do I need a single namespace? How can I do a filesystem check, how long will it take, and do I need one? Do I need cold spare drives or storage chassis? What RAID level do I need (file or object)? How many hot spares is appropriate? SATA versus SAS? And on, and on. When designing or selecting HPC storage, you have many questions to consider, but you might have noticed one item that I left out of the laundry list of questions. If you noticed I didn't discuss applications, you are correct.

Designing HPC storage, just as with designing the compute nodes and network layout, starts with the applications. Although designing hardware and software for storage is very important, without considering the I/O needs of your application mix, the entire process is just an exercise with no real value. You are designing a solution without understanding the problem. So the first thing you really should consider is the I/O pattern and I/O needs of your applications.

Some people might think that understanding application I/O patterns is a futile effort because, with thousands of applications, they think it's impossible to understand the I/O pattern of them all. However, it is possible to stack-rank the applications beginning with the few that use the most compute time or seemingly use a great deal of I/O. Then, you can begin to examine their I/O patterns.

Tools to Help Determine I/O Usage

You have several options for determining the I/O patterns of your applications. One obvious way is to run the application and monitor storage while it is running. Measuring I/O usage on your entire cluster might not always be the easiest thing to do, but some basic tools can help you. For example, tools such as sar, iotop, iostat, nfsiostat, and collectl can be used to help you measure your I/O usage. (Note: This is by all means not an exhaustive list of possible tools.)

iotop
Using iotop you can get quite a few I/O stats for a particular system. It runs on a single system, such as a storage server or a compute node, and measures the I/O on that system for all processes, but on a per-process basis, allowing you to watch particular processes (such as an HPC application). One way to use this tool is to run it on all of the compute nodes that are running a particular application, perhaps as part of a job script. When the MPI job runs, you will get an output file for each node, but before collecting them together, be sure you use a unique name for each node. Once you have gathered all of the output files, you should sort the data to get only the information for the MPI application. Finally, you can take the data for each MPI process and create a time history of the I/O usage of the application and perhaps find which process is creating the most I/O.

Depending on the storage solution you are using, you might be able to use iotop to measure I/O usage on the data servers. For example, if you are using NFS on a single server, you could run iotop on that server and measure I/O usage for the nfsd processes.

However, using iotop to measure I/O patterns really only gives you an overall picture without a great deal of detail. For example, it is probably impossible to determine whether the application is doing I/O to a central storage server or to the local node. It is impossible to use iotop to determine how the application is doing the I/O. Moreover, you really only get an idea of the throughput and not the IOPS that the application is generating.

iostat
Iostat allows you to collect quite a few I/O statistics, as well, and even allows you to specify a particular device on the node, but it does not separate out I/O usage on a per-process basis (i.e., you get an aggregate view of all I/O usage on a particular compute node). However, iostat gives you a much larger array of statistics that iotop. In general, you get two type of reports with iostat. (You can use options to get one report or the other but the default is both types of reports.) The first report has CPU usage, and the second report shows device utilization. The CPU report contains the following information:

  • %user: The percentage of CPU utilization that occurred while executing at the user level (this is application usage).
  • %nice: The percentage of CPU utilization that occurred while executing at the user level with nice priority.
  • %system: The percentage of CPU utilization that occurred while executing at the system level (kernel).
  • %iowait: The percentage of time the CPU or CPUs were idle, during which the system had an outstanding disk I/O request.
  • %steal: The percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
  • %idle: The percentage of time the CPU or CPUs were idle and the systems did not have an outstanding disk I/O request.

The values are computed as system-wide averages for all processors when your system has more than one core (which is pretty much everything today).

The second report prints out all kinds of details about device utilization (can be a physical device or a partition). If you don't use a device on the command line, then iostat will print out values for all devices (alternatively, you can use ALL as the device). Typically the report output includes the following:

  • Device: Device name
  • rrqm/s: The number of read requests merged per second that were queued to the device
  • wrqm/s: The number of write requests merged per second that were queued to the device
  • r/s: The number of read requests issued to the device per second.
  • w/s: The number of write requests issued to the device per second.
  • rMB/s: The number of megabytes read from the device per second.
  • wMB/s: The number of megabytes written to the device per second.
  • avgrq-sz: The average size (in sectors) of the requests issued to the device.
  • avgqu-sz: The average queue length of the requests issued to the device.
  • await: The average time (milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in a queue and the time spent servicing them.
  • svctm: The average service time (milliseconds) for I/O requests issued to the device.
  • %util: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this values is close to 100%.

These fields are specific to the set of options used, as well as to the device.

Relative to iotop, iostat gives more detail on the actual I/O, but it does it in an aggregate manner (i.e., not a per-process basis). Perhaps more importantly, iostat reports on a device basis, not a mountpoint basis. Because the HPC application is likely writing to a central storage server mounted on the compute node, it not is not very likely that the application is writing to a device on each node (however, you can capture local I/O using iostat), so the most likely use of iostat is on the storage server or servers.

nfsiostat
A tool called nfsiostat, which is very similar to iostat, is part of the systat collection, but it allows you to monitor NFS-mounted filesystems, so this tool can be used for MPI applications that are using NFS as a central filesystem. It produces the same information as iostat (see above). In the case of HPC applications, you would run nfsiostat on each compute node and collect the information and process it in much the way you did with iotop.

sar
sar is one of the most common tools for gathering information about system performance. Many admins commonly use sar to gather information about CPU utilization, I/O, and network information on systems. I won't go into detail about using sar because there are so many articles around the web about using it, but it can be used to examine the I/O pattern of HPC applications at a higher level.

Like iotop and nfsiostat, you would run sar on each compute node and gather the statistical I/O information, or you could just let it gather all information and sort out the I/O information. Then, you could gather all of that information together and create a time history of overall application I/O usage.

collectl
collectl is sort of an uber-tool for collecting performance stats on a system. It is more HPC oriented than sar, iotop, iostat, or nfsiostat because it has hooks to monitor NFS and Lustre. But it too requires that you run it on all nodes to get an understanding of what I/O load is imposed throughput the cluster.

Like sar, you run collectl on each compute node and gather the I/O information for that node. However, it does allow you to gather information about processes and threads, so you can capture a bit more information than sar. Then, you have to gather all of the information for each MPI process or compute node and create a time history of the I/O pattern of the application.

A similar tool called collectd is run as a daemon to collect performance statistics much like collectl.

These tools can help you understand what is happening on individual systems, but you have to gather the information on each compute node or for each MPI process and create your time history or statistical analysis of the I/O usage pattern. Moreover, they don't do a good job, if at all, of watching IOPS usage on systems, and IOPS can be a very important requirement for HPC systems.

blktrace

Other tools can help you understand more detailed I/O usage, but at a block level, allowing you to capture more information, such as IOPS. For example, blktrace can watch what is happening on specific block devices, so you can use this tool on storage servers to watch I/O usage. For example, if you are using a Linux NFS server, you could watch the block devices underlying the NFS file system, or if you are using Lustre, you could use blktrace to monitor block devices on the OSS nodes.

Blktrace can be a very useful tool because it also allows you to compute IOPS (I think it's only "Total IOPS"). Also, a tool called seekwatcher can be used to plot results from blktrace. An example on the website illustrates this.

Obviously, no one tool can give you all the information you need across a range of nodes. Perhaps a combination of iotop, iostat, nfsiostat, and collectl coupled with blktrace can give you a better picture of what your HPC storage is doing as a whole. Coordinating this data to generate a good picture of I/O patterns is not easy and will likely involve some coding. But if you assume that you can create this picture of I/O usage, you have to correlate it with the job history from the job scheduler to help determine which applications are using more I/O than others.

However, these tools only tell you what is happening in aggregate, focusing primarily on throughput, although blktrace can give you some IOPS information. They can't really tell you what the application is doing in more detailed, such as the order of reads and writes, the amount of data in each read or write function, and information on lseeks or other I/O functions. In essence, what is missing is the ability to look at I/O patterns from the application level. In the next section, I'll present a technique and application that can be used to help you understand the I/O pattern of your application.

Determining I/O Patterns

One of the keys to selecting HPC storage is to understand the I/O pattern of your application. This isn't an easy task to accomplish overall and several attempts have been made over the years to help you understand I/O patterns. One method I have been using is to use strace (system trace) to profile the I/O pattern of an application.

Because virtually all I/O from an application will use system libraries, you can use strace to capture the I/O profile of an application. For example, using the command

strace -T -ttt -o strace.out [application]

on an [application] doing I/O, might output a line like this:

1269925389.847294 write(17, " 37989 2136595 2136589 213"..., 3850) = 3850 <0.000004>

This single line has several interesting pieces. First, the amount of data written appears after the = sign (in this case, 3,850 bytes). Second, the amount of time used to complete the operation is on the far right in the angle brackets (< >) (in this case, 0.000004 seconds). Third, the data sent to the function is listed inside the quotes and is abbreviated, so you don’t see all of the data. This can be useful if you are examining an strace of an application that has sensitive data. The fourth piece of useful information is the first argument to the write() function, which contains the file descriptor to the specific file (in this case, it is fd 17). If you track the file associated with open() functions (and close() functions), you can tell which file the I/O function is operating upon.

From this information, you can start to gather all kinds of statistics from the application. For example, you can count the number of read() or write() operations and how much data is in each IO operation. This data can then be converted into throughput data in MB/s (or more). You can also count the number of I/O functions in a period of time to estimate the IOPS needed by the application. Additionally, you can do all of this on a per-file basis to find which files have more I/O operations than others. Then, you can perform a statistical analysis of this information, including histograms of the I/O functions. This information can be used as a good starting point for understanding the I/O pattern of your application.

However, if you run strace against your application, you are liable to end up with thousands, if not hundreds of thousands, of lines of output. To make things a bit easier, I have developed a simple program in Python that goes through the strace output and does this statistical analysis for you. The application, with the ingenious name "strace_analyzer," scans the entire strace output and creates an HTML report of the I/O pattern of the application, including plots (using matplotlib).

To give you an idea of what the HTML output from strace_analyzer looks like, you can look at this snippet of the first part of the major output (without plots).

This is only the top portion of the report, with the rest of the report containing plots – sometimes quite a few of them. The analysis is done for a single process (without threading at this time), and it gives you all sorts of useful statistical information about the application.

One subtle thing to notice about using strace to analyze I/O patterns is that you are getting the strace output from a single application. Because I’m interested in HPC Storage here, many of the applications will be MPI applications, for which you get one strace output per MPI process. Getting strace output for each MPI process isn't difficult and doesn't require that the application be changed. You can get a unique strace output file for each MPI process, but again, you will get a great deal of output from each strace output file. To help with this, strace_analyzer creates an output file (a “pickle” in Python-speak), and you can take the collection of these files for an entire MPI application and perform the same statistical analysis across the MPI processes. The tool, called "MPI Strace Analyzer", also produces an HTML report across the MPI processes. If you look in the Appendix to this article, you will see a full report from an example application (eight-core LS-Dyna run that uses a simple central NFS storage system).