HPC Storage is arguably one of the most pressing issues in HPC. Selecting various HPC Storage solutions is a problem that requires some research, study, and planning to be effective – particularly cost-effective. Getting this process started usually means understanding how your applications perform I/O. This article presents some techniques for examining I/O patterns in HPC applications. HPC Storage – Getting Started with I/O ProfilingIt's fairly obvious to say that storage is one of the largest issues, if not the largest issue, in HPC. All aspects of HPC storage are becoming critical to the overall success or productivity of systems: high performance, high reliability, access protocols, scalability, ease of management, price, power, and so on. These aspects, and perhaps more importantly, combinations of these aspects, are key drivers in HPC systems and HPC performance. With so many options and so many key aspects to HPC storage, a logical question one can ask is: Where should one start? Will a NAS (Network Attached Storage) solution work for my system? Do I need a high-performance parallel file system? Do I use a SAN (Storage Area Network) as a back end for my storage, or can I use less expensive DAS (Direct Attached Storage)? Should I use InfiniBand for the compute node storage traffic or will GigE or 10GigE be sufficient? Do I use 15,000 rpm drives or should I use 7,200 rpm drives? Do I need SSDs (Solid State Disks)? Which file system should I use? Which I/O scheduler should I use within Linux? Should I tune my network interfaces? How can I take snapshots of the storage and do I really need snapshots? How can I tell if my storage is performing well enough? How do I manage my storage? How do I monitor my storage? Do I need a backup or do I really only need a copy of the data? How can I monitor the state of my storage? Do I need quotas and how do I enforce them? How can I scale my storage in terms of performance and capacity? Do I need a single namespace? How can I do a filesystem check, how long will it take, and do I need one? Do I need cold spare drives or storage chassis? What RAID level do I need (file or object)? How many hot spares is appropriate? SATA versus SAS? And on, and on. When designing or selecting HPC storage, you have many questions to consider, but you might have noticed one item that I left out of the laundry list of questions. If you noticed I didn't discuss applications, you are correct. Designing HPC storage, just as with designing the compute nodes and network layout, starts with the applications. Although designing hardware and software for storage is very important, without considering the I/O needs of your application mix, the entire process is just an exercise with no real value. You are designing a solution without understanding the problem. So the first thing you really should consider is the I/O pattern and I/O needs of your applications. Some people might think that understanding application I/O patterns is a futile effort because, with thousands of applications, they think it's impossible to understand the I/O pattern of them all. However, it is possible to stack-rank the applications beginning with the few that use the most compute time or seemingly use a great deal of I/O. Then, you can begin to examine their I/O patterns. Tools to Help Determine I/O UsageYou have several options for determining the I/O patterns of your applications. One obvious way is to run the application and monitor storage while it is running. Measuring I/O usage on your entire cluster might not always be the easiest thing to do, but some basic tools can help you. For example, tools such as sar, iotop, iostat, nfsiostat, and collectl can be used to help you measure your I/O usage. (Note: This is by all means not an exhaustive list of possible tools.)
iotop Depending on the storage solution you are using, you might be able to use iotop to measure I/O usage on the data servers. For example, if you are using NFS on a single server, you could run iotop on that server and measure I/O usage for the nfsd processes. However, using iotop to measure I/O patterns really only gives you an overall picture without a great deal of detail. For example, it is probably impossible to determine whether the application is doing I/O to a central storage server or to the local node. It is impossible to use iotop to determine how the application is doing the I/O. Moreover, you really only get an idea of the throughput and not the IOPS that the application is generating.
iostat
The values are computed as system-wide averages for all processors when your system has more than one core (which is pretty much everything today). The second report prints out all kinds of details about device utilization (can be a physical device or a partition). If you don't use a device on the command line, then iostat will print out values for all devices (alternatively, you can use ALL as the device). Typically the report output includes the following:
These fields are specific to the set of options used, as well as to the device. Relative to iotop, iostat gives more detail on the actual I/O, but it does it in an aggregate manner (i.e., not a per-process basis). Perhaps more importantly, iostat reports on a device basis, not a mountpoint basis. Because the HPC application is likely writing to a central storage server mounted on the compute node, it not is not very likely that the application is writing to a device on each node (however, you can capture local I/O using iostat), so the most likely use of iostat is on the storage server or servers.
nfsiostat
sar Like iotop and nfsiostat, you would run sar on each compute node and gather the statistical I/O information, or you could just let it gather all information and sort out the I/O information. Then, you could gather all of that information together and create a time history of overall application I/O usage.
collectl Like sar, you run collectl on each compute node and gather the I/O information for that node. However, it does allow you to gather information about processes and threads, so you can capture a bit more information than sar. Then, you have to gather all of the information for each MPI process or compute node and create a time history of the I/O pattern of the application. A similar tool called collectd is run as a daemon to collect performance statistics much like collectl. These tools can help you understand what is happening on individual systems, but you have to gather the information on each compute node or for each MPI process and create your time history or statistical analysis of the I/O usage pattern. Moreover, they don't do a good job, if at all, of watching IOPS usage on systems, and IOPS can be a very important requirement for HPC systems. blktrace Other tools can help you understand more detailed I/O usage, but at a block level, allowing you to capture more information, such as IOPS. For example, blktrace can watch what is happening on specific block devices, so you can use this tool on storage servers to watch I/O usage. For example, if you are using a Linux NFS server, you could watch the block devices underlying the NFS file system, or if you are using Lustre, you could use blktrace to monitor block devices on the OSS nodes. Blktrace can be a very useful tool because it also allows you to compute IOPS (I think it's only "Total IOPS"). Also, a tool called seekwatcher can be used to plot results from blktrace. An example on the website illustrates this. Obviously, no one tool can give you all the information you need across a range of nodes. Perhaps a combination of iotop, iostat, nfsiostat, and collectl coupled with blktrace can give you a better picture of what your HPC storage is doing as a whole. Coordinating this data to generate a good picture of I/O patterns is not easy and will likely involve some coding. But if you assume that you can create this picture of I/O usage, you have to correlate it with the job history from the job scheduler to help determine which applications are using more I/O than others. However, these tools only tell you what is happening in aggregate, focusing primarily on throughput, although blktrace can give you some IOPS information. They can't really tell you what the application is doing in more detailed, such as the order of reads and writes, the amount of data in each read or write function, and information on lseeks or other I/O functions. In essence, what is missing is the ability to look at I/O patterns from the application level. In the next section, I'll present a technique and application that can be used to help you understand the I/O pattern of your application. Determining I/O PatternsOne of the keys to selecting HPC storage is to understand the I/O pattern of your application. This isn't an easy task to accomplish overall and several attempts have been made over the years to help you understand I/O patterns. One method I have been using is to use strace (system trace) to profile the I/O pattern of an application. Because virtually all I/O from an application will use system libraries, you can use strace to capture the I/O profile of an application. For example, using the command strace -T -ttt -o strace.out [application] on an [application] doing I/O, might output a line like this: 1269925389.847294 write(17, " 37989 2136595 2136589 213"..., 3850) = 3850 <0.000004> This single line has several interesting pieces. First, the amount of data written appears after the = sign (in this case, 3,850 bytes). Second, the amount of time used to complete the operation is on the far right in the angle brackets (< >) (in this case, 0.000004 seconds). Third, the data sent to the function is listed inside the quotes and is abbreviated, so you don’t see all of the data. This can be useful if you are examining an strace of an application that has sensitive data. The fourth piece of useful information is the first argument to the write() function, which contains the file descriptor to the specific file (in this case, it is fd 17). If you track the file associated with open() functions (and close() functions), you can tell which file the I/O function is operating upon. From this information, you can start to gather all kinds of statistics from the application. For example, you can count the number of read() or write() operations and how much data is in each IO operation. This data can then be converted into throughput data in MB/s (or more). You can also count the number of I/O functions in a period of time to estimate the IOPS needed by the application. Additionally, you can do all of this on a per-file basis to find which files have more I/O operations than others. Then, you can perform a statistical analysis of this information, including histograms of the I/O functions. This information can be used as a good starting point for understanding the I/O pattern of your application. However, if you run strace against your application, you are liable to end up with thousands, if not hundreds of thousands, of lines of output. To make things a bit easier, I have developed a simple program in Python that goes through the strace output and does this statistical analysis for you. The application, with the ingenious name "strace_analyzer," scans the entire strace output and creates an HTML report of the I/O pattern of the application, including plots (using matplotlib). To give you an idea of what the HTML output from strace_analyzer looks like, you can look at this snippet of the first part of the major output (without plots). This is only the top portion of the report, with the rest of the report containing plots – sometimes quite a few of them. The analysis is done for a single process (without threading at this time), and it gives you all sorts of useful statistical information about the application. One subtle thing to notice about using strace to analyze I/O patterns is that you are getting the strace output from a single application. Because I’m interested in HPC Storage here, many of the applications will be MPI applications, for which you get one strace output per MPI process. Getting strace output for each MPI process isn't difficult and doesn't require that the application be changed. You can get a unique strace output file for each MPI process, but again, you will get a great deal of output from each strace output file. To help with this, strace_analyzer creates an output file (a “pickle” in Python-speak), and you can take the collection of these files for an entire MPI application and perform the same statistical analysis across the MPI processes. The tool, called "MPI Strace Analyzer", also produces an HTML report across the MPI processes. If you look in the Appendix to this article, you will see a full report from an example application (eight-core LS-Dyna run that uses a simple central NFS storage system). |
