Articles

News

Vendors

Whitepapers

Write for Us

About Us

HPC Storage – Getting Started with I/O Profiling

What Do I Do with This Information?

Now that you have all of this statistical information about your applications, what do you do with it? I'm glad you asked. The answer is that you can do quite a bit. The first thing I always look for is how many processes in the MPI application actually do I/O, primarily write(). You will be very surprised that many applications have only a single MPI process, typically the rank-0 process, doing all of the I/O for the entire application. However, there is a second class of applications in which a fixed number of MPI processes perform I/O, and this number is less than the total number of MPI processes. Finally, a third class of applications have all, or virtually all, processes writing to the same file at the same time (MPI-IO). If you don't know whether your applications fall into one of these three classes, you can use mpi_strace_analyzer to help determine that.

Just knowing whether your application has a single process doing I/O or whether it uses MPI-IO is a huge step toward making an informed decision about HPC storage. The simple reason is that running MPI-IO codes on NFS storage is really not recommended, although it is possible. Instead, the general rule of thumb is to run MPI-IO codes on parallel distributed storage that have tuned MPI-IO implementations. (Please note that these are general rules of thumb, and it is possible to run MPI-IO codes on NFS storage and non-MPI-IO codes on parallel distributed storage).

Other useful information obtained from examining the application I/O pattern, such as

  • throughput requirements (read and write),
  • IOPS requirements (write IOPS, read IOPS, total IOPS),
  • sizes of read() and write() function calls (i.e., the distribution of data sizes),
  • time spent doing I/O versus total run time, and
  • lseek information,

can be used to determine not only what kind of HPC storage you need, but also how much performance you need from the storage. Is your application dominated by throughput performance or IOPS? What is the peak IOPS obtained from the strace output? How much time is spent doing I/O versus total run time?

If you look at the specific example in the Appendix, you could make the following observations from the HTML report.

  • The MPI process associated with file_18597 spends the largest amount of time doing I/O. However, it is only 1.15% of the total run time. At best, I could only ever improve the wall clock time by 1.15% by adding more I/O capability to the system. This can also be seen in Appendix Figure 1.
  • If you examine Appendix Table 2, which counts the number of times a specific I/O function is called, you can see that the MPI processes associated with file_18597 have the largest number of lseek(), write(), open(), fstat(), and stat() function calls of all of the processes. However, the MPI process associated with file_18590 also does quite a bit of I/O, as can be seen in Appendix Figure 2, which plots the major I/O functions ( read(), write(), lseek(), open(), and close() ).
  • To help resolve whether file_18590 or file_18597 is a more dominant I/O process, examine Appendix Figure 8. This figure plots the total amount of data written by each function with the average (± standard deviation) plotted as well. From this figure, it is easy to see that file_18597 did the vast majority of data writing for this application (1.75GB out of about 3GB, or almost 60 percent).
  • In Appendix Table 4, examine the write() functions and how much data use per function call is tabulated. You'll see that most of the data is passed in 1 to 8KB chunks, with the vast majority in 1KB or smaller chunks. This indicates that the application is doing a very large number of small writes (which could influence storage design).
  • If you look at the write() summary right after Appendix Figure 7, you can also see that the whole application only wrote a little more than 3GB of data, with an average data size of a little more than 10KB per write.
  • The same thing can be done for read() functions. Appendix Table 7 tabulates the data size passed to the read() function. You can see that the majority of data is read in the 1 to 10MB range. It turns out that a number of these read() function calls are the result of loading shared objects (.so) files, which are read and loaded into memory.
  • If you look at the last section that covers IOPS, you can see that the peak Write IOPS is about 2,444, which is fairly large considering that a typical 7,200 rpm SAS or SATA drive can only handle about 100 IOPS. The Read IOPS number is low and the Total IOPS number is high as well (2,444). You would think this application needs a fair amount of IOPS performance because of the very small writes being performed, and you might presume that you need to have 25, 7,200 rpm SAS or SATA drives to meet the IOPS requirement of the application. However, don't forget that the best wall clock improvement possible for this application is 1.15 percent, so spending so much on drives will only improve the overall performance of the application by a small amount. It might be better to use that money to buy another compute node, if the application scales well, to improve performance.

Summary

HPC storage is definitely a difficult problem for the industry right now. Designing systems to meet our storage needs has become a headache that is difficult to cure. However, you can make this headache easier to manage if you understand the I/O patterns of your applications.

In this article, I talk about different ways to measure the performance of your current HPC storage system, your applications, or both, but this requires a great deal of coordination to capture all of the relevant information and then piece it together. Some of the tools, such as iotop, iostat, nfsiostat, collectl, collectd, and blktrace, can be used to understand what is happening with your storage and your applications. However, they don't really give you the details of what is going from the perspective of the application. These tools are all focused on what is happening on a particular server (compute node). For HPC, you would have to gather this information for all nodes and then coordinate it to understand what is happening at an application (MPI) level.

Using strace can give you more information from the perspective of the application, although it also requires you to gather all of this information on each node in the job run and coordinate it. To help with this process, two applications – strace_analyzer and mpi_strace_analyzer – have been written to help sort through the mounds of strace data and produce some useful statistical information.

The tools were applied to a LS-Dyna run over eight cores that used an NFS filesystem (NFS over GigE). Portions of the strace analysis of a single process was presented, and  the entire MPI strace analysis was presented in an Appendix, showing the sort of information produced by the analysis tools to help you better understand the I/O pattern of your application.

I hope this article has presented some ideas about how to analyze your I/O needs from the perspective of an application. After all, making your applications run more efficiently, and hopefully faster, is the whole point of HPC.