Articles

News

Vendors

Whitepapers

Write for Us

About Us

Top Three HPC Roadblocks

The HPC market has come along way in the last decade, but some issues still impede the full potential of the market.

Top Three HPC Roadblocks

By Doug Eadline

If you are a practitioner of HPC, you might ask, “Only three things?” Of course, there are more, but the problems I want to talk about here are what I judge to be some of the top issues facing the HPC market and community. Over the last 15 years, the HPC market has seen great growth because of low-cost commodity hardware and open software. These dramatic changes have expanded HPC to the point that true HPC computing can be had for several thousand dollars. This growth has not been without challenges, most of which are easily solved. Having been involved in the HPC market for close to 25 years, my frustration stems from the lack of progress with the following issues.

Solve The Last Mile Problem

The cable/Internet industry coined the phrase "the last mile" to indicate the issue of making the final cable/phone/Internet connection to customers. The problem is real because, although each customer wants the same thing, in almost all case it becomes a "custom" job at some point. From the classic console television to the Windows 98 computer to the latest flat screen or tablet, a complete solution takes some know-how that the customer often does not have.

Companies have addressed this problem by creating clear demarcation points where their responsibility ends and the customer’s begins. It can be cable modem or a phone box, but one side is the users responsibility and the other is that of the service provider. In general, installers try to assist the customer with their specific situation, which can vary from simple to complex. Setting up wireless is one good example. The situation has created a need for companies like the Best Buy Geek Squad, who offer integration services. Results vary, but at least people have someone to call other than the cousin who figured out how to get his Xbox working on the Internet.

HPC has a huge "last mile" problem because of several factors. Underestimating costs is a clear issue in HPC. The idea that buying a cluster is as simple as specifying how many cores, DIMMS, HDDs, HCAs, switches, and so on you can fit in your budget is a misconception, even if it is a valid and necessary exercise. However, when it comes to software, most of which is open source, the assumption is that it is "free" and should not affect the cost of the system. The administrative expense is usually handled by "existing" resources, which can be anything from a graduate student to a well-trained and certified Windows admin. In almost all cases, the biggest perceived expense is the hardware.

Top-tier vendors will offer various levels of software and integration support, but when the customer finds they might have to cut a third of their hardware to pay for top-tier support, they tend to look elsewhere for a solution. The lower tier vendors, who are working on thin margins, prefer to deliver and support only the hardware. Most vendors do not have a software support staff to help customers beyond the initial install. Depending on the size of the institution and its charter, organizations might have staff that addresses these needs. In general, the national labs and large university computing sites have excellent support structures. Many of the organizations also contribute to the impressive collection of open HPC software. The last mile problem is largely present in the smaller organizations, who have smaller budgets and less personnel resources.

As an example, consider software upgrades. In a typical case, clusters are purchased fully functioning with some type of Red Hat-derived distribution augmented by cluster tools and libraries. Over time, the administrator makes changes and tweaks various items. Eventually, the cluster becomes heavily used, the hardware warranty expires, there is no software support, and users begin to request updated software. The upgrade requires a cascade of updates, and now the administrator is often required to rebuild a custom cluster software infrastructure. The effort could take several weeks of installation and testing that does not sit well with the end users. Depending on the skill level of the administrator, upgrades might not even work or could cause further headaches and delays.

Similar to the cable/Internet last mile problem, there is an economic opportunity in the HPC market. One provider, Bright Computing, has developed a turnkey cluster software solution that many administrator are finding useful. Some other open cluster management (or provisioning) solutions allow for much easier cluster management (e.g., the Warewulf Project). A plethora of other issues face administrators and users as well, and each issue has an associated cost that can include, storage, expansion, training, workflow policies, hardware failures, and local integration, just to name a few.

The situation described above has not really changed much in the last 10 years. The focus on the latest and greatest hardware often dwarfs the attention paid to many of these issues. The net result is slow market growth, failed installations, bad experiences, administrative turmoil, and unexpected costs.

In particular, smaller organizations are more vulnerable to last mile issues, and users need to understand that a successful HPC program has costs beyond the hardware. Another need is for more education and training of both users and system administrators. This effort needs to come from the entire industry because the last mile involves many vendors. Like the cable/Internet industry, once the last mile problem is addressed, the HPC market can expand in many ways.

Refocus Performance Goals

The Top500 is a great historical resource, as well as a way for the "top" computer vendors and users to measure their progress. The problem with the Top500 is that a majority of users do not have access to or require that level of computing; yet, they use it as a measure of overall HPC progress. Numerous surveys have gauged how many cores (in the past, CPUs) are needed to run HPC-specific applications. In a recent informal poll, 55% of the HPC respondents required 32 or fewer cores (presumably because of scaling issues). The term often used for users in this area is the missing middle. One needs to ask, “With a middle market of low-hanging fruit, for which fewer than 100 cores can have a huge economic effect, why does the HPC market focus on benchmark results that require tens of thousands of cores?”

Perhaps the belief that "press release" clusters will garner more business pushes the market toward the Top500. The other, more pragmatic reason, is that delivering a solution to the missing middle is more difficult than deploying racks of servers to achieve a spot on the Top500 list. In essence, the missing middle represent all that has been forgotten or passed over in the HPC industry. These issues include low-cost turnkey systems, real last-mile support, application porting, programming tools, and training – none of which are strong points in the current HPC ecosystem, but all of which contribute to successful production systems.

The absence of a missing middle infrastructure has stymied the growth of this sector. Addressing this audience (see The Council On Competitiveness) can have a huge effect on both the HPC market and the entire economy. Again, the solution lives across the industry but has less to do with peak FLOPS and more to do with effective FLOPS. Reducing the "barriers to effective FLOPS" benefit all HPC vendors, but no vendor has taken on this role in the industry, nor should they. Just like the last mile problem, the problem of the missing middle spans the entire market. It should be mentioned, that HPC Cloud might help with some of the issues mentioned above; however, the need to address the fundamentals remains unchanged.