Except for the folks at Cray, most people are unaware of the unique requirements that set apart supercomputing infrastructure from cloud computing infrastructure. In its simplest form the difference is between latency and capacity. For business intelligence applications such as optimization and logistics many servers are required to solve a single problem, and low latency communication between the servers is instrumental for performance. The intuition behind this is easy to understand: a modern microprocessor executes 4-5 instructions per 250ps, and thus packet latencies of 10GbE, (between 5-50usec), are roughly equivalent to 100k to 1M processor instructions. If a processor is dependent on the results computed by another processor, it will have to idle till the data is available. Cumulatively, across a couple hundred servers, this can lead to peak performance that is only 1-5% of peak.

Supercomputing applications are defined by these types of tightly connected concurrent processes, putting more emphasis on the performance of the interconnect, in particularly the latency. Running a traditional supercomputing application on an infrastructure designed for elastic applications, such as AWS or Azure, typically yield slow-downs by a factor 50 to 100. Measured in terms of cost, they would cost 50-100 times more to execute on a typical public cloud computing infrastructure.

Most supercomputing applications are associated with very valuable economic activities of the business. As mentioned earlier, production optimization and logistics applications save companies like Exxon Mobil and Fedex billions of dollars per year. Those applications are tightly integrated in the business operation and strategic decision making of these organizations and pay for themselves many times over. However, for the SMB market these supercomputing applications offer great opportunity for revenue growth and margin improvements as well. However, their economic value is attenuated by the revenue stream they optimize; 10% improvement for a $10B revenue stream yields a $1B net benefit, but for a $10M revenue stream the benefit is just a $1M, not enough to compensate for the risk and cost that deploying a supercomputer would require.

Enter On-Demand Supercomputing.

In 2011, we were asked to design, construct, and deploy an On-Demand supercomputing service for a Chinese cloud vendor. The idea was to build an interconnected set of supercomputer centers in China, and offer a multi-tenant on-demand service for high-value, high-touch applications, such as logistics, digital content creation, and engineering design and optimization. The pilot program consisted of a supercomputer center in Beijing and one in Shanghai. The basic building block that was designed was a quad rack, redundant QDR IB fat-tree architecture with blade chassis at the leaves. The architecture was inspired by the observation that for the SMB market, the granularity of deployment would fall in the range of 16 to 32 processors, which would be serviced by a single chassis, keeping all communication traffic local to the chassis. The topology is shown in the following figure:

Redundant QDR IB Network Topology for On-Demand Supercomputing
The chassis structure is easy to spot as the clusters of 20 servers at the leaves of the tree. The redundancy of the IB network is also clearly visible by the pairs of connections between all the layers in the tree. The quad configuration is a two rack symmetric setup, one pair holding one side of the redundant IB network/storage/computes. So half the quad can fall away, and the system would still have full connectivity between storage and computes. To lower the cost of the system, storage was designed around IB-based storage servers that plugged into the same infrastructure as the compute nodes. QDR throughput is balanced with PCIe gen2 and thus we were able to deliver ephemeral blades that get their personality from the storage servers and then dynamically connect via iSCSI services to whatever storage volumes they require. This is less expensive than designing a separate NAS storage subsystem, and it gives the infrastructure flexibility to build high-performance storage solutions. It was this system that set a new world record by being the first trillion triple semantic database system leveraging a Lustre file system consisting of 8 storage servers (trillion-triple-semantic-database-record).
The provisioning of on-demand supercomputing infrastructure is bare metal, mostly to avoid any of the I/O latency degradation that virtualization injects. Given the symmetry between storage and compute and the performance offered by QDR IB, a network boot mechanism can be used to put any personality on the blades without any impact on performance. The blades have local disk for scratch space, but run their OS and data volumes off the storage servers, thus avoiding the problem of DR of state on the blades.
The QDR IB infrastructure was based on Voltair switches and Mellanox HCAs. Intel helped us tune the infrastructure, using their cluster libraries for the processors we were using, and Mellanox was instrumental in getting the IB switches in shape. Over a three week period, we went from 60% efficiency to about 94% efficiency. The full quad has a peak performance of 19.2TFlops and after tuning the infrastructure we were able to consistently deliver 18TFlops of sustained performance.
The total cost of the core system was of the order of $3.6M. The On-Demand Supercomputing service offers a full dual socket server with 64GB of memory for about $5/hr, providing a cost-effective service for SMBs interested in leveraging high performance computing. For example, a digital content creation firm in Beijing leveraged about 100 servers as burst capacity for post-production. Their monthly cost to leverage a state of the art supercomputer was less than $20k per month. Similarly, a material science application was developed by a chemical manufacturer to study epitaxial growth. This allowed the manufacturer to optimize the process parameters for a thin-film process that would not have been cost-effective on a cloud infrastructure designed for elastic web applications.
The take-away of this project is echoing the findings in the missing middle reports for digital manufacturing (Digital Manufacturing Report). There is tremendous opportunity for SMBs to improve business operations by leveraging the same techniques as their enterprise brethren. But the cost of commercial software for HPC is not consistent with the value provided for SMBs. Furthermore, the IT and operational skills required both to setup and manage a supercomputing infrastructure is beyond the capabilities of most SMBs. On-demand HPC services, as we have demonstrated with the supers in Beijing and Shanghai, can overcome many of these issues. Most importantly, it enables a new level of innovation by domain experts, such as professors and independent consultants, who do have the skills necessary to leverage supercomputing techniques, but up to now have not had access to public supercomputing capability and services.