|
|
 |

Success Story



|
 |
Appro helps LLNL successfully implement a 620TF supercomputing cluster at three National Labs |
|
--> Download this Success Story
The Challenge The workhorse of modern day High Performance Computing (HPC) is the Linux Cluster. Almost all areas of science and engineering are dependent upon the power and performance offered by Linux clusters. Today's top computational problems now require performance in the TFLOP/s (10^12 Floating Point Operations per Second) range. In the very near future, the delivery of PFLOP/s (10^15) cluster systems will be a reality as well. One of the leading edge practitioners of Linux HPC computing is Lawrence Livermore National Lab (LLNL) located in Livermore, California (http://www.llnl.gov).

One of LLNL's primary missions for the National Nuclear Security Administration (NNSA) is ensuring the safety, security and reliability of the nation's nuclear deterrent through a program called stockpile stewardship. A cornerstone of this effort is NNSA's Advanced Simulation and Computing Program (ASC), which provides the integrating simulation and modeling capabilities and technologies needed to combine new and old experimental data, past nuclear test data, and past design and engineering experience into a powerful tool for assessment and certification of nuclear weapons and their components.
In addition, LLNL computing supports research in many other areas including, molecular dynamics, turbulence, peta scale atomistic simulations with quantum accuracy, simulation of protein membranes, nano technology, ultrahigh resolution global climate models, fundamental material research, and laser plasma interactions for the National Ignition Facility (NIF).
As an example, the NIF mission is a grand challenge problem and requires large amounts of compute resources for both design and when complete experimental data analysis. Briefly, when completed, the NIF will focus 192 giant lasers, housed in a tenstory building the size of three football fields onto a tiny BB sized target (see Figure One). The NIF will deliver at least 60 times more energy than any previous laser system to the target. Experiments conducted at the NIF will make significant contributions to national and global security, could lead to practical fusion energy, and will help the nation maintain its leadership in basic science and technology. In order to ensure the project success, LLNL computing is challenged to provide usable cycles for NIF team as well as other leading scientific projects.
Currently LLNL provides 495 TFLOP/s of computing power, spread over eighteen x86 Linux clusters, to its user base. Of these clusters, 480 TFLOP/s are derived from the eleven systems supplied by Appro International. These clusters are available for both classified and unclassified work depending on the project. In addition, clusters are broken into two types: Capability Clusters, or those clusters that are designed to handle unusually large computing jobs, and Capacity Clusters, or those that are designed to handle a large number of different computing jobs at the same time.
Fielding and maintaining these clusters is no small feat. As a result, LLNL has a strong incentive to reduce the Total Cost of Ownership (TCO) of these systems. The TCO challenge is not unfamiliar to other government labs, notably Los Alamos National Lab (LANL) and Sandia National Labs (SNL). Each has similar issues and working together on a common hardware and software infrastructure would be advantageous, but it would also present a management challenge.
In order to reduce TCO by over 50%, the Tri-Labs decided to purchase identical Scalable Units (SU) and build multiple clusters of various sizes over a two year cycle for all three labs. All- in-one procurement was certainly a novel, and an obvious approach. In theory, providing all three labs with the same hardware and software environment would allow applications to be easily moved from one cluster/site to another as well as offering an advantage for a common software development environment. In addition, a similar computing environment reduces man power and cost if software is the same across all three sites. In the past each lab created their own Linux based distribution -- often with operating system patches to better use the underling hardware. The Tri-lab procurement allowed the creation of TOSS (Tri-Lab Operating System Stack). TOSS is similar to the CHAOS environment developed by LLNL. Currently, the Tri-Labs are working on the TOSS project and when complete this is the first time all three labs will be able to share a common hardware environment and cluster operating systems infrastructure. Most of the TOSS components come from Red Hat Enterprise Linux 5 which is available under a DOE site license.
In an attempt to leverage this common hardware platform, LLNL, LANL, and SNL joined together in the first multi-lab Linux cluster procurement aptly called Tri-Labs Capacity Cluster 2007 or TLCC07 procurement. As part of the TLCC7 approach is a common SU hardware platform specification with the intention of creating this common software platform as well. The challenge that all three labs had to face was who would be the winning vendor that could deal with this huge procurement and delivery model. Indeed, successfully fielding many Linux clusters requires a vendor that can work closely with a variety of component suppliers, has HPC expertise and track record, has demonstrated sound project management and good communication skills with customers. With the Tri-Labs procurement, the number of customers increased to three. Each customer is geographically separate from the other and each had unique facilities requirements.
Once the multiple Linux clusters were deployed, an additional challenge was presented to LLNL computing center. The computational needs of the various lab programs had to be addressed with stable production clusters. Stable production clusters requires that the vendor of choice had to provide a high level of performance and system stability.
In the HPC world, headlines often cite the achievements of capability clusters to scale up and push the limits of computing. In the Tri-lab procurement, the challenge was not so much one of scaling up, but rather scaling out. According the Mark Seager of LLNL, 'Even though these are capacity clusters, we feel we are pushing the HPC state of the art because we bought 3,744 compute nodes to be spread across three Government Labs. That was a huge logistical challenge.'
The Solution LLNL working with two other NNSA national laboratories - Sandia and Los Alamos- developed the concept of the Scalable Unit (SU) cluster building block in order to build multiple commodity Linux clusters of different sizes from the same SU. With previous Linux cluster acquisitions, each cluster was designed, acquired and integrated individually. In a sense, this method treated every cluster as a unique creation with little carry over from other previous clusters. With a SU concept, large and small clusters can be constructed with the identical hardware and software. In addition, by building multiple clusters from the same SU, LLNL gained experience along the way and successive deployments went more smoothly. Thirdly, with a common hardware and software environment spread across multiple clusters at multiple locations, applications developer cost for porting and supporting these clusters was also significantly reduced. These are the factors that lead to the 50% reduction in the total cost of ownership.
The specific SU specification that was developed by the Tri-Labs for TLCC07 is based on 144 quad socket, quad core nodes, twelve InfiniBand 4X DDR 24 port switches, and scalable Ethernet management infrastructure. These nodes were deployed to critical cluster functions: 136 compute nodes (where user applications run), 6 gateway nodes (where the parallel file system shared by multiple clusters plugs into the SU), 1 login node (where users login, compile, debug and run their applications from) and 1 management node (where system administrators configure and monitor the nodes and the other 143 diskless nodes boot from). Additionally, one second level InfiniBand switch was specified for every 2 SUs in a multi-SU cluster. It is then possible to create clusters based on the number of SU needed. For example:
1 SU = 144 nodes/ 2,304 processor cores (144 port switch used instead of 12x24 port switches) 2 SU = 288 nodes/ 4,608 processor cores (288 port switch used instead of 2x12x24 port switches) 4 SU = 576 nodes / 9,216 processor cores (plus two 288 port switch) 6 SU = 864 nodes / 13,824 processor cores (plus three 288 port switch) 8 SU = 1152 nodes / 18,432 processor cores (plus four 288 port switches)
Part of the solution was a careful specification that placed clear boundaries between various aspects of the cluster solution (i.e. storage area networking and parallel file systems) and required only the delivery of a large quantity of 144 node SU and second level switches from the cluster provider.
The Tri-Lab procurement consisted of 21 scalable units for an aggregate performance of 438 TFLOP/s which would be broken into eight clusters and 2 options where Lawrence Livermore would create the Juno cluster from (8 SUs), the Hype cluster from (1 SU), and the Eos cluster from (2 SUs). Los Alamos would create Lobo from (2 SUs) and Hurricane from (2 SUs). Sandia would create Unity from (2SUs), Whitney from (2SUs) and Glory from (2 SUs). LLNL as part of the Tri-Labs labs procurement had the option to purchase an additional 2 clusters, Hera from (6 SUs) and Nyx from (2 SUs).
In the fall of 2007, Appro International was selected under the TLCC procurement as the solution provider to deliver the SUs to the three labs. Appro was chosen based on how well they addressed the requirements specified in the Request for Proposal, their proven HPC track record, system cost and project management skills. In addition, Appro was noted to work well with component suppliers and solve problems as they arise.
In terms of hardware, Appro 1U- 1143H, Quad-socket based on Quad-Core AMD Opteron Processors was specified instead of bladed packaging. While blade based servers add some convenience and redundancy, the most economical choice was still rack-mount 1U servers. In terms of the processors, the Tri-Labs SU employed a Quad-Core AMD Opteron (Barcelona) connected by DDR (4x) InfiniBand. Each SU was designed to achieve a high compute density and therefore used quad-socket motherboards with four Quad-Core AMD Opteron Socket F processors running at 2.2 GHz (Model 8354) for a total of 16 cores. Each node was equipped with 16 2GB DDR2-667 DIMMs for 32GBytes of memory and a 4x DDR InfiniBand Connect-X Host Channel Adapter (HCA) Card.
In order to achieve a balanced computational system, each node was required to achieve an optimum memory Bytes/FLOP ratio for both the memory interface and interconnect. The Opteron nodes were able to achieve a 20GB/s/node memory bandwidth which fit nicely within the specifications.
The Results Measuring success in HPC at LLNL is not so much a mater of up-times and peak performance, but rather how much faster does the science and engineering get done. In the case of LLNL, immediately after installation of the first SUs, the NIF team requested time to refine the optics needed for the NIF lasers. There was a pressing purchasing decision that needed to be completed as soon as possible. The NIF team was given a 1.5 months of time on one of the LLNL TLCC clusters to complete the calculations and make the right decisions. The NIF is slated to fire all 196 lasers in 2010. This historic event will be the culmination of many people years of work backed by many compute years of CPU time made possible by the Tri-Lab cluster acquisition. According to Mark Seager, 'The users are very excited about the TLCC Linux clusters. Appro and AMD have been great partners to work with. They have exceeded our expectation in terms of working with us to resolve problems as they came up and to get these system fielded quickly'.
In terms of delivered LINPACK performance numbers the Eos, Hera, and Juno clusters achieved the following:
Cluster: Eos -- SUs: 2 -- TFlops: 40.6 Cluster: Hera -- SUs: 4 -- TFlops: 121.7 Cluster: Juno -- SUs: 8 -- TFlops: 162.2
Note that the TFLOP/s scale, as expected, by a factor of four from Eos to Juno and a factor of three from Eos to Hera. The other clusters have not posted their benchmarks. However, since they are constructed from identical hardware and software, the performance should be almost identical. If there is a discrepancy, individual SUs can be tested rather than a cluster wide search for a problem.
There are other more subtle aspects to the common hardware platform. In the past, hardware incompatibilities in new clusters would sometimes cause program crashes. These crashes are not easily traced as they may occur after days or weeks of multi-node program execution. Searching for these bugs was tedious and often specific to a given cluster. With a shared hardware platform, solving these types of problems becomes easier. Instead of one lab searching for problems, all three labs can now combine their efforts and experience to ferret out issues. Indeed, to track down a deep problem, ancillary evidence from as many software applications becomes important. (i.e. 'Did you have a problem running this kind of code?') As with any detective work, the more evidence the easier it is to solve the mystery. By spreading the clusters across three labs, the amount of evidence and experience grows quickly. Problems are solved quicker and easier thus reducing the system integration aspect of TCO.
Mark Seager states, 'It is amazing how much leverage we are getting out of the standard configuration at all three Labs. Because we work on the same problems when we are trying to field these clusters, each lab brings its own unique approach to the problem and we tend to solve these problems more quickly.
The Tri-Lab procurement also had several 'up front' advantages. In the past, each cluster had to be purchased through a lengthy procurement process. A proposal had to be created and evaluated for each cluster. In terms of the Tri-lab procurement, these costs were reduced by at least five sixths (a single procurement rather than one procurement per year (2) for each lab (3)).
From a performance standpoint, the new system is not only a success, but also represents an improvement over the previous clusters in use at Lawrence Livermore that previously only had dualcore processors. With twice the number of cores and twice the memory capability, says Seager, “we’re seeing a performance boost of anywhere from 1.3 to 1.8 times compared to the previous system. The users are very excited about getting this kind of capability.” One group of users from the LLNL National Ignition Facility (NIF) focuses on both high energy-density physics research and new kinds of energy sources utilizing photon science. NIF requested three months of dedicated time on the clusters, but were assigned half that because the pressing demand for computing resources from other LLNL programs. “Still, they were able to do the research they needed on both the ignition research and optics,” says Seager.
Standing behind this success is Appro International. Their ability to deliver high quality hardware and work closely with component vendors proved to be a success factor in this project. Appro maintained good lines of communication with all vendors and customers. Appro also adapted well to other aspects of the procurement. There were some modification required to the standard SU design because not all sites had the same power and cooling capabilities. The solution was to reduce the density of systems in the SU and still maintain the SU concept.
The Summary Overall, the Tri-Labs project has been a huge success. Its challenge has proved to be beneficial to everyone involved. Mark Seager summarizes his thoughts on the entire project and says, “Can we put in all the information we need to put in, and when we do, what kind of science comes out in the results? Ultimately, that’s the most important attribute of the Tri-Labs challenge, and by that metric, the new clusters have been a big success.”
The Tri-Lab SU concept reduced costs across the board. Reductions in procurement costs and time frames were noted as well as ongoing maintenance costs. A key component to the success was the project management and problem solving Appro International brought to the table. Based on the experience of the Tri-Lab project, the future success of grand challenge computing depends on teamwork, planning, and amortizing the costs across multiple government labs. As LLNL has shown, Linux HPC clusters are igniting or nations future.
About Lawrence Livermore National Laboratory Established in 1952, Lawrence Livermore National Laboratory (LLNL) is a premier applied science laboratory, part of the National Nuclear Security Administration within the Department of Energy. LLNL is responsible for ensuring the nation’s nuclear weapons remain safe, secure, and reliable through application of advances in science and engineering. With its unique capabilities, the Laboratory meets other pressing national security needs, including countering the proliferation of weapons of mass destruction and strengthening homeland security against the terrorist use of such weapons. The Laboratory is an international leader in many areas of science and technology and undertakes significant research programs in energy, environment, bioscience, biotechnology, and basic science and advanced technology. Since so many of its projects require huge volumes of data analysis and hosting of highly complex proprietary applications the need for more powerful simulation compute resources has been an integral requirement since the 1980’s.
About Appro Appro accelerates technical applications and business results unlocking the value of IT for the high-performance and enterprise computing markets environment through differentiated performance balanced architecture, open standards, and engineering expertise. Appro is a leading developer of innovative, high-performance, density-managed servers, cluster-solutions, storage subsystems, and high-end workstations for the high-performance and enterprise computing markets.
Appro’s headquarters is in Milpitas, CA with an R&D and manufacturing partner in Asia and a sales and service office in Houston, Texas.
Learn More To learn more about Appro Supercomputer Solutions please contact your Sales Representative, visit www.appro.com or email us at clustersolutions@appro.com
For more information about Lawrence Livermore National Laboratory visit www.llnl.gov. Information on LLNL Clusters can be found at https://computing.llnl.gov/linux/clusters.html.
Additional information on the National Ignitions Facility (NIF) can be found at https://lasers.llnl.gov/.
To learn more about AMD’s latest ground-breaking technologies, visit www.amd.com.
|
|
|
|
|
|

|





 |
 |



Appro is focusing its product design to address the HPC cluster
market and key customer requirements including system management,
high availability and price/performance for HPC applications.
Appro has shown the ability to win highly sought-after, large-scale
HPC deals positioning the company to benefit from strong market
growth that IDC projects through 2012. Earl
Joseph,
IDC Program Vice President, Technical Computing
|
 |
 |
|



 |
 |

Customer Quote

“Appro not only offered us a cost effective solution but
they also improved our required technical specification through
better reliability, greater fault tolerance and redundancy as
well as more flexibility with regards to system scalability.
Bob Bell,
Technical Director, ING Renault F1 Team

|
 |
 |
|



|
|