IDG Accelerate: Technology Driving Business Performance. Sponsored by AMD - Smarter Choice.

  1. Enterprise Home
  2. News & Articles
  3. Analysis and Q&A
  4. Reviews
  5. Knowledge Centers
    1. Virtualization
    2. Modernization/Infrastructure
    3. Green IT
    4. IT Staffing/HR
    5. Mobility
    6. Operating Environments/Storage
    7. Performance
    8. ROI and Business Impact
    9. Security
    10. Managing IT
  6. Research
  7. Business Advice
  8. White Papers
  9. Case Studies/Best Practices
  10. Video/Webcasts
  11. How-To Tutorials
  1. Events
  2. RSS
  3. AMD Accelerate Magazine
  4. About Sponsors
  1. Subscribe
    1. eNewsletters
    2. AMD Accelerate Magazine

November 11, 2008

Case study: Lawrence Livermore's extensive computing needs results in a unique collaborative program

Three national labs collaborate on a deployment to save money and push the boundaries of science.

By Howard Baldwin

If you’re in business, your computational problems probably range from issues such as how much it’s going to cost to build a product to what’s the fastest, least expensive way to get the product from Juarez to Wichita. Not to minimize these issues, but Mark Seager’s computational challenges are in a whole different realm.

Seager leads the development team for high-performance computing platforms at Lawrence Livermore National Laboratory. One of his projects involves atomic simulation of shockless compression—which sounds complicated, and undoubtedly is. In layman’s terms, it’s a study of how copper or aluminum reacts to an impact when it’s in the form of metallic foam. This has implications in aerospace and manufacturing, especially as it relates to cooling of compact electronics, cryogenic tanks and lightweight optics filters.

Welcome to a typical day at Lawrence Livermore National Laboratory (LLNL), Livermore, Calif., one of the nation’s key applied science laboratories within the Department of Energy’s National Nuclear Security Administration (NNSA). The federal government funded 10 projects for LLNL last year, relating to topics as varied and perplexing as molecular dynamics (simulating protein membranes) to seismic analysis (simulating underground explosions for oil exploration) to atmospheric chemistry (constructing a climate model for the entire planet).

Not surprisingly, LLNL doesn’t order servers from the same distributors other businesses do. In fact, its computing resources far outstrip what most businesses will ever deal with. To tackle some of the challenges associated with deploying such extensive resources, Seager embarked last year on a unique program to expand computing power while decreasing its cost.

Triple Play

LLNL, working with two other NNSA national laboratories—Sandia and Los Alamos—developed the concept of the scalable unit (SU), or a cluster building block, to build multiple commodity Linux clusters of different sizes from the same SU. The total SUs purchased under this contract was approximately 31 (4,344 compute nodes  on Opteron™ processor-based servers). A scalable unit consists of 144 quad-socket, quad-core nodes and twelve 24-port InfiniBand 4X DDR switches. The scalable units are then combined to create clusters, either 144 (1x), 288 (2x), 576 (4x) or 1,152 (8x) nodes.

Taking advantage of so much scalability brought multiple advantages to the combined procurement effort. “We were trying to reduce the total cost of ownership for our capacity clusters,” says Seager. “Providing all three labs with the same hardware environment in their clusters and the same software environment would enable us to support applications among the three sites more easily. It also meant we could reduce the manpower and costs associated with installing and supporting the machines.”

Not only were they able to reduce costs through the shared acquisition but troubleshooting is also easier. “It’s amazing how much leverage we’re getting out of the standard configuration,” says Seager. “Having all three labs using the systems means that the engineers find and solve issues in days rather than weeks.”

In addition to scheduling weekly phone conversations, the Tri-Laboratories have deployed a wiki to track issues and share resolutions. “Using the wiki to disseminate information avoids a lot of duplication of effort,” says Seager. “When a problem that is new to one system administrator comes up, that person can search the wiki for similar problems and quickly learn what methods to use to get to the root cause and apply fixes.” Seager also notes that because the wiki can be edited by any of the participants, information is quickly refined as they gain more insight.

To help with the centralized procurement and deployment of these massive systems, Seager turned to AMD Platinum Partner Appro Systems, a Milpitas, Calif.-based developer of high-performance workstations, servers, clusters and supercomputers. Appro brought both technical and project management skills to the project, Seager says.

The labs’ projects have an overwhelming need for a high number of  bytes per FLOPS (floating-point operations per second). “We looked at interconnect bandwidth and realized that the right balance point was a machine that enabled us to have quad sockets and quad-core processors. This would maximize the number of FLOPS in a node, so we could minimize the number of nodes we had to purchase,” says Seager, noting that the bandwidth for the Opteron processors was a whopping 20 GB per node. “That’s quite substantial and represented the best of breed of what was available during our procurement cycle.”

At the same time, Seager notes, “Appro brings very strong technical skills in understanding the AMD ecology, plus good project management skills, which are very necessary in working with numerous vendors. Dealing with 30 scalable units and three sites was a project management challenge.”

Managing Complexity

Those strong technical skills were necessary when it came to building and deploying such a complex system. Determining the success of such a project incorporates multiple metrics. For instance, one metric the engineers track is how much difficulty they have in integrating the clusters for specific projects, says Seager. How easily does the code scale when any of the 1,663 users start running their projects? How reliable are the clusters over the course of an intensive computational project?

“We can dedicate half a cluster to one or two projects over a weekend or to a single project five days in a row,” says Seager. “In that instance, we’re looking for stability. Does it crash? If so, how often? Does it have enough bandwidth?”

From a performance standpoint, the new system is not only a success but also represents an improvement over the previous clusters in use at Lawrence Livermore, which had only dual-core processors. With twice the number of cores and twice the memory capability, says Seager, “We’re seeing a performance boost of anywhere from 1.3 to 1.8 times the performance of the previous system. The users are very excited about getting this kind of capability.”

One group of users from the LLNL National Ignition Facility (NIF) focuses on both high-energy/density physics research and new kinds of energy sources, utilizing photon science. NIF requested three months of dedicated time on the clusters but was assigned only half that, because of the pressing demand for computing resources from other LLNL programs. “Still, it was able to do the research it needed on both the ignition research and optics,” says Seager.

For Seager, the ultimate yardstick (a wholly inappropriate term when you’re talking about astrophysics) is the system’s ability to solve real-world problems. “Can we put in all the information we need to put in, and when we do, what kind of scientific discovery comes out in the results? Ultimately that’s the most important attribute, and by that metric, the new clusters have been a big success.”

Related content

Gold medal performance: CCTV.com broadcasts the 2008 Beijing Games with help from HP, Microsoft, and AMD

Rural governance in India gets tech boost