LLNL's El Capitan Surpasses Frontier With 1.74 ExaFLOPS Performance

Monday November 18, 2024.

SC24 Lawrence Livermore National Lab's (LLNL) El Capitan system has ended Frontier's 2.5-year reign as the number one ranked supercomputer on the Top500, setting a new high water mark for high-performance computing (HPC).

At 1.74 exaFLOPS of double precision performance as measured by the venerable Linpack (HPL) benchmark, the system, built by HPE Cray, boasts nearly 30 percent higher performance compared to the previous champion, which since its initial debut has climbed from 1.1 exaFLOPS to 1.35.

Like many DoE supers before it, El Capitan is tasked with one of the agency's most secretive workloads: ensuring that America's nuclear stockpile actually works.

"As NNSA's first exascale computer, it represents a pivotal next step in our commitment to ensuring the safety, security, and reliability of our nation's nuclear stockpile without the need to resume underground nuclear testing," the National Nuclear Security Administration's Corey Hinderstein explained during a press conference ahead of the November Top500 results.

This compute will enable the tri-labs – Livermore, Los Alamos, and Sandia – to "simulate multi-physics processes in 3D with unparalleled detail and speed," she added.

And when El Capitan isn't simulating the criticality of new and existing nuclear warheads, the machine, along with its sibling system Tuolumne, will assist with research into other areas including biology, weather forecasting, earthquake monitoring, disaster simulation, and even some AI.

Powering El Cap are 44,544 of AMD's Instinct MI300A accelerated processing units (APUs). Announced at AMD's Advancing AI event a little under a year ago, the HPC-centric part is unique in that each chip co-packages 24 of AMD's Zen 4 processor cores with six of its CDNA 3 compute dies for 61.3 (vector) and 122.6 (matrix) teraFLOPS of FP64 performance.

That compute is fed by 128 GB of coherent HBM3 memory capable of 5.3 TBps of bandwidth, which is shared by both the CPU and GPU dies – no DDR5 to be found here. For El Cap, HPE employed over 11,136 nodes each with four MI300As, each backed by its own 200 Gbps Slingshot-11 interconnect. In total, the LLNL's biggest iron boasts 5.4 petabytes of HBM3 memory.

According to Bronis R de Supinski, CTO of Livermore Computing at LLNL, MI300A's memory coherency between the CPU and GPU "significantly simplifies programming and optimization."

In addition to being the most powerful publicly known supercomputer, El Capitan is relatively efficient, achieving 58.89 gigaFLOPS per watt. That won't put El Cap at the top of Green500, which is still dominated by smaller Nvidia-powered systems, but it's still quite respectable considering its size. The most efficient HPC clusters tend to be fairly small for a reason, so the fact El Capitan claims 18th place in the Green500 is commendable.

Untapped potential

El Capitan's arrival is perhaps bittersweet for Argonne National Lab's Aurora supercomputer – another HPE system – which, despite 2 exaFLOPS of peak theoretical performance on tap, never claimed the spot on the biannual ranking and, it seems, never will.

Unlike Frontier, which picked up more than 100 petaFLOPS since last fall, Aurora's performance figures remain unchanged since spring, when it became the second US system to break the exaFLOP barrier.

El Capitan, on the other hand, may be leaving some performance on the table yet. At 1.74 exaFLOPs, it's only achieved about 62 percent of its 2.79 exaFLOPS of peak theoretical performance. For a first run of HPL, efficiency like this isn't unusual. When Frontier made its debut in 2022, it only managed 65 percent of peak. A year and a half later, Oak Ridge had pushed performance to 70 percent of peak. Curiously, in this latest run, performance is up but efficiency is back down closer to 65 percent of peak, suggesting it may be running into bottlenecks somewhere in the system.

For El Cap, hitting 2 exaFLOPS of real-world performance would seem like the natural target. Alas, doing so would require achieving 72 percent of peak – or more hardware and likely more power.

We're told that El Capitan, which we'll remind you is tasked with ensuring the viability of the US national nuclear arsenal, not running Linpack, might get one more run.

"We're not going to spend endless amounts of cycles trying to optimize Linpack performance," de Supinski explained. "That's not what we bought the system for. I anticipate that we will likely run Linpack at full scale one more time, probably around the time that we move the system to our classified network. We expect that we can probably get higher performance at that point."

El Cap isn't the only MI300A-based system to make the list, just the largest. At roughly a tenth the size, El Cap's smaller sibling, Tuolumne, managed 208 petaFLOPS out of a peak theoretical performance of 288. This shows that the underlying architecture behind El Capitan can scale to the kinds of efficiencies necessary to break the two exaFLOP barrier, though its scale is likely working against it.

A European shakeup

Frontier isn't the only system being dethroned this round. After 2.5 years as Europe's most powerful supercomputer, LUMI has also fallen from fifth to eighth place.

Taking its place in fifth is the all-new HPC6 system located at the Eni S.p.A center in Ferrera Erbognone, Italy. Based on the same underlying platform as Frontier with its mix of third-gen Epycs, MI250X accelerators, and Slingshot-11 interconnects, the machine managed 477 petaFLOPS in the HPL benchmark putting it just ahead of Japan's Fugaku all-Arm-CPU-based system in sixth with 442 petaFLOPs.

It's not just Italy's HPC6 either. Switzerland's Nvidia Grace-Hopper-Superchip-based Alps system has climbed to seventh to overtake LUMI as well. The shakeup is thanks in no small part to upgrades that pushed its real-world performance from 270 petaFLOPs this spring to 434.9 petaFLOPs today.

Nvidia's GH200 Superchips have proven to be remarkably efficient. EuroHPC's Jedi system remains the greenest of the bunch at 72.73 gigaFLOPS per watt, and offers a glimpse of how the future Jupiter system could perform when it comes online.

Finally, in ninth place, just behind LUMI, is Italy's Leonardo supercomputer, which was built by Eviden (formerly Atos) and is capable of churning out 241 petaFLOPS of FP64 grunt.

While Europe claims four of the ten most powerful systems on the Top500, the continent has yet to cross the exascale barrier – at least not publicly. Jupiter is slated to be the first serious contender for the title. But, as we've seen with Aurora, peak performance doesn't always translate to real-world benchmarks, especially not in HPL.

No Colossus, yet

While last fall's results bore a hyperscale surprise with Microsoft's Eagle system sliding into third ahead of Riken's aging Fugaku cluster, this November's ranking offered no such revelations – none that broke into the top ten anyway.

Eagle remains a potent submission, now claiming the number four spot, but it appears to have been a bit of a one-off.

With so many AI datacenters coming online toting tens of thousands of H100s, we would have expected to see more submissions like it. We suppose scaling a cluster to 10,000, let alone the 30,000-plus GPUs and accelerators necessary to compete with the US national labs' best may not have been worth the time and effort – not when they could be generating revenue on training and inference runs.

Some had expressed hopes to see Elon Musk's xAI submit an HPL run on its newly minted Colossus supercomputer with its cluster of 100,000 Nvidia H100 GPUs. It certainly wouldn't be out of character for Musk.

If fully networked, the machine would have peak FP64 matrix performance of 6.7 exaFLOPS. Unfortunately for hopefuls, the massive AI training cluster was a no-show. Here's hoping for next spring.

China recedes into the shadows

One rather important element to all of this is that Top500 is a ranking of publicly known supercomputers, and there are plenty of behemoths out there not on the list for this reason.

In China, we're aware of at least two machines. The Sunway Oceanlite and Tianhe-3 systems are said to have exceeded the exaflop barrier in the Linpack benchmark. Unfortunately, no formal submission has actually been made.

And so while the Middle Kingdom retains a large stake in Top500, the number of submissions has dwindled. In November's ranking, China did not introduce any new machines while the number of Chinese systems dropped from 80 to 63. In fact, at the current pace, Germany could well overtake China with 41 supers.

US trade policy has undoubtedly contributed to this phenomenon. Over the past few years, Uncle Sam has grown increasingly hostile toward Chinese efforts to advance AI in particular. Many of China's national supercomputing centers have landed on the US Entities list, barring them from using Intel, AMD, or Nvidia parts.

As a result, many Chinese supers are based on homegrown architectures. However, even those parts are unlikely to be fabbed domestically. The US has previously taken advantage of this fact to disrupt Chinese GPU vendors like Biren and Moore Threads.

And so it ultimately becomes a game of shadows. The Top500 may provide an opportunity to measure progress in scientific computing, but it's not a necessary step for achieving their ultimate goal. ®

LLNL's El Capitan Surpasses Frontier With 1.74 ExaFLOPS Performance

Untapped potential

A European shakeup

No Colossus, yet

China recedes into the shadows

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

The Data Crunch In AI: Strategies For Sustainability

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

navigation

media coverage

follow us