HPE Goes Cray For Nvidia's Blackwell GPUs, Crams 224 Into A Single Cabinet
If you thought Nvidia's 120 kW NVL72 racks were compute dense with 72 Blackwell accelerators, they have nothing on HPE Cray's latest EX systems, which will pack more than three times as many GPUs into a single cabinet.
Announced ahead of next week's Super Computing conference in Atlanta, Cray's EX154n platform will support up to 224 Nvidia Blackwell GPUs and 8,064 Grace CPU cores per cabinet. That works out to just over 10 petaFLOPS at FP64 for HPC applications or over 4.4 exaFLOPS of FP4 for sparse AI and machine learning workloads, where precision usually isn't as big a deal.
Specifically, each EX154n accelerator blade will feature a pair of 2.7 kW Grace Blackwell Superchips (GB200), each of which is equipped with two Blackwell GPUs and a single 72-core Arm CPU. Those two Superchips will be interconnected by Nvidia's NVL4 reference configuration.
At a rack level, the compute alone will consume upwards of 300 kW, so it goes without saying that, just like past EX systems, HPE's Blackwell blades will be liquid cooled.
In fact, these systems are completely fanless right down to the all-new Slingshot 400 family of Ethernet NICs, cables, and switches. As the name suggests, Slingshot 400 represents a welcome upgrade over its predecessor, pushing bandwidth from 200 to 400 Gbps, bringing it in line with current-gen Ethernet and InfiniBand networking.
HPE's prior-gen Slingshot 200 interconnects have become a mainstay of large-scale supercomputing platforms and are at the heart of the Frontier, Aurora, and Lumi machines to name just a handful.
Unfortunately, anyone looking to get their hands on Cray's super-dense Blackwell systems and speedy Slingshot 400 networking will have to wait a while. Neither are expected to ship until late in 2025.
If conventional CPU-based HPC is more your thing, Cray's fifth-gen Epyc-based EX4252 Gen 2 compute blades are due out next spring and will pack up to eight 192-core Turin-C processors for a total of 98,304 cores per cabinet.
Cray will also begin shipping upgraded E2000 storage systems, which it claims will more than double the I/O performance over prior generations thanks to faster PCIe 5.0-based NVMe storage. HPE expects to start shipping these storage arrays beginning early 2025.
- The Register takes AMD's Ryzen 9800X3D for a spin
- Dow swaps Intel for Nvidia leaving no index free from wild AI volatility
- Fujitsu, AMD lay groundwork to pair Monaka CPUs with Instinct GPUs
- xAI picked Ethernet over InfiniBand for its H100 Colossus training cluster
While HPE's Cray EX Platforms promise greater density than a typical server or rack, they aren't exactly the kind of systems that can be deployed in your average datacenter. So HPE is also rolling out a pair of new air-cooled ProLiant Compute servers, which make use of its enterprise-focused iLO lights-out management system.
These systems will be fairly familiar to anyone who's ever seen an Nvidia HGX platform with both XD680 and XD685 servers boasting support for eight accelerators of your choice.
Surprisingly, we aren't limited to just Nvidia and AMD GPUs as you might expect. The XD680 actually comes standard with eight Intel Gaudi3 accelerators totaling 1 TB of HBM2e. As we reported in spring, Gaudi3 is quite competitive with the current crop of accelerators. Each is capable of churning out 1.8 petaFLOPS of dense BF16 performance, giving it an edge in compute-bound workloads over the H100, H200, and AMD's MI300X.
Stepping up to HPE's XD685, you have the choice of either eight Nvidia H200s with a combined 1.1 TB of HBM3e or the upcoming Blackwell GPUs – presumably B200 – which should boost memory capacity to 1.5 TB. The former is due out in early 2025, while timing for the Blackwell-based systems remains rather vague.
If Nvidia isn't your style, or you need more memory, HPE is also rolling out a version of the system with AMD's newly launched MI325X. That system, announced alongside the accelerator in October, will boast up to 2 TB of HBM3e memory on board and is set to ship in the first quarter of 2025. ®
From Chip War To Cloud War: The Next Frontier In Global Tech Competition
The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more
The High Stakes Of Tech Regulation: Security Risks And Market Dynamics
The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more
The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics
Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more
The Data Crunch In AI: Strategies For Sustainability
Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more
Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser
After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more
LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue
In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more