Nvidia Won The AI Training Race, But Inference Is Still Anyone's Game

Comment With the exception of custom cloud silicon, like Google's TPUs or Amazon's Trainium ASICs, the vast majority of AI training clusters being built today are powered by Nvidia GPUs. But while Nvidia may have won the AI training battle, the inference fight is far from decided.

Up to this point, the focus has been on building better, more capable, and trustworthy models. Most inference workloads, meanwhile, have taken the form of proof-of-concepts and low-hanging fruit like AI chatbots and image generators. Because of this, most AI compute has been optimized for training rather than inference.

But as these models get better, as applications become more complex, and as AI infiltrates deeper into our daily lives, this ratio is poised to change dramatically over the next couple of years. In light of this change, many of the chip companies that missed the boat on AI training are now salivating at the opportunity to challenge Nvidia's market dominance.

Finding a niche

Compared to training, which pretty much universally requires gobs of compute, often spanning entire data halls and consuming megawatts of power for days or weeks at a time, inference is a far more diverse workload.

When it comes to inference, performance is predominantly determined by three core factors:

  • Memory capacity dictates what models you can run.
  • Memory bandwidth influences how quickly the response is generated.
  • Compute affects how long it takes for the model to respond and how many requests it can serve at a time.

But which of these you prioritize depends heavily on your model's architecture, parameter count, hosting location, and target audience.

For example, a small latency-sensitive model might be better suited to a low-power NPU or even a CPU, while a multi-trillion-parameter LLM is going to need datacenter-class hardware with terabytes of incredibly fast memory.

The latter example is exactly what AMD appears to have targeted with its MI300-series GPUs, which boast between 192 GB and 256 GB of speedy HBM. More memory means AMD is able to cram larger frontier models into a single server than Nvidia, which might explain why companies like Meta and Microsoft were so keen to adopt them.

On the other end of the spectrum, companies like Cerebras, SambaNova, and Groq – not to be confused with xAI's Grok series of models – have prioritized speed, leaning on their SRAM-heavy chip architectures and tricks like speculative decoding to run models five, ten, or even 20 times faster than the best GPU-based inference-as-a-service vendors have managed to achieve thus far.

With the rise of chain-of-thought reasoning models, which might need to generate thousands of words – or more specifically, tokens – to answer a question, lightning-fast inference goes from being a neat gimmick to something legitimately useful.

So it's no surprise that startups like d-Matrix and others are looking to get in on the "fast inference" game as well. The company expects its Corsair accelerators, due out in Q2, will be able to run models like Llama 70B at latencies as low as 2 ms per token, which, by our estimate, works out to 500 tokens a second. The company has set its sights on even larger models for its next-gen Raptor series of chips, which we're told will use vertically stacked DRAM to boost memory capacity and bandwidth.

At the lower end of the spectrum, we've seen a growing number of vendors like Hailo AI, EnCharge, and Axelera developing low-power, high-performance chips for the edge and PC markets.

Speaking of the PC market, more established chipmakers like AMD, Intel, Qualcomm, and Apple are racing to integrate ever more powerful NPUs into their SoCs to support AI-augmented workflows.

Finally, we can't ignore the cloud and hyperscaler providers, many of which will continue buying Nvidia hardware while simultaneously hedging their bets on in-house silicon.

Don't count Nvidia out yet

While Nvidia certainly is facing more competition than it ever has, it's still the biggest name in AI infrastructure. With its latest generation of GPUs, it's clearly preparing for the transition to large-scale inference deployments.

In particular, Nvidia's GB200 NVL72, unveiled last year, expanded its NVLink compute domain to 72 GPUs, totaling more than 1.4 exaFLOPS and 13.5 TB of memory.

Prior to this, Nvidia's most powerful systems topped out at just eight GPUs per node and between 640 GB and 1.1 TB of vRAM. This meant that large-scale, frontier-class models, like GPT-4, had to be distributed across multiple systems not just to fit all the parameters in memory, but to achieve reasonable throughput.

If Nvidia's projections are to be believed, the NVL72's high-speed interconnect fabric will enable it to deliver a 30x improvement in throughput for 1.8 trillion parameter-scale mixture-of-expert models – like GPT-4 – compared to an eight-node, 64-GPU cluster of H100s.

More importantly, these are general-purpose GPUs, which means they're not limited to just training or inference. They can be used to train new models and later re-tasked to run them – something that isn't necessarily true of every silicon upstart vying for a piece of Jensen's turf.

With GTC set to kick off next week, Nvidia is expected to detail its next-gen Blackwell-Ultra platform, which, if it's anything like its H200 generation of GPUs, should be tuned specifically with inference in mind.

Given the launch of Nvidia's Blackwell-based RTX cards earlier this year, we also wouldn't be surprised to see a L40 successor or even some refreshed workstation-class cards.

In the end, inference is a tokens-per-dollar game

Whatever hardware AI service providers end up packing their bit barns with, the economics of inferencing ultimately boils down to tokens per dollar.

We're not saying developers won't be willing to pay extra for access to the latest models, or higher throughput, especially if it helps their app or service to stand out.

But from a developer standpoint, these services amount to little more than an API spigot to which they connect their app, allowing tokens to flow on demand.

The fact they're using Nvidia's Blackwell parts or some bespoke accelerator you've never heard of is completely abstracted behind what usually ends up being an OpenAI-compatible API endpoint. ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more