China Approves Nvidia’s Mellanox Deal, Could Help Fight AMD and Intel in Exascale

  • Post Author:
  • Post Category:Blog

Nvidia Turing Silicon

(Image credit: Nvidia)

Nvidia announced today that its $6.9 billion acquisition of Mellanox had cleared the final hurdle from Chinese antitrust authorities, paving the way for the graphics card company to take control of the networking giant by April 27. The move could help Nvidia bolster its presence in the HPC and supercomputer segments as the company finds itself without any new notable exascale contracts despite its overwhelming market presence in GPU-driven AI and compute workloads in the data center. Meanwhile, Intel and AMD have both been tapped for the Department of Energy’s (DoE) next generation of world-leading supercomputers. 

Mellanox is known for its commanding presence in Ethernet and InfiniBand products for the data center, so while the acquisition obviously diversifies Nvidia’s portfolio, it also creates synergies that are larger than they appear on the surface.

Tying accelerators, like GPUs, together into one seamless architecture that spans across a massive number of supercomputer nodes is the winning formula for reaching exascale-class levels of performance within an acceptable power threshold and physical footprint, but AMD and Intel seem to be winning the battle of these interconnected designs. 

AMD and Intel, by virtue of having both CPU and GPU production in-house, can tie the CPU and GPU together in much more sophisticated ways than Nvidia due to their purpose-built designs. AMD has won two of the three exascale contracts with the DoE, but the Frontier system is the first in line for deployment in 2021. This 1.5-exaflop system will be powered by AMD’s EPYC Rome CPUs and Radeon Instinct GPUs, but we recently learned that the CPUs and GPUs will support memory coherency via the Infinity Fabric.

Memory coherency between the CPUs and GPUs reduces data movement to improve performance, reduce latency, and boost performance per watt. It also eases programming requirements. Again, this is a capability that Nvidia currently doesn’t have because it doesn’t design processors, though it has adopted the CXL specification that should include that feature in future revisions. We won’t see that specification until PCIe 5.0 comes to market, though.  

But there’s another interesting tidbit. Cray, an HPE Company, designs the chassis and Slingshot networking fabric for all of the exascale supercomputers, and the company has largely taken the lead in the HPC market due to its cutting-edge system and networking design. Cray revealed some interesting information, seen in the slide below, during a recent presentation at the Rice Oil and Gas HPC Summit. 

(Image credit: Rice Oil and Gas HPC Summit)

The block in the lower left hand corner represents Nvidia’s coming supercomputer design, which uses Cray’s Shasta blades and Slingshot networking for the Perlmutter supercomputer. Here we can see the Nvidia Volta Next GPUs that power this pre-exascale system that will come online in spring 2021, meaning this is Nvidia’s latest known design. Notice that the GPUs reside on a fabric tied to the EPYC Rome CPUs, but the “To Slingshot” markings up top denote that the network traffic travels into the CPUs and then through to the GPUs. Remember, Slingshot is an Ethernet-based fabric that is the key to Cray’s success in developing highly-scalable supercomputers. That data from the network, which flows in and out at amazing rates in supercomputers, has to pass through the CPU before landing in the Nvidia GPUs. 

In contrast, the bottom right corner represents AMD’s next architecture that powers the Frontier supercomputer. Here we can see that the Slingshot networking fabric never touches the EPYC Milan CPU, instead flowing directly into the GPUs. That means those accelerators will experience higher throughput and lower latency to the network, all without tying up the fabric that provides memory coherency with the EPYC processor. 

There are networking technologies, like RDMA, that can cut through a CPU to offer direct memory access without involving the compute cores, but that data still has to flow over a secondary fabric in Nvidia’s design. There’s no substitute for the capabilities enabled by having a high-bandwidth network connection directly into the GPU, arguably making the AMD implementation superior, at least conceptually. It’s probably no coincidence that AMD’s design scales to 1.5 exaflops of performance, while the Nvidia design in Perlmutter hits 100 petaflops.

As such, Nvidia’s acquisition of Mellanox, a networking leader with an IP warchest, could give Nvidia access to technologies to create tighter integrations such as this. Granted, Nvidia has its own exceptional switching technology with its NVSwitch that powers what the company calls the “World’s Largest GPU,” but that interface is primarily designed for GPU-to-GPU transfers. It should go without saying, but Mellanox’s years of experience in networking and topologies will also help with future iterations of NVSwitch, too. 

(Image credit: AMD)

But AMD and Intel aren’t sitting still. AMD’s next-next-gen EPYC Genoa CPUs and Radeon Instinct CPUs will power what will be the world’s fastest supercomputer, the two-exaflop El Capitan, and it comes with an even more sophisticated third generation of the Infinity Architecture. Again, this relies upon an amazing amount of integration between the CPU and GPU. The illustrations include eight GPUs tied into a mesh/taurus-type topology that’s connected to the CPUs. Imagine the possibilities if this architecture also has direct networking connections to each of the eight GPUs. 

Image 1 of 2

(Image credit: Intel)

Image 2 of 2

(Image credit: Intel)

Intel is also on the same path. Intel’s as-yet unproven Ponte Vecchio GPUs come with a new interconnected node design that leans on its new XE Memory Fabric, complemented by a ‘Rambo Cache,’ to tie together CPUs and GPUs into a single coherent memory space. This design will power the one-exaflop Aurora supercomputer.

Meanwhile, Nvidia might suffer in the supercomputer realm because it doesn’t produce both CPUs and GPUs and, therefore, cannot enable similar levels of integration. Is this type of architecture, and the underlying unified programming models, required to hit exascale-class performance within acceptable power envelopes? That’s an open question, but both AMD and Intel have won exceedingly important contracts for the U.S. DOE’s exascale-class supercomputers (the broader server ecosystem often adopts the winning HPC techniques), but Nvidia hasn’t made any announcements about such wins despite their dominating position for GPU-accelerated compute in the HPC and data center space.

As such, we can expect Nvidia to expand into more holistic rack-level architectures to complement its prowess in GPU compute. Mellanox has plenty of promising new products, including the chiplet-based Spectrum-3 line in the works, and Nvidia’s recent acquisition of SwiftStack (for an undisclosed sum) also opens the possibility for complementary storage software that enables large-scale AI systems. This object-based storage company has plenty of experience with vast network-attached data repositories used for AI workloads, which ties in nicely with Nvidia’s burgeoning portfolio.  

The industry is rife with speculation that all that’s left for Nvidia to do is to roll out its own CPUs. ARM, anyone? 

Nvidia plans to take over Mellanox on April 27th.