Published in Cloud

AMD's silicon surge

by on17 April 2024


Tensorwave says AMD does better than Nvidia and cheaper

TensorWave, a burgeoning bit barn enterprise, has recently installed systems powered by AMD's Instinct MI300X, which it intends to lease at a cost significantly lower than that of accessing Nvidia accelerators.

TensorWave co-founder Jeff Tatarchuk is confident in AMD's latest accelerators. Notably, they are readily available, more than can be said for Nvidia gear and TensorWave has procured many of them.

TensorWave aspires to have 20,000 MI300X accelerators operational across two sites by the end of 2024 and plans to introduce additional liquid-cooled systems the following year.

AMD's most recent AI chip outpaces Nvidia's said sought-after H100. "In terms of sheer specifications, the MI300X outshines the H100," Tatarchuk remarked.

Unveiled at AMD's Advancing AI event in December, the MI300X represents the company's most sophisticated accelerator. The 750W chip employs sophisticated packaging techniques to amalgamate 12 chipsets—20 when including the HBM3 modules—into a singular GPU purportedly 32 per cent swifter than Nvidia's H100.

Beyond its superior floating-point capabilities, the chip also features an expansive 192GB of HBM3 memory, capable of delivering 5.3TB/s of bandwidth, compared to the H100's 80GB and 3.35TB/s.

As demonstrated by Nvidia's H200—a variant of the H100 enhanced with HBM3e—memory bandwidth significantly influences AI performance, particularly in inferencing extensive language models.

Like Nvidia's HGX and Intel's OAM configurations, AMD's latest GPU standard requires eight accelerators per node.

This is precisely the setup that TensorWave's team is diligently assembling.

"We have hundreds being installed now, with thousands more scheduled in the upcoming months," Tatarchuk disclosed.

In a social media post, TensorWave showcased what seemed to be three 8U Supermicro AS-8125GS-TNMR2 systems. This prompted questions about whether TensorWave's racks were constrained by power or thermal limitations, as it's not uncommon for these systems to consume over 10kW when fully operational.

TensorWave has not yet completed the installations and aims for four nodes per rack, with a total capacity of approximately 40kW. These systems will utilise rear door heat exchangers (RDHx) for cooling—a technology that datacentre operators have increasingly sought after to accommodate denser GPU clusters.

TensorWave COO Piotr Tomasik acknowledged the challenges in capacity, particularly concerning RDHx. "We've encountered numerous capacity issues, even with ancillary datacentre equipment. However, we've been successful so far and remain optimistic about our deployment capabilities," he stated.

Looking ahead, TensorWave is setting its sights on direct-to-chip cooling, which can be challenging to implement in datacentres not initially designed for GPU housing. "We're eager to implement direct-to-chip cooling later this year, as it promises greater ease and efficiency with high-density setups," Tomasik expressed.

Another hurdle is the market's confidence in AMD's performance. While there is enthusiasm for AMD as an alternative to Nvidia, customers remain uncertain about achieving comparable performance.

"There's a degree of scepticism regarding whether AMD's performance will match what we're accustomed to with Nvidia," Tatarchuk acknowledged.

TensorWave will launch its MI300X nodes using RDMA over Converged Ethernet (RoCE) to expedite system deployment. These bare metal systems will be available for fixed lease periods, starting at $1/hr/GPU.

In the long term, TensorWave plans to introduce a cloud-like orchestration layer for resource provisioning. The company also aims to integrate GigaIO's PCIe 5.0-based FabreX technology, which can interconnect up to 5,750 GPUs within a single domain, encompassing over a petabyte of high-bandwidth memory.

These so-called TensorNODEs are derived from GigaIO's SuperNODE architecture, which was demonstrated last year and connects up to 32 AMD MI210 GPUs. This design allows a single CPU head node to manage a significantly more significant number of accelerators than the typical eight found in GPU nodes.

This strategy diverges from Nvidia's preferred NVLink design, which interlinks multiple Superchips into a large GPU. Although NVLink boasts faster bandwidth, reaching 1.8TB/s, it is limited to configurations of up to 576 GPUs.

Last modified on 17 April 2024
Rate this item
(0 votes)