Kurkure is the lead engineer for VMware's performance engineering team and as spent the past five years working on ways to virtualise machine-learning workloads running on accelerators.
According to El Reg his team squeezed "near or better than bare-metal performance" for Bidirectional Encoder Representations from Transformers (BERT) and Mask R-CNN — two popular machine-learning workloads — running on virtualised GPUs (vGPU) connected using Nvidia's NVLink interconnect.
While that might seem like an overdose of acronyms NVLink enables compute and memory resources to be shared across up to four GPUs over a high-bandwidth mesh fabric operating at 6.25GB/s per lane compared to PCIe 4.0's 2.5GB/s.
Kurkure's team used it to pool 160GB of GPU memory from the Dell PowerEdge system's four 40GB Nvidia A100 SXM GPUs.
"As the machine learning models get bigger and bigger, they don't fit into the graphics memory of a single chip, so you need to use multiple GPUs," he explained.
He said that with NVLink being supported in vSphere, customers have the flexibility where they can combine multiple GPUs on the same host using NVLink so they can support bigger models, without a significant communication overhead.
Kurkure expects most HPC workloads will be virtualised to sort out those pesky performance bottlenecks that leave systems underused.
Kurkure's team was able to achieve performance comparable to bare metal while using just a fraction of the dual-socket system's CPU resources.
"We were only using 16 logical cores out of 128 available," he said. "You could use that CPU resources for other jobs without affecting your machine-learning intensive graphics modules. This is going to improve your utilization, and bring down the cost of your data centre."