Articles by tag "GPUDirect RDMA"

NVIDIA ConnectX-7 MCX755106AS-HEAT Deployment & Compatibility Guide
Selene Gong

18
When you are orchestrating a multi-node GPU cluster for Large Language Model (LLM) training and notice sudden training epoch stalls or microburst packet drops during All-Reduce collective communication phases, the bottleneck is rarely the GPU itself—it is almost always the network interface ...
Is NVIDIA MCX755106AS-HEAT ConnectX-7 SmartNIC Worth It for AI Servers?
Selene Gong

18
When you are executing a multi-node LLM training run across a cluster of H100 or A100 GPU servers and start noticing sudden, unexplained training epoch stalls, the culprit is rarely the compute silicon. Instead, it is almost always a networking bottleneck: packet drops under heavy RoCEv2 (RDMA ...