Articles by tag "GPUDirect RDMA"

2 Items

Set Descending Direction
  1. NVIDIA ConnectX-7 MCX755106AS-HEAT Deployment & Compatibility Guide When you are orchestrating a multi-node GPU cluster for Large Language Model (LLM) training and notice sudden training epoch stalls or microburst packet drops during All-Reduce collective communication phases, the bottleneck is rarely the GPU itself—it is almost always the network interface ...
  2. Is NVIDIA MCX755106AS-HEAT ConnectX-7 SmartNIC Worth It for AI Servers? When you are executing a multi-node LLM training run across a cluster of H100 or A100 GPU servers and start noticing sudden, unexplained training epoch stalls, the culprit is rarely the compute silicon. Instead, it is almost always a networking bottleneck: packet drops under heavy RoCEv2 (RDMA ...

2 Items

Set Descending Direction