Articles by tag "NVIDIA ConnectX-7"

NVIDIA ConnectX-7 MCX755106AS-HEAT Deployment & Compatibility Guide
Selene Gong

18
When you are orchestrating a multi-node GPU cluster for Large Language Model (LLM) training and notice sudden training epoch stalls or microburst packet drops during All-Reduce collective communication phases, the bottleneck is rarely the GPU itself—it is almost always the network interface ...
Is NVIDIA MCX755106AS-HEAT ConnectX-7 SmartNIC Worth It for AI Servers?
Selene Gong

18
When you are executing a multi-node LLM training run across a cluster of H100 or A100 GPU servers and start noticing sudden, unexplained training epoch stalls, the culprit is rarely the compute silicon. Instead, it is almost always a networking bottleneck: packet drops under heavy RoCEv2 (RDMA ...
NVIDIA MCX755106AS-HEAT ConnectX-7: QSFP112 Transceiver and Cable Compatibility Guide
Selene Gong

0
Quick Take Deploying the 400GbE NVIDIA ConnectX-7 (MCX755106AS-HEAT) requires strict adherence to the QSFP112 form factor and its 112G PAM4 SerDes architecture. This guide resolves critical physical layer challenges—including FEC mismatches, thermal throttling, and transceiver ...
How to Configure Socket-Direct with NVIDIA MCX755106AS-HEAT ConnectX-7 for Ultra-Low Latency AI Clusters
Selene Gong

18
Quick Take Bypass NUMA bottlenecks and eliminate inter-socket UPI/Infinity Fabric latency in high-density AI clusters. This guide provides a step-by-step walkthrough to configure Socket-Direct on the NVIDIA MCX755106AS-HEAT ConnectX-7 NIC, splitting the PCIe Gen 5.0 x16 interface ...