Microsoft has shared new details on the networking innovations powering Fairwater, its next-generation AI supercomputer designed to support extremely large-scale AI training workloads. The announcement highlights how Microsoft is rethinking AI infrastructure with a new approach to resilient, fault-tolerant networking for AI supercomputers.
At the center of this innovation is Multipath Reliable Connection (MRC), a new AI-optimized networking technology developed in collaboration with AMD, Broadcom, Intel, NVIDIA, and OpenAI.
Key Highlights:
AI-Optimized Networking at Massive Scale
Microsoft’s new networking architecture is designed to support 100,000+ GPUs while maintaining high reliability and performance for large AI training jobs.
Traditional network failures such as:
- Packet loss
- Slow links
- Switch failures
- Congestion
can severely impact AI model training at scale. Microsoft’s MRC technology is designed to ensure AI workloads continue running efficiently even during network disruptions.
What Makes MRC Different
Unlike traditional networking approaches that depend heavily on dynamic routing and lossless fabrics, MRC uses:
- Multi-path networking
- Endpoint-driven traffic control
- Intelligent packet distribution
- Fast recovery mechanisms
This allows the network to automatically adapt to failures without interrupting large AI training operations.
Improved AI Infrastructure Resilience
Microsoft’s architecture enables:
- Graceful performance degradation during failures
- Faster recovery from network disruptions
- Better GPU utilization
- Reduced downtime during large-scale AI workloads
The system is designed around the idea that failures are expected, not exceptional.
Static SRv6 Routing for Predictable Performance
Microsoft also introduced the use of static SRv6 routing, reducing dependency on complex dynamic routing systems and improving network predictability, diagnostics, and operational stability at AI supercomputer scale.
Open Source Contributions
To help accelerate AI infrastructure innovation across the industry, Microsoft is open-sourcing several MRC-related technologies, including:
- MRC transport APIs
- NCCL MRC plugins
- SRv6 enhancements for SONiC networking
This supports broader adoption of resilient AI networking technologies across enterprise and hyperscale environments.
Why This Matters
As organizations increasingly invest in AI, large language models, and high-performance computing, resilient infrastructure becomes critical. Microsoft’s latest innovations demonstrate how AI infrastructure is evolving to support:
- Faster AI model training
- Higher GPU efficiency
- Reduced operational disruption
- Scalable AI workloads at enterprise scale
This marks another major step in Microsoft’s broader strategy to build the next generation of AI-ready cloud and supercomputing infrastructure.