Microsoft Update: Building Resilient AI Supercomputer Networks withMultipath Reliable Connection (MRC)

Microsoft has shared new details on the networking innovations powering Fairwater, its next-generation AI supercomputer designed to support extremely large-scale AI training workloads. The announcement highlights how Microsoft is rethinking AI infrastructure with a new approach to resilient, fault-tolerant networking for AI supercomputers.

At the center of this innovation is Multipath Reliable Connection (MRC), a new AI-optimized networking technology developed in collaboration with AMD, Broadcom, Intel, NVIDIA, and OpenAI.

Key Highlights:

AI-Optimized Networking at Massive Scale

Microsoft’s new networking architecture is designed to support 100,000+ GPUs while maintaining high reliability and performance for large AI training jobs.

Traditional network failures such as:

  • Packet loss
  • Slow links
  • Switch failures
  • Congestion

can severely impact AI model training at scale. Microsoft’s MRC technology is designed to ensure AI workloads continue running efficiently even during network disruptions.

What Makes MRC Different

Unlike traditional networking approaches that depend heavily on dynamic routing and lossless fabrics, MRC uses:

  • Multi-path networking
  • Endpoint-driven traffic control
  • Intelligent packet distribution
  • Fast recovery mechanisms

This allows the network to automatically adapt to failures without interrupting large AI training operations.

Improved AI Infrastructure Resilience

Microsoft’s architecture enables:

  • Graceful performance degradation during failures
  • Faster recovery from network disruptions
  • Better GPU utilization
  • Reduced downtime during large-scale AI workloads

The system is designed around the idea that failures are expected, not exceptional.

Static SRv6 Routing for Predictable Performance

Microsoft also introduced the use of static SRv6 routing, reducing dependency on complex dynamic routing systems and improving network predictability, diagnostics, and operational stability at AI supercomputer scale.

Open Source Contributions

To help accelerate AI infrastructure innovation across the industry, Microsoft is open-sourcing several MRC-related technologies, including:

  • MRC transport APIs
  • NCCL MRC plugins
  • SRv6 enhancements for SONiC networking

This supports broader adoption of resilient AI networking technologies across enterprise and hyperscale environments.

Why This Matters

As organizations increasingly invest in AI, large language models, and high-performance computing, resilient infrastructure becomes critical. Microsoft’s latest innovations demonstrate how AI infrastructure is evolving to support:

  • Faster AI model training
  • Higher GPU efficiency
  • Reduced operational disruption
  • Scalable AI workloads at enterprise scale

This marks another major step in Microsoft’s broader strategy to build the next generation of AI-ready cloud and supercomputing infrastructure.

Apply Job