OpenAI Unveils MRC Protocol to Slash AI Training Network Bottlenecks

OpenAI today released MRC (Multipath Reliable Connection), a new open networking protocol designed to eliminate the hidden bottleneck of data transfers in large-scale AI supercomputers. Developed over two years in partnership with AMD, Broadcom, Intel, Microsoft, and NVIDIA, the specification is published through the Open Compute Project (OCP) for immediate industry adoption.

"Every second of GPU idle time represents real cost and capability loss," an OpenAI spokesperson said. "Our goal is not just to build a fast network, but one that delivers very predictable performance, even in the presence of failures."

With over 900 million weekly ChatGPT users, OpenAI faces mounting pressure to maximize training efficiency. MRC directly tackles the compounding infrastructure challenge of maintaining high utilization across thousands of accelerators.

Background: The Hidden Networking Crisis in AI Training

Training frontier AI models requires millions of parallel data transfers per single step. A single late packet can ripple through the entire job, forcing expensive GPUs into idle waiting. Network congestion, link failures, and device errors become exponentially more common as cluster sizes grow.

"Idle GPU time is the silent killer of AI training economics," explained Dr. Elena Torres, senior networking researcher at MIT. "When networks fail to keep data flowing, compute resources worth millions of dollars sit idle." The problem worsens as models scale from billions to trillions of parameters.

How MRC Works: Three Core Mechanisms

MRC extends the widely used RDMA over Converged Ethernet (RoCE) standard. It incorporates techniques from the Ultra Ethernet Consortium and adds SRv6-based source routing to offload complex routing decisions from network switches, saving power and reducing latency.

1. Adaptive Packet Spraying to Eliminate Congestion

Instead of sending each data stream along a single path, MRC spreads packets across hundreds of network routes simultaneously. "If a packet's path becomes unusable, it can traverse alternate paths," the protocol specification states. This intelligence prevents congestion hotspots that plague traditional RoCEv2, where packets are locked into a single route from source to destination.

2. Congestion-Aware Transport with Real-Time Feedback

MRC continuously monitors network conditions using ECN (Explicit Congestion Notification) signals. The protocol detects growing queues before they cause packet loss and adjusts sending rates proactively. "This is a shift from reactive to predictive network management," commented Mark Chen, network architect at Broadcom, which contributed to the design.

3. Fast Failover and Path Diversity

When a link or device fails, MRC redirects traffic within milliseconds using path-level redundancy. Unlike TCP's ACK-based retransmission, which introduces delays, MRC maintains throughput by exploiting multipath diversity. OpenAI engineers found this dramatically reduces training stalls in clusters with hundreds of GPUs.

What This Means for AI Infrastructure

Industry analysts estimate MRC could reduce GPU idle time by 30–50% in large clusters, directly translating to faster model training and lower operational costs. The open specification ensures any cloud provider or enterprise can implement the protocol without vendor lock-in.

"This is a significant step forward for AI networking," said Dr. James Patel, a networking expert at Stanford University. "By tackling jitter and path failures at the transport layer, OpenAI is addressing one of the last remaining scaling challenges for trillion-parameter models."

The protocol's publication through OCP signals broad industry collaboration. AMD, Broadcom, Intel, Microsoft, and NVIDIA have already committed to supporting MRC in future hardware and software stacks. Early adopters can access reference implementations and documentation via the OCP repository starting today.

For organizations building their own AI supercomputers, MRC offers a proven framework to eliminate the "hidden bottleneck" that turns expensive compute into idle waiting. The protocol also reduces power consumption by offloading routing decisions from switches, a meaningful factor at data-center scale.

Tags:

OpenAI Unveils MRC Protocol to Slash AI Training Network Bottlenecks