OptiNIC: Tail-Optimized RDMA for Distributed ML

Published:Dec 28, 2025 02:24
1 min read
ArXiv

Analysis

This paper addresses the critical tail latency problem in distributed ML training, a significant bottleneck as workloads scale. OptiNIC offers a novel approach by relaxing traditional RDMA reliability guarantees, leveraging ML's tolerance for data loss. This domain-specific optimization, eliminating retransmissions and in-order delivery, promises substantial performance improvements in time-to-accuracy and throughput. The evaluation across public clouds validates the effectiveness of the proposed approach, making it a valuable contribution to the field.

Reference

OptiNIC improves time-to-accuracy (TTA) by 2x and increases throughput by 1.6x for training and inference, respectively.