OptiNIC: Tail-Optimized RDMA for Distributed ML
Analysis
This paper addresses the critical tail latency problem in distributed ML training, a significant bottleneck as workloads scale. OptiNIC offers a novel approach by relaxing traditional RDMA reliability guarantees, leveraging ML's tolerance for data loss. This domain-specific optimization, eliminating retransmissions and in-order delivery, promises substantial performance improvements in time-to-accuracy and throughput. The evaluation across public clouds validates the effectiveness of the proposed approach, making it a valuable contribution to the field.
Key Takeaways
- •OptiNIC is a domain-specific RDMA transport designed for distributed ML workloads.
- •It eliminates retransmissions and in-order delivery, prioritizing speed over strict reliability.
- •OptiNIC uses adaptive timeouts and shifts loss recovery to the ML pipeline.
- •Evaluation shows significant improvements in TTA, throughput, and latency compared to traditional RDMA.
“OptiNIC improves time-to-accuracy (TTA) by 2x and increases throughput by 1.6x for training and inference, respectively.”