Fault-Tolerant Collective Communication for LLMs
Analysis
Key Takeaways
- •Addresses the problem of network failures in large-scale LLM training and inference.
- •Introduces R^2CCL, a fault-tolerant communication library.
- •Leverages multi-NIC hardware for failover and load redistribution.
- •Demonstrates significant performance improvements over existing baselines (AdapCC and DejaVu).
- •Shows low overheads (less than 1% for training, less than 3% for inference) under NIC failures.
“R$^2$CCL is highly robust to NIC failures, incurring less than 1% training and less than 3% inference overheads.”