Anomaly Detection Benchmarks: Navigating Imbalanced Industrial Data
Published:Jan 5, 2026 05:00
•1 min read
•ArXiv ML
Analysis
This paper provides valuable insights into the performance of various anomaly detection algorithms under extreme class imbalance, a common challenge in industrial applications. The use of a synthetic dataset allows for controlled experimentation and benchmarking, but the generalizability of the findings to real-world industrial datasets needs further investigation. The study's conclusion that the optimal detector depends on the number of faulty examples is crucial for practitioners.
Key Takeaways
- •Anomaly detection performance is highly sensitive to the number of faulty examples in the training data.
- •Unsupervised methods (kNN/LOF) perform well with very few faulty examples (<20).
- •Semi-supervised (XGBOD) and supervised (SVM/CatBoost) methods show significant performance gains with 30-50 faulty examples, especially with higher dimensionality.
Reference
“Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset, with additional healthy examples offering insignificant benefits in most cases.”