Is 399 rows × 24 features too small for a medical classification model?
Published:Jan 3, 2026 05:13
•1 min read
•r/learnmachinelearning
Analysis
The article discusses the suitability of a small tabular dataset (399 samples, 24 features) for a binary classification task in a medical context. The author is seeking advice on whether this dataset size is reasonable for classical machine learning and if data augmentation is beneficial in such scenarios. The author's approach of using median imputation, missingness indicators, and focusing on validation and leakage prevention is sound given the dataset's limitations. The core question revolves around the feasibility of achieving good performance with such a small dataset and the potential benefits of data augmentation for tabular data.
Key Takeaways
- •The dataset size (399 samples, 24 features) is small, potentially limiting model performance.
- •Classical ML techniques are likely the most appropriate approach, given the dataset size.
- •Data augmentation for tabular data at this scale is questionable and may not yield significant improvements.
- •Focusing on robust validation and leakage prevention is crucial due to the risk of overfitting.
Reference
“The author is working on a disease prediction model with a small tabular dataset and is questioning the feasibility of using classical ML techniques.”