Scalable Framework for logP Prediction
Analysis
Key Takeaways
- •Developed a scalable framework for logP prediction using a large curated dataset.
- •Identified the importance of molecular weight as a predictor using SHAP analysis.
- •Demonstrated the superiority of tree-based ensemble methods over linear models.
- •Achieved optimal performance with a stratified modeling strategy.
- •Showed that descriptor-based ensemble models are competitive with graph neural networks.
“Tree-based ensemble methods, including Random Forest and XGBoost, proved inherently robust to this violation, achieving an R-squared of 0.765 and RMSE of 0.731 logP units on the test set.”