Background Identification and analysis of influencing factors of occupational injury is an important research content of feature selection. In recent years, with the rise of machine learning algorithms, feature selection combined with Boosting algorithm provides a new analysis idea to construct occupational injury prediction models.
Objective To evaluate applicability of Boosting algorithm-based model in predicting severity of miners' non-fatal occupational injuries, and provide a basis for rationally predicting the severity level of miners' non-fatal occupational injuries.
Methods The publicly available data of the US Mine Safety and Health Administration (MSHA) from 2001 to 2021 on metal miners' non-fatal occupational injuries were used, and the outcome variables were lost working days < 105 d (minor injury) and ≥ 105 d (serious injury). Four different feature sets were screened out by four feature selection methods including least absolute shrinkage and selection operator (Lasso) regression, stepwise regression, single factor + Lasso regression, and single factor + stepwise regression. Logistic regression, gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) were selected to construct prediction models by training with the four feature sets. A total of 12 prediction models of severity of miners' non-fatal occupational injuries were built and their area under the curve (AUC), sensitivity, specificity, and Youden index were calculated for model evaluation.
Results According to the results of four feature selection methods, age, time of accident occurrence, total length of service, cause of injury, activities that triggered injury occurrence, body part of injury, nature of injury, and outcome of injury were identified as influencing factors of non-fatal occupational injury severity in miners. Feature set 4 was the optimal set screened out by single factor+stepwise regression and the GBDT model presented the best predictive performance in predicting the severity of non-fatal occupational injuries. The associated specificity, sensitivity, and Youden index were 0.7530, 0.9490, and 0.7020, respectively. The AUC values of logistic regression, GBDT, and XGBoost models trained by feature set 4 were 0.8526 (95%CI: 0.8387, 0.8750), 0.8640 (95%CI: 0.8474, 0.8806), and 0.8603 (95%CI: 0.8439, 0.8773), respectively, higher than the AUC values trained by feature set 2 0.8487 (95%CI: 0.8203, 0.8669), 0.8110 (95%CI: 0.8012, 0.8344), and 0.8439 (95%CI: 0.8245, 0.8561), respectively . The AUC values of GBDT and XGBoost models trained by feature set 4 were higher than that of logistic regression model.
Conclusion The performance of the prediction models constructed by predictors screened out by two feature selection methods is better than those by single feature selection methods. At the same time, under the condition of optimal feature set, the performance of model prediction based on Boosting is better than that of traditional logistic regression model.