Applicability of feature selection combined with Boosting algorithm in severity prediction of non-fatal occupational injuries in miners
应用美国矿山安全与健康管理局（MSHA）2001—2021年金属矿工非致命性职业伤害的公开数据，以损失工作日天数<105 d为轻伤、≥105 d为重伤作为结局变量。通过最小绝对收缩与选择算子算法（Lasso）回归、逐步回归、单因素+Lasso回归、单因素+逐步回归4种特征选择方法分别筛选出4个不同特征集。选择基于Boosting思想的梯度提升决策树（GBDT）和极端梯度提升算法（XGBoost）两种模型，应用4个特征集分别训练logistic回归、GBDT、XGBoost三种模型，共形成12种矿工非致命性职业伤害严重等级预测模型，以获取预测模型的曲线下面积（AUC）、灵敏度、特异度、约登指数为主要评价指标。
Identification and analysis of influencing factors of occupational injury is an important research content of feature selection. In recent years, with the rise of machine learning algorithms, feature selection combined with Boosting algorithm provides a new analysis idea to construct occupational injury prediction models.
To evaluate applicability of Boosting algorithm-based model in predicting severity of miners' non-fatal occupational injuries, and provide a basis for rationally predicting the severity level of miners' non-fatal occupational injuries.
The publicly available data of the US Mine Safety and Health Administration (MSHA) from 2001 to 2021 on metal miners' non-fatal occupational injuries were used, and the outcome variables were lost working days < 105 d (minor injury) and ≥ 105 d (serious injury). Four different feature sets were screened out by four feature selection methods including least absolute shrinkage and selection operator (Lasso) regression, stepwise regression, single factor + Lasso regression, and single factor + stepwise regression. Logistic regression, gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) were selected to construct prediction models by training with the four feature sets. A total of 12 prediction models of severity of miners' non-fatal occupational injuries were built and their area under the curve (AUC), sensitivity, specificity, and Youden index were calculated for model evaluation.
According to the results of four feature selection methods, age, time of accident occurrence, total length of service, cause of injury, activities that triggered injury occurrence, body part of injury, nature of injury, and outcome of injury were identified as influencing factors of non-fatal occupational injury severity in miners. Feature set 4 was the optimal set screened out by single factor+stepwise regression and the GBDT model presented the best predictive performance in predicting the severity of non-fatal occupational injuries. The associated specificity, sensitivity, and Youden index were 0.7530, 0.9490, and 0.7020, respectively. The AUC values of logistic regression, GBDT, and XGBoost models trained by feature set 4 were 0.8526 (95%CI: 0.8387, 0.8750), 0.8640 (95%CI: 0.8474, 0.8806), and 0.8603 (95%CI: 0.8439, 0.8773), respectively, higher than the AUC values trained by feature set 2 0.8487 (95%CI: 0.8203, 0.8669), 0.8110 (95%CI: 0.8012, 0.8344), and 0.8439 (95%CI: 0.8245, 0.8561), respectively . The AUC values of GBDT and XGBoost models trained by feature set 4 were higher than that of logistic regression model.
The performance of the prediction models constructed by predictors screened out by two feature selection methods is better than those by single feature selection methods. At the same time, under the condition of optimal feature set, the performance of model prediction based on Boosting is better than that of traditional logistic regression model.