基于特征选择结合Boosting算法模型在预测矿工非致命性职业伤害严重等级中的适用性

莫有桦; 徐婷; 孟诗迪; 朱晓俊; 樊晶光

doi:10.11836/JEOM23172

基于特征选择结合Boosting算法模型在预测矿工非致命性职业伤害严重等级中的适用性

Applicability of feature selection combined with Boosting algorithm in severity prediction of non-fatal occupational injuries in miners

摘要

摘要:
背景职业伤害影响因素的识别分析是特征选择的重要研究内容，随着机器学习算法兴起，特征选择结合Boosting算法模型构建可为职业伤害预测分析中提供新的分析思路。
目的探讨基于Boosting算法模型在预测矿工非致命性职业伤害严重等级中的适用性，为科学合理地预测矿工非致命性职业伤害严重等级提供依据。
方法应用美国矿山安全与健康管理局（MSHA）2001—2021年金属矿工非致命性职业伤害的公开数据，以损失工作日天数<105 d为轻伤、≥105 d为重伤作为结局变量。通过最小绝对收缩与选择算子算法（Lasso）回归、逐步回归、单因素+Lasso回归、单因素+逐步回归4种特征选择方法分别筛选出4个不同特征集。选择基于Boosting思想的梯度提升决策树（GBDT）和极端梯度提升算法（XGBoost）两种模型，应用4个特征集分别训练logistic回归、GBDT、XGBoost三种模型，共形成12种矿工非致命性职业伤害严重等级预测模型，以获取预测模型的曲线下面积（AUC）、灵敏度、特异度、约登指数为主要评价指标。
结果根据4种不同特征选择方法，年龄、事故发生时间、总工龄、伤害发生原因、伤害发生活动、受伤部位、伤害性质、伤害结局8个特征是影响矿工非致命性职业伤害严重等级的主要影响因素。单因素+逐步回归筛选的特征集4为最优特征集并且其构建的GBDT模型对非致命性职业伤害严重等级预测效能最佳，特异度、灵敏度、约登指数分别为0.7530、0.9490、0.7020。特征集4构建logistic回归、GBDT、XGBoost预测模型的AUC值分别为0.8526（95%CI：0.8387~0.8750）、0.8640（95%CI：0.8474~0.8806）、0.8603（95%CI：0.8439~0.8773），均比逐步回归筛选的特征集2所构建的预测模型AUC值0.8487（95%CI：0.8203~0.8669）、0.8110（95%CI：0.8012~0.8344）、0.8439（95%CI：0.8245~0.8561）高，并且特征集4构建GBDT、XGBoost均比logistic回归预测模型AUC值高。
结论两种特征选择方法比单一特征选择筛选的预测因子构建的预测模型性能更优。同时在最优特征集条件下，基于Boosting思想构建的非致命性伤害严重程度预测模型与传统逻辑回归预测模型相比性能更优。

Abstract:
Background Identification and analysis of influencing factors of occupational injury is an important research content of feature selection. In recent years, with the rise of machine learning algorithms, feature selection combined with Boosting algorithm provides a new analysis idea to construct occupational injury prediction models.
Objective To evaluate applicability of Boosting algorithm-based model in predicting severity of miners' non-fatal occupational injuries, and provide a basis for rationally predicting the severity level of miners' non-fatal occupational injuries.
Methods The publicly available data of the US Mine Safety and Health Administration (MSHA) from 2001 to 2021 on metal miners' non-fatal occupational injuries were used, and the outcome variables were lost working days < 105 d (minor injury) and ≥ 105 d (serious injury). Four different feature sets were screened out by four feature selection methods including least absolute shrinkage and selection operator (Lasso) regression, stepwise regression, single factor + Lasso regression, and single factor + stepwise regression. Logistic regression, gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost) were selected to construct prediction models by training with the four feature sets. A total of 12 prediction models of severity of miners' non-fatal occupational injuries were built and their area under the curve (AUC), sensitivity, specificity, and Youden index were calculated for model evaluation.
Results According to the results of four feature selection methods, age, time of accident occurrence, total length of service, cause of injury, activities that triggered injury occurrence, body part of injury, nature of injury, and outcome of injury were identified as influencing factors of non-fatal occupational injury severity in miners. Feature set 4 was the optimal set screened out by single factor+stepwise regression and the GBDT model presented the best predictive performance in predicting the severity of non-fatal occupational injuries. The associated specificity, sensitivity, and Youden index were 0.7530, 0.9490, and 0.7020, respectively. The AUC values of logistic regression, GBDT, and XGBoost models trained by feature set 4 were 0.8526 (95%CI: 0.8387, 0.8750), 0.8640 (95%CI: 0.8474, 0.8806), and 0.8603 (95%CI: 0.8439, 0.8773), respectively, higher than the AUC values trained by feature set 2 0.8487 (95%CI: 0.8203, 0.8669), 0.8110 (95%CI: 0.8012, 0.8344), and 0.8439 (95%CI: 0.8245, 0.8561), respectively . The AUC values of GBDT and XGBoost models trained by feature set 4 were higher than that of logistic regression model.
Conclusion The performance of the prediction models constructed by predictors screened out by two feature selection methods is better than those by single feature selection methods. At the same time, under the condition of optimal feature set, the performance of model prediction based on Boosting is better than that of traditional logistic regression model.

HTML全文

参考文献(16)

施引文献

补充资料

审稿意见

勘误/撤稿