LightGBM模型及模型可解释性方法在预测职业伤害严重程度中的探讨

Exploration of predicting occupational injury severity based on LightGBM model and model interpretability method

  • 摘要:
    背景 轻量级梯度提升机算法(LightGBM)以其高效、快速等特点成为预测模型中的热门选择。然而,由于机器学习模型存在“黑盒”特性,导致模型可解释性较差。目前很少有研究从LightGBM模型及模型可解释性的角度评估职业伤害的严重程度。
    目的 评估LightGBM模型及模型可解释性方法在职业伤害预测中的应用价值。
    方法 应用美国矿山安全与健康管理局(MSHA)1983—2022年采矿业工人职业伤害数据集,以伤害程度(死亡/致命性职业伤害和永久/部分残疾)作为结局变量,以伤害发生的月份、年龄、性别、事故发生时间、轮班开始时间、事故发生时间与轮班开始时间间隔、总工龄、矿山总工龄、现矿山工龄、职业伤害致因、事故类型、伤害发生活动(即伤害发生时工人正在进行的活动)、伤害来源、受伤部位、作业环境类型、产品类别、伤害性质共17个指标作为预测变量。通过最小绝对收缩与选择算子算法(Lasso)回归方法筛选特征集。应用LightGBM构建职业伤害预测模型,以预测模型的曲线下面积(AUC)为主要评价指标,AUC越接近1,说明模型预测性能越好。应用Shapley 加法解释(SHAP)法对模型可解释性进行评价。
    结果 通过Lasso回归,识别出关键影响因素7个,分别为事故发生时间与轮班开始时间间隔、现矿山工龄、职业伤害致因、事故类型、受伤部位、伤害性质、作业环境类型。基于Lasso回归特征筛选构建的LightGBM模型预测性能良好,其AUC值、准确度、特异度、灵敏度分别为0.9941(95%CI:0.9917~0.9966)、0.97430.97810.9640,预测的致死性职业伤害概率与实际的致死性职业伤害概率一致性较高。在职业伤害预测模型中,通过SHAP值分析各指标的重要性,发现受伤部位和伤害性质是影响模型预测结果的两个主要特征,其他特征的影响较小。受伤部位的SHAP值分布广泛,尤其是头颈部和多部位的受伤,对预测致死性风险的模型有显著影响。伤害性质也对模型有不同方向的影响,窒息/溺水、挤压和多部位受伤对工人发生致死性职业伤害风险影响较大。
    结论 LightGBM模型能够高效地处理大规模数据并提供高精度的预测结果。模型可解释性研究有助于更准确地探索、分析采矿业工人发生致死性职业伤害的各种风险关键因素,并进一步揭示这些因素间的复杂交互作用,从而为劳动工人提供更好的预防干预保护措施和最佳的资源配置。

     

    Abstract:
    Background Light gradient boosting machine (LightGBM) has become a popular choice in prediction models due to its high efficiency and speed. However, the "black box" issues in machine learning models lead to poor model interpretability. At present, few studies have evaluated the severity of occupational injuries from the perspective of LightGBM model and model interpretability.
    Objective To evaluate the application value of LightGBM models and model interpretability methods in occupational injury prediction.
    Methods The Mine Safety and Health Administration (MSHA) occupational injury data set of mining industry workers from 1983 to 2022 was used. Injury severity (death/fatal occupational injury and permanent/partial disability) was used as the outcome variable, and the predictor variables included the month of occurrence, age, sex, time of accident, time since beginning of shift, accident time interval from shift start, total experience, total mining experience, experience at this mine, cause of injury, accident type, activity of injury, source of injury, body part of injury, work environment type, product category, and nature of injury. Feature sets were screened using least absolute shrinkage and selection operator (Lasso) regression. A LightGBM model was then employed to predict occupational injury, with area under curve (AUC) of the model serving as the primary evaluation metric; an AUC closer to 1 indicates better predictive performance of the model. The interpretability of the model was evaluated using Shapley additive explanations (SHAP).
    Results Through Lasso regression, 7 key influencing factors were identified, including accident time interval from shift start, experience at this mine, cause of injury, accident type, body part of injury, nature of injury, and work environment type. A LightGBM model, constructed based on feature selection via Lasso regression, demonstrated good predictive performance with an AUC value of 0.9941 (95%CI: 0.9917, 0.9966), accuracy of 0.9743, specificity of 0.9781, and sensitivity of 0.9640. The predicted probability of fatal occupational injuries showed high consistency with the actual probability of fatal occupational injuries. In the occupational injury prediction model, the importance of each indicator was analyzed through its SHAP value, and it was found that the body part of injury and the nature of injury were the two main features that affected the prediction results of the model, and the impacts of other features were relatively small. The distribution of SHAP values across body part of injury was broad, with significant impacts on the model's prediction of fatal risk, particularly for injuries to the head and neck, as well as multi-part injuries. The nature of injury also exerted influences on the model in different directions, with suffocation/drowning, crushing, and multi-part injuries having a greater impact on the risk of fatal occupational injuries.
    Conclusion LightGBM model is capable of efficiently processing large-scale data and providing high-precision prediction results. Research on model interpretability aids in more accurately exploring and analyzing various key risk factors for fatal occupational injuries among mining workers, and further reveals the complex interactions among these factors. This, in turn, enables better preventive intervention and protection measures, as well as optimal resource allocation for labor workers.

     

/

返回文章
返回