Background Light gradient boosting machine (LightGBM) has become a popular choice in prediction models due to its high efficiency and speed. However, the "black box" issues in machine learning models lead to poor model interpretability. At present, few studies have evaluated the severity of occupational injuries from the perspective of LightGBM model and model interpretability.
Objective To evaluate the application value of LightGBM models and model interpretability methods in occupational injury prediction.
Methods The Mine Safety and Health Administration (MSHA) occupational injury data set of mining industry workers from 1983 to 2022 was used. Injury severity (death/fatal occupational injury and permanent/partial disability) was used as the outcome variable, and the predictor variables included the month of occurrence, age, sex, time of accident, time since beginning of shift, accident time interval from shift start, total experience, total mining experience, experience at this mine, cause of injury, accident type, activity of injury, source of injury, body part of injury, work environment type, product category, and nature of injury. Feature sets were screened using least absolute shrinkage and selection operator (Lasso) regression. A LightGBM model was then employed to predict occupational injury, with area under curve (AUC) of the model serving as the primary evaluation metric; an AUC closer to 1 indicates better predictive performance of the model. The interpretability of the model was evaluated using Shapley additive explanations (SHAP).
Results Through Lasso regression, 7 key influencing factors were identified, including accident time interval from shift start, experience at this mine, cause of injury, accident type, body part of injury, nature of injury, and work environment type. A LightGBM model, constructed based on feature selection via Lasso regression, demonstrated good predictive performance with an AUC value of 0.9941 (95%CI: 0.9917, 0.9966), accuracy of 0.9743, specificity of 0.9781, and sensitivity of 0.9640. The predicted probability of fatal occupational injuries showed high consistency with the actual probability of fatal occupational injuries. In the occupational injury prediction model, the importance of each indicator was analyzed through its SHAP value, and it was found that the body part of injury and the nature of injury were the two main features that affected the prediction results of the model, and the impacts of other features were relatively small. The distribution of SHAP values across body part of injury was broad, with significant impacts on the model's prediction of fatal risk, particularly for injuries to the head and neck, as well as multi-part injuries. The nature of injury also exerted influences on the model in different directions, with suffocation/drowning, crushing, and multi-part injuries having a greater impact on the risk of fatal occupational injuries.
Conclusion LightGBM model is capable of efficiently processing large-scale data and providing high-precision prediction results. Research on model interpretability aids in more accurately exploring and analyzing various key risk factors for fatal occupational injuries among mining workers, and further reveals the complex interactions among these factors. This, in turn, enables better preventive intervention and protection measures, as well as optimal resource allocation for labor workers.