@文章{信息:doi/10.2196/15601,作者=“Rahman, Quazi Abidur和Janmohamed, Tahir和Clarke, Hance和Ritvo, Paul和Heffernan, Jane和Katz, Joel”,标题=“管理我的疼痛应用程序用户的疼痛波动预测模型的可解释性和类不平衡:使用特征选择和多数投票方法的分析”,期刊=“JMIR Med Inform”,年=“2019”,月=“11月”,日=“20”,量=“7”,数=“4”,页=“e15601”,关键词=“慢性疼痛;疼痛波动;数据挖掘;聚类分析;机器学习;预测模型;管理我的痛苦;背景:疼痛波动是影响慢性疼痛体验和适应的重要因素。之前,我们使用机器学习方法来定义和预测Manage My pain应用程序用户的疼痛波动水平。减少特征的数量对于帮助提高此类预测模型的可解释性非常重要。预测结果还需要从多个随机子样本中整合,以解决类不平衡问题。 Objective: This study aimed to: (1) increase the interpretability of previously developed pain volatility models by identifying the most important features that distinguish high from low volatility users; and (2) consolidate prediction results from models derived from multiple random subsamples while addressing the class imbalance issue. Methods: A total of 132 features were extracted from the first month of app use to develop machine learning--based models for predicting pain volatility at the sixth month of app use. Three feature selection methods were applied to identify features that were significantly better predictors than other members of the large features set used for developing the prediction models: (1) Gini impurity criterion; (2) information gain criterion; and (3) Boruta. We then combined the three groups of important features determined by these algorithms to produce the final list of important features. Three machine learning methods were then employed to conduct prediction experiments using the selected important features: (1) logistic regression with ridge estimators; (2) logistic regression with least absolute shrinkage and selection operator; and (3) random forests. Multiple random under-sampling of the majority class was conducted to address class imbalance in the dataset. Subsequently, a majority voting approach was employed to consolidate prediction results from these multiple subsamples. The total number of users included in this study was 879, with a total number of 391,255 pain records. Results: A threshold of 1.6 was established using clustering methods to differentiate between 2 classes: low volatility (n=694) and high volatility (n=185). The overall prediction accuracy is approximately 70{\%} for both random forests and logistic regression models when using 132 features. Overall, 9 important features were identified using 3 feature selection methods. Of these 9 features, 2 are from the app use category and the other 7 are related to pain statistics. After consolidating models that were developed using random subsamples by majority voting, logistic regression models performed equally well using 132 or 9 features. Random forests performed better than logistic regression methods in predicting the high volatility class. The consolidated accuracy of random forests does not drop significantly (601/879; 68.4{\%} vs 618/879; 70.3{\%}) when only 9 important features are included in the prediction model. Conclusions: We employed feature selection methods to identify important features in predicting future pain volatility. To address class imbalance, we consolidated models that were developed using multiple random subsamples by majority voting. Reducing the number of features did not result in a significant decrease in the consolidated prediction accuracy. ", issn="2291-9694", doi="10.2196/15601", url="http://medinform.www.mybigtv.com/2019/4/e15601/", url="https://doi.org/10.2196/15601", url="http://www.ncbi.nlm.nih.gov/pubmed/31746764" }
Baidu
map