@Article{info:doi/10.2196/41342,作者=“陈佩富与何,太亮与林,胜彻与楚,袁嘉与郭,陈宗聪与赖菲,飞沛与王,苏明与朱,万璇与陈,宽致与郭,陆成与洪,方明与林,玉成与蔡,张一昌与邱,志浩与张,舒致与杨志宇”,标题=“基于联邦学习的疾病分类第十版的深度语境化语言模型训练”;模型开发与验证研究”,期刊=“JMIR Med Inform”,年=“2022”,月=“11”,日=“10”,卷=“10”,号=“11”,页=“e41342”,关键词=“联邦学习”;国际疾病分类;机器学习;自然语言处理;背景:使用《国际疾病分类第十版》(ICD-10)对临床文献进行自动编码,可以进行统计分析和报销。随着自然语言处理模型的发展,新的具有注意机制的变压器体系结构已经超越了以前的模型。虽然多中心训练可以提高模型的性能和外部有效性,但临床文件的隐私应得到保护。我们使用联邦学习来训练具有多中心数据的模型,而不共享数据本身。目的:利用联邦学习训练ICD-10多标签分类模型。 Methods: Text data from discharge notes in electronic medical records were collected from the following three medical centers: Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital. After comparing the performance of different variants of bidirectional encoder representations from transformers (BERT), PubMedBERT was chosen for the word embeddings. With regard to preprocessing, the nonalphanumeric characters were retained because the model's performance decreased after the removal of these characters. To explain the outputs of our model, we added a label attention mechanism to the model architecture. The model was trained with data from each of the three hospitals separately and via federated learning. The models trained via federated learning and the models trained with local data were compared on a testing set that was composed of data from the three hospitals. The micro F1 score was used to evaluate model performance across all 3 centers. Results: The F1 scores of PubMedBERT, RoBERTa (Robustly Optimized BERT Pretraining Approach), ClinicalBERT, and BioBERT (BERT for Biomedical Text Mining) were 0.735, 0.692, 0.711, and 0.721, respectively. The F1 score of the model that retained nonalphanumeric characters was 0.8120, whereas the F1 score after removing these characters was 0.7875---a decrease of 0.0245 (3.11{\%}). The F1 scores on the testing set were 0.6142, 0.4472, 0.5353, and 0.2522 for the federated learning, Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital models, respectively. The explainable predictions were displayed with highlighted input words via the label attention architecture. Conclusions: Federated learning was used to train the ICD-10 classification model on multicenter clinical text while protecting data privacy. The model's performance was better than that of models that were trained locally. ", issn="2291-9694", doi="10.2196/41342", url="https://medinform.www.mybigtv.com/2022/11/e41342", url="https://doi.org/10.2196/41342", url="http://www.ncbi.nlm.nih.gov/pubmed/36355417" }
Baidu
map