@文章{信息:doi/10.2196/36711,作者="Guardiolle, Vianney和Bazoge, Adrien和Morin, Emmanuel和Daille, B{\'e}atrice和Toublant, Delphine和Bouzill{\'e}, Guillaume和Merel, Youenn和Pierre-Jean, Morgane和Filiot, Alexandre和Cuggia, Marc和Wargny, Matthieu和Lamer, Antoine和Gourraud, Pierre-Antoine",标题="连接生物医学数据仓库记录与法国国家死亡率数据库:“大规模匹配算法”,期刊=“JMIR Med Inform”,年份=“2022”,月份=“11月”,日期=“1”,卷=“10”,号=“11”,页=“e36711”,关键词=“数据仓库”;临床数据仓库;医学信息学应用;病案联动;法国全国死亡率数据库;数据重用;开放数据,R;背景:在生物医学数据仓库(BDW)中经常缺失或不确定,出院后的生命状态是BDW在医学研究中的核心价值。法国国家死亡率数据库(FNMD)提供每个死亡的公开提名记录。 Matching large-scale BDWs records with the FNMD combines multiple challenges: absence of unique common identifiers between the 2 databases, names changing over life, clerical errors, and the exponential growth of the number of comparisons to compute. Objective: We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performance. Methods: We developed a deterministic algorithm based on advanced data cleaning and knowledge of the naming system and the Damerau-Levenshtein distance (DLD). The algorithm's performance was independently assessed using BDW data of 3 university hospitals: Lille, Nantes, and Rennes. Specificity was evaluated with living patients on January 1, 2016 (ie, patients with at least 1 hospital encounter before and after this date). Sensitivity was evaluated with patients recorded as deceased between January 1, 2001, and December 31, 2020. The DLD-based algorithm was compared to a direct matching algorithm with minimal data cleaning as a reference. Results: All centers combined, sensitivity was 11{\%} higher for the DLD-based algorithm (93.3{\%}, 95{\%} CI 92.8-93.9) than for the direct algorithm (82.7{\%}, 95{\%} CI 81.8-83.6; P<.001). Sensitivity was superior for men at 2 centers (Nantes: 87{\%}, 95{\%} CI 85.1-89 vs 83.6{\%}, 95{\%} CI 81.4-85.8; P=.006; Rennes: 98.6{\%}, 95{\%} CI 98.1-99.2 vs 96{\%}, 95{\%} CI 94.9-97.1; P<.001) and for patients born in France at all centers (Nantes: 85.8{\%}, 95{\%} CI 84.3-87.3 vs 74.9{\%}, 95{\%} CI 72.8-77.0; P<.001). The DLD-based algorithm revealed significant differences in sensitivity among centers (Nantes, 85.3{\%} vs Lille and Rennes, 97.3{\%}, P<.001). Specificity was >98{\%} in all subgroups. Our algorithm matched tens of millions of death records from BDWs, with parallel computing capabilities and low RAM requirements. We used the Inseehop open-source R script for this measurement. Conclusions: Overall, sensitivity/recall was 11{\%} higher using the DLD-based algorithm than that using the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used to match any large-scale databases. While matching operations using names are considered sensitive computational operations, the Inseehop package released here is easy to run on premises, thereby facilitating compliance with cybersecurity local framework. The use of an advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combining open-source external data to improve the usage value of BDWs. ", issn="2291-9694", doi="10.2196/36711", url="https://medinform.www.mybigtv.com/2022/11/e36711", url="https://doi.org/10.2196/36711", url="http://www.ncbi.nlm.nih.gov/pubmed/36318244" }
Baidu
map