@Article{信息:doi 10.2196 / / jmir。1123,作者=“Himmel, Wolfgang and Reincke, Ulrich and Michelmann, Hans Wilhelm”,标题=“基于web的专家论坛请求自动分类的文本挖掘和自然语言处理方法”,期刊=“J Med Internet Res”,年=“2009”,月=“7月”,日=“22”,卷=“11”,数=“3”,页=“e25”,关键词=“文本挖掘;定性研究;自然语言处理;消费者健康信息学;互联网;远程会诊;背景:健康人群和病人越来越多地使用电子媒体获取医疗信息和建议。例如,互联网用户可以向基于网络的专家论坛或所谓的“询问医生”服务发送请求。目的:结合不同的文本挖掘策略,对Internet医学专家论坛的外行请求进行自动分类。 Methods: We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website ``Rund ums Baby'' (``Everything about Babies'') into one or more of 38 categories belonging to two dimensions (``subject matter'' and ``expectations''). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50{\%}. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. Results: According to the manual classification of 988 documents, 102 (10{\%}) documents fell into the category ``in vitro fertilization (IVF),'' 81 (8{\%}) into the category ``ovulation,'' 79 (8{\%}) into ``cycle,'' and 57 (6{\%}) into ``semen analysis.'' These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54{\%}) as ``general information'' and 351 (36{\%}) as a wish for ``treatment recommendations.'' The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100{\%} precision and 100{\%} recall were realized in 18 (47{\%}) out of the 38 categories in the test sample. For 35 (92{\%}) categories, precision and recall were better than 80{\%}. For some categories, the input variables (ie, ``words'') also included variables from other categories, most often with a negative sign. For example, absence of words predictive for ``menstruation'' was a strong indicator for the category ``pregnancy test.'' Conclusions: Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback. ", issn="1438-8871", doi="10.2196/jmir.1123", url="//www.mybigtv.com/2009/3/e25/", url="https://doi.org/10.2196/jmir.1123", url="http://www.ncbi.nlm.nih.gov/pubmed/19632978" }
Baidu
map