@Article{信息:doi 10.2196 / / jmir。8164,作者=“Sarker, Abeed and Chandrashekar, Pramod and Magge, Arjun and Cai, Haitao and Klein, Ari and Gonzalez, Graciela”,标题=“从社交媒体中发现孕妇的安全监测和分析队列”,期刊=“J Med Internet Res”,年=“2017”,月=“10”,日=“30”,卷=“19”,数=“10”,页=“e361”,关键词=“自然语言处理;机器学习;文本挖掘;社交媒体;怀孕;队列研究;背景:妊娠暴露登记是孕妇孕期用药安全的主要信息来源。这种登记在怀孕早期以自愿的方式登记孕妇,并跟踪她们直到怀孕结束或更长时间,以系统地收集有关具体妊娠结果的信息。虽然妊娠登记模式与其他研究设计相比具有明显的优势,但也面临着入围率低、成本高、选择偏倚等诸多挑战和限制。 Objective: The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. Methods: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. Results: Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa ($\kappa$)=.79. On a blind test set, the implemented classifier obtained an overall F1 score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. Conclusions: Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them. ", issn="1438-8871", doi="10.2196/jmir.8164", url="//www.mybigtv.com/2017/10/e361/", url="https://doi.org/10.2196/jmir.8164", url="http://www.ncbi.nlm.nih.gov/pubmed/29084707" }
Baidu
map