@文章{信息:doi/10.2196/17853,作者=“Zowalla, Richard and Wetter, Thomas and Pfeifer, Daniel”,标题=“爬取德国健康网络:探索性研究和图形分析”,期刊=“J医学互联网研究”,年=“2020”,月=“7月”,日=“24”,卷=“22”,数=“7”,页=“e17853”,关键词=“健康信息;互联网;web爬行;背景:互联网已成为越来越重要的卫生信息资源。然而,随着网页数量的增长,人类几乎不可能手动跟踪健康领域中不断发展和不断变化的内容。为了更好地理解以特定语言提供的所有基于网络的卫生信息的性质,重要的是要确定(1)卫生领域的信息中心,(2)高声望的内容提供者,以及(3)卫生相关网络中的重要主题和趋势。在这种情况下,自动网络爬行方法可以为回答(1)到(3)的计算和统计分析提供必要的数据。目的:本研究证明了集中爬虫对获取德国健康web (GHW)的适用性,其中包括三个主要讲德语的国家德国、奥地利和瑞士的所有与健康相关的web内容。基于收集到的数据,我们对GHW的图表结构进行了初步分析,包括其规模、最重要的内容提供商以及公共与私人利益相关者的比例。此外,我们还提供了构建和操作这种高度可伸缩的爬虫的经验。方法:一个支持向量机分类器训练从各种德国内容提供商获得的大型数据集,以区分健康相关和非健康相关的网页。 The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36{\%} (19,126/22,405), .at: 6.83{\%} (1530/22,405), .ch: 7.81{\%} (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76{\%}; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40{\%} (30/75) were web sites published by public institutions. 25{\%} (19/75) were published by nonprofit organizations and 35{\%} (26/75) by private organizations or individuals. Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. ", issn="1438-8871", doi="10.2196/17853", url="//www.mybigtv.com/2020/7/e17853/", url="https://doi.org/10.2196/17853", url="http://www.ncbi.nlm.nih.gov/pubmed/32706701" }
Baidu
map