TY - JOUR AU - Zowalla, Richard AU - Wetter, Thomas AU - Pfeifer, Daniel PY - 2020 DA - 20/7/24 TI -德国卫生网络爬行:探索性研究和图形分析JO - J Med Internet Res SP - e17853 VL - 22 IS - 7kw -卫生信息KW -互联网KW -网络爬行KW -分布式系统AB -背景:互联网已成为越来越重要的卫生信息资源。然而,随着网页数量的增长,人类几乎不可能手动跟踪健康领域中不断发展和不断变化的内容。为了更好地理解以特定语言提供的所有基于网络的卫生信息的性质,重要的是要确定(1)卫生领域的信息中心,(2)高声望的内容提供者,以及(3)卫生相关网络中的重要主题和趋势。在这种情况下,自动网络爬行方法可以为回答(1)到(3)的计算和统计分析提供必要的数据。目的:本研究证明了集中爬虫对获取德国健康web (GHW)的适用性,其中包括三个主要讲德语的国家德国、奥地利和瑞士的所有与健康相关的web内容。基于收集到的数据,我们对GHW的图表结构进行了初步分析,包括其规模、最重要的内容提供商以及公共与私人利益相关者的比例。此外,我们还提供了构建和操作这种高度可伸缩的爬虫的经验。方法:支持向量机分类器在从各种德国内容提供商获得的大型数据集上进行训练,以区分与健康相关的网页和非健康相关的网页。在80/20训练/测试分割(TD1)和人群验证数据集(TD2)上,使用准确性、召回率和精度对分类器进行评估。为了实现这个爬虫程序,我们扩展了开源框架StormCrawler。实际爬行进行了227天。 The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. SN - 1438-8871 UR - //www.mybigtv.com/2020/7/e17853/ UR - https://doi.org/10.2196/17853 UR - http://www.ncbi.nlm.nih.gov/pubmed/32706701 DO - 10.2196/17853 ID - info:doi/10.2196/17853 ER -
Baidu
map