TY -的盟Daniulaityte Raminta盟——陈陆AU -拉米,弗朗索瓦•R AU -卡尔森,罗伯特·G AU - Thirunarayan Krishnaprasad盟——Sheth Amit PY - 2016 DA - 2016/10/24 TI -“当‘坏’是‘好’”:识别个人沟通和情绪与毒品有关的微博乔- JMIR公共卫生Surveill SP - e162六世- 2 - 2 KW -社会媒体千瓦Twitter KW -大麻KW -合成大麻类KW -机器学习KW -情绪分析KW - eDrugTrends AB -背景:为了充分利用社交媒体对药物滥用趋势进行流行病学监测的潜力,该领域需要在处理和分析社交媒体内容方面实现更高水平的自动化。目的:本研究的目的是描述eDrugTrends平台的监督机器学习技术的发展,该技术可以根据通信类型/来源(个人、官方/媒体、零售)和大麻和合成大麻素相关推文中表达的情绪(积极、消极、中性)对推文进行自动分类。方法:使用Twitter流媒体应用程序编程接口收集推文,并通过eDrugTrends平台使用大麻、可食用大麻、大麻浓缩液和合成大麻素相关关键词进行过滤。在创建编码规则并评估编码器间可靠性之后,通过对eDrugTrends(2015年5月至11月)收集的15,623,869条推文中随机选择的几批推文子集进行编码,开发了手动标记的数据集(N=4000)。在4000条tweet中,25%(1000/4000)用于构建源分类器,75%(3000/4000)用于情感分类器。使用逻辑回归(LR)、朴素贝叶斯(NB)和支持向量机(SVM)来训练分类器。源分类(n=1000)测试了使用短url的方法1和将url扩展并包含在词袋分析中的方法2。对于情感分类,方法1使用了所有的推文,而不考虑其来源/类型(n=3000),而方法2仅将情感分类应用于个人通信推文(2633/3000,88%)。研究了多类和二元分类任务,并将机器学习情感分类器的性能与基于词典和规则的情感推理的价感知词典(VADER)进行了比较。 The performance of each classifier was assessed using 5-fold cross validation that calculated average F-scores. One-tailed t test was used to determine if differences in F-scores were statistically significant. Results: In multiclass source classification, the use of expanded URLs did not contribute to significant improvement in classifier performance (0.7972 vs 0.8102 for SVM, P=.19). In binary classification, the identification of all source categories improved significantly when unshortened URLs were used, with personal communication tweets benefiting the most (0.8736 vs 0.8200, P<.001). In multiclass sentiment classification Approach 1, SVM (0.6723) performed similarly to NB (0.6683) and LR (0.6703). In Approach 2, SVM (0.7062) did not differ from NB (0.6980, P=.13) or LR (F=0.6931, P=.05), but it was over 40% more accurate than VADER (F=0.5030, P<.001). In multiclass task, improvements in sentiment classification (Approach 2 vs Approach 1) did not reach statistical significance (eg, SVM: 0.7062 vs 0.6723, P=.052). In binary sentiment classification (positive vs negative), Approach 2 (focus on personal communication tweets only) improved classification results, compared with Approach 1, for LR (0.8752 vs 0.8516, P=.04) and SVM (0.8800 vs 0.8557, P=.045). Conclusions: The study provides an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid–related tweets with fairly high accuracy. Use of these content analysis tools along with geographic identification capabilities developed by the eDrugTrends platform will provide powerful methods for tracking regional changes in user opinions related to cannabis and synthetic cannabinoids use over time and across different regions. SN - 2369-2960 UR - http://publichealth.www.mybigtv.com/2016/2/e162/ UR - https://doi.org/10.2196/publichealth.6327 UR - http://www.ncbi.nlm.nih.gov/pubmed/27777215 DO - 10.2196/publichealth.6327 ID - info:doi/10.2196/publichealth.6327 ER -
Baidu
map