@Article{信息:doi 10.2196 / /公共健康。6327,作者=“Daniulaityte, Raminta和Chen, Lu和Lamy, Francois R和Carlson, Robert G和Thirunarayan, Krishnaprasad和Sheth, Amit”,标题=“‘当‘坏’是‘好’:识别与毒品相关的推文中的个人沟通和情绪”,期刊=“JMIR公共卫生监测”,年=“2016”,月=“10月”,日=“24”,卷=“2”,数=“2”,页=“e162”,关键词=“社交媒体;推特;大麻;合成大麻类;机器学习;情绪分析;背景:为了充分利用社交媒体对药物滥用趋势进行流行病学监测的潜力,该领域在处理和分析社交媒体内容方面需要更高水平的自动化。目的:本研究的目的是描述eDrugTrends平台的监督机器学习技术的发展,该技术可以根据大麻和合成大麻素相关推文中表达的类型/通信来源(个人、官方/媒体、零售)和情绪(积极、消极、中性)自动对推文进行分类。方法:使用Twitter流应用程序编程接口收集推文,并通过eDrugTrends平台使用与大麻、大麻可食用物、大麻浓缩物和合成大麻素相关的关键词进行过滤。 After creating coding rules and assessing intercoder reliability, a manually labeled data set (N=4000) was developed by coding several batches of randomly selected subsets of tweets extracted from the pool of 15,623,869 collected by eDrugTrends (May-November 2015). Out of 4000 tweets, 25{\%} (1000/4000) were used to build source classifiers and 75{\%} (3000/4000) were used for sentiment classifiers. Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machines (SVM) were used to train the classifiers. Source classification (n=1000) tested Approach 1 that used short URLs, and Approach 2 where URLs were expanded and included into the bag-of-words analysis. For sentiment classification, Approach 1 used all tweets, regardless of their source/type (n=3000), while Approach 2 applied sentiment classification to personal communication tweets only (2633/3000, 88{\%}). Multiclass and binary classification tasks were examined, and machine-learning sentiment classifier performance was compared with Valence Aware Dictionary for sEntiment Reasoning (VADER), a lexicon and rule-based method. The performance of each classifier was assessed using 5-fold cross validation that calculated average F-scores. One-tailed t test was used to determine if differences in F-scores were statistically significant. Results: In multiclass source classification, the use of expanded URLs did not contribute to significant improvement in classifier performance (0.7972 vs 0.8102 for SVM, P=.19). In binary classification, the identification of all source categories improved significantly when unshortened URLs were used, with personal communication tweets benefiting the most (0.8736 vs 0.8200, P<.001). In multiclass sentiment classification Approach 1, SVM (0.6723) performed similarly to NB (0.6683) and LR (0.6703). In Approach 2, SVM (0.7062) did not differ from NB (0.6980, P=.13) or LR (F=0.6931, P=.05), but it was over 40{\%} more accurate than VADER (F=0.5030, P<.001). In multiclass task, improvements in sentiment classification (Approach 2 vs Approach 1) did not reach statistical significance (eg, SVM: 0.7062 vs 0.6723, P=.052). In binary sentiment classification (positive vs negative), Approach 2 (focus on personal communication tweets only) improved classification results, compared with Approach 1, for LR (0.8752 vs 0.8516, P=.04) and SVM (0.8800 vs 0.8557, P=.045). Conclusions: The study provides an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid--related tweets with fairly high accuracy. Use of these content analysis tools along with geographic identification capabilities developed by the eDrugTrends platform will provide powerful methods for tracking regional changes in user opinions related to cannabis and synthetic cannabinoids use over time and across different regions. ", issn="2369-2960", doi="10.2196/publichealth.6327", url="http://publichealth.www.mybigtv.com/2016/2/e162/", url="https://doi.org/10.2196/publichealth.6327", url="http://www.ncbi.nlm.nih.gov/pubmed/27777215" }
Baidu
map