@Article{信息:doi 10.2196 / / jmir。4392,作者=“科尔-刘易斯,希瑟和瓦格塞,阿伦和桑德斯,艾米和施瓦茨,玛丽和普格奇,吉利安和奥古斯特森,埃里克”,标题=“使用监督机器学习评估电子烟相关推文的情感和内容”,期刊=“J医学互联网研究”,年=“2015”,月=“8”,日=“25”,卷=“17”,数字=“8”,页=“e208”,关键词=“社交媒体;推特;由于电子烟;背景:电子烟在社交媒体用户中继续成为一个日益增长的话题,尤其是在推特上。实时分析有关电子烟的对话的能力,可以为公众对电子烟的知识、态度和信念的趋势提供重要的洞察,并随后指导公共卫生干预措施。目的:我们的目标是建立一个有监督的机器学习算法,以建立预测分类模型,评估Twitter数据的一系列因素与电子烟相关。方法:对17098条推文进行人工内容分析。这些推文分为五类:电子烟相关性、情感、用户描述、类型和主题。然后为这五个类别建立机器学习分类模型,并使用词组(n-grams)定义每个分类器的特征空间。 Results: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40{\%} and 99.34{\%} of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59{\%} to 80.62{\%}. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; {\%} improvement: 80.62{\%}), Relevance (performance: 0.94; {\%} improvement: 75.26{\%}), Ad or Promotion (performance: 0.89; {\%} improvement: 72.69{\%}), and Marketing (performance: 0.91; {\%} improvement: 72.56{\%}). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. Conclusions: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. ", issn="1438-8871", doi="10.2196/jmir.4392", url="//www.mybigtv.com/2015/8/e208/", url="https://doi.org/10.2196/jmir.4392", url="http://www.ncbi.nlm.nih.gov/pubmed/26307512" }