信息网络安全 ›› 2017, Vol. 17 ›› Issue (1): 57-62.doi: 10.3969/j.issn.1671-1122.2017.01.009

• • 上一篇    下一篇

基于Word2vec的微博短文本分类研究

张谦(), 高章敏, 刘嘉勇   

  1. 四川大学电子信息学院,四川成都 610065
  • 收稿日期:2016-10-01 出版日期:2017-01-20 发布日期:2020-05-12
  • 作者简介:

    作者简介: 张谦(1987—),男,贵州,博士研究生,主要研究方向为网络信息安全、数据挖掘;高章敏(1991—),男,湖北,硕士研究生,主要研究方向为数据挖掘与机器学习;刘嘉勇(1962—),男,四川,教授,博士,主要研究方向为网络数据分析与信息安全。

  • 基金资助:
    国防保密通信重点实验室基金[9140C110401140C11053]

Research of Weibo Short Text Classification Based on Word2vec

Qian ZHANG(), Zhangmin GAO, Jiayong LIU   

  1. College of Electronics and Information Engineering of Sichuan University, Chengdu Sichuan 610065, China
  • Received:2016-10-01 Online:2017-01-20 Published:2020-05-12

摘要:

随着微博等社会化媒体的信息量急剧膨胀,人们迫切需要实现这些信息的自动分类处理,以帮助用户快速查找所需信息和过滤垃圾信息。针对传统文本分类模型存在的特征维数灾难、无语义特征等问题,文章基于Word2vec模型对微博短文本进行了分类研究。鉴于Word2vec模型无法区分文本中词汇的重要程度,进一步引入TFIDF对Word2vec词向量进行加权,实现加权的Word2vec分类模型。最后合并加权Word2vec和TFIDF两种模型,实验结果表明合并后模型分类准确率高于加权Word2vec模型和使用TFIDF的传统文本分类模型。

关键词: 短文本分类, Word2vec, TFIDF, 支持向量机

Abstract:

With the rapid expansion of new available information on Microblogging and other social media. Text automatic classification becomes imperative in order to help people locate the information he inquires and filter spam. Based on the characteristics of curse of dimensionality and lack of semantic features in Traditional text classification model, put forward a short text classify based on Word2vec model.Since Word2vec can not distinguish the weight of words, we applied weights using tf-idf weighting with Word2vec, implemented weighted Word2vec. Then we concatenated tf-idf with our word2vec weighted by tf-idf. Our results show that the combination of Word2vec weighted by tf-idf without stop words and tf-idf without stop words can outperform either Word2vec weighted by tf-idf without stop words and tf-idf with or without stop word.

Key words: short text classification, Word2vec, TFIDF, SVM

中图分类号: