Netinfo Security ›› 2016, Vol. 16 ›› Issue (5): 64-70.doi: 10.3969/j.issn.1671-1122.2016.05.010

• Orginal Article • Previous Articles     Next Articles

Research on Short Text Representation Based on Sentential Semantic Components

Hai SHANG, Senlin LUO, Lei HAN, Ji ZHANG()   

  1. Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China
  • Received:2016-03-14 Online:2016-05-20 Published:2020-05-13

Abstract:

With the development of mobile Internet and information technology, short text data such as commentary, microblog, has explosive growth. The sparseness of short text requires an effective algorithm of short text representation to improve the results of text clustering and classification, hot event detection and public opinion analysis, etc. This paper proposes an algorithm of short text representation based on sentential semantic components. Without changing the dimension of feature space, the method utilizes the sentential semantic components and topic model to obtain the semantic correlated words, and expands the short text with those words according to the topic selection rules. It reduces the zero-value dimension of in the text representation feature vectors. This paper implements short text classification experiments based on the Sogou corpus. The results show that the accuracy of short text classification reaches 0.7958, which is better than other methods. In summary, the proposed short text representation method, expanding short text with the semantic correlated words, can mitigate the sparseness problem effectively and improve the performance of short text classification.

Key words: text representation, sentential semantic components, topic model, text classification

CLC Number: