Netinfo Security ›› 2015, Vol. 15 ›› Issue (5): 16-20.doi: 10.3969/j.issn.1671-1122.2015.05.003

Previous Articles     Next Articles

A Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository

WU Xu1,2,3(), GUO Fang-yu1,2, XIE Xia-qing3, XU Jin1,2   

  1. 1. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
    3. Beijing University of Posts and Telecommunications Library, Beijing 100876, China
  • Received:2015-04-10 Online:2015-05-10 Published:2018-07-16

Abstract:

Institutional repositories content is a variety of digital products created by body members in the process of work. Its purpose is supported by a network to collect, sort, save, retrieve and providing access. Its text data set is structurization and discreteness. Personalized recommendation technology can effectively improve the visibility and utilization of institutional repositories. The existing "user-driven" paradigm can be shifted to "knowledge-driven" mode. The institutional repository users are allowed to more efficiently access to academic information. To this end, the existent similarity measure method in both at home and abroadand has been studied. Different weight have different effects on overall similarity to the word idea is introduced. Based on TF-IDF and word matching algorithm of text similarity evaluation is presented. It filters invalid data by analyzing DC (Dublin Core) metadata format, calculates the right weight of certain words in specified domain and counts the number of matches. The text similarity can be calculated based on the normalization of the length of texts. The paper validates the feasibility of algorithm by using experimental data created manually and the algorithm is proved that it can calculate the similarity of structured text data reasonably.

Key words: institutional repository, discrete data, structured data, word matching, TF-IDF, text similarity

CLC Number: