信息网络安全 ›› 2015, Vol. 15 ›› Issue (5): 16-20.doi: 10.3969/j.issn.1671-1122.2015.05.003

• 技术研究 • 上一篇    下一篇

面向机构知识库结构化数据的文本相似度评价算法

吴旭1,2,3(), 郭芳毓1,2, 颉夏青3, 许晋1,2   

  1. 1. 北京邮电大学计算机学院,北京 100876
    2. 北京邮电大学可信分布式计算与服务教育部重点实验室,北京 100876
    3. 北京邮电大学图书馆,北京 100876
  • 收稿日期:2015-04-10 出版日期:2015-05-10 发布日期:2018-07-16
  • 作者简介:

    作者简介: 吴旭(1963-),女,吉林,研究员,硕士,主要研究方向:服务科学与情报信息技术;郭芳毓(1990-),女,河北,硕士研究生,主要研究方向:情报信息技术与可信计算;颉夏青(1988-),女,山西,助理馆员,硕士,主要研究方向:服务科学与情报信息技术;许晋(1990-),男,山东,博士研究生,主要研究方向:情报信息技术与可信计算。

  • 基金资助:
    国家高技术研究发展计划[2012AA01A404];国家教育部信息资源保障体系第三期工程

A Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository

WU Xu1,2,3(), GUO Fang-yu1,2, XIE Xia-qing3, XU Jin1,2   

  1. 1. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
    3. Beijing University of Posts and Telecommunications Library, Beijing 100876, China
  • Received:2015-04-10 Online:2015-05-10 Published:2018-07-16

摘要:

机构知识库是一个以机构成员在工作过程中所创建的各种数字化产品为内容,以网络为依托,以收集、整理、保存、检索、提供利用为目的的知识库,其中文本数据集多呈现结构化,且具有离散性。而个性化推荐技术可以有效提高机构知识库资源的曝光率和利用率,将现有的“用户主导行为”模式转变为“以知识驱动行为”模式,使得机构知识库用户能够更高效地获取学术信息。为此,文章在研究国内外已有的相似性度量方法的基础上,引入不同权重词语对整体相似度有不同影响的思想,提出一种基于TF-IDF和词语匹配的文本相似度评价算法。通过分析DC(Dublin Core)元数据格式,筛选其中有效数据,计算特定词语在指定域中的权重并统计匹配次数,在文本长度归一化的基础上进行文本相似度计算。实验以手动建立文本测试集进行相似度计算,经统计分析,表明该算法能够对结构化离散文本数据的相似度进行合理计算,降低了机构知识库离散数据集在进行相似度计算时的向量维度,计算结果与实际数据吻合较好,具有可行性和实际应用价值。

关键词: 机构知识库, 离散化数据, 结构化数据, 词语匹配, TF-IDF, 文本相似度

Abstract:

Institutional repositories content is a variety of digital products created by body members in the process of work. Its purpose is supported by a network to collect, sort, save, retrieve and providing access. Its text data set is structurization and discreteness. Personalized recommendation technology can effectively improve the visibility and utilization of institutional repositories. The existing "user-driven" paradigm can be shifted to "knowledge-driven" mode. The institutional repository users are allowed to more efficiently access to academic information. To this end, the existent similarity measure method in both at home and abroadand has been studied. Different weight have different effects on overall similarity to the word idea is introduced. Based on TF-IDF and word matching algorithm of text similarity evaluation is presented. It filters invalid data by analyzing DC (Dublin Core) metadata format, calculates the right weight of certain words in specified domain and counts the number of matches. The text similarity can be calculated based on the normalization of the length of texts. The paper validates the feasibility of algorithm by using experimental data created manually and the algorithm is proved that it can calculate the similarity of structured text data reasonably.

Key words: institutional repository, discrete data, structured data, word matching, TF-IDF, text similarity

中图分类号: