面向机构知识库结构化数据的文本相似度评价算法

doi:10.3969/j.issn.1671-1122.2015.05.003

信息网络安全 ›› 2015, Vol. 15 ›› Issue (5): 16-20.doi: 10.3969/j.issn.1671-1122.2015.05.003

面向机构知识库结构化数据的文本相似度评价算法

吴旭^1,^2,³(), 郭芳毓^1,², 颉夏青³, 许晋^1,²

1. 北京邮电大学计算机学院,北京 100876
2. 北京邮电大学可信分布式计算与服务教育部重点实验室,北京 100876
3. 北京邮电大学图书馆,北京 100876

收稿日期:2015-04-10 出版日期:2015-05-10 发布日期:2018-07-16
作者简介:
作者简介：吴旭（1963-）,女,吉林,研究员,硕士,主要研究方向：服务科学与情报信息技术;郭芳毓（1990-）,女,河北,硕士研究生,主要研究方向：情报信息技术与可信计算;颉夏青（1988-）,女,山西,助理馆员,硕士,主要研究方向：服务科学与情报信息技术;许晋（1990-）,男,山东,博士研究生,主要研究方向：情报信息技术与可信计算。
基金资助:
国家高技术研究发展计划[2012AA01A404];国家教育部信息资源保障体系第三期工程

A Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository

WU Xu^1,^2,³(), GUO Fang-yu^1,², XIE Xia-qing³, XU Jin^1,²

1. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
3. Beijing University of Posts and Telecommunications Library, Beijing 100876, China

Received:2015-04-10 Online:2015-05-10 Published:2018-07-16

摘要/Abstract

摘要：

机构知识库是一个以机构成员在工作过程中所创建的各种数字化产品为内容,以网络为依托,以收集、整理、保存、检索、提供利用为目的的知识库,其中文本数据集多呈现结构化,且具有离散性。而个性化推荐技术可以有效提高机构知识库资源的曝光率和利用率,将现有的“用户主导行为”模式转变为“以知识驱动行为”模式,使得机构知识库用户能够更高效地获取学术信息。为此,文章在研究国内外已有的相似性度量方法的基础上,引入不同权重词语对整体相似度有不同影响的思想,提出一种基于TF-IDF和词语匹配的文本相似度评价算法。通过分析DC（Dublin Core）元数据格式,筛选其中有效数据,计算特定词语在指定域中的权重并统计匹配次数,在文本长度归一化的基础上进行文本相似度计算。实验以手动建立文本测试集进行相似度计算,经统计分析,表明该算法能够对结构化离散文本数据的相似度进行合理计算,降低了机构知识库离散数据集在进行相似度计算时的向量维度,计算结果与实际数据吻合较好,具有可行性和实际应用价值。

关键词: 机构知识库, 离散化数据, 结构化数据, 词语匹配, TF-IDF, 文本相似度

Abstract:

Institutional repositories content is a variety of digital products created by body members in the process of work. Its purpose is supported by a network to collect, sort, save, retrieve and providing access. Its text data set is structurization and discreteness. Personalized recommendation technology can effectively improve the visibility and utilization of institutional repositories. The existing "user-driven" paradigm can be shifted to "knowledge-driven" mode. The institutional repository users are allowed to more efficiently access to academic information. To this end, the existent similarity measure method in both at home and abroadand has been studied. Different weight have different effects on overall similarity to the word idea is introduced. Based on TF-IDF and word matching algorithm of text similarity evaluation is presented. It filters invalid data by analyzing DC (Dublin Core) metadata format, calculates the right weight of certain words in specified domain and counts the number of matches. The text similarity can be calculated based on the normalization of the length of texts. The paper validates the feasibility of algorithm by using experimental data created manually and the algorithm is proved that it can calculate the similarity of structured text data reasonably.

Key words: institutional repository, discrete data, structured data, word matching, TF-IDF, text similarity

中图分类号:

TP309

吴旭, 郭芳毓, 颉夏青, 许晋. 面向机构知识库结构化数据的文本相似度评价算法[J]. 信息网络安全, 2015, 15(5): 16-20.

WU Xu, GUO Fang-yu, XIE Xia-qing, XU Jin. A Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository[J]. Netinfo Security, 2015, 15(5): 16-20.

图/表 2

参考文献 13

[1]	BONILLA-CALERO A.Institutional Repositories as complementary tools to evaluate the quantity and quality of research outputs[J]. Library Review, 2014, 63(1/2): 46-59.
[2]	ARMSTRONG M.Institutional repository management models that support faculty research dissemination[J]. OCLC Systems & Services, 2014, 30(1): 43-51.
[3]	LYNCH C A.Institutional repositories: essential infrastructure for scholarship in the digital age[J]. portal: Libraries and the Academy, 2003, 3(2): 327-336.
[4]	RUIZ-CONDE E, Calderón-Martínez A.University institutional repositories: competitive environment and their role as communication media of scientific knowledge[J]. Scientometrics, 2014, 98(2): 1283-1299.
[5]	聂华,韦成府,崔海媛. CALIS机构知识库:建设与推广,反思与展望[J]. 中国图书馆学报,2013,(2):46-52.
[6]	李雨,张明宝. Web2.0环境下的科技论文共享之机构知识库[J]. 江苏科技信息,2013,(23):17-19.
[7]	孙鹏. 论高校机构知识库服务体系的建设[J]. 图书馆学刊,2013,(9):90-91.
[8]	王文联. 嵌入数据监护的图书馆机构库高效运行模式[J]. 新世纪图书馆,2014,(3):36-38.
[9]	刘瑛. 我国机构知识库建设中存在的问题及对策探研[J]. 黑龙江史志,2012,(19):33.
[10]	YE J.Multicriteria group decision-making method using vector similarity measures for trapezoidal intuitionistic fuzzy numbers[J]. Group Decision and Negotiation, 2012, 21(4): 519-530.
[11]	WU D, MENDEL J M.A vector similarity measure for linguistic approximation: Interval type-2 and type-1 fuzzy sets[J]. Information Sciences, 2008, 178(2): 381-402.
[12]	KEKRE H B, MISHRA D, KARIWALA A.A survey of CBIR techniques and semantics[J]. International journal of Engineering science and Technology (IJEST), 2011, 3(5): 4510-4517.
[13]	AHN H J.A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem[J]. Information Sciences, 2008, 178(1): 37-51.

面向机构知识库结构化数据的文本相似度评价算法

A Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 2

参考文献 13

相关文章 2

编辑推荐

Metrics

本文评价

[1]	孙兴东, 李爱平, 李树栋. 一种基于聚类的微博关键词提取方法的研究与实现[J]. 信息网络安全, 2014, 14(12): 27-31.
[2]	李政泽;韩毅;周斌;贾焰. 微博用户分类的特征词权重优化及推荐策略[J]. , 2012, 12(8): 0-0.