信息网络安全 ›› 2017, Vol. 17 ›› Issue (2): 12-21.doi: 10.3969/j.issn.1671-1122.2017.02.003

• • 上一篇    下一篇

面向专用信息获取的用户定制主题网络爬虫技术研究

薛丽敏1(), 吴琦1,2, 李骏1   

  1. 1. 海军指挥学院信息系,江苏南京 211800
    2. 海军92853部队,辽宁兴城 125106
  • 收稿日期:2016-11-28 出版日期:2017-02-20 发布日期:2020-05-12
  • 作者简介:

    作者简介: 薛丽敏(1968—),女,山西,副教授,硕士,主要研究方向为信息安全;吴琦(1987—),女,黑龙江,工程师,硕士,主要研究方向为信息安全;李骏(1979—),男,江苏,工程师,本科,主要研究方向为信息安全。

  • 基金资助:
    国家自然科学基金[11202239]

Research on User Customized Topic Web Crawler for Specialized Information Acquiration Technology

Limin XUE1(), Qi WU1,2, Jun LI1   

  1. 1. Department of Information, Naval Command College, Nanjing Jiangsu 211800, China
    2. Navy Unit 92853, Xingcheng Liaoning 125106, China
  • Received:2016-11-28 Online:2017-02-20 Published:2020-05-12

摘要:

进入大数据时代,互联网已成为各行各业进行信息采集的重要阵地。面对爆炸式增长的网络信息资源,如何快速高效地筛选出所需的信息成为亟需解决的现实难题。在互联网海量数据和专用信息采集人员之间构建一个满足特定需求的信息筛选机制,可以大幅度提高专用信息获取工作效率。主题网络爬虫是所有互联网信息获取手段必须具备的首要环节,为了提高专用信息采集的准确性,文章进行了面向公开网络的用户定制主题网络爬虫技术研究。针对大数据时代信息筛选困难的问题,文章通过将用户的兴趣偏好融入到主题网络爬虫的抓取过程中,有效提高了信息筛选力度,并通过实验验证了文中方法能够提高查准率。

关键词: 大数据, 主题网络爬虫, Pagerank算法, 行为分析, 用户定制

Abstract:

Stepping into the era of big data, the Internet has become an important battle field for every walk of life to collect intelligence. Facing the explosive growth of network information resources, how to screen out the required information quickly and efficiently is a practical problem to solve. It is very important to construct an information screening mechanism between the mass data and intelligence personnel to meet the needs of specific tasks, which can greatly improve the efficiency. In order to improve the accuracy of the information collected, this paper conducts the research on the user customized topic Web crawler technology for information acquisition. In order to solve the difficult problem of information screening in the large data age, the user’s interest preference is integrated into the crawling process of the topic Web crawler, and the information screening is effectively improved. Experimental results show that the method can improve the precision.

Key words: big data, topic Web crawler, Pagerank algorithm, behavior analysis, user customized

中图分类号: