信息网络安全 ›› 2018, Vol. 18 ›› Issue (1): 58-66.doi: 10.3969/j.issn.1671-1122.2018.01.009

• • 上一篇    下一篇

基于MapReduce的开源代码并行爬取方法研究与实现

马军岩1,2(), 曾国荪1   

  1. 1. 同济大学计算机科学与技术系,上海 200092
    2. 国家高性能计算机工程技术中心同济分中心,上海 200092
  • 收稿日期:2017-10-31 出版日期:2018-01-20 发布日期:2020-05-11
  • 作者简介:

    作者简介:马军岩(1993—),女,山东,硕士研究生,主要研究方向为软件工程、源代码搜索、并行计算;曾国荪(1964—),男,江西,教授,博士,主要研究方向为信息管理、并行计算、可信软件。

  • 基金资助:
    国家高技术研究发展计划(863计划)[2009AA012201];同济大学实验教学改革基金[0800104214]

Research and Implementation on Parallel Crawl Method for Source Code Based on MapReduce

Junyan MA1,2(), Guosun ZENG1   

  1. 1. Department of Computer Science and Technology, Tongji University, Shanghai 200092, China
    2. Tongji Branch, National Engineering & Technology Center of High Performance Computer, Shanghai 200092, China;
  • Received:2017-10-31 Online:2018-01-20 Published:2020-05-11

摘要:

随着互联网上开源代码越来越多,设计程序时寻找和复用已有的开源代码已经成为软件工程的一种趋势。为了快速和精准地搜索开源代码,文章设计了基于MapReduce计算模型的源代码并行爬取方法,并编写代码,实现了源代码并行爬取系统。文章首先分析当前各种开源代码库,选择合适的代码爬取场所和爬取对象,明确并行爬取的工作流程和目标;然后给出源代码并行爬取的Map方法和Reduce方法,基于此设计一套开源代码并行爬取算法;最后利用集群计算环境实现开源代码的并行爬取。实验表明,多机并行搜索源代码与传统方法相比,速度明显提高,且搜索的返回结果可信度更高。

关键词: 软件工程, 源代码搜索, 并行爬取, MapReduce, 开源代码

Abstract:

With the increment of the number of the open source code, it has been a trend to reuse existing codes in software programming. In order to quickly and accurately search open source code, this paper carries out a method of parallel crawling of source code based on MapReduce. Firstly, analyze the current open source code libraries to select the appropriate site and object, and clear the whole parallel crawl the process and target. Secondly, design the Map method and Reduce method according to source code characters, and propose a source code parallel crawl algorithm. Finally, implement the parallel crawling of open source code with cluster computing environment. The experiment shows that using the multi-machine parallel to search source code compared with the traditional method, the speed is obviously improved, and the results of the search results are more reliable.

Key words: software engineering, source code searching, parallel crawling, MapReduce, open source code

中图分类号: