Netinfo Security ›› 2018, Vol. 18 ›› Issue (1): 58-66.doi: 10.3969/j.issn.1671-1122.2018.01.009

• Orginal Article • Previous Articles     Next Articles

Research and Implementation on Parallel Crawl Method for Source Code Based on MapReduce

Junyan MA1,2(), Guosun ZENG1   

  1. 1. Department of Computer Science and Technology, Tongji University, Shanghai 200092, China
    2. Tongji Branch, National Engineering & Technology Center of High Performance Computer, Shanghai 200092, China;
  • Received:2017-10-31 Online:2018-01-20 Published:2020-05-11

Abstract:

With the increment of the number of the open source code, it has been a trend to reuse existing codes in software programming. In order to quickly and accurately search open source code, this paper carries out a method of parallel crawling of source code based on MapReduce. Firstly, analyze the current open source code libraries to select the appropriate site and object, and clear the whole parallel crawl the process and target. Secondly, design the Map method and Reduce method according to source code characters, and propose a source code parallel crawl algorithm. Finally, implement the parallel crawling of open source code with cluster computing environment. The experiment shows that using the multi-machine parallel to search source code compared with the traditional method, the speed is obviously improved, and the results of the search results are more reliable.

Key words: software engineering, source code searching, parallel crawling, MapReduce, open source code

CLC Number: