信息网络安全 ›› 2021, Vol. 21 ›› Issue (12): 118-125.doi: 10.3969/j.issn.1671-1122.2021.12.016

• 入选论文 • 上一篇    下一篇

基于加权Stacking集成学习的Tor匿名流量识别方法

王曦锐, 芦天亮(), 张建岭, 丁锰   

  1. 中国人民公安大学信息技术与网络安全学院,北京 100038
  • 收稿日期:2021-08-16 出版日期:2021-12-10 发布日期:2022-01-11
  • 通讯作者: 芦天亮 E-mail:lutianliang@ppsuc.edu.cn
  • 作者简介:王曦锐(1998—),男,江苏,硕士研究生,主要研究方向为网络信息安全、网络攻防|芦天亮(1985—),男,河北,副教授,博士,主要研究方向为网络信息安全、恶意代码分析与检测|张建岭(1965—),男,河北,副教授,硕士,主要研究方向为计算机科学技术、人工智能|丁锰(1980—),男,北京,副教授,硕士,主要研究方向为电子物证检验
  • 基金资助:
    中国人民公安大学基科费新型犯罪专项研究(2021XXFZ003)

Tor Anonymous Traffic Identification Method Based on Weighted Stacking Ensemble Learning

WANG Xirui, LU Tianliang(), ZHANG Jianling, DING Meng   

  1. College of Information Technology and Internet Security, People’s Public Security University of China, Beijing 100038, China
  • Received:2021-08-16 Online:2021-12-10 Published:2022-01-11
  • Contact: LU Tianliang E-mail:lutianliang@ppsuc.edu.cn

摘要:

Tor网络常被犯罪分子用来从事各类违法活动,因此对Tor流量进行高效识别对网络监管和打击犯罪有着重要意义。文章针对真实环境中Tor流量稀疏及识别准确率不高的问题,基于集成学习思想,提出一种加权Stacking模型的Tor流量识别方法。基于数据流层面提取流量的时间相关性特征,文章计算信息增益筛选最大的前14个特征构成输入数据集,对KNN、SVM和XGBoost进行不同的加权改进并作基学习器,XGBoost作为元学习器构建两层Stacking模型。在公开数据集上与10种其他算法对比,实验结果表明,文章提出的识别模型在准确率上优于大部分算法并且拥有较低的漏报率,更符合真实网络环境中Tor流量识别的要求。

关键词: 匿名网络, Tor, 不平衡数据, Stacking

Abstract:

The Tor network is often utilized by criminals to engage in various illegal activities, so it is important to identify the tor traffic efficiently for network supervision and fighting against crime. In this paper, based on the integrated learning idea, the weighted stacking model for tor traffic identification was proposed to solve the problem of sparse tor traffic and low recognition accuracy in real environment. Based on the data flow, time correlation characteristics of the flow were extracted, and the first 14 features of the information gain were calculated to form the input data set. KNN, SVM and XGBoost were weighted differently and used as base learners. XGBoost was used as the meta learners to construct two-layer stacking model. Compared with 10 algorithms on the open data set, the experimental results show that the recognition model proposed in this paper is superior to most algorithms in accuracy and has a lower missed rate, which is more in line with the target of tor traffic recognition in real network environment.

Key words: anonymous network, Tor, unbalanced data, Stacking

中图分类号: