基于预训练Transformers的物联网设备识别方法

doi:10.3969/j.issn.1671-1122.2024.08.013

摘要/Abstract

摘要：

为帮助网络管理员迅速隔离局域网内的异常、易受攻击的物联网设备，以防攻击者利用设备漏洞侵入内部网络进行潜伏和后续深度攻击，高效的物联网设备识别方法显得尤为重要。然而，现有基于机器学习的识别方法普遍存在特征选择过程复杂、获取的数据流特征不稳定等问题，从而影响了识别准确性。为此，文章提出了一种基于预训练Transformers的物联网设备识别方法，该方法主要通过IoTBERT模型对设备流量进行处理，以实现物联网设备识别目标。IoTBERT包括预训练单元和设备识别单元等核心组件，预训练单元通过使用无标记物联网设备流量数据训练ALBERT模型，将数据特征编码嵌入高维特征向量中，从而获取流量特征表示模型。设备识别单元则利用标记数据微调预训练模型的参数权重，并结合残差网络在分组级别上完成物联网设备识别。该方法自动学习流量特征表示并执行分类识别决策，无需人工设计特征工程和手动构建多阶段处理流程，直接将原始数据分组编码映射到相应的类别标签，从而实现端到端的物联网设备识别。在公开数据集Aalto、UNSW和CIC IoT上的实验结果表明，文章所提方法能够基于数据分组有效识别物联网设备，并且该方法的平均识别准确率分别达到97.2%、92.1%和99.8%。

关键词: 物联网, 设备识别, 表示学习, 预训练模型

Abstract:

To help network administrators quickly isolate anomalous and vulnerable IoT devices in the LAN to prevent attackers from exploiting device vulnerabilities to penetrate the internal network for latent and subsequent deep attacks, efficient IoT device identification methods are particularly important. However, existing machine learning-based classification methods generally suffer from the problems of cumbersome feature selection process and unstable data flow features, which affect the identification accuracy. Accordingly, IoT device identification method based on pre-trained transformers was proposed. This method mainly realized the goal of IoT device identification by processing the device traffic through the model IoTBERT model. IoTBERT included two major components, the pre-training module and the device identification module. The pre-training module trained the ALBERT model by using the unlabeled IoT device flow data, and embedding data feature encoding into high-dimensional feature vectors to achieve the acquisition of traffic feature representation models. While the device identification module used the labeled data to fine-tune the parameter weights of the pre-trained model, and combined the residual networks to accomplish the identification of IoT devices using the packet-level information. This method automatically learnt traffic feature representations and performed classification and identification decisions, eliminating the need for manually designing feature engineering and manually building multi-stage processing flows. It directly mapped raw data grouping codes to corresponding category labels for end-to-end IoT device identification. The experimental results on the publicly available datasets Aalto, UNSW and CIC IoT show that this method is able to recognize and classify IoT devices effectively, and the average recognition accuracy of the method reaches 97.2%, 92.1% and 99.8% respectively.

Key words: Internet of things, device identification, representation learning, pre-trained model

中图分类号:

TP309

邢长友, 王梓澎, 张国敏, 丁科. 基于预训练Transformers的物联网设备识别方法[J]. 信息网络安全, 2024, 24(8): 1277-1290.

XING Changyou, WANG Zipeng, ZHANG Guomin, DING Ke. IoT Device Identification Method Based on Pre-Trained Transformers[J]. Netinfo Security, 2024, 24(8): 1277-1290.

图/表 14

图1

图2

表1

图3

图4

图5

图6

表2

表3

图7

图8

图9

图10

表4

参考文献 28

[1]	ROSEN M. Driving the Digital Agenda Requires Strategic Architec-ture[EB/OL]. (2015-04-22)[2024-02-11]. https://idc-cema.com/dwn/SF_177701.
[2]	Fortune Business Insights. IoT Market Size, Growth IoT Industry Report 2026[EB/OL]. (2019-01-01)[2024-02-11]. https://www.fortunebusinessinsights.com/industry-reports/internet-of-things-iot-market-100307.
[3]	FENG Guangsheng, JIANG Shunpeng, HU Xianlang, et al. New Research Progress on Intrusion Detection Techniques for the Internet of Things[J]. Netinfo Security, 2024, 24(2): 167-178.
	冯光升, 蒋舜鹏, 胡先浪, 等. 面向物联网的入侵检测技术研究新进展[J]. 信息网络安全, 2024, 24(2): 167-178.
[4]	ZHANG Xiaolu, UPTON O, BEEBE N L, et al. IoT Botnet Forensics: A Comprehensive Digital Forensic Case Study on Mirai Botnet Servers[EB/OL]. (2020-04-12)[2024-02-11]. https://www.sciencedirect.com/science/article/pii/S2666281720300214.
[5]	LIU Yongxin, WANG Jian, LI Jianqiang, et al. Machine Learning for the Detection and Identification of Internet of Things Devices: A Survey[J]. IEEE Internet of Things Journal, 2022, 9(1): 298-320.
[6]	WU Hua, WU Qiuyan, CHENG Guang, et al. SFIM: Identify User Behavior Based on Stable Features[J]. Peer-to-Peer Networking and Applications, 2021, 14(6): 3674-3687.
[7]	ZHANG Haozhen, YU Le, XIAO Xi, et al. TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-Grained Encrypted Traffic Classification[C]// ACM. Proceedings of the ACM Web Conference 2023. New York: ACM, 2023: 2066-2075.
[8]	ZENG Jun, WANG Ziwei, YU Yang, et al. Word Embedding Methods in Natural Language Processing: A Review[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 24-43.
	曾骏, 王子威, 于扬, 等. 自然语言处理领域中的词嵌入方法综述[J]. 计算机科学与探索, 2024, 18(1): 24-43. doi: 10.3778/j.issn.1673-9418.2303056
[9]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[EB/OL]. (2018-10-12)[2024-02-11]. https://arxiv.org/abs/1810.04805v2.
[10]	LAN Zhenzhong, CHEN Mingda, GOODMAN S, et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations[EB/OL]. (2019-09-26)[2024-02-11]. https://arxiv.org/abs/1909.11942v6.
[11]	SENGUPTA S, GANGULY N, DE P, et al. Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification[C]// ACM. The World Wide Web Conference. New York: ACM, 2019: 1657-1668.
[12]	HE Hongye, YANG Zhiguo, CHEN Xiangning. PERT: Payload Encoding Representation from Transformer for Encrypted Traffic Classification[C]// IEEE. 2020 ITU Kaleidoscope:Industry-Driven Digital Transformation (ITU K). New York: IEEE, 2020: 111-118.
[13]	LIN Xinjie, XIONG Gang, GOU Gaopeng, et al. ET-BERT: A Contextualized Datagram Representation with Pre-Training Transformers for Encrypted Traffic Classification[C]// ACM. Proceedings of the ACM Web Conference 2022. New York: ACM, 2022: 633-642.
[14]	MIETTINEN M, MARCHAL S, HAFEEZ I, et al. IoT SENTINEL: Automated Device-Type Identification for Security Enforcement in IoT[C]// IEEE. 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). New York: IEEE, 2017: 2177-2184.
[15]	HAMAD S A, ZHANG W E, SHENG Q Z, et al. IoT Device Identification via Network-Flow Based Fingerprinting and Learning[C]// IEEE. 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE). New York: IEEE, 2019: 103-111.
[16]	SIVANATHAN A, GHARAKHEILI H H, LOI F, et al. Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics[J]. IEEE Transactions on Mobile Computing, 2019, 18(8): 1745-1759.
[17]	YIN Feihong, YANG Li, WANG Yuchen, et al. IoT ETEI: End-to-End IoT Device Identification Method[C]// IEEE. 2021 IEEE Conference on Dependable and Secure Computing (DSC). New York: IEEE, 2021: 1-8.
[18]	KOSTAS K, JUST M, LONES M A. IoTDevID: A Behavior-Based Device Identification Method for the IoT[J]. IEEE Internet of Things Journal, 2022, 9(23): 23741-23749.
[19]	LI Zhihua, WANG Zhihao. IoT Device Identification Method Based on LCNN and LSTM Hybrid Structure[J]. Netinfo Security, 2023, 23(6): 43-54.
	李志华, 王志豪. 基于LCNN和LSTM混合结构的物联网设备识别方法[J]. 信息网络安全, 2023, 23(6): 43-54.
[20]	LUO Yantian, CHEN Xu, GE Ning, et al. Transformer-Based Device-Type Identification in Heterogeneous IoT Traffic[J]. IEEE Internet of Things Journal, 2023, 10(6): 5050-5062.
[21]	WANG Juan, ZHONG Jing, LI Jiangqi. IoT-Portrait: Automatically Identifying IoT Devices via Transformer with Incremental Learning[J]. Future Internet, 2023, 15(3): 102-115.
[22]	PENG Chuang, TAN Xiaobin, XIE Peng, et al. IoT Device Identification Base on Inter-Flow Correlation Analysis Using Graph Neural Network[C]// IEEE. 2023 9th International Conference on Big Data Computing and Communications (BigCom). New York: IEEE, 2023: 24-31.
[23]	REZAEI S, LIU Xin. Deep Learning for Encrypted Traffic Classification: An Overview[J]. IEEE Communications Magazine, 2019, 57(5): 76-81. doi: 10.1109/MCOM.2019.1800819
[24]	SENNRICH R, HADDOW B, BIRCH A. Neural Machine Translation of Rare Words with Subword Units[EB/OL]. (2015-08-31)[2024-02-11]. https://arxiv.org/abs/1508.07909v5.
[25]	HWANG R H, PENG Mingchun, NGUYEN V L, et al. An LSTM Based Deep Learning Approach for Classifying Malicious Traffic at the Packet Level[EB/OL]. (2019-08-19)[2024-02-11]. https://doi.org/10.3390/app9163414.
[26]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep Residual Learning for Image Recognition[C]// IEEE. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778.
[27]	DADKHAH S, MAHDIKHANI H, DANSO P K, et al. Towards the Development of a Realistic Multidimensional IoT Profiling Dataset[C]// IEEE. 2022 19th Annual International Conference on Privacy, Security & Trust (PST). New York: IEEE, 2022: 1-11.
[28]	LOSHCHILOV I, HUTTER F. Decoupled Weight Decay Regularization[EB/OL]. (2017-11-14)[2024-02-11]. 1711.05101. https://arxiv.org/abs/1711.05101v3.

参数	值
hidden_size	768
layers_num	12
heads_num	12
feedforward_size	3072
seq_length	128

Aalto	设备名称	Precision	Recall	F1-score
	Aria	94.1%	94.1%	0.941
	D-LinkCam	98.5%	95.0%	0.967
	D-LinkDayCam	95.4%	81.9%	0.881
	D-LinkDoorSensor	99.9%	99.7%	0.998
	D-LinkHomeHub	97.9%	96.1%	0.970
	D-LinkSensor	96.1%	96.1%	0.961
	D-LinkSiren	90.0%	97.1%	0.934
	D-LinkSwitch	98.2%	94.9%	0.965
	D-LinkWaterSensor	97.2%	95.7%	0.965
	EdimaxCam1	93.3%	85.2%	0.891
	EdimaxCam2	83.3%	86.2%	0.847
	EdnetCam1	68.7%	82.5%	0.750
	EdnetCam2	75.6%	87.2%	0.810
	EdimaxPlug1101W	92.9%	92.6%	0.928
	EdimaxPlug2101W	95.8%	87.2%	0.913
	EdnetGateway	90.8%	94.4%	0.926
	HueBridge	98.9%	99.9%	0.994
	HueSwitch	99.9%	99.5%	0.997
	iKettle2	45.7%	50.0%	0.478
	Lightify	99.1%	98.3%	0.987
	MAXGateway	98.1%	100%	0.991
	SmarterCoffee	60.9%	53.8%	0.571
	TP-LinkPlugHS100	95.5%	95.9%	0.957
	TP-LinkPlugHS110	92.4%	93.6%	0.930
	WeMoInsightSwitch	94.6%	98.0%	0.962
	WeMoInsightSwitch2	99.2%	98.6%	0.989
	WeMoLink	99.2%	99.0%	0.991
	WeMoSwitch	99.3%	98.5%	0.989
	WeMoSwitch2	95.3%	93.5%	0.944
	Witdings	90.3%	99.1%	0.945
	Home Matic Plug	100%	100%	1
UNSW	Amazon Echo	99.8%	99.1%	0.995
	Belkin Wemo Motion Sensor	77.3%	57.3%	0.658
	Belkin Wemo Switch	67.5%	91.7%	0.778
	Blipcare Blood Pressure Meter	85.7%	54.5%	0.667
	Dropcam	100%	100%	1
	HP Printer	97.9%	100%	0.989
	iHome	98.4%	47.7%	0.642
	Insteon Camera	99.8%	99.9%	0.998
	Light Bulbs LiFX Smart Bulb	97.8%	91.9%	0.947
	NEST Protect Smoke Alarm	100%	93.2%	0.965
	Netatmo Weather Station	78.4%	92.8%	0.850
	Netatmo Welcome	99.9%	97.6%	0.988
	PIX-STAR Photo-Frame	92.6%	92.4%	0.925
	Samsung SmartCam	99.7%	89.5%	0.943
	Smart Things	99.7%	99.7%	0.997
	TP-Link Cloud Camera	99.5%	95.4%	0.974
	TP-Link Smart Plug	87.7%	86.1%	0.869
	Triby Speaker	85.7%	82.8%	0.842
	Withings Smart Sleep Sensor	99.7%	98.8%	0.992
	Withings Smart Baby Monitor	99.8%	99.9%	0.999
	Withings Smart Scale	100%	97.6%	0.988
	Non-IoT	99.4%	97.6%	0.985

设备名称	CIC IoT Dataset(空闲场景)			CIC IoT Dataset(活动场景)
设备名称	Precision	Recall	F1-score	Precision	Recall	F1-score
Amazon Alexa Echo Dot	99.9%	99.8%	0.998	99.7%	99.7%	0.997
Amazon Alexa Echo Spot	99.5%	99.8%	0.996	99.5%	99.5%	0.995
Amazon Alexa Echo Studio	99.8%	99.7%	0.997	99.7%	99.4%	0.996
Amazon Plug	100%	99.7%	0.998	99.8%	99.8%	0.998
AMCREST Wi-Fi Camera	100%	100%	1	100%	99.9%	1
Arlo Base Station	99.9%	100%	1	99.8%	99.9%	0.999
Arlo Q Camera	99.9%	99.6%	0.997	99.9%	99.9%	0.999
Atomi Coffee Maker	100%	100%	1	100%	100%	1
DCS8000LHA1 D-Link Mini Camera	100%	100%	1	100%	100%	1
Google Nest Mini	100%	100%	1	100%	100%	1
Gosund ESP_032979 Plug	100%	100%	1	100%	100%	1
HeimVision Smart Wi-Fi Camera	100%	99.9%	1	100%	100%	1
HeimVision SmartLife Radio Lamp	100%	100%	1	100%	100%	1
iRobot Roomba	100%	98.7%	0.994	100%	99.5%	0.997
Philips Hue Bridge	100%	100%	0.1	100%	100%	1
Sonos One Speaker	99.9%	99.9%	0.999	99.8%	100%	0.999

识别方法	特征类型	数据集	评估指标	结果
IoTSentinel	数据分组头部字段	Aalto	Accuracy	81.5%
文献[15]方法	流统计特征	Aalto	F1-score	0.903
IoTDevID	数据分组头部字段和有效载荷	Aalto	Accuracy	83.3%
IoTDevID	数据分组头部字段和有效载荷	UNSW	Accuracy	94.3%
文献[16]方法	流统计特征和协议字段	UNSW	Accuracy	98.4%
文献[22]方法	流统计特征和流间特征	CIC IoT（空闲场景）	Accuracy	98.5%
文献[22]方法	流统计特征和流间特征	CIC IoT（活动场景）	Accuracy	98.9%
IoTBERT	数据分组十六进制编码	Aalto	Accuracy	97.2%
		Aalto	F1-score	0.915
		UNSW	Accuracy	92.1%
		CIC IoT（空闲场景）	Accuracy	99.9%
		CIC IoT（活动场景）	Accuracy	99.8%