基于最大信息系数的主成分分析差分隐私数据发布算法

doi:10.3969/j.issn.1671-1122.2020.02.006

信息网络安全 ›› 2020, Vol. 20 ›› Issue (2): 37-48.doi: 10.3969/j.issn.1671-1122.2020.02.006

基于最大信息系数的主成分分析差分隐私数据发布算法

彭长根^1,^2,³, 赵园园^1,³(), 樊玫玫¹

1.贵州大学数学与统计学院公共大数据国家重点实验室,贵阳 550025
2.贵州大学计算机科学与技术学院,贵阳 550025
3.贵州大学密码学与数据安全研究所,贵阳 550025

收稿日期:2019-10-27 出版日期:2020-02-10 发布日期:2020-05-11
作者简介:
作者简介：彭长根（1963—）,男,贵州,教授,博士,主要研究方向为密码学与大数据安全;赵园园（1993—）,女,河南,硕士研究生,主要研究方向为隐私保护与数据安全;樊玫玫（1981—）,女,贵州,讲师,硕士,主要研究方向为密码学与信息安全。
基金资助:
国家自然科学基金[U1836205, 61662009, 61772008, 11761020];贵州省科技重大专项[[2018]3001,[2018]3007,[2017]3002];贵州省科技支撑计划[[2019]2004,[2018]2162];贵州省科技基础研究计划[[2017]1045,[2019]1049];贵州财经大学科研基金[2017XJC01];贵州大学研究生创新基金[研理工2016068];贵州省研究生科研基金立项课题[KYJJ2017005]

A Differential Private Data Publishing Algorithm via Principal Component Analysis Based on Maximum Information Coefficient

PENG Changgen^1,^2,³, ZHAO Yuanyuan^1,³(), FAN Meimei¹

1. State Key Laboratory of Public Big Data, College of Mathematics and Statistics, Guizhou University, Guiyang 550025, China
2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
3. Institute of Cryptography & Data Security, Guzihou University, Guiyang 550025, China

Received:2019-10-27 Online:2020-02-10 Published:2020-05-11

摘要/Abstract

摘要：

数据的私密性与可用性是隐私保护领域的核心问题。主成分分析差分隐私算法将高维数据的主成分降维与加噪融合,可有效对高维数据进行隐私保护并保持数据的高可用性。现有的主成分分析差分隐私保护算法依赖于皮尔逊相关系数,在隐私保护过程中仅能捕获高维数据的线性关系,且未考虑将差分预算在降维后的数据集上进行优化分配,导致算法普适性不足、数据效用不高。针对此问题,文章提出一种基于最大信息系数的主成分分析差分隐私（MIC-PCA-DP）数据发布算法。实验结果表明,文章所提出的隐私保护算法适用于线性关系、非线性关系、非函数依赖关系等数据关系的降维保持,其主成分降维可实现高效的原始数据信息承载;与经典的差分隐私算法以及PCA-based PPDP算法相比,在相同隐私保护强度下,对数据添加的噪声更小,能够有效保持数据的可用性。

关键词: 差分隐私, 主成分分析, 降维, 数据发布

Abstract:

The privacy and availability of data are important issues of privacy protection. Principal component analysis (PCA) differential privacy can effectively protect the privacy of high-dimensional data and maintain the high availability of data by integrating the dimensionality reduction and noise addition of the principal component of high-dimensional data. The existing principal component analysis differential privacy protection algorithm relies on Pearson correlation coefficient, which can only capture the linear relationship of high-dimensional data in the privacy protection process, and does not consider to optimize the allocation of difference budget on the data set after dimension reduction, resulting in insufficient applicability of the algorithm and low data utility. To this end, a differential private data publishing algorithm (MIC-PCA-DPPD) is proposed via principal component analysis based on maximum information coefficient. The experimental results show that the proposed privacy protection algorithm is suitable for dimensionality maintenance of linear relationships, nonlinear relationships, multi-function relationships, etc. Its principal component dimensionality reduction can achieve efficient raw data information bearing. Compared with the classic differential privacy algorithm and PCA-based PPDP algorithm, the noise added to the data is smaller and the data availability can be effectively maintained, under the same constraint of privacy protection intensity.

Key words: differential privacy, principal components analysis, dimension reduction, data publication

中图分类号:

TP309

彭长根, 赵园园, 樊玫玫. 基于最大信息系数的主成分分析差分隐私数据发布算法[J]. 信息网络安全, 2020, 20(2): 37-48.

PENG Changgen, ZHAO Yuanyuan, FAN Meimei. A Differential Private Data Publishing Algorithm via Principal Component Analysis Based on Maximum Information Coefficient[J]. Netinfo Security, 2020, 20(2): 37-48.

图/表 8

图1

图2

图3

图4

图5

图6

图7

图8

参考文献 25

[1]	DWORK C.Differential Privacy[C]//Springer. The 33rd International Colloquium on Automata, Languages, and Programming, July 9-16, 2006, Venice, Italy. Berlin: Spinger, 2006: 1-12.
[2]	DWORK C, LEI J.Differential Privacy and Robust Statistics[C]//SIGACT. 41st ACM Symposium on Theory of Computing(STOC 2009), May 31-June 2, 2009, Bethesda, Maryland. New York: ACM, 2009: 371-380.
[3]	HARDT M, ROTH A.Beating Randomized Response on Incoherent Matrices[C]//SIGACT. 44th ACM Symposium on Theory of Computing(STOC 2012), May 19-22, 2012, New York, USA. New York: ACM, 2012: 1255-1268.
[4]	DWORK C, ROTH A.The Algorithmic Foundations of Differential Privacy[J]. Foundations and Trends in Theoretical Computer Science, 2013, 9(3-4): 211-407.
[5]	BLUM A, DWORK C, MCSHERRY F, et al.Practical Privacy: The SULQ Framework[C]//SIGACT. 24th ACM Sigmod-Sigact-Sigart Symposium on Principles of Database Systems(PODS 2005), June 13-15, 2005, Baltimore, Maryland. New York: ACM, 2005: 128-138.
[6]	JIANG Xiacqiang, JI Zhanglong, WANG Shuang, et al.Differential-Private Data Publishing Through Component Analysis[J]. Transactions on Data Privacy, 2013, 6(1): 19-34.
[7]	DWORK C, TALWAR K, THAKURTA A, et al.Analyze Gauss: Optimal Bounds for Privacy-Preserving Principal Component Analysis[C]//SIGACT. 46th ACM Symposium on Theory of Computing(STOC 2014), May 31-June 3, 2014, New York, USA. New York: ACM, 2014: 11-20.
[8]	CHAUDHURI K, SARWATE A D, SINHA K.A Near-Optimal Algorithm for Differentially-Private Principal Components[J]. Journal of Machine Learning Research, 2013, 14(9): 2905-2943.
[9]	KAPRALOV M, TALWAR K.On Differentially Private Low Rank Approximation[C]//SIGACT. 24th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA 2013), January 6-8, 2013, New Orleans, Los Angeles, USA. New York: ACM, 2013: 1395-1414.
[10]	HARDT M, ROTH A.Beyond Worst-Case Analysis in Private Singular Vector Computation[C]//SIGACT. 45th ACM Symposium on the Theory of Computing, June 1-4, 2013, Palo Alto, California, USA. New York: ACM, 2013: 331-340.
[11]	JIANG Wuxuan, XIE Cong, ZHANG Zhihua.Wishart Mechanism for Differentially Private Principal Components Analysis[J]. Computer Science, 2015, 18(VI): 458-473.
[12]	HARDT M, PRICE E.The Noisy Power Method: A Meta Algorithm with Applications[C]//NIPS. Annual Conference on Neural Information Processing Systems 2014, December 8-13, 2014, Montreal, Quebec, Canada. Boston: MIT Press, 2014: 2861-2869.
[13]	HALKON, MARTINSSON P G, TROPP J A. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions[J]. SIAM Review, 2011, 53(2): 217-288.
[14]	KALAI A, VEMPALA S.Efficient Algorithms for Online Decision Problems[J]. Journal of Computer and System Sciences, 2005, 71(3): 291-307.
[15]	ZHANG Yuxuan, WEI Jianghong, LI Ji, et al.Graph Degree Histogram Publication Method with Node-Differential Privacy[J]. Journal of Computer Research and Development, 2019, 6(3): 508-520.
	张宇轩,魏江宏,李霁,等.点差分隐私下图数据的度直方图发布方法[J]. 计算机研究与发展,2019,6(3):508-520.
[16]	MCSHERRY F.Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis[C]//SIGACT. 2009 ACM SIGMOD International Conference on Management of Data, June 29-July 2, 2009, Providence, Rhode Island, USA. New York: ACM, 2009: 19-30.
[17]	DWORK C.Differential Privacy: A Survey of Results[C]//Springer. 5th International Conference Theory and Applications of Models of Computation, April 25-29, Xi’an, China. Berlin: Spinger, 2008: 1-19.
[18]	DWORK C, MCSHERRY F, NISSIM K, et al.Calibrating Data to Sensitivity in Private Data Analysis[C]// Springer. the 3rd Theory of Cryptography Conference, March 4-7, 2006, New York, USA. Berlin: Springer, 2006: 265-284.
[19]	MCSHERRY F, TALWAR K.Mechanism Design via Differential Privacy[C]//IEEE. 48th Annual IEEE Symposium on Foundations of Computer Science, October 21-23, 2007, Providence, Rhode Island, USA. New York: IEEE, 2007: 94-103.
[20]	RESHEF D N, RESHEF Y A, FINUCANE H K, et al.Detecting Novel Associations in Large Data Sets[J]. Science, 2011, 334(6062): 1518-1524.
[21]	WOLD S.Principal Component Analysis[J]. Chemometrics & Intelligent Laboratory Systems, 1987, 2(1): 37-52.
[22]	TIPPING M E, BISHOP C M.Probabilistic Principal Component Analysis[J]. Journal of the Royal Statistical Society Series B(Statistical Methodology), 1999, 61(3): 611-622.
[23]	MOHAMMED N, CHEN R, FUNG B C M, et al. Differentially Private Data Release for Data Mining[C]//SIGACT. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 21-24, 2011, San Diego, California, USA. New York: ACM, 2011: 493-501.
[24]	KAISER, H F.The Application of Electronic Computers to Factor Analysis[J]. Educational and Psychological Measurement, 1960, 20(1): 141-151.
[25]	Residential Building Data Set Data Set[EB/OL]. 2018-2-19.

基于最大信息系数的主成分分析差分隐私数据发布算法

A Differential Private Data Publishing Algorithm via Principal Component Analysis Based on Maximum Information Coefficient

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 25

相关文章 10

编辑推荐

Metrics

本文评价

[1]	蒋辰, 杨庚, 白云璐, 马君梅. 面向隐私保护的频繁项集挖掘算法[J]. 信息网络安全, 2019, 19(4): 73-81.
[2]	刘敬浩, 毛思平, 付晓梅. 基于ICA算法与深度神经网络的入侵检测模型[J]. 信息网络安全, 2019, 19(3): 1-10.
[3]	傅彦铭, 李振铎. 基于拉普拉斯机制的差分隐私保护k-means++聚类算法研究[J]. 信息网络安全, 2019, 19(2): 43-52.
[4]	杨威超, 郭渊博, 钟雅, 甄帅辉. 基于设备型号分类和BP神经网络的物联网流量异常检测[J]. 信息网络安全, 2019, 19(12): 53-63.
[5]	戚名钰, 刘铭, 傅彦铭. 基于PCA的SVM网络入侵检测研究[J]. 信息网络安全, 2015, 15(2): 15-18.
[6]	. 基于两个方向二维核主成分分析的手指静脉识别[J]. , 2014, 14(4): 51-.
[7]	王文明;谭毓安. 基于EM聚类算法的机器人视觉场景深度分类方法[J]. , 2013, 13(6): 0-0.
[8]	张新超;董建锋;张婧. 主成份分析在网络流量异常检测中的应用研究[J]. , 2012, 12(1): 0-0.
[9]	郑辉. 基于KPCA组合核函数SVM的网络危险因素识别[J]. , 2010, (2): 0-0.
[10]	孙子文;周治平;李慧. 基于小波子带特征函数矩和主成分分析的图像隐写分析方法[J]. , 2009, 9(7): 0-0.