信息网络安全 ›› 2020, Vol. 20 ›› Issue (2): 37-48.doi: 10.3969/j.issn.1671-1122.2020.02.006

• • 上一篇    下一篇

基于最大信息系数的主成分分析差分隐私数据发布算法

彭长根1,2,3, 赵园园1,3(), 樊玫玫1   

  1. 1.贵州大学数学与统计学院公共大数据国家重点实验室,贵阳 550025
    2.贵州大学计算机科学与技术学院,贵阳 550025
    3.贵州大学密码学与数据安全研究所,贵阳 550025
  • 收稿日期:2019-10-27 出版日期:2020-02-10 发布日期:2020-05-11
  • 作者简介:

    作者简介:彭长根(1963—),男,贵州,教授,博士,主要研究方向为密码学与大数据安全;赵园园(1993—),女,河南,硕士研究生,主要研究方向为隐私保护与数据安全;樊玫玫(1981—),女,贵州,讲师,硕士,主要研究方向为密码学与信息安全。

  • 基金资助:
    国家自然科学基金[U1836205, 61662009, 61772008, 11761020];贵州省科技重大专项[[2018]3001,[2018]3007,[2017]3002];贵州省科技支撑计划[[2019]2004,[2018]2162];贵州省科技基础研究计划[[2017]1045,[2019]1049];贵州财经大学科研基金[2017XJC01];贵州大学研究生创新基金[研理工2016068];贵州省研究生科研基金立项课题[KYJJ2017005]

A Differential Private Data Publishing Algorithm via Principal Component Analysis Based on Maximum Information Coefficient

PENG Changgen1,2,3, ZHAO Yuanyuan1,3(), FAN Meimei1   

  1. 1. State Key Laboratory of Public Big Data, College of Mathematics and Statistics, Guizhou University, Guiyang 550025, China
    2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
    3. Institute of Cryptography & Data Security, Guzihou University, Guiyang 550025, China
  • Received:2019-10-27 Online:2020-02-10 Published:2020-05-11

摘要:

数据的私密性与可用性是隐私保护领域的核心问题。主成分分析差分隐私算法将高维数据的主成分降维与加噪融合,可有效对高维数据进行隐私保护并保持数据的高可用性。现有的主成分分析差分隐私保护算法依赖于皮尔逊相关系数,在隐私保护过程中仅能捕获高维数据的线性关系,且未考虑将差分预算在降维后的数据集上进行优化分配,导致算法普适性不足、数据效用不高。针对此问题,文章提出一种基于最大信息系数的主成分分析差分隐私(MIC-PCA-DP)数据发布算法。实验结果表明,文章所提出的隐私保护算法适用于线性关系、非线性关系、非函数依赖关系等数据关系的降维保持,其主成分降维可实现高效的原始数据信息承载;与经典的差分隐私算法以及PCA-based PPDP算法相比,在相同隐私保护强度下,对数据添加的噪声更小,能够有效保持数据的可用性。

关键词: 差分隐私, 主成分分析, 降维, 数据发布

Abstract:

The privacy and availability of data are important issues of privacy protection. Principal component analysis (PCA) differential privacy can effectively protect the privacy of high-dimensional data and maintain the high availability of data by integrating the dimensionality reduction and noise addition of the principal component of high-dimensional data. The existing principal component analysis differential privacy protection algorithm relies on Pearson correlation coefficient, which can only capture the linear relationship of high-dimensional data in the privacy protection process, and does not consider to optimize the allocation of difference budget on the data set after dimension reduction, resulting in insufficient applicability of the algorithm and low data utility. To this end, a differential private data publishing algorithm (MIC-PCA-DPPD) is proposed via principal component analysis based on maximum information coefficient. The experimental results show that the proposed privacy protection algorithm is suitable for dimensionality maintenance of linear relationships, nonlinear relationships, multi-function relationships, etc. Its principal component dimensionality reduction can achieve efficient raw data information bearing. Compared with the classic differential privacy algorithm and PCA-based PPDP algorithm, the noise added to the data is smaller and the data availability can be effectively maintained, under the same constraint of privacy protection intensity.

Key words: differential privacy, principal components analysis, dimension reduction, data publication

中图分类号: