Nearly Optimal Rates of Privacy-preserving Sparse Generalized Eigenvalue Problem

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper, we study the (sparse) Generalized Eigenvalue Problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher&#x0027;s discriminant analysis (FDA) and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a <inline-formula><tex-math notation="LaTeX">$\rho$</tex-math></inline-formula>-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math></inline-formula>-norm estimation error of <inline-formula><tex-math notation="LaTeX">$\tilde{O}(\frac{d}{n}+\frac{d}{n^{2}\rho })$</tex-math></inline-formula> (under some mild assumptions), where <inline-formula><tex-math notation="LaTeX">$d$</tex-math></inline-formula> is the dimension and <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula> is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where <inline-formula><tex-math notation="LaTeX">$d\gg n$</tex-math></inline-formula>, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of <inline-formula><tex-math notation="LaTeX">$\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^{2}\rho })$</tex-math></inline-formula> for various statistical models, where <inline-formula><tex-math notation="LaTeX">$s$</tex-math></inline-formula> is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of <inline-formula><tex-math notation="LaTeX">$\text{Poly}(\log n)$</tex-math></inline-formula> by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-DP and <inline-formula><tex-math notation="LaTeX">$\rho$</tex-math></inline-formula>-CDP for GEP, we also provide the lower bound <inline-formula><tex-math notation="LaTeX">$\Omega (\frac{d}{n}+\frac{d^{2}}{n^{2}\epsilon ^{2}})$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\Omega (\frac{s\log d}{n}+\frac{s^{2}\log ^{2} d}{n^{2}\epsilon ^{2}})$</tex-math></inline-formula> of private minimax risk for PCA, under the statistical setting and <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis.

Original language English (US) 1-14 14 IEEE Transactions on Knowledge and Data Engineering 36 8 https://doi.org/10.1109/TKDE.2023.3330775 Accepted/In press - 2023

IEEE

Keywords

• Differential privacy
• Dimension Reduction
• Dimensionality reduction
• Eigenvalues and eigenfunctions
• Estimation error
• Generalized Eigenvalue Problem
• Principal component analysis
• Privacy
• Sliced Inverse Regression
• Stochastic processes
• Upper bound

ASJC Scopus subject areas

• Information Systems
• Computer Science Applications
• Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Nearly Optimal Rates of Privacy-preserving Sparse Generalized Eigenvalue Problem'. Together they form a unique fingerprint.