We study the bias introduced in Differentially-Private Stochastic Gradient Descent (DP-SGD) with clipped or normalized per-sample gradient. As one of the most popular but artificial operations to ensure bounded sensitivity, gradient clipping enables composite privacy analysis of many iterative optimization methods without additional assumptions on either learning models or input data. Despite its wide applicability, gradient clipping also presents theoretical challenges in systematically instructing improvement of privacy or utility. In general, without an assumption on globally-bounded gradient, classic convergence analyses do not apply to clipped gradient descent. Further, given limited understanding of the utility loss, many existing improvements to DP-SGD are heuristic, especially in the applications of private deep learning.In this paper, we provide meaningful theoretical analysis validated by thorough empirical results of DP-SGD. We point out that the bias caused by gradient clipping is underestimated in previous works. For generic non-convex optimization via DP-SGD, we show one key factor contributing to the bias is the sampling noise of stochastic gradient to be clipped. Accordingly, we use the developed theory to build a series of improvements for sampling noise reduction from various perspectives. From an optimization angle, we study variance reduction techniques and propose inner-outer momentum. At the learning model (neural network) level, we propose several tricks to enhance network internal normalization and BatchClipping to carefully clip the gradient of a batch of samples. For data preprocessing, we provide theoretical justification of recently proposed improvements via data normalization and (self-)augmentation.Putting these systematic improvements together, private deep learning via DP-SGD can be significantly strengthened in many tasks. For example, in computer vision applications, with an (ϵ = 8, δ = 10 −5 ) DP guarantee, we successfully train Re...
Bibliographical noteKAUST Repository Item: Exported on 2023-07-24
Acknowledged KAUST grant number(s): BAS/1/1689-01-01, FCC/1/1976-49-01, REI/1/4811-10-01, RGC/3/4816-01-01, URF/1/4663-01-01
Acknowledgements: We would like to thank Jun Wan for very helpful discussion and for reading several drafts of this paper. We also thank the anonymous reviewers for their constructive feedback. Hanshen Xiao was supported in part by DSTA, Singapore, and a MathWorks fellowship. Di Wang and Zihang Xiang were supported by BAS/1/1689-01-01, URF/1/4663-01-01, FCC/1/1976-49-01, RGC/3/4816-01-01, and REI/1/4811-10-01 of King Abdullah University of Science and Technology (KAUST) and KAUSTSDAIA Center of Excellence in Data Science and Artificial Intelligence.