Abstract
Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression of updates were recently proposed using sparsification or quantization. However, none of the prior methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new method–DIANA–which resolves this issue via compression of gradient differences. We provide theory in the strongly convex and nonconvex settings that shows improved convergence rates, and use it to obtain the first convergence rate for the previously proposed method TernGrad. Finally, we provide theory to support non-smooth regularizers.
Original language | English (US) |
---|---|
Journal | Optimization Methods and Software |
DOIs | |
State | Accepted/In press - 2024 |
Bibliographical note
Publisher Copyright:© 2024 Informa UK Limited, trading as Taylor & Francis Group.
Keywords
- communication compression
- convex optimization
- Distributed optimization
- federated learning
- non-convex optimization
ASJC Scopus subject areas
- Software
- Control and Optimization
- Applied Mathematics