Finding Low-Rank Matrix Weights in DNNs via Riemannian Optimization: RAdaGrad and RAadmW

发布者：刘茜茜发布时间：2026-06-04浏览次数：169

江苏省应用数学（中国矿业大学）中心系列学术报告

报告题目：Finding Low-Rank Matrix Weights in DNNs via Riemannian Optimization: RAdaGrad and RAadmW

报告人：蔡剑锋

报告时间：2026年6月10日（周三）晚上20:00-21:00

腾讯会议：909-994-803

欢迎全校师生参加！

数学学院

报告人及报告内容简介：

蔡剑锋，香港科技大学数学系教授。本科及硕士毕业于复旦大学，博士毕业于香港中文大学。获博士学位后，先后在新加坡国立大学与加州大学洛杉矶分校开展博士后研究。加入香港科技大学前，曾任爱荷华大学助理教授。其研究主要聚焦于数据科学与成像领域的数学理论与算法。

Abstract: Finding low-rank matrix weights is a key technique for addressing the high memory usage and computational demands of large models. Most existing algorithms rely on the factorization of the low-rank matrix weights, which is non-unique and redundant. Their convergence is slow especially when the target low-rank matrices are ill-conditioned, because the convergence rate depends on the condition number of the Jacobian operator for the factorization and the Hessian of the loss function with respect to the weight matrix. To address this challenge, we adopt the Riemannian gradient descent (RGD) algorithm on the Riemannian manifold of fixed-rank matrices to update the entire low-rank weight matrix. This algorithm completely avoids the factorization, thereby eliminating the negative impact of the Jacobian condition number. Furthermore, by leveraging the geometric structure of the Riemannian manifold and selecting an appropriate metric, it mitigates the negative impact of the Hessian condition number. Ultimately, this results in our two plug-and-play optimizers: RAdaGrad and RAdamW, which are RGD with metrics adapted from AdaGrad and AdamW and restricted to the manifold. Our algorithms can be seamlessly integrated with various deep neural network architectures without any modifications. We evaluate the effectiveness of our algorithms through fine-tuning experiments on large language models and diffusion models. Experimental results consistently demonstrate that our algorithms provide superior performance compared to state-of-the-art methods. Additionally, our algorithm is not only effective for fine-tuning large models but is also applicable to deep neural network (DNN) compression.