The Neural Differential Manifold (NDM) proposes re-conceptualizing neural networks as differentiable manifolds where parameters define a Riemannian metric, rather than operating in Euclidean spaces. This architecture aims to intrinsically regularize representations, enhance interpretability through geometric meaning, and enable more efficient optimization.
Damek Davis and Benjamin Recht introduce a unified theoretical framework for reinforcement learning algorithms applied to large language model post-training with binary rewards. Their work reveals that diverse methods optimize a monotonically transformed probability of a correct answer, with the specific transformation dictated by the advantage weights.
MuonBP introduces a block-periodic orthogonalization scheme to address the communication overhead of the Muon optimizer in distributed training for large language models. The method achieves an 8% increase in per-iteration throughput and reduces wall-clock time by 10-13% to reach target perplexity, while maintaining or slightly improving validation perplexity compared to baseline Muon.
In finite dimension, the long-time and metastable behavior of a gradient flow perturbated by a small Brownian noise is well understood. A similar situation arises when a Wasserstein gradient flow over a space of probability measure is approximated by a system of mean-field interacting particles, but classical results do not apply in these infinite-dimensional settings. This work is concerned with the situation where the objective function of the optimization problem contains an entropic penalization, so that the particle system is a Langevin diffusion process. We consider a very simple class of models, for which the infinite-dimensional behavior is fully characterized by a finite-dimensional process. The goal is to have a flexible class of benchmarks to fix some objectives, conjectures and (counter-)examples for the general situation. Inspired by the systematic study of these toy models, one application is presented on the continuous Curie-Weiss model in a symmetric double-well potential. We show that, at the critical temperature, although the N-particle Gibbs measure does not satisfy a uniform-in-N standard log-Sobolev inequality (the optimal constant growing like N), it does satisfy a more general Lojasiewicz inequality uniformly in N, inducing uniform polynomial long-time convergence rates, propagation of chaos at stationarity and uniformly in time, and creation of chaos.
Stephen J. Wright's paper synthesizes the historical and ongoing interaction between theoretical properties and practical performance in continuous optimization algorithms, covering linear programming and unconstrained minimization. It explains how theoretical advancements influence practical solvers and how empirical observations drive new theoretical investigations, tracing the evolution of algorithms across decades.
Seesaw is an algorithm developed by researchers at Harvard and UC Berkeley that establishes a theoretical equivalence between learning rate decay and batch size ramp-up for large language model pretraining. This principled approach accelerates LLM training by approximately 36% in wall-clock time while maintaining model performance.
Cautious Weight Decay (CWD) modifies decoupled weight decay to apply regularization only when the optimizer's update direction aligns with the parameter's sign, ensuring optimizers target the original loss function. This approach consistently improves validation loss and accuracy across large language models and vision models, integrating as a drop-in change without requiring new hyperparameter tuning.
Researchers at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) introduced a unified theoretical framework using preconditioned matrix norms, demonstrating that diverse optimization algorithms like steepest descent, quasi-Newton, and adaptive methods are special cases. This framework enables the systematic design of new optimizers, exemplified by MuAdam, which achieves competitive performance across various deep learning tasks.
We propose a variant of the approximate Bregman proximal gradient (ABPG) algorithm for minimizing the sum of a smooth nonconvex function and a nonsmooth convex function. Although ABPG is known to converge globally to a stationary point even when the smooth part of the objective function lacks globally Lipschitz continuous gradients, and its iterates can often be expressed in closed form, ABPG relies on an Armijo line search to guarantee global convergence. Such reliance can slow down performance in practice. To overcome this limitation, we propose the ABPG with a variable metric Armijo--Wolfe line search. Under the variable metric Armijo--Wolfe condition, we establish the global subsequential convergence of our algorithm. Moreover, assuming the Kurdyka--Łojasiewicz property, we also establish that our algorithm globally converges to a stationary point. Numerical experiments on ℓp regularized least squares problems and nonnegative linear inverse problems demonstrate that our algorithm outperforms existing algorithms.
Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.
Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L0,L1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.
Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5\% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25\% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at this https URL.
Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix's null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.
Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not converge to the stationary point since the LMO is a biased operator. We then propose FedMuon which can mitigate this issue. We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately. Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods.
The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new f-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an L-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with L. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.
Ensuring safe and effective collaboration between humans and autonomous legged robots is a fundamental challenge in shared autonomy, particularly for teleoperated systems navigating cluttered environments. Conventional shared-control approaches often rely on fixed blending strategies that fail to capture the dynamics of legged locomotion and may compromise safety. This paper presents a teleoperator-aware, safety-critical, adaptive nonlinear model predictive control (ANMPC) framework for shared autonomy of quadrupedal robots in obstacle-avoidance tasks. The framework employs a fixed arbitration weight between human and robot actions but enhances this scheme by modeling the human input with a noisily rational Boltzmann model, whose parameters are adapted online using a projected gradient descent (PGD) law from observed joystick commands. Safety is enforced through control barrier function (CBF) constraints integrated into a computationally efficient NMPC, ensuring forward invariance of safe sets despite uncertainty in human behavior. The control architecture is hierarchical: a high-level CBF-based ANMPC (10 Hz) generates blended human-robot velocity references, a mid-level dynamics-aware NMPC (60 Hz) enforces reduced-order single rigid body (SRB) dynamics to track these references, and a low-level nonlinear whole-body controller (500 Hz) imposes the full-order dynamics via quadratic programming to track the mid-level trajectories. Extensive numerical and hardware experiments, together with a user study, on a Unitree Go2 quadrupedal robot validate the framework, demonstrating real-time obstacle avoidance, online learning of human intent parameters, and safe teleoperator collaboration.
Riemannian accelerated gradient methods have been well studied for smooth optimization, typically treating geodesically convex and geodesically strongly convex cases separately. However, their extension to nonsmooth problems on manifolds with theoretical acceleration remains underexplored. To address this issue, we propose a unified Riemannian accelerated proximal gradient method for problems of the form F(x)=f(x)+h(x) on manifolds, where f can be either geodesically convex or geodesically strongly convex, and h is ρ-retraction-convex, possibly nonsmooth. We rigorously establish accelerated convergence rate under reasonable conditions. Additionally, we introduce a safeguard mechanism to ensure global convergence in non-convex settings. Numerical results validate the theoretical acceleration of the proposed method.
This survey collects, within a unified framework, various results (primarily by the authors themselves) on the use of Deterministic Infinite-Dimensional Optimal Control Theory to address applied economic models. The main aim is to illustrate, through several examples, the typical features of such models (including state constraints, non-Lipschitz data, and non-regularizing differential operators) and the corresponding methods needed to handle them. This necessitates developing aspects of the existing Deterministic Infinite-Dimensional Optimal Control Theory (see, e.g., the book by Li and Yong, 2012) in specific and often nontrivial directions. Given the breadth of this area, we emphasize the Dynamic Programming Approach and its application to problems where explicit or quasi-explicit solutions of the associated Hamilton-Jacobi-Bellman (HJB) equations can be obtained. We also provide insights and references for cases where such explicit solutions are not available.
We propose a prox-regular-type low-rank constrained nonconvex nonsmooth optimization model for Robust Low-Rank Matrix Recovery (RLRMR), i.e., estimate problem of low-rank matrix from an observed signal corrupted by outliers. For RLRMR, the ℓ1-norm has been utilized as a convex loss to detect outliers as well as to keep tractability of optimization models. Nevertheless, the ℓ1-norm is not necessarily an ideal robust loss because the ℓ1-norm tends to overpenalize entries corrupted by outliers of large magnitude. In contrast, the proposed model can employ a weakly convex function as a more robust loss, against outliers, than the ℓ1-norm. For the proposed model, we present (i) a projected variable smoothing-type algorithm applicable for the minimization of a nonsmooth weakly convex function over a prox-regular set, and (ii) a convergence analysis of the proposed algorithm in terms of stationary point. Numerical experiments demonstrate the effectiveness of the proposed model compared with the existing models that employ the ℓ1-norm.
Control barrier functions (CBFs) are a powerful tool for the constrained control of nonlinear systems; however, the majority of results in the literature focus on systems subject to a single CBF constraint, making it challenging to synthesize provably safe controllers that handle multiple state constraints. This paper presents a framework for constrained control of nonlinear systems subject to box constraints on the systems' vector-valued outputs using multiple CBFs. Our results illustrate that when the output has a vector relative degree, the CBF constraints encoding these box constraints are compatible, and the resulting optimization-based controller is locally Lipschitz continuous and admits a closed-form expression. Additional results are presented to characterize the degradation of nominal tracking objectives in the presence of safety constraints. Simulations of a planar quadrotor are presented to demonstrate the efficacy of the proposed framework.
There are no more papers matching your filters at the moment.