$$ \newcommand{\qed}{\tag*{$\square$}} \newcommand{\span}{\operatorname{span}} \newcommand{\dim}{\operatorname{dim}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\norm}[1]{\|#1\|} \newcommand{\grad}{\nabla} \newcommand{\prox}[1]{\operatorname{prox}_{#1}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\mat}[1]{\mathcal{M}[#1]} \newcommand{\null}[1]{\operatorname{null} \left(#1\right)} \newcommand{\range}[1]{\operatorname{range} \left(#1\right)} \newcommand{\rowvec}[1]{\begin{bmatrix} #1 \end{bmatrix}^T} \newcommand{\Reals}{\mathbf{R}} \newcommand{\RR}{\mathbf{R}} \newcommand{\Complex}{\mathbf{C}} \newcommand{\Field}{\mathbf{F}} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}[1]{\operatorname{E}[#1]} \newcommand{\Var}[1]{\operatorname{Var}[#1]} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
Let be a closed, convex, proper (CCP) function, (that is, its epigraph is a nonempty closed convex set), and let be a positive scalar. The proximal operator of the function is defined as
Because the function minimized above is strongly convex, it has a unique minimizer for every . The name “proximal” comes from the observation that is found by trading off minimizing and proximity to .
If , then .
If , with , then .
If , where is an orthogonal matrix, then = .
Assume is CCP. The minimizers of are precisely the fixed points of . To see one direction, let be a minimizer of . Then
so the proximal point of with respect to is .
For the other direction, we’ll assume is subdifferentiable everywhere on its domain. Assume is a fixed point of . Note that means that . If , then , i.e., minimizes .
The proximal operator of a CCP function is firmly nonexpansive; this means that repeatedly applying will yield a fixed point (and thus a minimizer) of .
The Moreau decomposition connects proximal operators with convex duality. The decomposition is
Proof. (source: these slides)
The Moreau decomposition generalizes the notion of orthogonal complements of subspaces. If is a subspace and is its orthogonal complement, then ( is the orthogonal projection operator). This follows from the Moreau decomposition by noting that , , and .
In fact, the Moreau decomposition shows how convex cones play a role analogous to subspaces. If is a convex cone and its polar cone ():
by an argument similar to the one made for orthogonal complements of subspaces. In particular,
that is, .
The proximal operator of the indicator of a closed convex set is the projection operator onto . In this sense, proximal operators can be viewed as generalized projections.
The Moreau envelope of a the function with parameter is defined as
The point attaining the infimum above is . The Moreau envelope is the infimal convolution of and , where infimal convolution of functions and is defined as
A useful fact is that . This means that . Because the conjugate of a CCP function is smooth when the function is strongly convex, can be viewed as a smoothed version of . Finally, using facts about conjuagte functions, it can be seen that
The provides a useful interpretation: iteration of the proximal operator is essentially gradient descent on a smoothed form of .
It is easy to show that
The right-hand side is called the resolvent of . It is in fact a function (i.e., it is single-valued), because the proximal operator furnishes the minimizer of a strongly convex function.
The proximal operator, evaluated at , for the first-order Taylor expansion of a function near a point is ; the operator for the second-order Taylor expansion is .
The proximal minimization algorithm is a fixed point iteration of a proximal operator:
The iteration converges to a fixed point because the proximal operator of a CCP function is firmly nonexpansive.
The proximal minimization algorithm can be interpreted as gradient descent on the Moreau envelope of . It can also be interpreted as disappearing quadratic regularization, in that as the iterates converge to a fixed point, the regularizer term goes to .
Gradient descent and proximal minimization correspond to algorithms obtained by discretizing the gradient flow equation
A forward discretization yields the gradient descent update
while a backward discretization yields the proximal update
The proximal gradient method is an algorithm for solving
where is differentiable and is proximable (and both are CCP). The algorithm is the iteration
When is the indicator of a convex set, this proximal gradient method reduces to the projected gradient method. The proximal gradient method converges if is Lipschitz continuous with constant , with rate , as long as the step sizes are less than .
The proximal gradient method can be interpreted as a fixed point iteration of the forward-backward operator
As was the case for proximal minimization, the proximal gradient method can viewed as a method for solving gradient flows. The gradient flow system is
We discretize the flow as
note that we use a forward discretization for the gradient of and a backward discretization for the gradient of . Rearranging, we obtain
The resolvent is called a backward operator and the right-hand term is a forward operator.
The proximal gradient method can be accelerated by changing the update to
For suitably chosen (e.g., ) and , this method has an optimal (for first-order methods) rate: .
ADMM is an algorithm for solving
where and are CCP. The algorithm is the iteration
Under mild assumptions, the sequences and converge to the same point. Hence, can be interpreted as a running sum of errors, and ADMM can be interpreted as an integral control method. ADMM can also understood as an algorithm for minimizing the augmented Lagrangian of the lifted problem
in which case can be interpreted as a sequence converging to the dual variable for the equality constraint (up to a constant scale factor).
ADMM is a fixed-point iteration for finding a point satisfying
Fixed points satisfy
The last equation implies . Hence the fixed point conditions are
which in turn mean
and in particular
This implies that , i.e., is optimal.