Lecture Notes

Proximal operators

By Akshay Agrawal.

Definition

Let $f : \RR^n \to \RR \cup \{\infty\}$ be a closed, convex, proper (CCP) function, (that is, its epigraph is a nonempty closed convex set), and let $\lambda$ be a positive scalar. The of the function $\lambda f$ is defined as

$\prox{\lambda f}(v) = \argmin{x}{\left(f(x) + \frac{1}{2\lambda}\norm{x - v}_2^2\right)}.$

Because the function minimized above is strongly convex, it has a unique minimizer for every $v \in \RR^n$ . The name “proximal” comes from the observation that $\prox{\lambda f}(v)$ is found by trading off minimizing $f$ and proximity to $v$ .

Properties

Separable sum

If $f(x) = \sum_{i=1}^{n} f_i(x_i)$ , then $(\prox{f}{v})_i = \prox{f_i}(v_i)$ .

Postcomposition with an affine function

If $f(x) = \alpha \varphi(x) + b$ , with $\alpha > 0$ , then $\prox{\lambda f}(v) = \prox{\alpha \lambda \varphi}(v)$ .

Precomposition with affine functions

If $f(x) = \varphi(Qx)$ , where $Q$ is an orthogonal matrix, then $\prox{f}(v)$ = $Q^T\prox{\varphi}(Qv)$ .

Fixed points

Assume $f$ is CCP. The minimizers of $f$ are precisely the fixed points of $\prox{f}$ . To see one direction, let $\tilde x$ be a minimizer of $f$ . Then

$f(x) + \frac{1}{2}\norm{x - \tilde x}_2^2 \geq f(\tilde x),$

so the proximal point of $\tilde x$ with respect to $f$ is $\tilde x$ .

For the other direction, we’ll assume $f$ is subdifferentiable everywhere on its domain. Assume $\tilde x$ is a fixed point of $\prox{f}$ . Note that $x = \prox{f}(v)$ means that $0 \in \partial f(x) + (x - v)$ . If $\tilde x = \prox{f}(\tilde x)$ , then $0 \in \partial f(\tilde x)$ , i.e., $\tilde x$ minimizes $f$ .

Fixed point iteration

The proximal operator of a CCP function $f$ is ; this means that repeatedly applying $prox{f}$ will yield a fixed point (and thus a minimizer) of $f$ .

Moreau decomposition

The Moreau decomposition connects proximal operators with convex duality. The decomposition is

$v = \prox{f}(v) + \prox{f^*}(v).$

Proof. (source: these slides)

$\begin{align*} u = \prox{f}(v) &\iff v - u \in \partial f(u) \\ &\iff u \in \partial f^*(v - u) \\ &\iff 0 \in \partial f(v - u) + ((v - u) - v)\\ &\iff v - u = \prox{f^*}(v). \end{align*}$

The Moreau decomposition generalizes the notion of orthogonal complements of subspaces. If $L$ is a subspace and $L^\perp$ is its orthogonal complement, then $v = \Pi_L(v) + \Pi_{L^\perp}(v)$ ( $\Pi$ is the orthogonal projection operator). This follows from the Moreau decomposition by noting that $I_L^* = I_{L^\perp}$ , $\prox{I_L} = \Pi_L$ , and $\prox{I_L^*} = \prox{I_{L^\perp}} = \Pi_{L^\perp}$ .

In fact, the Moreau decomposition shows how convex cones play a role analogous to subspaces. If $K$ is a convex cone and $K^\circ$ its polar cone ( $\{y \mid y^Tx \leq 0 \text{ for all } x \in K\}$ ):

$v = \Pi_{K}(v) + \Pi_{K^\circ}(v),$

by an argument similar to the one made for orthogonal complements of subspaces. In particular,

$I_{K}^*(y) = \sup_{x} y^Tx - I_K(x) = \begin{cases} 0 && \text{if } y^Tx \leq 0 \text{ for all } x \in K, \\ \infty && \text{otherwise}, \end{cases}$

that is, $I_K^* = I_{K^\circ}$ .

Interpretations

As generalized projections

The proximal operator of the indicator of a closed convex set $C$ is the projection operator onto $C$ . In this sense, proximal operators can be viewed as generalized projections.

Relationship to the Moreau envelope

The of a the function $f$ with parameter $\lambda$ is defined as

$M_{\lambda f}(v) = \inf_{x} \left( f(x) + \frac{1}{2}\norm{x - v}^2 \right).$

The point attaining the infimum above is $\prox{\lambda f}(v)$ . The Moreau envelope is the of $\lambda f$ and $\frac{1}{2}\norm{\cdot}^2$ , where infimal convolution of functions $f$ and $g$ is defined as

$(f \square g)(v) = \inf_{x} \left( f(x) + g(v - x)\right).$

A useful fact is that $(f \square g)^* = f^* + g^*$ . This means that $M_f = (f^* + \frac{1}{2}\norm{\cdot}^2)^*$ . Because the conjugate of a CCP function is smooth when the function is strongly convex, $M_f$ can be viewed as a smoothed version of $f$ . Finally, using facts about conjuagte functions, it can be seen that

$\prox{\lambda f}(x) = x - \lambda \nabla M_{\lambda f}(x).$

The provides a useful interpretation: iteration of the proximal operator is essentially gradient descent on a smoothed form of $f$ .

Resolvent

It is easy to show that

$\prox{\lambda f} = (I + \partial f)^{-1}.$

The right-hand side is called the of $f$ . It is in fact a function (i.e., it is single-valued), because the proximal operator furnishes the minimizer of a strongly convex function.

As generalizations of gradient and Levenberg-Marquardt updates

The proximal operator, evaluated at $v$ , for the first-order Taylor expansion of a function $f$ near a point $v$ is $v - \lambda \nabla f(v)$ ; the operator for the second-order Taylor expansion is $v - (\nabla^2 f(v) + 1 / \lambda)I)^{-1}\nabla f(v)$ .

Algorithms

Proximal minimization

The proximal minimization algorithm is a fixed point iteration of a proximal operator:

$x^{k+1} := \prox{\lambda f}(x^k).$

The iteration converges to a fixed point because the proximal operator of a CCP function is firmly nonexpansive.

The proximal minimization algorithm can be interpreted as gradient descent on the Moreau envelope of $f$ . It can also be interpreted as disappearing quadratic regularization, in that as the iterates converge to a fixed point, the regularizer term goes to $0$ .

Gradient flow

Gradient descent and proximal minimization correspond to algorithms obtained by discretizing the gradient flow equation

$\frac{d}{dt} x(t) = -\grad f(x).$

A forward discretization yields the gradient descent update

$\frac{x^{k+1} - x^{k}}{h} = -\grad f(x^k), \quad x^{k+1} = x^k - h \grad, f(x^k);$

while a backward discretization yields the proximal update

$\frac{x^{k+1} - x^{k}}{h} = -\grad f(x^{k+1}), \quad x^{k+1} = \prox{hf}(x^k).$

Proximal gradient method

The proximal gradient method is an algorithm for solving

$\operatorname{minimize} f(x) + g(x),$

where $f : \RR^n \to \RR$ is differentiable and $g: \RR^n \to \RR \cup \{\infty\}$ is proximable (and both are CCP). The algorithm is the iteration

$x^{k+1} := \prox{\lambda^k g}(x^k - \lambda^k \grad f(x^k)).$

When $g$ is the indicator of a convex set, this proximal gradient method reduces to the projected gradient method. The proximal gradient method converges if $\grad f$ is Lipschitz continuous with constant $L$ , with rate $O(1/k)$ , as long as the step sizes are less than $2/L$ .

The proximal gradient method can be interpreted as a fixed point iteration of the forward-backward operator

$(I + \lambda \partial g)^{-1}(I - \lambda \grad f).$

Forward-backward integration of gradient flow

As was the case for proximal minimization, the proximal gradient method can viewed as a method for solving gradient flows. The gradient flow system is

$\frac{d}{dt}x(t) = -\grad f(x(t)) - \grad(x(t)).$

We discretize the flow as

$\frac{x^{k+1} - x^{k}}{h} = -\grad f(x^k) - \grad g(x^{k+1});$

note that we use a forward discretization for the gradient of $f$ and a backward discretization for the gradient of $g$ . Rearranging, we obtain

$x^{k+1} = (I + h \grad g)^{-1}(I - h \grad f)x^k.$

The resolvent is called a backward operator and the right-hand term is a forward operator.

Acceleration

The proximal gradient method can be accelerated by changing the update to

$\begin{align*} y^{k+1} &= x^k + \omega^k(x^k - x^{k-1}) \\ x^{k+1} &= \prox{\lambda^{k+1} g}(y^{k+1} - \lambda^k \grad f(y^{k+1})). \end{align*}$

For suitably chosen $\omega^k \in [0, 1]$ (e.g., $k / (k+1)$ ) and $\lambda^k$ , this method has an optimal (for first-order methods) rate: $O(1/k^2)$ .

Alternating direction method of multipliers (ADMM)

ADMM is an algorithm for solving

$\operatorname{minimize} f(x) + g(x),$

where $f$ and $g$ are CCP. The algorithm is the iteration

$\begin{align*} x^{k+1} &= \prox{\lambda f}(z^k - u^k) \\ z^{k+1} &= \prox{\lambda g}(x^{k+1} + u^k) \\ u^{k+1} &= u^{k} + x^{k+1} - z^{k+1}. \end{align*}$

Under mild assumptions, the sequences $x^k$ and $z^k$ converge to the same point. Hence, $u^k$ can be interpreted as a running sum of errors, and ADMM can be interpreted as an integral control method. ADMM can also understood as an algorithm for minimizing the augmented Lagrangian of the lifted problem

$\begin{array}{ll} \operatorname{minimize} & f(x) + g(z) \\ \operatorname{subject to} & x = z, \end{array}$

in which case $u^k$ can be interpreted as a sequence converging to the dual variable for the equality constraint (up to a constant scale factor).

As a fixed-point iteration

ADMM is a fixed-point iteration for finding a point $x^\star$ satisfying

$0 \in \partial f(x^\star) + \partial g(x^\star).$

Fixed points satisfy

$x = \prox{\lambda f}(z - u), \quad z = \prox{\lambda g}(x + u), \quad u = u + x - z.$

The last equation implies $x = z$ . Hence the fixed point conditions are

$x = \prox{\lambda f}(x - u), \quad x = \prox{\lambda g}(x + u),$

which in turn mean

$x = (I + \lambda f)^{-1}(x - u), \quad x = (I + \lambda g)^{-1}(x + u)$

and in particular

$x - u \in x + \lambda f(x), \quad x + u \in x + \lambda g(x).$

This implies that $0 \in \partial f(x) + \partial g(x)$ , i.e., $x$ is optimal.

References

Proximal Algorithms, Parikh and Boyd
EE263C Lecture Notes, Lieven Vandenberghe
Math 301 Lecture Notes, Emmanuel Candes