Papers

Train Faster, Generalize Better: Stability of Stochastic Gradient Descent (Hardt 2016)

Nut graf: fast training prevents overfitting, and a mathematical way to state this.

Background

Recall a $\textbf{defintion}$ : Let $\mathcal{D}$ be the training data, and assume both $\mathcal{D}$ and the test data are generated by draws from the same distribution $X$ . Let $e(w; x)$ be the error incurred by a model parametrized by $s$ on a set of samples $x$ , where each sample is drawn from $X$ . The $\textbf{generalization error}$ is defined as

$\E[e(w; x)]- e(w; \mathcal{D}), \quad x \sim X, \quad x \notin \mathcal{D}.$

The former term is the population risk and the latter term is the empirical risk. If the weights $w=A(S)$ , $S$ a sample, $A$ a randomized function, then the expected generalization error is defined as the expectation of the generalization error, taken over the randomness in $S$ and $A$ . And if two samples $S$ and $S'$ differ in at most one example, then $A$ is $\epsilon$ - $\textbf{uniformly stable}$ if

$\sup_{z}\E_{A}[f(A(S); z) - f(A(S'); z)] \leq \epsilon,$

where $f$ the loss function evaluated at $z$ . Key idea: such an algorithm has generalization error bounded by $\epsilon$ .

Results

The key results are stability bounds for the application of SGD to $\beta$ -smooth, $L$ -Lipschitz, convex functions and certain classes of nonconvex functions. The results for the former require each step size to be bounded by $\approx 1/\beta$ . The results for the latter require that the function lie in $[0, 1]$ , is L-Lipschitz, and $\beta$ -smooth, and that the sequence of step sizes is inversely related to the step number.

The upshots are cool: regularization, gradient clipping, dropout, proximal steps, and model averaging all improve the authors’ bounds.

Commentary

Are machine learning loss functions often Lipschitz and smooth? Is this too stringent of an assumption?
Again, are the assumptions made for the nonconvex case reasonable?
It’s cool that the bounds match empirical findings.