$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
Nut graf: fast training prevents overfitting, and a mathematical way to state this.
Recall a : Let be the training data, and assume both and the test data are generated by draws from the same distribution . Let be the error incurred by a model parametrized by on a set of samples , where each sample is drawn from . The is defined as
The former term is the population risk and the latter term is the empirical risk. If the weights , a sample, a randomized function, then the expected generalization error is defined as the expectation of the generalization error, taken over the randomness in and . And if two samples and differ in at most one example, then is - if
where the loss function evaluated at . Key idea: such an algorithm has generalization error bounded by .
The key results are stability bounds for the application of SGD to -smooth, -Lipschitz, convex functions and certain classes of nonconvex functions. The results for the former require each step size to be bounded by . The results for the latter require that the function lie in , is L-Lipschitz, and -smooth, and that the sequence of step sizes is inversely related to the step number.
The upshots are cool: regularization, gradient clipping, dropout, proximal steps, and model averaging all improve the authors’ bounds.