$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
The thesis:
Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD. This weighting improves performance by about 10% to 30% in textual similarity tasks, and beats sophisticated supervised methods including RNN’s and LSTMs … This simple method should be used as the baseline to beat in the future, especially when labeled training data is scarce or nonexistent.
This paper uses a modified version of the generative model proposed in Arora 2016 in order to obtain a closed form estimate of the sentence vector. The sentence vector is assumed to be time-invariant. Two modifications are made that account for the observation that words appear out of context at times, and that some frequent words appear very often and without regard to the topic of conversation. In math:
The MLE derivation is short, and the upshot is nice:
In practice, is treated as a hyper-parameter. The final sentence vector is obtained by subtracting out the first principal component in order to “denoise” the data:
The experimental upshot is also nice: the obtained sentence vectors either match or outperform neural methods on similarity, entailment, and sentiment tasks.