$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
There are two popular classes of word embedding techniques:
A re-weighting scheme used in (1) is to replace the co-occurrence statistics with the pointwise mutual information between two words.
The pointwise mutual information (PMI) of a pair of outcomes , , , discrete random variables measures the extent to which their joint distribution differs from the product of the marginal distributions:
Note that attains its maximum when or .
It is observed empirically that
This paper proposes a generative model for word embeddings that provides a theoretical justification of and word2vec and GloVe. The key assumption it makes is that word vectors, the latent variables of the model, are spatially isotropic (intuition: “no preferred direction in space”). Isotropy of low-dimensional vectors helps explain the linear structure of word vectors as well.
A time-step model: at time , word is produced by a random walk of a discourse vector that represents the topic of conversation. Each generated word has a latent vector that measures the correlation with the discourse vector. In particular:
is the -th word .
a small random displacement. Under this model, the authors prove that the co-occurrence probabilities and marginal probabilities are functions of the word vectors; this is useful when optimizing the likelihood function .