$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
TensorFlow uses a dataflow graph to represent both computation and state. Nodes represent computation upon mutable data, and edges carry tensors, or multi-dimensional arrays, between nodes. The system uses synchronous replication successfully, contradicting the folklore that asynchronous replication is needed for scalability.
DistBelief, the prequel to TensorFlow, was limited by its parameter server architecture; at times it is desirable to offload computation onto the server that owns the data. As such, TensorFlow eschews the separation of workers and parameter servers in favor of a hybrid model. The key design principles of TensorFlow are
Note that a batch dataflow model, which favors large batches of computation and requires immutable inputs and deterministic computation, is not appropriate if stochastic gradient descent is your optimization method of choice. TensorFlow allows for mutable state ‘‘that can be shared between different executions of the graph’’ and ‘‘concurrect executions or overlapping subgraphs.’’
A tensor is a multi-dimensional array that stores primitive types; one of those primitive types is a string, which holds arbitrary binary data. All tensors are dense in order to ensure that memory allocation and serialization can be implemented efficiently. Sparse vectors can be encoded as either variable-length string elements or tuples of dense tensors. The shape of a tensor can vary along its dimensions.
Stateless operations map one list of tensors to another list of tensors. The simplest way to think of such an operation is as a mathematical function.
A variable is a stateful operation. Each variable owns a mutable
buffer that, for example, holds the model parameters as it is trained.
Variables take no inputs. They instead expose a read
operation and various
write operations. An example of a write operation is AssignAdd
, which
is semantically equivalent to the familiar ‘‘plus-equals.’’
Queues allow for concurrent access to the tensors that they hold. They can provide backpressure when they are full and are used to implement streaming computation between subgraphs.
Every operation is placed on a device, and each device assembles its operations into a subgraph. TensorFlow is ‘‘optimized for executing large subgraphs repeatedly with low latency.’’
Conditional statements and other control flow primitives are supported.
Users can handroll their gradients if they so desire, or they can rely on the automatic differentiation. It is simple to implement new optimization algorithms using the TensorFlow framework; no code changes are required.
Both asynchronous and synchronous SGD are supported; the latter converges to a good solution faster than the former (both in practice and, I assume, in theory). In the synchronous scheme, firing up redundant workers and taking the updates from those who finish first improves throughput by up to 10 percent.