RNN Compressive Memory Part 1: A high level introduction.
This is the first post of series dedicated to Compressive Memory of Recurrent Neural Networks. This is inspired by a recent DeepMind paper published in November 2019 on Arxiv.
Currently, the ambition of the series is to follow this plan:
Part 1 (here): A high level introduction to Compressive Memory mechanics starting from basic RNNS;
Part 2: a detailed explanation of the TransformerXL;
Part 3: an implementation using PyTorch (soon);
Part 4: finally, its application to time series (soon).
Most likely, this will be fine-tuned over time.
Big thanks to Gautier Marti and Vincent Zoonekynd for their suggestions and proof-reading!
Update: Additional diagrams (14 March 2020)
Recurrent Neural Networks (RNN)
From simple RNNs to LSTMs
Traditional neural networks were developed to train/run on information provided in a single step in a consistent format (e.g. images with identical resolution). Conceptually, a neural network could similarly be taught on sequential information (e.g. a video as a series of images) looking at it as a single sample, but that would require (1) being trained on the full sequence (e.g. an entire video), (2) being able to cope with information of variable length (i.e. short vs. long video). (1) is computationally intractable, and (2) means that units analysing later parts of the video would not be receiving as much training as earlier units when ideally they should be all share the same amount of training .
The original RNN address those issues:
Sequences are chopped in small consistent sub-sequences (say, a segment of 10 images, or a group of 20 words).
An RNN layer is a group of blocks (or cells), each receiving a single element of the segment as input. Note that here layer does not have the traditional meaning of a layer of neural units fully connected to a previous layer of units. It is a layer of RNN cells. Within each cell, quite a few things happen, including using layers of neural units. From here on, a layer will refer to an RNN layer and not a layer of neural units..
Within a layer, cells are identical: they have the same parameters.
Although each element of a sequence might be of interest on its own, it only becomes really meaningful in the context of the other elements. Each cell contains a state vector (called hidden state). Each cell is trained using an individual element from a segment and the hidden state from the preceding cell. Training the network means training the creation of those states. Passing of the hidden state transfers some context or memory from prior elements of the segment. The cells receiving a segment form a single layer. Each cell would typically (but not necessarily) also include an additional sub-cell to create an output as a function of the hidden step. In that case, the output of a layer can then be used as input of new RNN layer.
A layer is trained passing hidden states from prior cells to later cells. The hidden state from prior elements is used to contextualise a current element. To use context from later elements (e.g. in English, a noun giving context to a preceding adjective), a separate layer is trained where context instead passes from later to prior elements. Those forward and backward layers jointly create a bidirectional RNN .
Historically, RNNs applied to NLP deal with elements which are either one-hot encoded (either letters, or, more efficient, tokens), or word embeddings often normalised as unit vectors (for example see Word2Vec and GloVe). RNN cells therefore deal with values between 0 and 1. Typically, non-linearity is brought by \(tanh\) or \(sigmoid\) activations which guarantee unit values within that range. Those activation functions quickly have very flat gradients. Segments often have 10s or 100s of elements. Because of vanishing gradients, a hidden state receives little information from distant cells (training gradients are hardly influenced by gradients of distant cells).
Long/Short Term Memory RNNs
Long/Short Term Memory RNNs (LSTM) address this by passing two states:
a hidden state \(h\) as described above trained with non-linearity: this is the short-term memory; and,
another hidden state \(c\) (called context) weighting previous contexts with a simple exponential moving average (in Gated Recurrent Units) or a slightly more complicated version thereof in the original LSTM model structure. Determining the optimal exponential decay is part of the training process. This minimally processed state is the long-term memory.
LTSM can also be made bidirectional.
Without going into further details, note that each \(\sigma\) and \(\tanh\) orange block represents matrix of parameters to be learned.
Attention
RNN were further extended with an attention mechanism. Blog posts on attention by Jay Alammar and Lilian Weng are good introductions.
A multi-layer RNN takes the output a layer and uses it as input for the next. With the attention mechanism, the outputs go through an attention unit.
Beyond LSTM: Transformers
RNNs were then simplified (insert large air quotes) with Transformers (using what is called self-attention) that significantly reduce the number of model parameters and can be efficiently parallelised with minimum model performance impact. For an extremely clear introduction to those significant improvements, you cannot do better than reading , and by Peter Bloem on transformers. The following assumes that you are broadly familiar with those ideas.
The basic transformer structure uses self-attention where, for a given element (the query), the transformer looks at the other elements of the segment (the keys) to determine how much ‘attention’ other elements of the segment influence the role of the query in changing the hidden state.
Broadly:
The query is projected in some linear space (a matrix \(W_q\)). That’s basically an embedding which is part of the model training.
All the other elements, the keys, are projected in another linear space (a matrix \(W_k\)); another embedding which is part of the model training.
The similarity (perharps affinity would be a better word) between the projected query and each projected key is calculated with a dot product / cosine distance. This is exactly the approach of basic recommender systems with the difference that the recommendation is between sets of completely different nature (for example affinity between users and movies). Note that although query and keys are elements of identical type, they are embedded into different spaces with different projections matrices.
We now have a vector of the same size as the segment length (one cosine distance per input element). It goes through another layer (a matrix \(W_v\)) to give a value.
The triplet of \(\left( W_q, W_k, W_v \right)\) is called an attention head. Actual models would include multiple heads (of the order of 10), and the output of a transformer layer could then feed into a new transformer layer.
This model is great until you notice that the dot product / cosine similarity is commutative and does not reflect whether a key element is located before or after the query element: order is fundamental to sequential information (“quick fly” vs. “fly quick”). To address this, the input elements are always enriched with a positional embedding: the input elements are concatenated with positional information showing where they stand within a segment.
Note that a transformer layer is trained on a segment using only the information from that segment. This is fine to train on sentences, but it cannot really account for more distant relationships between words within a lengthy paragraph, let alone a full text.
Transformer-XL
Transformers have been further improved with Tranformer-XL (XL = extra long) which are trained using hidden states from previous segments, therefore using information from several segments, to improve a model’s memory span.
Conceptually, this is an obvious extension of the basic transformer to increase its memory span. But there is a fundamental problem. Going back to the basic transformer, each element includes its absolute position within the segment. The position of the first word of the segment is 1, that of the last one is, say, 250 . Such a scheme breaks down as soon as the state of the previous segment is taken into account. Word 1 of the current segment obviously comes before word 250, but has to come after word 250 of the previous segment. The absolute position encoding does not reflect the relative position of elements located in different segments.
The key contribution of the Transformer-XL is to develop a relative positional encoding that allows hidden state information to cross segment boundaries. In their implementation, the authors evaluate that the attention length, being basically how many hidden states are used, is 450% longer that the basic transformer. That’s going from sentence length to full paragraph, but still far from a complete book.
A side, but impressive, benefit is that the evaluation speed of the model, or it use once trained, is significantly increased thanks to the relative addressing (the paper states up to a 1,800-fold increase depending on the attention length).
Compressive Transformers
Full text understanding cannot be achieved by simply lengthening segment sizes from 100s to the word count of a typical novel (about 100,000). When training a model routinely takes 10s of hours on GPU clusters, an increase by 3 orders of magnitude is not realistic.
In a recent paper, DeepMind proposes a new RNN model called Compressive Transformers.
Introduction
Transformer-XL uses the hidden state of a prior segment (\(h_{T-1}\)) to improve the training of the current segment (\(h_{T}\)). When moving to the next segment, training (\(h_{T+1}\)) now only uses \(h_{T}\) and \(h_{T-1}\) is discarded. To increase the memory span, one could train using more past segments at the expense of increase in memory usage and computation time (quadratic). The actual Transformer-XL uses the hidden states of several previous segments, but the discarding mechanism will remain.
The key contribution of the Compressive Transformers is the ability to retain salient information from those otherwise discarded past states. Instead of being discarded, they are stored in compressed form.
Each Transformer-XL layer is now trained with prior hidden states (primary memory) and the compressed memory of older hidden states.
As an aside, although not explicitly mentioned, we should note that the ‘-XL’ aspect of the Transformer-XL and the memory compression mechanics are conceptually independent from the actual types of RNN cell. Simple RNNs, GRUs or LSTMs could be trained using the hidden states of past segments (not dissimilar to state/context peeking into past cells in certain RNN variants). But the performance benefit of Transformer-XL is such that the paper only focuses on transformer-XL.
Compression scheme
As compared to Transformer-XL, the key difference is the compression scheme. The rest of the model seems identical.
Size parameters
The size of the model is described with a few size parameters:
\(n_s\): size of a segment = the number of cells in a layer.
\(n_m\): number of hidden states in the primary uncompressed memory (like the Transformer-XL). \(n_m\) is a multiple of \(n_s\). The primary memory is a FIFO buffer: the first (oldest) memories will be the first to be later compressed.
\(n_{cm}\): number of compressed hidden states in the compressed memory. States in the compressed memory will compress an old segment of size \(n_s\) dropping out of the primary memory. \(c\) is an information compression ratio from \(n_s\) primary memory entries into compressed memory entries. There can be two ways of applying this compression ratio, which both reduce the number of hidden states by the same ratio:
\(c\) uncompressed layers could create a single compressed hidden state of identical size. This merges the information of a group of elements (e.g. \(c\) words) into a single hidden state. In this case, \(n_s\) is proportional to \(c\) and \(n_{cm}\) is proportional to \(n_s / c\). The authors do not use this approach. It would enforce a sub-segmentation of an uncompressed segment at arbitrary intervals (why group 3 words instead of 5 or 7…)
Instead, the authors use dimension reduction: a single uncompressed hidden state is compressed into a new hidden state with \(c\) times fewer hidden states. If the size of the hidden state of a Transformer-XL cell is \(n_h\), hidden states in the primary memory will have the same size, and the compressed memory hidden states will have a size of \(n_h / c\).
By way of example, a segment could have 100 cells (\(n_s = 100\)). This segment could be trained with the hidden states of the past 3 segments’ training (\(n_m = 3 * n_s = 300\)). When training the next segment, an old segment of size 100 becomes available for compression which will create 100 new hidden states.
This example is for a single layer. The same scheme would be replicated for each layer of the model
Note that the paper only contemplates a single set of compressed memories. There could also be multiple generations of compressed memories, primary memory compresses in generation 1, then compressing into generation 2…
Compression functions
A compressed hidden state is created from \(c\) primary memory hidden states. When training on texts with word embeddings,the authors used a value of \(c=3\) or \(c=4\).
Several compression schemes are explored in the paper:
max or mean pooling with a stride of \(c\). This is typical of image convolution networks - no explanation required.
1-dimensional convolution with a stride of \(c\). This is also typical of image convolution network apart from being one-dimensional. This requires parameter training.
dilated convolution. In practice image convolutions have shown to be inadequate for sequential information where dependencies can be at both short and long ranges: working at different scales makes sense. Dilated convolutions use convolution filters that are contracted and dilated versions of a template to be trained.
a most-used mechanism that identifies and retains part of the hidden states according to their importance in the cells training gauged by the attention they received.
Compression training
Training the compression parameters is done separately from the optimisation of the Transformer-XL cells.
The purpose of the compressed memory is to provide a compressed and lossy representation of the primary memory (hidden states) or the attention heads parameters: the quality of the compression mechanics is assessed by how well the original information can be re-generated from it. In essence, the compressed hidden states are a compressed representations to a learned representation vector in an auto-encoder. This is the training mechanics used by the authors.
As in an auto-encoder, the representation is learned by comparing the original information to its reconstruction. This training is kept completely independent from the training of the transformers: the auto-encoding loss and gradients do not impact the attention heads’ parameters.
Conversely, the loss and gradients of the attention heads’ training do not flow into the training of the compression scheme.
Summary
This was a high level introduction of RNNs all the way up to Compressive Memory mechanics. Next, the algorithm’s nitty-gritty.