Neural Network - Incremental Growth
DRAFT 1
We all have laptops. But le’ts face it, even in times of 32GB of RAM and NVMe2 drives, forget about running any interesting TensorFlow model. You need to get an external GPU, build your own rig, or very quickly pay a small fortune for cloud instances.
Back in 1993, I read a paper about growing neural networks neuron-by-neuron. I have no other precise recollection about this paper apart from the models considered being of the order of 10s of neurons and the weight optimisation being made on a global basis, i.e. not layer-by-layer like backpropagation. Nowadays, it is still too often the case that finding a network structure that solves a particular problem is a random walk: how many layers, with how many neurons, with which activation functions? Regularisation methods? Drop-out rate? Training batch size? The list goes on.
This got me thinking about how a training heuristic could incrementally modify a network structure given a particular training set and, apart maybe from a few hyperparameters, do that with no external intervention. At regular training intervals, a layer1 will be modified depending on what it seems able or not to achieve. As we will see, we will use unsupervised learning methods to do this: a layer modification will be independent of the actual learning problem and automatic.
Many others have looked into that. But what I found regarding self-organising networks is pre-2000, and nothing in the context of deep learning. So it seems that the topic has gone out of fashion because of the current amounts of computing power, or has been set aside for reasons unknown. (See references at the end). In any event, it is interesting enough a question to research it.
Background
Let us look at a simple 1-D layer and decompose what it exactly does. Basically a layer does:
If the input
Then, looking at
Looking at the matrix representing a layer, can we identify which parts are (1) useless, (2) useful and complex enough, or (3) useful but too simplistic?.
Here, complex enough or simplistic is basically a synonym of “one layer is enough”, or “more layers are necessary”.
The idea to look for important/complex information which where the network needs to grow more complex; and identify trivial information which can be discarded, or can be viewed as minor adjustments to improve error rates (basically overfitting…)
Caveat: Note that we ignore the activation function. They are key to introduce non-linearity. Without it, a network is only a linear function, i.e. no interest. They have a clear impact on the performance of a network.
Singular matrix decomposition
There exists many ways to decompose a matrix. Singular matrix decomposition (SVD)
In a statistical world, SVD (with eigenvalues ordered by decreasing value) is how to do principal component analysis(PCA).
In a geometrical context, SVD:
- takes a vector (expressed in the orthonormal basis);
- re-expresses onto a new basis made of the eigenvectors (that would only exceptionally be orthonormal);
- dilates/compresses those components by the relevant eigenvalues;
- and returns this resulting vector expressed back onto the orthonormal basis.
As presented here, this explanation requires a bit more intellectual gymnastic when the matrix is not square (i.e. when the input and output layers have different dimensions), but the principle remains identical.
Where next?
Taking the statistical and geometrical points of view together, the layer (matrix
Intuitively, the simplest and most useless
If compared to the identity matrix, the SVD shows that
- What are interesting combinations of the input units? This is expressed by how much the input vector is rotated in space.
- Independently from whether a combination is complicated or not (i.e. multiple units, or unit passthrough), how an input is amplified (as expressed by the eigenvalues).
The idea is then produce a 2x2 decision matrix with high/low rotation mess and high/low eigenvalues.
A picture is gives the intuition of what we are after:

Transformation of the Layer Matrix
Looking from top to bottom at what the “after” matrices would be:
Part of the original layer, immediately followed by a new one (we will see below what that would look like). The intuition is that this layer is really messing things up down the line, or seems very sensitive.
Part of the original layer where the number of units would be increased (here doubled as an example).
Part of the original layer kept functionally essentially as is.
Delete the rest which is either not sensitive to input or outputs nothings. This would be within a certain precision. That is basically a form of regularisation preventing the overall model to be too sensitive. I am aware that there are other types of regularisations, but that will go in the limitations category.
The next layer would take as input all the transformed outputs.
In practice, the picture presents the matrices separated. This is for ease of understanding. In reality the same effect would be achieved if the three dark blue sub-layers are merged in a single layer.
Back to SVD
Let us assume that there are
Note that instead of using
The
Then:
where
Regularisation
At this stage, we can regularise all components.
Vector coordinates
For each vector
After regularisation, the matrices will not be orthonormal anymore. They can easily be made normal by scaling up by
We need a way to measure the rotation messiness
of each vector. As a shortcut, we can use the proportion of non-zero vector coordinates (after de minimis regularisation).
Eigenvalues
The same can be done for the
Threshold
Where to set the threshold is to be experimented with. Mean? Median since more robust? Some quartile?
2-by-2 decision matrix
Based on those regularisation, we would propose the following:
[TODO] Other Principal Components methods
SVD is PCA. Projects information on hyperplanes.
Reflect on non-linear versions: Principal Curves, Kernel Principal Components, Sparse Principal Components, Independent Component Analysis. (_Elements of Statistical Learning s. 14.5 seq.).
Limitations and further questions
Limitations
Only 1-D layers. Higher-order SVD is in principle feasible for higher order tensors. Other methods?
We delete the eigenvectors associated to low eigenvalues and limited rotations. There are other forms of regularisations, e.g. random weight cancelling that would not care about anything eigen-.
What is the real impact of ignoring the activation function? PCA requires centered values. Geometrically, uncentered values would mean more limited rotations since samples would be in quadrant far from 0.
Further questions
The final structure is a direct product of the training set. What if the training is done differently (batches sized or ordered differently)?
What about training many variants with different subsets of the training set and using ensemble methods?
The eigenvalues could be modified when creating the new layers. By decreasing the highest eigevalues (in absolute value), we effectively regularise the layers outputs. This decrease could bring additional non-linearity if the compression ratio depends on the eigengevalue (e.g. replacing it by it square root). And this non-linearty would not bring additional complexity to the back-propagation algorithm, or auto-differentiated functions: it only modifies the final values if the new matrices.
Litterature
Here are a few summary litterature references related to the topic.
The Elements of Statistical Learning
The ESL top of p 409 proposes PCA to interpret layers, i.e. to improve the interpretability of the decisions made by a network.
Neural Network Implementations for PCA and Its Extensions
http://downloads.hindawi.com/archive/2012/847305.pdf
Uses neural networks as a substitute for PCA.
An Incremental Neural Network Construction Algorithm for Training Multilayer Perceptrons
Aran, Oya, and Ethem Alpaydin. “An incremental neural network construction algorithm for training multilayer perceptrons.” Artificial Neural Networks and Neural Information Processing. Istanbul, Turkey: ICANN/ICONIP (2003).
https://www.cmpe.boun.edu.tr/~ethem/files/papers/aran03incremental.pdf
Kohonen Maps
Self-Organising Network
A Self-Organising Network That Grows When Required (2002)
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8763
The Cascade-Correlation Learning Architecture
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.6421
Growth with quick freeze as a way to avoid the expense of back-propagation.
SOINN:Self-Organizing Incremental Neural Network
http://www.haselab.info/soinn-e.html https://cs.nju.edu.cn/rinc/SOINN/Tutorial.pdf
Seems focused on neuron by neuron evolution.
We will only consider modifying the network layer by layer, not neuron by neuron.↩︎
This could actually be a big limitation of this discussion. In reality, even an identity matrix yields changes by piping the inputs through a new round of non-linearity, which is not necessarily identical to the preceding layer↩︎