NLP with Biologically-Inspired Neural Networks
Is it possible to reverse engineer the human brain, the Neocortex? Numenta has proposed a novel machine learning algorithm which is a bold attempt in doing exactly that.
This biologically-inspired neural network called Hierarchical Temporal Memory (HTM) has some major differences to the Artificial Neural Networks we have all become familiar with from Deep Learning.
For instance, HTM has:
- Mini-columns (like a capsule in a capsule network)
- Binary-valued neurons!
- Two types of weights (analogous to Feedforward weights and Self-attention weights)
But aside from these differences in detail, HTM stands apart as an overall ML model because it performs:
- Simple Hebbian Learning (no Complex Backpropagation Calculations)
- A novel style of Unsupervised Learning (not Supervised Learning)
- Online Learning (i.e. Continuously Learning Live)
- Few shot learning
In this post we shall attempt to demystify this powerful machine learning algorithm by relating its underlying components to analogous concepts in Deep learning. In our next post we will be using this biologically-inspired neural network to learn and solve a few NLP tasks.
For an Artificial Neural Network (ANN), a layer simply contains neurons.
However, for Hierarchical Temporal Memory (HTM), a layer contains HTM capsules (mini-columns) — each of which contains HTM neurons (pyramidal cells).
Furthermore, while ANN neurons output continuous values, HTM neurons are simply binary (i.e. ‘firing’ or ‘non-firing’).
And since HTM neurons are binary-valued, inputs and outputs are encoded as a ‘sparse distributed representation’ SDR x (i.e. a sparse binary vector).
ANN neurons have weights (weighted synapses) that connect to other neurons, but HTM neurons have two different types of weighted connections:
- ‘Proximal Connections’ between neurons within different layers Wp (analogous to interlayer, feedforward weights) and
- ‘Distal Connections’ between neurons within the same layer Wd (i.e. intralayer / intra-column, lateral connections analogous to self-attention weights)
1) A feedforward prediction H for the next layer
2) A self-attention prediction A determined by the input’s context within a sequence (an initially context-free input is represented by all mini-columns firing at once — all 1s)
Feedforward Predictions (Spatial Pooler)
The spatial pooler σ projects a layer’s firing neurons H into a representational space using feedforward weights Wp (similar to mainstream feedforward neural networks).
Where a high-pass filter f binarises the weights, thus determining if they are ‘connected’ (i.e. above some predefined threshold θ ) or not. (The high-pass binarisation filter f also acts as the non-linear activation function since binary-values are inherently non-linear).
Weights are iteratively optimised using hebbian learning (unlike mainstream neural networks which use a more compute intensive optimisation known as backpropagation). Weights for all ‘firing’ neurons H get incremented by the learning rate δ while weights for ‘non-firing’ neurons H_bar get decremented.
Attention (Temporal Memory)
The anticipated state At+1 is predicted by a ‘temporal memory’ τ (similar to an attention mechanism) which uses the self-attention weights Wd to predict the columns which will most likely fire on the next time step in this particular context (i.e. the columns that have active distal connections to the currently firing neurons Ht ).
Where the conditional gate g ‘bursts’ a mini-column that hasn’t got any connected neurons, so that all the neurons in that column act as if they are connected (if none, then all)
Only the weights Wd which connect to the next active neurons S are optimised (using hebbian learning) and weights for all other neurons are left unchanged.
Since a mini-column ‘fires’ if any of its neurons are connected, only one neuron per mini-column needs to learn a distal connection (self-attentive weight) because more than one would be redundant information. Therefore, only the mini-column’s most connected neuron (the one with the strongest weight already) is incremented by some learning rate δ and all other neurons in that mini-column are decremented. (Furthermore, numenta mentions decrementing neurons which are connected to non-firing mini-columns H_bar , however this is one of the few details which have been omitted).
The spatial pooler σ predictions (i.e. feed-forward predictions) S are compared with the temporal memory τ predictions (i.e. anticipated predictions based on self-attention weights) A and all joint predictions make up the final contextualised predictions (i.e. hidden layer) H which serves as the input to the next layer. This means that predictions which align with what was anticipated for the given context are given preference by the algorithm.
A prediction H can be mapped to an output SDR x by returning the layer’s ‘firing’ mini-columns (i.e. mini-columns containing one or more firing neurons = 1).
If you liked this article and want to support Wluper, please share it and follow us on Twitter and Linkedin.
If you want to work on Conversational AI, check our careers page.