NLP with Biologically-Inspired Neural Networks

Mohammed Terry-Jack

December 21, 2021

–

9 min read

Is it possible to reverse engineer the human brain, the Neocortex? Numenta has proposed a novel machine learning algorithm which is a bold attempt in doing exactly that.

This biologically-inspired neural network called Hierarchical Temporal Memory (HTM) has some major differences to the Artificial Neural Networks we have all become familiar with from Deep Learning.

For instance, HTM has:

Mini-columns (like a capsule in a capsule network)
Binary-valued neurons!
Two types of weights (analogous to Feedforward weights and Self-attention weights)

But aside from these differences in detail, HTM stands apart as an overall ML model because it performs:

Simple Hebbian Learning (no Complex Backpropagation Calculations)
A novel style of Unsupervised Learning (not Supervised Learning)
Online Learning (i.e. Continuously Learning Live)
Few shot learning

In this post we shall attempt to demystify this powerful machine learning algorithm by relating its underlying components to analogous concepts in Deep learning. In our next post we will be using this biologically-inspired neural network to learn and solve a few NLP tasks.

Neurons

For an Artificial Neural Network (ANN), a layer simply contains neurons.

‍

‍

However, for Hierarchical Temporal Memory (HTM), a layer contains HTM capsules (mini-columns) — each of which contains HTM neurons (pyramidal cells).

Let there be n ‘pyramidal cells’ (HTM neurons) per ‘mini-column’ (HTM capsule) and let there be m mini-columns per ‘region’ (HTM layer).

‍

*Image 1: (left) a pyramidal cell [HTM neuron], (centre) a mini-column [HTM capsule], (right) a region [HTM layer].*

‍

*Image 2: Cells [HTM neurons] and (mini-)columns [HTM capsules] in a network with three regions [HTM Layers].*

‍

Furthermore, while ANN neurons output continuous values, HTM neurons are simply binary (i.e. ‘firing’ or ‘non-firing’).

And since HTM neurons are binary-valued, inputs and outputs are encoded as a ‘sparse distributed representation’ SDR x (i.e. a sparse binary vector).

*SDRs encoding the words ‘cat’, ‘dog’ and ‘fish’.*

‍

Weights

ANN neurons have weights (weighted synapses) that connect to other neurons, but HTM neurons have two different types of weighted connections:

‘Proximal Connections’ between neurons within different layers Wp (analogous to interlayer, feedforward weights) and
‘Distal Connections’ between neurons within the same layer Wd (i.e. intralayer / intra-column, lateral connections analogous to self-attention weights)

Image 3: A region / layer depicting the two types of weights (lines) between cells / neurons (circles) arranged in mini-columns (rows). The proximal connections / feedforward weights (bottom) and distal connections / self-attention weights (top).

‍

Layers

1) A feedforward prediction H for the next layer

2) A self-attention prediction A determined by the input’s context within a sequence (an initially context-free input is represented by all mini-columns firing at once — all 1s)

Feedforward Predictions (Spatial Pooler)

Image 5: A network with two regions (layers) with the cell (neuron) and mini-column labelled. The arrows show the direction of the proximal connections (feedforward weights) similar to other feedforward neural networks.

The spatial pooler σ projects a layer’s firing neurons H into a representational space using feedforward weights Wp (similar to mainstream feedforward neural networks).

Where a high-pass filter f binarises the weights, thus determining if they are ‘connected’ (i.e. above some predefined threshold θ ) or not. (The high-pass binarisation filter f also acts as the non-linear activation function since binary-values are inherently non-linear).

Weights are iteratively optimised using hebbian learning (unlike mainstream neural networks which use a more compute intensive optimisation known as backpropagation). Weights for all ‘firing’ neurons H get incremented by the learning rate δ while weights for ‘non-firing’ neurons H_bar get decremented.

‍

Attention (Temporal Memory)

The anticipated state At+1 is predicted by a ‘temporal memory’ τ (similar to an attention mechanism) which uses the self-attention weights Wd to predict the columns which will most likely fire on the next time step in this particular context (i.e. the columns that have active distal connections to the currently firing neurons Ht ).

*Image 7: The ‘connected’ (red lines) distal connections (self-attention weights) between firing neurons (red circles) in a region (layer).*

Where the conditional gate g ‘bursts’ a mini-column that hasn’t got any connected neurons, so that all the neurons in that column act as if they are connected (if none, then all)

*Image 8: Mini-columns of neurons ‘bursting’ because they do not have any ‘connected’ weights (if none, then all).*

Only the weights Wd which connect to the next active neurons S are optimised (using hebbian learning) and weights for all other neurons are left unchanged.

Since a mini-column ‘fires’ if any of its neurons are connected, only one neuron per mini-column needs to learn a distal connection (self-attentive weight) because more than one would be redundant information. Therefore, only the mini-column’s most connected neuron (the one with the strongest weight already) is incremented by some learning rate δ and all other neurons in that mini-column are decremented. (Furthermore, numenta mentions decrementing neurons which are connected to non-firing mini-columns H_bar , however this is one of the few details which have been omitted).

‍

Output Predictions

Image 9: A region (layer) depicting cells (neurons) which are red for ‘firing’ (connected via feedforward weights / spatial pooler predictions S) and yellow for ‘predictive’ (connected via self-attention weights / temporal memory predictions A). The output H are those neurons which are both red and yellow (jointly predicted).

The spatial pooler σ predictions (i.e. feed-forward predictions) S are compared with the temporal memory τ predictions (i.e. anticipated predictions based on self-attention weights) A and all joint predictions make up the final contextualised predictions (i.e. hidden layer) H which serves as the input to the next layer. This means that predictions which align with what was anticipated for the given context are given preference by the algorithm.

*Image 10: A region/layer updated over three time steps, with the second time step uncovered to reveal how the temporal memory* τ *updates interact with the spatial pooler* σ *updates.*

A prediction H can be mapped to an output SDR x by returning the layer’s ‘firing’ mini-columns (i.e. mini-columns containing one or more firing neurons = 1).

‍

If you liked this article and want to support Wluper, please share it and follow us on Twitter and Linkedin.

If you want to work on Conversational AI, check our careers page.