Information Theory and Artificial Intelligence

Mohammed Terry-Jack

July 27, 2023

9 min read

Artificial Intelligence (AI) and Machine Learning (ML) differ from other fields as they tend to lack a coherent theoretical basis. In fact, they have often been compared with alchemy than science, with ML practitioners experimenting on combinations of mathematical and computational techniques by trial and error, tweaking and fine-tuning various hyperparameters (and/or prompts) until, by some touch of luck, they stumble upon that perfect recipe that magically works and should no more be touched!

Information Theory (IT), by contrast, has a strong theoretical and mathematically rigorous basis and it has become well-respected and oft-utilised in many scientific fields (such as physics, communication theory, dynamical systems, etc)

In brief, it is a theory which formalises the notion of “information” and defines what it could be, how it can be quantified, in which ways it may underpin and influence a system, etc. The field actually shares a common origin story with AI (both stemming from early work in Cybernetics). However, they both quickly branched off and parted ways, flourishing into their own respective fields which continue to grow to this day

Recently there has been a push in academia for research which combines Information Theory with Machine Learning.

For instance, the popular Elsevier journal Physica D: Nonlinear Phenomena released some special issues dedicated to work at the interface of Machine Learning and Algorithmic Information Theory (ML and AIT), of which we contributed a couple of papers (e.g. 0–1 Test for Chaos… and FT-bounded Kolmogorov Complexity)

In a recent paper entitled “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors, the authors combine an information- theoretic distance metric with a classical ML algorithm — the k-nearest-neighbours (KNN) classifier — to demonstrate that even this very simple classifier, when combined with the power of Information Theory, can outperform SOTA Deep Learning (DL) models! In the authors’ own words, their approach is:

“a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification” with “results that are competitive with non-pretrained deep learning methods on six in-distribution datasets. It even outperforms BERT” and their “method also excels in the few- shot setting, where labeled data are too scarce to train DNNs effectively”.

To summarise their approach, they combine a KNN classifier with a custom metric that measures the information distance between strings of text. To measure the information distance of texts they rely on a concept from AIT called Kolmogorov Complexity.

In essence, the length of a string of text may be long and yet contain very little information (e.g. “aaaaaaaaaaaaaaaaaaaa”) but it will also contain a lot of redundancy (i.e. repetitions, patterns, etc) which can utilised to rewrite the string into a more concise and efficient form and so, the length of its summarised representation would end up being very short (e.g. “a”*20). However, if the original string does not contain any redundancy, then its summary form would remain long and it can be safely assumed that the string contains more information.

In short (pun-intended), the length of a string’s optimally summarised representation is an accurate indication to the amount of information it contains. (For more detailed examples and definitions of Kolmogorov Complexity, please see here and here)

A visualisation to show the relative space taken by strings of text both before (left) and after (right) a compressor like gzip has been applied.

Now the Kolmogorov Complexity (K) is technically incomputable, however, it can be approximated using any lossless compressor (the authors use gzip as their compressor of choice).

>> 23
>> 23
>> 23

The information distance presented in the paper is defined as follows:

which can be “interpreted as how many bytes do we still need to encode y based on the information of x”. In other words, if two texts (x and y) are identical (informationally speaking) then no additional bytes (of information) would be required to encode text y than the bytes already used to encode text x.

So if there is no difference (informationally) between two strings of text, their information distance from one another is 0

>> 0
>> 0

Similarly, the more different they are (informationally), the further their information distance grows.

>> 7

The information distance is not upper bounded, so the final information-theoretic metric presented in the paper was normalised (to be within 0 and 1) to give the Normalised Compression Distance (NCD)

(where the notation C here refers to the compressor-based approximation of K — since the ideal K is an incomputable function)
>> 0.25925925925925924

However, for most cases, using the information distance (without normalisation) works just as fine. Here is an example where we can use it to rank texts:

Lets use a list of food items for the text

>> [‘Shrimp and prawn as food’]
>> [‘Cheeseburgers’, ‘Hamburgers’]
>> [‘Chicken balls’, ‘Chicken nuggets’, ‘Chicken steak’, ‘Chicken strip

Now lets use the information distance in combination with a KNN classifier.

The KNN classifier from Scikit learn is convenient to use, however, it requires vectors as inputs (not strings or bytes) and since our metric takes in strings (not vectors) we need to momentarily convert the strings into vectors and back to strings again (just to bypass the sklearn input format validation checks — or alternatively we must code up our own KNN classifier as the authors of the paper did)

>> “testing”

Now lets get the training data. We use the huggingface api to download nlu_evaluation_data

Now fit the KNN classifier (super quick)

>> KNeighborsClassifier(metric=<function <lambda> at 0x13c78e940>)

Et voila! Our classifier is now ready to be tested

>> ’iot_hue_lightdim’
>> sample: 77
text: could you order sushi for tonight dinner
expected: takeaway_order
predicted: takeaway_order

In conclusion, the intersection of Information Theory and Machine Learning has shown tremendous potential in enhancing classical machine learning algorithms. While AI and ML have historically relied on experimental techniques, the infusion of information-theoretic principles brings a new level of mathematical rigour and efficiency to the field. Such a potent interdisciplinary mix will undoubtedly open up new avenues for research and innovation and contribute to the maturity and evolution of AI.

It’s an exciting time for the world of AI, where the magic of algorithms and the depth of information theory converge to create cutting-edge solutions that push the boundaries of what’s possible.

Any comments or questions? Let us know what you think.

Follow us also on Twitter and LinkedIn for more.

Share with others

Read this post on Medium


Read More