# Probabilistic Matrix Factorisation Mohammed Terry-Jack

August 12, 2019

### Information Retrieval

Information Retrieval (IR) is widely used to solve a range of problems, including:

• Recommendations (i.e. matching Users to Items — e.g. Movies, Songs, etc.)
• Classification (i.e. matching Inputs to Classes)
• Dialogue Systems (i.e. matching Utterances to Responses)

To demonstrate how an IR-based dialogue assistant may work, let’s suppose we have a set of utterances stored along with their corresponding replies.

<pre> utterance_replies = {
"this is a test"           :  "good luck",
"i have a cat"             :  "aw thats nice",
"i like dogs"              :  "what type of dog do you have?",
"i want to go for a walk"  :  "where to?",
} </pre>

We embed each utterance into a vector space.

U = np.array([embedding_spacy(utterance).vector for utterance in utterance_replies])</pre>

And any new utterances are also embedded into this same vector space so that they become comparable.

<pre>new_utterances = [
"this is an exam",
"i have a kitten",
"i love puppies",
"i want to go for a stroll",
]V = np.array([embedding_spacy(utterance).vector for utterance in new_utterances])</pre>

For each new utterance, we find the most similar utterance stored (i.e. its nearest neighbour). We could do this by naively looping through all utterance vectors and comparing them using some distance metric (i.e. cosine similarity, euclidean distance, etc).

<pre>from scipy.spatial.distance import cosinefor v,utterance in zip(V,utterance_replies):
results = sorted([(cosine(v,u), new_utterance ) for u,new_utterance in zip(U,new_utterances)])
print(utterance, results)</pre>

However, this is quite inefficient and fails to scale to larger knowledge bases (i.e. comparing millions of utterances would be slightly time consuming).

### Dot Product

Instead, we compute the dot product to compare all the new utterances (V) with all the known utterances (U) in a single operation.

<pre>R = U @ V.T</pre>

<pre>R_normalised = R.div(R.sum(axis=1), axis=0)
nearest_neighbour = R_normalised.idxmax()</pre>

We now use the retrieved nearest neighbours to return their reply as the reply to the new incoming utterance.

E.g. “I love puppies” (~ “I love dogs” — > “what type of dog to you have?”)

<pre>utterance_replies[nearest_neighbour["i love puppies"]]</pre>

E.g. “I want to go for a stroll” (~ “i want to go for a walk” — > “where to”)

<pre>utterance_replies[nearest_neighbour["i want to go for a stroll"]]
</pre>

In fact, we can even cut out the middleman and forget the nearest neighbour part altogether. Instead of retrieving the most similar utterance to get its reply, we can retrieve replies directly by comparing them with the new utterances.

x` → x → y (nearest neighbour approach)

x` → y (direct comparison)

But in order to do that, we must have the replies embedded in the same vector space as the utterances so that they are comparable. Matrix Factorisation is one way to do this.

### Matrix Factorisation

Rather than using the dot product operation to use two matrices (U and V) to produce a single matrix (R), now we shall be starting with a single matrix (R) and factorising it into the two matrices (U and V —our utterance and reply vectors, or question and answer vectors, or user and item vectors, etc).

To illustrate this, we shall use IR on a classification task (such as sentiment analysis). Suppose we have a matrix (R) with example sentences that have been classified into one of three possible classes (positive, neutral or negative).

<pre> sentences = [
"How are you",
"I like it",
"I hate you",
"Shut up",
"how are you",
"thanks a lot",
]classes = ["Positive","Neutral","Negative"]
</pre>

<pre>R.T.idxmax()</pre>

We could factorise R (the initial matrix) to obtain both matrices U (sentence vectors) and V (class vectors), but since we already have suitable manner to obtain vector representations for our sentences (U), we shall keep these fixed and embed our class labels (V) into the same vector space by obtaining the “best fit” for V using the pseudo-inverse of U.

<pre>V = np.linalg.pinv(U) @ R </pre>

Et Voila! Now that we’ve embedded the class labels, the classification process boils down to retrieving the most similar class for each sentence.

<pre> test_inputs = [
"Hows things",
"I really like it",
"I hate her",
"how you doing",
"thank you my friend",
]U_new = np.array([embedding_spacy(test).vector for test in test_inputs])R_hat = U_new.T @ V
R_hat.T.idxmax()
</pre>

Doing classification via embeddings and IR (rather than more traditional approaches with classifiers), has multiple advantages. It is extremely fast during inference and it solves the multi-label classification problem! So it isn’t just good at handling multiple classes (multi-class classification ) but it can also predict multiple class labels per prediction (multi-label classification) e.g. “Hi, how are you” = “Greeting” + “Other”).

<pre> for new_sentence in R_hat.T:
print(new_sentence)
print(R_hat[new_sentence].sort_values(ascending=False))
</pre> Multi-Label classification: Notice how the predicted class labels for these sentences: “Hi how are you” and “Hi” are both “greetings” but the former has an additional label prediction of “other”.

If we didn't have an appropriate embedding method to convert the sentence vectors (U) either, we can simply use Matrix Factorisation to obtain both the sentence vectors (U) and label vectors (V). We would start with randomly initialised vectors and iteratively alternate this “fitting” process (by fixing one matrix of vectors while changing the other, then fixing the other and changing the first, etc).

<pre> for _ in range(5):
V = np.linalg.pinv(U.T) @ R
U = np.linalg.pinv(V.T) @ R.T
</pre>

This is almost how Alternating Least Squares (ALS) works (except the pseudo-inverse is not used to minimise the error).

### Probabilistic Matrix Factorisation

Creating the initial matrix (R) is usually an ongoing process which means R will be changing continually as it is updated with ever newer information, and if it is a large matrix, values are likely to be missing (NaN), making it a sparse (incomplete) matrix.

We might want to impute the missing values, but such guesses introduce error into our matrix and unnecessarily converts our high-quality sparse matrix into a low-quality dense matrix. Instead, it's easier, safer and more computationally efficient, to approximate the matrix factorisation process from the few known values we have via probabilistic approaches like Stochastic Gradient Descent (SGD) (which is good at generalising to unseen data, whereas the pseudo-inverse would overfit to the known values).

Essentially, SGD is used to minimise the error (difference between the observed Matrix R and the one predicted by the dot product of the matrix factors (R_predicted = U @ V.T)). Ideally, the predicted and real matrices should be identical (error = R-R_predicted = 0). We can use the squared error too, error² = (R-R_predicted)² . We can also add in some regularisation terms to penalise overfitting, error = (R-R_predicted)² + lambda_normalisation * (U² + V²).

Let’s imagine we want to get user vectors (U) and film vectors (V) from the above sparse matrix of user-film ratings (R). We start by randomly initialising our user vectors (U) and film vectors (V) with N features (e.g. N is essentially how many dimensions each vector will have and can be as small or large as you like — in practise, its usually set somewhere between 50 to 300).

<pre> n_features = 300

U = np.random.normal(scale=1./n_features, size=(len(users), n_features))

V = np.random.normal(scale=1./n_features, size=(len(films), n_features))
</pre>

We then prepare the training samples

<pre> training_samples = [(i, j, R[i, j]) for i in range(len(users)) for j in range(len(films)) if R[i, j] > 0]
</pre>

and set the hyper-parameters for our learning algorithm (SGD)

<pre> learning_rate = .1
lambda_regularisation = .01
iterations = 30
</pre>

And iteratively fine-tune U and V

<pre> for _ in range(iterations):
np.random.shuffle(training_samples)
for i, j, r in training_samples:
r_hat = U[i, :] @ V[j, :].T
error = (r - r_hat)
U[i, :] += learning_rate * (error * V[j, :] - lambda_regularisation * U[i,:])
V[j, :] += learning_rate * (error * U[i, :] - lambda_regularisation * V[j,:])
</pre>

So there we have it — approximated vectors that embed both the users and the items (U and V) in the same vector space for direct comparison. Using the dot product on them will produce a resultant matrix (R_hat) that is nearly identical to the known values in the user-item matrix (R).

I hope this gentle introduction to Probabilistic Matrix Factorisation gives a better idea on how it works and why it is so useful.

## Featured 