Methods to Grow Your Own Data Sets for Conversational AI

Mohammed Terry-Jack

January 14, 2020

–

9 min read

Scaling from tiny to large amounts of NLP and Dialogue data.

Supervised Learning algorithms require a significant amount of labelled training examples to properly approximate a function that is robust to the richness and variation inherent in natural languages. Providing too few examples may produce a model which fails to generalise to underlying patterns, making it brittle and easily broken when exposed to unseen examples encountered in the wild.

However, collecting and annotating training data within a domain demands considerable time and resources. Fortunately, there are a range of high-quality augmentation techniques to artificially inflate textual datasets, including methods using state-of-the-art language models like BERT and GPT-2.

1. Generating Longer Conversations (using GPT-2)

2. Inserting Words (using BERT)

3. Back-Translation (aka Spinning)

4. Substituting Synonyms (with POS filtering)

5. Shifting

1. Generating Longer Conversations (using GPT-2)

Given a tiny dataset of just three conversations:

Open-AI’s massive GPT-2 language model was trained on so much data that it is able to generate very realistic sentences. We can use this fact to produce new variant examples by extending each conversation’s final sentence (e.g. “i’m just in a bad mood” → “…because I lost in the qualifiers”):

Similarly, we can use it in additional sentences to the end of the conversation too. In fact, each time you run this language model, you get slightly different results so you could re-run this augmentation method multiple times to introduce even more variant conversations into your dataset.

First we will need to install GPT-2-simple, an open-source library designed to make accessing this powerful language model very easy. We download the 774M parameter version of the model and load it up.

<pre> !pip3 install gpt-2-simplefrom nltk.tokenize import sent_tokenize
import gpt_2_simple as gpt2
model_name = "774M"
gpt2.download_gpt2(model_name=model_name)
gpt2.load_gpt2(
gpt2.start_tf_sess(),
model_name=model_name
)
</pre>

We create a small function which takes an example conversation as an input and calls GPT-2 to generate the next 100 words which it thinks could follow on from this conversation (we have chosen to extend the conversation by a single sentence [:n+1], but feel-free to modify this to extend the conversation more).

<pre> def _extend_conversation(conversation_as_string):
generated_samples = gpt2.generate(
sess,
model_name=model_name,
prefix=conversation_as_string,
length=100,
return_as_list = True
)
n = len(
sent_tokenize(
conversation_as_string
)
)
return sent_tokenize(
generated_samples[0]
)[:n+1]
</pre>

2. Inserting Words (using BERT)

Google’s BERT is another powerful language model which has revolutionised NLP but it is trained slightly differently to other language models, like GPT-2. This difference makes it well suited for predicting masked words; wherein a word is masked (hidden) and BERT uses the words surrounding that masked word to predict what the masked word could be. E.g. “One day she [MASK] down the hall” → “One day she ran down the hall”

We inserted masks between words in a complete sentence and could trick BERT into predicting new words and extending the sentence from the middle (as opposed to the end). E.g. “the fox” → “the [MASK] fox” → “the brown fox” → “the [MASK] brown fox” → “the striped brown fox” …

Therefore, we use BERT by iteratively placing a mask between every word in a given conversation to produce multiple variant conversations. We could even iterate this process to produce even more variations by inserting masks into the newly created variant conversations too.

*Variant conversations produced by BERT*

‍

To do this method, we must first install the open-source library, pytorch-pretrained-bert, and download the language model along with its accompanying tokeniser.

<pre> !pip3 install -U pytorch-pretrained-bert
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import torchmodel_name = 'bert-base-uncased'
bert_tokeniser = BertTokenizer.from_pretrained(model_name)
bert_model = BertForMaskedLM.from_pretrained(model_name)
</pre>

We also need to create a function which appropriately formats the input string with special tokens ([CLS], [SEP]), splits the string into tokens (using the model’s accompanying tokeniser) and then inserts a special mask token ([MASK]) to indicate the word we wish the model to predict.

<pre> def _format_model_input(text, tokeniser, insert_mask_at_idx):
tokens = tokeniser.tokenize(
f"[CLS] {text} [SEP]"
)
tokens_with_mask = tokens[:insert_mask_at_idx] + [
"[MASK]"
] + tokens[insert_mask_at_idx:]
return torch.tensor(
[
tokeniser.convert_tokens_to_ids(tokens_with_mask)
]
)
</pre>

For the output of the model, we want to return the sentence with an additional word inserted between two other words. Therefore, we create a function which converts the token indexes back into words (using the same tokeniser), and then fetch the token index predicted by the model corresponding to the location of the mask token (and convert it to a word in the same way). We then join the words together into a single string, clean it up a bit by removing any special tokenisation symbols (e.g. ## , [CLS] at the beginning, [SEP] at the end, etc).

<pre> def _format_model_output(model_output, token_idxs, tokeniser, masked_idx):
tokens = tokeniser.convert_ids_to_tokens(
token_idxs.tolist()[0]
)
tokens[masked_idx] = tokeniser.convert_ids_to_tokens(
[
torch.argmax(
model_output[0, masked_idx]
).item()
]
)[0]
return ' '.join(tokens[1:-1]).replace("##","")
</pre>

We now define a function to connect everything together; the input formatting, the BERT model, the output formatting.

<pre> def _insert_mask_and_predict(sentence, model, tokeniser, masked_idx):
tokens_with_mask_inserted = _format_model_input(
text = sentence,
tokeniser = tokeniser,
insert_mask_at_idx = masked_idx,
)
segment_ids = torch.tensor(
[[0]*len(tokens_with_mask_inserted)]
)
with torch.no_grad():
return _format_model_output(
model_output = model(
tokens_with_mask_inserted,
segment_ids
),
tokeniser = tokeniser,
token_idxs = tokens_with_mask_inserted,
masked_idx = masked_idx,
)
</pre>

‍
You may have noticed that the function requires you to specify which position you want to insert the mask. Well, why not each and every position in turn? This is what this next function does; iteratively creating variants by placing a mask at every index in the sentence until it reaches the end and fails, at which point it returns all the newly created variants.

<pre> def _insert_words(example):
new_examples = [example]
idx = 1
try:
while True:
new_examples.append(
_insert_mask_and_predict(
sentence = example,
model = bert_model,
tokeniser = bert_tokeniser,
masked_idx = idx
)
)
idx += 1
except:
new_examples.pop()
return new_examples
</pre>

3. Back-Translation (aka Spinning)

Back translation (a.k.a. spinning) uses a machine translation model to translate a sentence into a foreign language and then to back again into the original language. We found this method very good at producing natural sounding sentences which are grammatically consistent yet slightly different from the original.

*Spinning the text using Spanish top) or Arabic (bottom) as the foreign language.*

First we need to import textblob, or any other open-source library with access to free translation.

<pre> from textblob import TextBlob </pre>

Next, we define a function which translates the sentence into some specified foreign language and back again into English (the assumed original language). This could also be extended to include translations into more than one foreign language before being translated back into English.

<pre> def _spin_text(text, foreign_language):
try:
spun_text = _clean_word(
TextBlob(
TextBlob(text).translate(
from_lang="en",
to=foreign_language
).raw
).translate(
from_lang=foreign_language,
to="en"
).raw
)
return spun_text if spun_text != _clean_word(text) else None
except:
return None
</pre>

If the translation fails, or the spun sentence turns out identical to the original (disregarding formatting or punctuation changes), then the function returns None.

<pre> from string import punctuationdef _clean_word(word):
return word.lower().strip(punctuation)
</pre>

4. Substituting Synonyms (with POS filtering)

A more classical technique is to pick a word in the sentence and substitute it for one of its synonyms. This method can produce a huge number of variations by substituting one word in the conversation at a time (e.g. “I’m just in a bad mood” → “I’m simply in a bad mood”).

‍

*“I’m just in a bad mood” → “I'm simply in a bad mood”*

We can scan through a sentence and substitute each word with its synonyms to produce a variant sentence.

In fact, substituting combinations of words will lead to an exponentially large pool of variants for any given example. However, this method needs careful filtering (we show you one filtering technique using POS tags below) since many of these variants will not sound natural if synonyms are substituted blindly. For instance, “a bad mood” → “a unsound mood” or “nice to meet you Carla” → “nice to conform to you Carla”, etc.

We first download the relevant files from NLTK (an older yet giant NLP library).

<pre> import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet as wn
</pre>

We define a function which fetches synonyms for a word from WordNet (a massive, hand-curated web of words and their various relations with one another).

<pre> def synonyms(word, pos_tag):
return list(
{
lemma.replace("_"," ").replace("-"," ") for synset in wn.synsets(
_clean_word(word),
pos_tag,
) for lemma in synset.lemma_names()
}
)
</pre>

We have added a filter here which takes into account the word’s part-of-speech before fetching any synonyms (i.e. is it a noun, verb, adjective, adverb, etc). E.g. The word “test” can be used as a noun or verb, etc:

‍

We then define another function to automatically infer the part-of-speech tag:

Fortunately, NLTK has some pre-trained POS Taggers which considerably simplifies our lives:

<pre> def _infer_pos_tags(tokens):
return [
(
token,
_convert_nltk_to_wordnet_tag(nltk_tag)
) for token,nltk_tag in nltk.pos_tag(tokens)
]
</pre>

This function takes in tokens, as opposed to a string, so be sure to split the string into words or use one of NLTK’s provided tokenisers (nltk.word_tokenize(some_string_to_be_tokenised)). The function outputs the resulting POS tags returned by the tagger but first converts them into the POS notation required for compatibility with WordNet:

<pre> def _convert_nltk_to_wordnet_tag(pos_tag):
if pos_tag.startswith("N"):
return wn.NOUN
if pos_tag.startswith("V"):
return wn.VERB
if pos_tag.startswith("R"):
return wn.ADV
if pos_tag.startswith("J"):
return wn.ADJ
</pre>

5. Shifting

Our final technique is a very basic one which can be applied to any time-series data (like conversations). The position of each sentence (or data point) in the conversation (training example) is simply shifted (offset) by one place to produce a valid variant. E.g. “hi”, “how are you”, “fine thanks”… → “how are you”, “fine thanks”…

You can also combine time-series samples (conversations) by appending them to each other, producing a “new”, longer example:

Finally, if you wish to make your model robust to textual errors which can occur in real-world scenarios, you can create variants by inserting textual noise (e.g. random spelling mistakes, additions, deletions and word order changes, etc).

Conclusion

We at Wluper combined the above five techniques, added a little bit of magic sauce, and were able to achieve a 10,000x fold improvement in training data for actual conversations, without even exhausting the number of possible variants each of these five techniques alone are able to offer. Powerful, yet efficient.

As well as state-of-the-art language models, some of our unrevealed NLP augmentation methods involve multi-task learning, semi-supervised, and even unsupervised learning. One specific example would be Clustering, a great way to find natural variations for simple intents like greetings, requesting alternatives, etc.

We use clustering algorithms to analyse public data sets and discover groups of phrases which share some underlying semantic relationships. Although one will not be told explicitly which similarities these clusters share, they can be inferred using other techniques. Further, a seed phrase with a known class label can be used to solve such ambiguities.

We are constantly striving to find efficient and effective methods to grow our quality data sets and allow our Dialogue system to become better, more accurate, and more robust.

You can find code snippets for the above NLP and Dialogue data augmentation in the “dsag” repo on our Github page.

If you liked this article and want to support Wluper, please share it and follow us on Twitter and Linkedin.

Methods to Grow Your Own Data Sets for Conversational AI

Scaling from tiny to large amounts of NLP and Dialogue data.

Contents