Not Another Conversational AI Report

Mohammed Terry-Jack

July 22, 2019

9 min read

But an overview on State-of-the-Art and Emerging Dialogue Management — Part 1: State Tracking.

The term ‘Conversational AI’ is fast fading into a fog of ambiguity, not too dissimilar to the way its parent term, ‘AI’, has almost vanished into the Vague Abyss of meaninglessness. High-level reports with fuzzy buzzwords, colourful diagrams and pretty patterns will not cut it for the curious reader who seeks to sink their teeth into this mysterious fruit. We at Wluper have delved into the mists of misconception surrounding Conversational AI, analysing resources from academia and industry to serve up five of the freshest, juiciest findings on Dialogue Management.

This razor-sharp Wluper-Writeup (NOT report) shares various approaches and solutions used by researchers and businesses. What follows is an overview of emerging research trends and state-of-the-art solutions with particular emphasis on advancements related to several areas within Dialogue Management, including:

Part 1: State Tracking ✔️
Part 2: Dialogue Policies
Part 3: Common Sense
Part 4: Dynamic Memory
Part 5: Learning

State Tracking

The first part of our series covers a core component of every spoken dialogue system, referred to as Dialogue State Tracking. Machines — given they are designed to communicate with us — must keep track of what users say and want. Same as with us humans, this is a crucial premise to have a proper dialogue.

The global states of the various systems that we have evaluated were composed of one or both of the following:

  1. Persona Information
  2. Dialogue History

It was often encoded as a state vector so that models (i.e. in the dialogue policy) could take the dialogue state as an input.

The Persona Information and Dialogue History are initially encoded as separate vectors, and later combined (e.g. concatenated) to form a single State Vector. Some of the more sophisticated approaches for combining the vectors together involve neural architectures, like a fully connected feed-forward neural network to compress the multiple vectors down to one (Alexa Prize teams).

LittleBaby (who came third place in ConvAI) merges the vectors together by feeding them sequentially into a GRU and considering the GRU’s final output as the State Vector.

Lost In Conversation [2] (The ConvAI winner) combines the vectors together using the shared multi-head attentions of a modified Transformer.

1.1 Persona Information

A key observation noted in the AlexaPrize competition was that Conversational User Experience (CUX) takes the least effort to incorporate and yet leads to the highest gains. Often systems are strong at conversing in topics they were trained on (e.g. movies) and drive the conversation toward these competency areas rather than toward areas which the user is more interested in. Many of the successful Alexa Prize teams would automatically adapt their dialogue manager based on a user’s interests while teams who “didn’t give much emphasis to CUX were not received as top performers by Alexa users”. XiaoIce’s developers refer to CUX as “emotional intelligence (EQ)” and consider it just as important as XiaoIce’s “IQ”, since it enables the system to learn social skills, empathy (e.g. a comforting skill is triggered if an extremely negative user sentiment is detected) and even a sense of humour.

Incorporating Persona Information into the state vector helps the system learn to personalise responses to the user’s persona and generate more attractive responses and a more pleasant conversation. XiaoIce even incorporates its own persona information into the state vector too, so that it can generate responses in-line with its personality (“an 18-year-old girl who is reliable, sympathetic, affectionate” and “never comes across as egotistical…despite being extremely knowledgeable”).

The Persona Information is a learnt Vector encoding a characteristic set of behaviours and emotional patterns that represent a user’s distinctive character. Behavioural features include content-based things such as the topics of interest and the entities they like talking about. Emotional features mainly include things like sentiment, emotion, opinion, etc (e.g. Alexa Prize teams like Eve measure the user’s mood through pre-trained sentiment models like VADER, while Alqist train their own sentiment analysis model on movie reviews using a bi-directional GRU. XiaoIce’s empathetic computing module outputs a set of empathy labels for a given text). The Persona Information’s features are updated throughout the conversation.

The Persona Information vector could also be learnt using vectors of unknown features and machine learning techniques used in adaptive recommendation systems (e.g. Netflix’s recommendation system uses skip-gram networks to learn user vectors).

1.2 Dialogue History

In our experience at Wluper, we think that the user’s utterance can quite often be ambiguous and contain insufficient information when viewed in isolation from the conversation (especially short, bland utterances like “OK”, “why”, “I don’t know”, etc). To make sense of it, additional context from the dialogue history is necessary (i.e. the user utterances and system responses/actions over the past N-turns).

Successful approaches for vectorising past utterances and replies (the dialogue history) involved embedding each string at multiple levels:

  1. Sentence-level
  2. Word-level
  3. Character-level (and ngrams)
  4. NLP features (e.g. dialogue acts, etc)

1.2.1 Sentence-level Embeddings

GunRock (the University of California’s Alexa Prize entry [7]) uses Google’s Universal Sentence Encoder to embed utterances and responses at the sentence-level. These embeddings alone are sufficient and offer contextualised embeddings, however, they require you to load in large, pre-trained models which can be burdensome in contrast to other (older) methods which simply rely on pre-computed vectors to encode the sentence at the level of its individual words.

1.2.2. Word-level Embeddings

Many also take advantage of the wide selection of pre-trained word embeddings available (eg. Word2Vec, Glove, Fasttext, Elmo, Numberbatch, Poincare, etc). DeepPavlov even allow you to fine-tune Elmo embeddings for your data. Since sentences are of varying lengths (some containing more words than others), various methods are employed to merge the sentence’s word vectors into a single, fixed-length vector that represents the entire sentence (e.g. combinations of mean, max, min, sum pooling).

Unfortunately, the sentence word order is lost as the word vectors merge together and so LittleBaby uses more sophisticated merging methods to ensure the relative positional information of words are embedded too (bi-directional GRU and CNN) as do some Alexa Prize teams (LSTM encoder). Hugging Face and others use an Attention mechanism and/or the Transformer to ensure words are weighted appropriately when merged.

Just before passing the word vectors into the Transformer’s Multi-head attention, Hugging Face sum them with additional positional embeddings to indicate which utterance the word belongs to (i.e. the user’s utterance, the system’s response, etc) to encode both the sentence order and the order of sentences.

1.2.3 Character-level Embeddings

One problem with word-level embeddings is that typos and out of vocab (OOV) words cannot be represented. DeepPavlov also offers pre-trained spell-checkers for correcting typos prior to embedding, however, character-level embeddings are a more robust alternative for encoding words with errors and are also beneficial for capturing morphological features, such as capitalisations. Some encoded sentences at the character-level via an LSTM while others used Tf-Idf weighted N-grams (BabyFace).

1.2.4 NLP Features

NLP features (e.g. sentiment, dialogue act, topic, intent, etc) provide a further source of context by “reading between the lines” for information not explicitly mentioned in the dialogue history. The quality of Alexa Prize systems increased in accuracy from a baseline of 55% to 76% after the inclusion of additional contextual features such as topic and dialogue act.

Topics (e.g. Sports, Politics, Entertainment, Fashion, Technology, etc) are an important feature for ensuring the system is able to sustain a topical discussion and hold a dialogue on a particular topic over multiple turns (deep topical conversation problem) rather than constantly switching to unrelated topics (which would annoy the user). Most Alexa Prize teams (Tartan, Slugbot, Fantom, Eve, Gunrock, etc) used the provided topical classifier: an attention-based Deep Average Network (ADAN) which can simultaneously classify a topic while detecting entities (words with high attention). However, a handful (GunRock, Alana) opted to detect topics from named entities using knowledge graphs (e.g. Google Knowledge Graph, Microsoft Concept Graph, etc).

Intents (e.g. request_change_topic, request_exit, etc) represent the underlying goals of the user (i.e. what they wished to achieve from their utterance). DeepPavlov provides a number of pre-trained intent classifiers implemented in sklearn or keras or via fine-tuning BERT.

Utterances often become complex in real conversations, with pauses, hesitations and compound sentences (the user expresses multiple thoughts in a single sentence — e.g. “I love watching the Harry Potter movies, but I think the books are more enjoyable”). To overcome this problem, some AlexaPrize teams resort to sentence segmenting (parsing the sentences to break it into smaller segments) prior to predicting the intent (e.g. “Alexa that is cool what do you think of the Avengers” → “Alexa <BRK> that is cool <BRK> what do you think of the Avengers <BRK>”). GunRock trains bi-LSTM with attention encoder-decoders on Common Crawl annotated with a special breaking tokens.

Dialogue Acts are very broad intents (e.g. “greeting”, “question”, “yes-no-question”, “statement-opinion”, “opinion request”, “request for information”, “delivery of information”, etc). AlexaPrize provided three pre-trained dialogue act classifiers using a Deep Average Network (DAN) or a bi-LSTM. Some Alexa Prize teams (i.e. Slugbot, Iris, Gunrock, Alqist) opted to train their own classifiers (e.g. GunRock [7] trained a CNN and a bi-LSTM on the Switchboard Dialog Act Corpus (SWDA)).

Entities are words which refer to a specific object (e.g. person, organisation, etc) or event in the real world (e.g. in the utterance “what do you think about the Mars Mission”, the words “Mars Mission” are an entity as it refers to a specific event) and extracting them is useful for retrieving relevant information during response generation. Entities are also core to the meaning and intent of a sentence. Most Alexa Prize teams use an ensemble of pre-trained NER models (e.g StanfordCoreNLP, SpaCy, Google NLP, Alexa’s ASK NLU, etc). Alqist trains their own model using manually labelled utterances collected from conversations while Iris trains a CNN on entities from DBPedia.

Alternative approaches for detecting entities are taken by Iris and Fantom, who used a large database to recognise relevant entities. DeepPavlov suggests slotfill_raw which uses a type of fuzzy search and also offers keyword extraction via regular expression matching (regex), Tf-idf ranking and slot filling.

Morphological features (sentence structure, parse information, part of speech, gender, case, plural, mood, tense) are sometimes extracted to gain additional signals. Alqist (an Alexa Prize entry from the Czech Technical University in Prague [12]) uses publically available annotators from Stanford CoreNLP. DeepPavlov offers fine-grained morphological tagging (i.e. detecting case, gender, number, mood, tense, etc).

Hand Engineered Features (based on items such as response confidence, utterance length, whether or not an utterance or word has been repeated, etc) are extracted by two Alexa Prize teams (SlugBot and Eve). These features come in handy if training systems to avoid replies which are too short and bland or to avoid repetitions in their choice of replies (comments about Lost In Conversation revealed that it would say “awesome” too much). Features like the ratio of questions asked by the user and system may even help the system to mirror the conversational style of the user (it was noted that Hugging Face asked too many questions which made it more annoying and less natural).


To comprehensively scope this vast and rapidly changing landscape, we draw from a triad of secondary sources (which, in turn, directed our search toward specific architectures and publications):

1. Competitions (e.g. Alexa Prize, DSTC, Loebner Prize, ConvAI)

2. Toolkits/Building Frameworks (e.g. Rasa, DeepPavlov)

3. Commercial (e.g. Microsoft’s XiaoIce)

We have collected the best practises and approaches used across these various sources and, although we do not find many cases where all the techniques are used in combination, they are by no means exclusive, so we recommend using them together to reap the benefits of all.

Hugging Face, for instance, did second best in the ConvAI competition (and actually came first on the automated metrics scoreboard), but it fails at more subtle conversational cues, like balancing the number of questions asked (it asks way too many questions). Microsoft’s XiaoIce, on the other hand, is very well rounded and almost incorporate everything we have found. Many others (e.g. Google Assistant, Siri, Alexa, etc) simply don’t compare at all.

The Loebner Prize is pretty much focused on rule-based approaches while ConvAI consists mostly of end-to-end, DL models. The Alexa Challenge, however, often combines approaches and even contains novel approaches with knowledge-graphs too. Each approach has its pros and cons which we shall discuss in the next post.

The main negative about many systems analysed is that they only utilises one or two approaches and miss out on many other (complementary) approaches being explored in this field. This may be due to the fact that this space is quite young and growing so fast, that they simply don’t realise what they are missing!

What’s Next

Now that we have got an overview on the Dialogue Management landscape with various approaches and solutions, we can dig into Dialogue Policies next.

In a typical Conversational AI implementation, the dialogue policy decides how the system should respond given the current state, and we are going to learn about methods, strengths, and weaknesses of different policies.

If you liked this article and want to support Wluper, please share it and follow us on Twitter and Linkedin.

Read part 2: How do Dialogue Systems decide what to say or which actions to take?

Share with others

Read this post on Medium


Read More