How do Dialogue Systems Decide What to Say or Which Actions to Take?

Mohammed Terry-Jack

October 17, 2019

–

9 min read

Continuation of our blog series on emerging Dialogue Management — Part 2: Dialogue Policies.

Welcome back to our series on state-of-the-art research in Dialogue Management. In the previous post we focused on various aspects related to State Tracking (including how to take the personality of the user and system into account). In this post we look at the strengths and weaknesses of various types of Dialogue Policies.

Part 1: State Tracking ✔️

Part 2: Dialogue Policies️ ◄ ︎️️you are here

Part 3: Common Sense

Part 4: Dynamic Memory

Part 5: Learning

A dialogue policy decides how the system should respond given the current state. XiaoIce, Rasa and various Alexa Prize teams use a hybrid approach (i.e. a hierarchical ensemble of policies trained for different kinds of conversations — e.g. form filling, question-answering, story-telling, profanity, jokes, comforting, the weather, the news, recommending restaurants, films, music, sports, etc) for a diversity of conversation modes that could better maintain a user’s interest.

To reduce the computational load, a decision-making mechanism or two control which policy/policies are used at a given time (e.g. Alqist decides based on the detected topic). The use cases of policies usually do not overlap, however, if multiple policies are triggered simultaneously there is usually a method for selecting the winning policy (e.g. Alana assigns static priorities to all policies while XiaoIce bases the decision on the confidence scores of each trigger).

There are several types of policies, each with its own strengths and weaknesses:

Rule-based policies
Retrieval-based policies
Generative policies
Graph-based policies

1. Rule-based Policies

Weaknesses. Rule-based policies are one of the earliest and easiest approaches to conversational AI (e.g. ELIZA) involving a mountain of carefully hand-crafted responses and heuristics. This approach struggles with many-turn dialogues (the rules are brittle and easily broken by unexpected dialogues) and extremely time consuming to create and maintain (the rules do not scale well).

Strengths. Apart from one or two chatbots (Mitsuku, Tutor and other Loebner Prize winners), conversational AI systems of today only use rule-based policies to support other policies because they are computationally efficient and can save the system the expense of processing frequent queries or commonly cached utterances which require little thought. Heriot-Watt University’s Alana [6] used a rule-based approach to script custom replies for common out-of-scope utterances (“I love you” or “you are stupid”) that were not easily handled by their other dialogue policies. GunRock script multiple responses and selects one at random to make the predefined responses less repetitive. They further create dynamic template responses which can be customised by swapping out the value of its slots.

Methods. Mitsuku, Tutor and Alexa Prize teams used AIML to create their rule-based policies. Rasa also offers heuristic Memoization and Mapping Policies [26] and DeepPavlov offers a similar Pattern Matching Skill.

2. Retrieval-based Policies

Weaknesses. Retrieval-based policies take a search-oriented approach to conversations which assumes dialogues are a sequence of search queries (initiated by the user) and search results (provided by the system). Naturally, this type of approach favours task-oriented dialogues like FAQ answering, Open Domain Question Answering (which is the task of finding an exact answer in Wikipedia articles to any question asked), and eCommerce (i.e. user’s utterance are treated as search queries and task actions become the search results) which are each implemented in DeepPavlov. Seeing dialogues in this way, however, leads to conversations that are very direct, dull and user driven (i.e. the system merely reacts to what the user says as opposed to proactively initiating ideas and topics for a more natural and engaging dialogue).

Strengths. Retrieval-based policies guarantee high-quality responses that are coherent and well formed (since they are retrieved from human-generated conversations or texts) Alana scrapes text snippets from humorous subreddits like “ShowerThoughts” and “Today I Learned”. Other AlexaPrize competitors makes use of various datasets, including News API, EVI, Wikidata, IMDB, ESPN Washington Post, DuckDuckGo, Rotten Tomatoes, Spotify, Bing, Common Alexa Prize Chats (CAPC), online forums, social media (e.g. Twitter), movie subtitles (e.g. Cornell Movie Dialogs, OpenSubtitles), Jabberwacky chatbot chat logs, CNN chat show transcripts, etc. ConvAI makes mention of PersonaChat, DailyDialog and Reddit (files.pushshift.io/reddit/comments). [29] provide a very comprehensive list of datasets.

Furthermore, the quality of the responses can easily be controlled by filtering and curating the data. Alexa Prize teams filtered out inappropriate content (e.g. sexual language, profanity and other toxic phrases) and Alana [11] noticed users give low ratings if their system reports news with strong political opinions, so they curate their data to ensure their content is politically neutral. XiaoIce filter personally identifiable information and spelling mistakes and curate their content to retain only empathetic utterances that fit XiaoIce’s personality.

Method. Using a specified distance measure (e.g. cosine similarity), the distance from the query (e.g. the system’s current state) to each candidate response (e.g. possible system actions, answers, replies, etc) is computed and the closest is retrieved as the result.

Pre-Trained Embeddings. There are various ways to embed the search queries and responses into a shared vector space so that they become mathematically comparable. DeepPavlov offer Tf-Idf and BLEU ranking retrieval methods while most Alexa Prize teams prefer pre-trained word or sentence embeddings (such as those detailed in section 1.2).

Training Custom Embeddings. More sophisticated approaches involve jointly learning similar vector representations for dialogue states (the query) and system actions (the replies) in a supervised setting. DeepPavlov offer Siamese Neural Networks for this task while Rasa offers the Keras Policy [26] (a bi-LSTM with attention). Rasa also has the Embedding Policy which uses a newer embedding algorithm (Recurrent Embedding Dialogue Policy (REDP) [27]) which is specifically designed to handle adversarial interjections (i.e. users talking off topic during the completion of a task). It does this by attending over the dialogue history to learn which user utterances and system actions are important for deciding the next action. System actions are initially represented as bags of features (e.g. the class’ hierarchy, the tokens derived from the action’s name / label — e.g. “utter_explain_details_hotel → {utter, explain, details, hotel}”, the functions executed by this action, etc) such that similar actions have more features in common than dissimilar actions. However, they are then transformed by the network’s attended embedding layer to produce the final representation.

End-to-End Models. There are also retrieval-based models which are end-to-end differentiable (i.e. the model can learn to output text responses or action functions without requiring an explicit representation for the dialogue state and other internal components). DeepPavlov offers two; the Key-Value Retrieval Network and the Hybrid Code Network (HCN) which combines an RNN with domain-specific knowledge and system action templates. Alqist also experimented with HCNs in the Alexa Prize while Ruby Star uses Dynamic Memory Networks.

LittleBaby uses the Profile-Encoded Multi-Turn Response Selection via Multi-Grained Deep Match Network while ConvAI organisers suggest the “Retrieve and Refine” model [31].

3. Generative Policies

Weaknesses. Whereas retrieval-based policies guarantee coherent responses, they are but bounded imitations of the natural variety found in human-human dialogues. Generative policies, on the other hand, compose natural sentences on the fly for a broad range of topics (e.g. books, games, holidays, movies, music, news, sports, general knowledge, etc) without needing an external knowledge base (ungrounded). Despite the rapid improvements of end-to-end generative models (e.g. GPT-2), they are still not favoured over more reliable, retrieval-based methods because they are too often utter nonsensical and informationally inconsistent sentences.

Strengths. Although generative policies cannot always guarantee high-quality responses, their robustness and high coverage nicely complement retrieval-based policies. For instance, if conversations drift into new or previously unseen topics and the retrieved candidate responses become too sparse or irrelevant, a generative policy would kick and compose additional candidates for inclusion into the rank and retrieval pool. This is how XiaoIce’s retrieval-based and neural-model-based generators work together.

Methods. In the Alexa Prize, MilaBot experimented with a combination of neural architectures including Hierarchical Latent Variable Encoder-Decoder (VHRED), a Dual Encoder and Skip Thought Network. XiaoIce follows a popular sequence-to-sequence GRU architecture for its generative policy while ConvAI recommend a similar baseline model (LSTM-based attentive sequence-to-sequence model). Teams that performed highly in the ConvAI competition implement variations of the Transformer for their generative policies (Lost In Conversation modified the OpenAI GPT transformer architecture while Hugging Face fine-tuned the BERT transformer architecture).

4. Knowledge Graph based Policies

Retrieval-based and generative policies both share a common problem: they produce dull dialogues. The Alexa Prize notes that retrieval-based policies produce responses which are on topic but not necessarily interesting. SlugBot astutely observes that they are too reactive to form engaging conversations: “it will not be possible to carry on a 20 minute conversation if SlugBot is simply responding to user initiatives”. Generative policies are better at generalising to the unseen but they are still not engaging. XiaoIce mentions its generative policy “often generates well-formed but short responses” while even generative models like Hugging Face (ConvAI runner-up) are a “bit boring” and Lost In Conversation (ConvAI winner) are “not that bad, just really uninteresting” and “not super interesting, but it’s able to respond well”. Knowledge Graph-based policies are a novel approach incorporated into XiaoIce and at least two Alexa Prize teams to complement their generative and retrieval-based policies for a more stimulating and engaging discourse. KG-based policies give a system the ability to take the initiative during conversations and proactively add information and steer a conversation into new yet related topics.

Many teams activate their KG-based policy when the measured “dullness” of their response passes a specified threshold. Alana v1 [11] used an word2vec embedding of the phrase “I don’t know” as a representation of a dull reply and measured the similarity of sentences to it to calculate their dullness score (i.e. more similar sentences are more dull). Another Alexa Prize team trained a classifier (a bi-LSTM with fully connected layer to merge the encoder states) to rate retrieved responses as “appropriate”, “inappropriate” or “potentially appropriate”. XiaoIce’s topic switching mechanism is triggered if it has insufficient knowledge to talk about a topic in depth (i.e. its reply simply repeats the user inputs or contains no new information) or the user exhibits signs of boredom (bland user utterances like “OK”, “I see”, “go on”, etc).‍

Methods. KG-based policies try to link extracted entities mentioned by the user (e.g. “Paris”) to a knowledge graph so that related entities (e.g. “Eiffel Tower”, “Louvre”, “Mona Lisa”, etc) guide toward a more informative reply. Alana’s Ontology bot (Contextualised Linked Concept Generator [6] ) links entities to a Wikipedia graph, while other teams used Evi, FreeBase, Microsoft Concept Graph, etc. SlugBot’s discourse relation dialogue model links extracted entities to its own domain ontology (UniSlug) to find additional entities related by time, contingency (one entity causally influences the other), comparison (including contrast and concession) or expansion (to continue to talk about the same topic but with a more specific attribute of the entity — e.g. “Paris” > “Eiffel Tower”). XiaoIce also uses its own knowledge graph (a collection of head-relation-tail triples) however it then uses a custom model (boosted tree ranker) to rank the related entities according to its features (e.g. how related the entity is, has it been discussed yet, is it related to the user’s interests or persona information, is it related to the news and thus current, is it popular on the internet, is it often received highly by other users, etc).

In summary, there are many ways in which we can combine different types of Dialogue Policies to achieve a more balanced, fluent and well-rounded conversation.

What’s Next

Now that we learnt more about the various approaches in Dialogue Policies, we will next look at one of the most fascinating topics in Conversational AI: Common Sense — every intelligent assistant's ultimate “life” goal.

If you liked this article and want to support Wluper, please share it and follow us on Twitter and Linkedin.

How do Dialogue Systems Decide What to Say or Which Actions to Take?