Adversarial Bootstrapping for Dialogue Model Training

Adversarial Bootstrapping for Dialogue Model Training

Oluwatobi O. Olabiyi, Erik T. Mueller, Christopher Larson, Tarek Lahlou
Capital One Conversation Research, Vienna, VA
{oluwatobi.olabiyi, erik.mueller, christopher.larson2, tarek.lahlou}

Open domain neural dialogue models, despite their successes, are known to produce responses that lack relevance, diversity, and in many cases coherence. These shortcomings stem from the limited ability of common training objectives to directly express these properties as well as their interplay with training datasets and model architectures. Toward addressing these problems, this paper proposes bootstrapping a dialogue response generator with an adversarially trained discriminator as an effective solution. The proposed method involves training a neural generator in both auto-regressive and traditional teacher-forcing modes, with the maximum likelihood loss of the auto-regressive outputs weighted by the score from a metric-based discriminator model. The discriminator input is a mixture of ground truth labels, the teacher-forcing outputs of the generator, and distractors sampled from the dataset, thereby allowing for richer feedback on the autoregressive outputs of the generator. To improve the calibration of the discriminator output, we also bootstrap the discriminator with the matching of the intermediate features of the ground truth and the generator’s autoregressive output. We explore different sampling and adversarial policy optimization strategies during training in order to understand how to encourage response diversity without sacrificing relevance. Our experiments shows that adversarial bootstrapping is effective at addressing exposure bias, leading to improvement in response relevance and coherence. The improvement is demonstrated with the state-of-the-art results on Movie and Ubuntu dialogue datasets with respect to human evaluations and BLUE, ROGUE, and DISTINCT scores.

1 Introduction

End-to-end neural dialogue models have demonstrated the ability to generate reasonable responses to human interlocutors. However, a significant gap remains between these state-of-the-art dialogue models and human-level discourse. The fundamental problem with neural dialogue modeling is exemplified by their generic responses, such as I don’t know, I’m not sure, or how are you, when conditioned on broad ranges of dialogue contexts. In addition to the limited contextual information in single-turn Seq2Seq models [Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Li et al.2016a], which has motivated the need for hierarchical recurrent encoder decoder (HRED) multi-turn models [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Xing et al.2017, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019], previous work points to three underlying reasons as to why neural models fail at dialogue response generation.

i) Exposure Bias: Similar to language and machine translation models, traditional conversation models are trained with the model input taken from the ground truth rather than a previous output (a method known as teacher forcing [Williams and Zipser1989]). During inference, however, the model uses past outputs, i.e., is used autoregressively. Interestingly, training with teacher forcing does not present a significant problem in the machine translation setting since the conditional distribution of the target given the source is well constrained. On the other hand, this is problematic in the dialogue setting since the learning task is unconstrained [Lowe et al.2015]. In particular, there are several suitable target responses per dialogue context and vice versa. This discrepancy between training and inference is known as exposure bias [Williams and Zipser1989, Lamb et al.2016] and significantly limits the informativeness of the responses as the decoding error compounds rapidly during inference. Training methods that incorporates autoregressive sampling into model training have been explored to address this [Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Zhang et al.2018b].

ii) Training data: The inherent problem with dialogue training data, although identified, has not been particularly addressed in the literature [Sharath, Tandon, and Bauer2017]. Human conversations contain a large number of generic, uninformative responses with little or no semantic meaning, giving rise to a classic class-imbalance problem. This problem also exists at the word and turn level; human dialogue [Banchs2012, Serban et al.2017b] contains non-uniform sequence entropy that is concave with respect to the token position, with the tokens at the beginning and end of a sequence having lower entropy than those in the middle (see Fig. 1). This initial positive energy gradient can create learning barriers for recurrent models, and is a primary contributing factor to their short, generic outputs. In [Shao et al.2017], the use of glimpse-based decoding is seemingly able to circumvent this problem by breaking this data-induced pattern but at the expense of both training and inference time.

iii) Training Objective: Most existing dialogue models are trained using the maximum likelihood estimation (MLE) [Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Serban et al.2016, Xing et al.2017] with teacher forcing because autoregressive sampling leads to unstable training. Unfortunately, the use of MLE is incongruent with the redundant nature of dialogue datasets, exacerbates the exposure bias problem in dialogue datasets, and is the primary factor leading to uninteresting and generic responses. Alternative training frameworks that complement MLE with other constraints such as generative adversarial networks, reinforcement learning, and variational auto-encoders that specifically encourage diversity have been explored to overcome the limitations of the MLE objective alone [Li et al.2016a, Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Serban et al.2017b, Zhang et al.2018b, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019].

In this paper, we propose an adversarial bootstrapping framework for training dialogue models. This framework tackles the class imbalance caused by the redundancy in dialogue training data, and addresses the problem of exposure bias in dialogue models. Bootstrapping has been proposed in the literature as a way to handle data with noisy, subjective, and incomplete labels by combining cross-entropy losses from both the ground truth (i.e. teacher forcing) and model outputs (i.e. autoregression) [Reed et al.2015, Grandvalet and Bengio2005, Grandvalet and Bengio2006]. Here, we first extend its use to dialogue model training to encourage the generation of high variance response sequences for a given ground truth target [Reed et al.2015]. This expectedly should reduce the tendency of dialogue models to reproduce those generic and uninteresting target responses present in the training data. This is achieved by training a discriminator adversarially, and use the feedback from the discriminator to weigh the cross-entropy loss from the model-predicted target. The gradient from the feedback provided by the discriminator encourages the dialogue model to generate a wide range of structured outputs. Second, we bootstrap the discriminator to improve the calibration of its output. We use the similarity between the representations of the generator’s autoregressive output and the groundtruth from an intermediate layer of the discriminator as an addition target for the discriminator. This further improves the diversity of the generator’s output without sacrificing relevance. We apply adversarial bootstrapping to multi-turn dialogue models. Architecture wise, we employ an HRED generator and an HRED discriminator depicted in Figure 2, with a shared hierarchical recurrent encoder. In our experiments, the proposed adversarial bootstrapping demonstrates state-of-the-art performances on Movie and Ubuntu datasets as measured in terms of both automatic (BLUE, ROGUE, and DISTINCT scores) and human evaluations.

Figure 1: Positional Entropy of Movie and Ubuntu datasets - Applying a greedy training objective to the datasets can achieve low overall entropy just by overfitting to low entropy regions, resulting in short and generic responses.

2 Related Work

The literature on dialogue modeling even in multi-turn scenario is vast (see [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Xing et al.2017, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019, Li et al.2016b]), and so in this section, we focus on key relevant previous papers. The proposed adversarial bootstrapping is closely related to the use of reinforcement learning for dialogue response generation with an adversarially trained discriminator serving as a reward function [Li et al.2017]. First, we employ a different discriminator training strategy from \citeauthorLi2017 \shortciteLi2017. The negative samples of our discriminator consists of the generator’s deterministic teacher forcing output, and distractors sampled from the training set. This makes the discriminator’s task more challenging and improve the quality of the feedback to the generator by discouraging the generation of high frequency generic responses. Also, while [Li et al.2017] samples over all the possible outputs of the generator, we take samples from the generator’s top k outputs or the MAP output with Gaussian noise as additional inputs. This allows our model to explore mostly plausible trajectories during training compared to [Li et al.2017] where the discriminator mostly score the generated samples very low. The top_k sampling strategy also mitigates the gradient variance problem found in the traditional policy optimization employed by [Li et al.2017]. Finally, we bootstrap our discriminator with the similarity between the intermediate representations of the generator’s autoregressive output and the groundtruth to improve the calibration of the discriminator output.

3 Model

Let denote the context or conversation history up to turn and let denote the associated target response. Provided input-target samples , we aim to learn a generative model which scores representative hypotheses given arbitrary dialogue contexts such that responses that are indistinguishable from informative and diverse target responses are favored with high scores and otherwise given low scores. Notationally, we write the collection of possible responses at turn as the set containing elements where is the length of the -th candidate response and is the -th word of that response.

3.1 Generator Bootstrapping

To achieve the goal outlined above, we propose an Adversarial Bootstrapping (AB) approach to training multi-turn dialogue models such as the one depicted in Fig. 2. The adversarial bootstrapping for the generator can be expressed according to the objective


where is the target variable that controls the generator training. Indeed, hard bootstrapping [Reed et al.2015] is one such special case of (1) wherein , , and otherwise, where is a hyperparameter. Similarly, MLE is another special case in which , and otherwise. It is reasonable to assume from these formulations that bootstrapping will outperform MLE since it does not assume all negative outputs are equally wrong.

Figure 2: A multi-turn recurrent architecture with adversarial bootstrapping: - The generator and discriminator share the same encoder (through the context state) and the same word embeddings. The generator also uses the word embeddings as the output projection weights. The encoder and the discriminator RNNs are bidirectional while the context and generator RNNs are unidirectional.

Interestingly, \citeauthorLi2017 \shortciteLi2017 make use of the MLE setting but additionally relies on the sampling stochasticity to obtain non-zero credit assignment information from the discriminator for the generator policy updates. To avoid this inconsistency, we instead modify the generator target to


where is a hyperparamter and is the bootstrapped target obtained from a neural network discriminator with parameters . The first two assignments in (2) are also used in training the discriminator in addition to the human-generated distractors, denoted , from the dataset. In detail, we make use of the term


within the context of the objective function. Namely, the discriminator objective is the cross-entropy between the output and the target of the discriminator given by


The inclusion of human-generated negative samples encourages the discriminator to assign low scores to high frequency, generic target responses in the dataset, thereby discouraging the generator from reproducing them.

3.2 Discriminator Bootstrapping

In addition to the generator bootstrapping with the discriminator in section 3.1, we can also bootstrap the discriminator using the similarity measure, , between latent representations of the sampled generator outputs, , and ground truth encoded by discriminator. i.e.


In our experiments, we chose the cosine similarity metric and the output of the discriminator before the logit projection for and respectively. This helps to better calibrate the discriminator’s judgment of the generator’s outputs.

3.3 Sampling Strategy

To backpropagate the learning signal for the case where , we explore both stochastic and deterministic policy gradient methods. For stochastic policies, we approximate the gradient of w.r.t. by Monte Carlo samples using the REINFORCE policy gradient method [Li et al.2017, Glynn1990, Williams1992], i.e.


For deterministic policies, we approximate the gradient according to [Silver et al.2014, Zhang et al.2018b]


where and is the source of randomness. We denote the model trained with Eq. 7 as . To reduce the variance of Eq. 6, we propose a novel approach of sampling from top_k generator outputs using (i) categorical distribution based on the output logits (), similar to the treatment in [Radford et al.2019], and (ii) uniform distribution (); where top_k is a hyperparameter. This is especially useful for dialogue modeling with large vocabulary sizes.

3.4 Encoder

Referring to the network architecture in Fig. 2, the generator and discriminator share the same encoder. The encoder uses two RNNs to handle multi-turn representations similar to the approach in [Serban et al.2016]. First, during turn , a bidirectional encoder RNN, , with an initial state of maps the conversation context comprising the sequence of input symbols , where is the sequence length, into a sequence of hidden state vectors according to


where is the embedding lookup and is the embedding matrix with dimension, and vocabulary size, . The vector representation of the input sequence is the pooling over the encoded sequence, [Serban et al.2016]. In addition, we use the output sequence as an attention memory to the generator as depicted in Fig. 2. This is done to improve the relevance of the generated response.

To capture we use a unidirectional context RNN, , to combine the past dialogue context, with the pooling of the encoded sequence as


Note that we decided not to allow attention at turn-level, as was done in [Xing et al.2017], since the number of dialogue turns is unpredictable. Also, the use of a single vector representation helps to simplify both training and inference procedures. We however think that converting the turn-level sequential memory to a random access memory [Graves, Wayne, and Danihelka2014, Graves et al.2016] should lead to further system performance improvements without much impact on inference time. We hope to explore this direction in the future.

3.5 Generator

The generator, denoted , is a unidirectional decoder RNN with an attention mechanism [Bahdanau, Cho, and Bengio2015, Luong et al.2015]. Similar to [Serban et al.2016], the decoder RNN is initialized with the last state of the context RNN. The generator outputs a hidden state representation, for each previous token according to


where is the attention over the encoded sequence, . When the generator is run in teacher-forcing mode, as is typically used during training, the previous token from the ground truth is used, i.e., , whereas during inference (autoregressive mode), we use the generator’s previous decoded output, i.e., .

The decoder hidden state, is mapped to a probability distribution typically through a logistic layer, , yielding,


where is a hyperparameter, , and is the logit bias. Note that, the output projection matrix is the same as the embedding matrix similar to [Vaswani et al.2017]. The generative model can then be derived as:


3.6 Discriminator

The discriminator, is a binary classifier that takes as input, a response sequence, , and a dialogue context, , and trained with output labels provided in Eq. 3. The discriminator, as shown in Figure 2 is an RNN, , that shares the hierarchical encoder and the word embeddings with the generator, with the initial state being the final state of the context RNN. The last layer of the discriminator RNN is fed to a logistic layer and sigmoid function to produce the a normalized (action-value function) value of the pair of dialogue context (state) and response (action). This definition is more general than in [Li et al.2017], as the target is not just a binary classification of the dialogue context and response pair as either human-generated () or machine-generated (). That is, the target also includes the distractor classification, and the output of a similarity function. This formulation allows us to view the discriminator as a metric learner [Kulis2013] as opposed to the generator, which is a maximum likelihood learner. Therefore, adversarial bootstrapping can be viewed as joint metric and maximum likelihood learning.

We explore two options of estimating the value, i.e., word or utterance level. At utterance level, we use Eq. 4 in conjunction with a unidirectional discriminator RNN. The value is calculated using the last output of , i.e,


where , , and are the logit projection and bias respectively. At word level, the discriminator RNN (we use a bidirectional RNN in our implementation) produces a word-level evaluation. The normalized value and the adversarial bootstrapping objective function are then respectively given by,



Model Movie Ubuntu
Relevance Diversity Relevance Diversity
HRED 0.0474 0.0384 0.0026/0.0056 0.535 0.0177 0.0483 0.0203/0.0466 0.892
VHRED 0.0606 0.1181 0.0048/0.0163 0.831 0.0171 0.0855 0.0297/0.0890 0.873
hredGAN_u 0.0493 0.2416 0.0167/0.1306 0.884 0.0137 0.0716 0.0260/0.0847 1.379
hredGAN_w 0.0613 0.3244 0.0179/0.1720 1.540 0.0216 0.1168 0.0516/0.1821 1.098
DAIM 0.0155 0.0077 0.0005/0.0006 0.721 0.0015 0.0131 0.0013/0.0048 1.626
Transformer 0.0360 0.0760 0.0107/0.0243 1.602 0.0030 0.0384 0.0465/0.0949 0.566
aBoots_u_gau 0.0642 0.3326 0.0526/0.2475 0.764 0.0115 0.2064 0.1151/0.4188 0.819
aBoots_w_gau 0.0749 0.3755 0.0621/0.3051 0.874 0.0107 0.1712 0.1695/0.7661 1.235
aBoots_u_uni 0.0910 0.4015 0.0660/0.3677 0.975 0.0156 0.1851 0.0989/0.4181 0.970
aBoots_w_uni 0.0902 0.4048 0.0672/0.3653 0.972 0.0143 0.1984 0.1214/0.5443 1.176
aBoots_u_cat 0.0880 0.4063 0.0624/0.3417 0.918 0.0210 0.1491 0.0523/0.1795 1.040
aBoots_w_cat 0.0940 0.3973 0.0613/0.3476 1.016 0.0233 0.2292 0.1288/0.5190 1.208

Table 1: Automatic evaluation of generator performance

4 Training

We trained both the generator and discriminator simultaneously with two samples for the generator and three for the discriminator. In all our experiments, we use the generator’s teacher forcing outputs to train the discriminator (i.e., cases of Eqs. 2 and 3). The encoder parameters are included with the generator, i.e., we did not update the encoder during discriminator updates. Each RNN is a 3-layer GRU cells, with a hidden state size () of 512. The word embedding size is the same , and the vocabulary size, is . Other hyperparameters include, , , , for and , and for . Although, we used a single top_k value during training, we avoided training with multiple top_k values by searching for the optimum top_k (between 1 and 20) on the validation set using the BLEU score. We used the obtained optimum values for inference. Other training parameters include, the initial learning rate is with decay rate factor of , applied when the generator loss has increased over two iterations. We use a batch size of 64 and clip gradients around . All parameters are initialized with Xavier uniform random initialization [Glorot and Bengio2010]. Due to the large vocabulary size, we use sampled softmax loss [Jean et al.2015] for the generator to limit the GPU memory requirement and expedite the training process. However, we use full softmax for evaluation. The model is trained end-to-end using the stochastic gradient descent algorithm.

5 Experiments

5.1 Setup

We evaluated the proposed adversarial bootstrapping (aBoots) with both generator and discriminator bootstrapping, on the Movie Triples and Ubuntu Dialogue corpora randomly split into training, validation, and test sets, using 90%, 5%, and 5% proportions. We performed minimal preprocessing of the datasets by replacing all words except the top 50,000 most frequent words by an UNK symbol. The Movie dataset [Serban et al.2016] spans a wide range of topics with few spelling mistakes and contains about 240,000 dialogue triples which makes it suitable for studying the relevance vs. diversity tradeoff in multi-turn conversations. The Ubuntu dataset, extracted from the Ubuntu Relay Chat Channel, [Serban et al.2017b], contains about 1.85 million conversations with an average of 5 utterances per conversation. This dataset is ideal to train dialogue models that can provide expert knowledge/recommendation in domain-specific conversations.

We explore different variants of aBoots along the choice of discrimination either word(_w) or utterance(_u) level and sampling strategy either uniform(_uni), categorical(_cat) or with Gaussian noise (_gau). We compare their performances with existing state-of-the-art dialogue models including, (V)HRED111implementation obtained from [Serban et al.2016, Serban et al.2017b], and DAIM222implementation obtained from [Zhang et al.2018b]. For the purpose of completeness, we also include results from a transformer-based Seq2Seq model [Vaswani et al.2017].

We compare the performance of the models based on the informativeness (a combination of relevance and diversity metrics) of the generated responses. For relevance, we adopted BLEU-2 [Papineni et al.2002] and ROUGE-2 [Lin2014] scores. For diversity, we adopted distinct unigram (DIST-1) and bi-gram (DIST2) [Li et al.2016a] as well as and normalized average sequence length (NASL) scores [Olabiyi et al.2018].

For human evaluation, we follow a similar setup as [Li et al.2016a], employing crowd-sourced judges to evaluate a random selection of 200 samples. We present both the multi-turn context and the generated responses from the models to 3 judges and asked them to rank the response quality in terms of informativeness. Ties are not allowed. The informativeness measure captures the temporal appropriateness, i.e, the degree to which the generated response is temporally and semantically appropriate for the dialogue context as well as other factors such as length of the response, and repetitions. For analysis, we pair the models and compute the average number of times each model is ranked higher than the other.

Model Pair Movie Ubuntu
aBoots_w_cat – DAIM 0.9570.043 0.9600.040
aBoots_w_cat – HRED 0.6450.355 0.7700.230
aBoots_w_cat – VHRED 0.6100.390 0.7460.254
aBoots_w_cat – hredGAN_w 0.550 – 0.450 0.556 – 0.444

Table 2: Human evaluation of generator performance based on response informativeness
Model Movie Ubuntu
Relevance Diversity Relevance Diversity
aBoots_g_u_gau 0.0638 0.3193 0.0498/0.2286 0.778 0.0150 0.1298 0.0480/0.1985 0.960
aBoots_g_w_gau 0.0729 0.3678 0.0562/0.3049 1.060 0.0123 0.1370 0.0646/0.1820 0.841
aBoots_g_u_uni 0.0801 0.3972 0.0655/0.3414 0.869 0.0124 0.1424 0.0636/0.1853 0.870
aBoots_g_w_uni 0.0860 0.4046 0.0671/0.3514 0.838 0.0170 0.2049 0.1074/0.4646 1.349
aBoots_g_u_cat 0.0836 0.3887 0.0597/0.3276 0.917 0.0131 0.1214 0.0597/0.3276 1.060
aBoots_g_w_cat 0.0928 0.4029 0.0613/0.3358 0.976 0.0202 0.2343 0.1254/0.4805 0.873

Table 3: Automatic evaluation of models with the generator bootstrapping only

6 Results and Discussion

6.1 Quantitative Evaluation

The quantitative measures reported in Table 1 shows that adversarial bootstrapping, gives the best overall relevance and diversity performance in comparison to other models, (V)HRED, hredGAN, DAIM and Transformer on both Movie and Ubuntu datasets. We believe the combination of improved discriminator training and the policy-based objective is responsible for the observed performance improvement. On the other hand, (V)HRED and hredGAN suffer performance loss due to exposure bias, since autoregressive sampling is not included in their training. Although DAIM uses autoregressive sampling, its poor performance shows the limitation of single-turn architecture and GAN objective compared to multi-turn architecture and policy-based objective in . The transformer Seq2Seq model, which performs better than RNNs on the machine translation task, also suffers from the exposure bias, and overfits very quickly to the low entropy regions in the data leading to a poor inference performance. Also, the results from models indicate that word-level discrimination performs better than utterance-level discrimination, consistent with the result reported in [Olabiyi et al.2018] for the hredGAN model. While it is difficult to identify why some models generate very long responses, we observe that models with Gaussian noise inputs (e.g., hredGAN and ) may be using the latent Gaussian distribution to better encode response length information; indeed, this is an area of on-going work. Within the variants of , we observe that models trained with stochastic policy, and , outperform those trained with a deterministic policy, . Notably, we find that for the stochastic policy, there is a tradeoff on relevance and diversity between top_k categorical and uniform sampling. The categorical sampling tends to perform better with relevance but worse with diversity. We believe that this is because top_k categorical sampling causes the generator to exploit high likelihood (i.e., more likely to be encountered during inference) than uniform sampling of the top candidates, while still allowing the policy to explore. This however comes with some loss of diversity, although not significant. Overall, the automatic evaluation indicates that adversarial bootstrapping trained with stochastic policy using top_k categorical sampling strategy gives the best performance.

6.2 Qualitative Evaluation

As part of our evaluation we also consider scores from human judges. Specifically, we had each evaluator compare responses from five models: , hredGAN_w, (V)HRED and DAIM. The pairwise human preferences are reported in Table 2. These data indicate a significant preference for responses generated by as compared to both (V)HRED and DAIM. We observe that is preferred over hredGAN_w on average, although not by a significant margin. We note that this score was computed from only 200 evaluation samples, which is likely too small to demonstrate a strong preference for . It is also worthy to note that the hredGAN_w model represents a strong baseline, based on previous human evaluations [Olabiyi et al.2018], against which to compare our adversarially trained models. It is interesting to note that although automatic evaluation scores hredGAN_w much lower than on relevance, the long response length from hredGAN_w which indicates strong diversity has a considerable impact on how human evaluators judge the informativeness of responses generated by hredGAN_w. Table 4 shows example responses from the models.

Model Response
Context_0 we ’ re grateful , person . it ’ s not that . those craftsman tools . they ’ re a dream come true .
Context_1 your dad and i were raised with a certain work ethic , that ’ s all .
HRED person .
VHRED person , mr . person . person , i ’ m not going to tell you , person . i ’ m sorry . i ’ m sorry , person . i ’ m sorry
hredGAN_w person , i know you ’ re see . continued_utterance person , please .
person , i ’ m a little worried about my future . i ’ m not here to marry you .
Context_0 where would they be going ?
Context_1 we ’ ve got two choices . one of them is wrong .
HRED i ’ m sorry , person .
VHRED i know .
hredGAN_w person , i know , it ’ s a real hair . with you a little little little , little little little little somethin
i don ’ t care where they are , i don ’ t know anyone will be outside .
Context_0 DJones: update manager won’t detect 12.04.1 as a new version if you are already running 12.04, because 12.04.1 = 12.04 + lots of package updates
Context_1 DJones I am running 10.04.
Context_2 what does ’gksu update-manager c’ report?
HRED I don’t know what the problem is
VHRED you mean ?
hredGAN_w yes , I get the new one , I just want to get the new one
it says that command not found
Context_0 libgnome2-dev is the package that has that pkgconfig file
Context_1 After I installed libgnome2-dev I got a different output. sorry it is so long.
Context_2 I thought that was what was going on thanks. :) I ran the command ./configure && sudo make && sudo make install would that have worked.
HRED I don’t know what the problem is
VHRED what do you mean ? __eou__ what are you trying ?
hredGAN_w I have no idea , I can just copy the files and copy files to the other one
yes I did . I just did sudo apt-get update and it worked
Table 4: Sample responses of DAIM, (V)HRED, hredGAN, .

6.3 Ablation Studies

In this section, we examine the effect of partial bootstrapping on the model performance. Here, the target in Eq. (5) is excluded from the discriminator. The automatic evaluation results on all the variants of are reported in Table 3. The table shows that the generator models bootstrapped by a discriminator that is not bootstrapped generally performs worse than ones with a bootstrapped discriminator. This improvement is particularly more evident in the best performing variant, . We attribute this performance improvement to the better calibration of discriminator obtained from the bootstrapping of the discriminator output with the similarity measure between the generator’s autoregressive output and the groundtruth during training.

7 Conclusion

In this paper, we have proposed a novel training technique, adversarial bootstrapping, which is useful for dialogue modeling. The proposed method addresses the issues of data-induced redundancy and exposure bias in dialogue models trained with maximum likelihood. This is achieved by bootstrapping the teacher-forcing MLE objective with feedback on autoregressive outputs from an adversarially trained discriminator. This feedback discourages the generator with from producing bland and generic responses that are characteristic of MLE training. Experimental results indicate that a doubly bootstrapped system gives a better performance than a system, where only the generator is bootstrapped. Also, the model variant characterized by choosing top_k categorical sampling, stochastic policy optimization, and word-level discrimination gives the best performance. The results demonstrate that the proposed method leads to models generating more relevant and diverse responses in comparison to existing methods.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description