A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural Networks

A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural Networks

Md Iftekhar Tanveer, Md Kamrul Hassan, Daniel Gildea, M. Ehsan Hoque
University of Rochester
{itanveer,mhasan8,gildea,mehoque}@cs.rochester.edu
Abstract

Automated prediction of public speaking performance enables novel systems for tutoring public speaking skills. We use the largest open repository—TED Talks—to predict the ratings provided by the online viewers. The dataset contains over 2200 talk transcripts and the associated meta information including over 5.5 million ratings from spontaneous visitors to the website. We carefully removed the bias present in the dataset (e.g., the speakers’ reputations, popularity gained by publicity, etc.) by modeling the data generating process using a causal diagram. We use a word sequence based recurrent architecture and a dependency tree based recursive architecture as the neural networks for predicting the TED talk ratings. Our neural network models can predict the ratings with an average F-score of 0.77 which largely outperforms the competitive baseline method.

A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural Networks


Md Iftekhar Tanveer, Md Kamrul Hassan, Daniel Gildea, M. Ehsan Hoque University of Rochester {itanveer,mhasan8,gildea,mehoque}@cs.rochester.edu

1 Introduction

While the demand for physical and manual labor is gradually declining, there is a growing need for a workforce with soft skills. Which soft skill do you think would be the most valuable in your daily life? According to an article in Forbes (Gallo, 2014), 70% of employed Americans agree that public speaking skills are critical to their success at work. Yet, it is one of the most dreaded acts. Many people rate the fear of public speaking even higher than the fear of death (Wallechinsky et al., 2005). To alleviate the situation, several automated systems are now available that can quantify behavioral data for participants to reflect on (Fung et al., 2015). Predicting the viewers’ ratings from the speech transcripts would enable these systems to generate feedback on the potential audience behavior.

Predicting human behavior, however, is challenging due to its huge variability and the way the variables interact with each other. Running Randomized Control Trials (RCT) to decouple each variable is not always feasible and also expensive. It is possible to collect a large amount of observational data due to the advent of content sharing platforms such as YouTube, Massive Open Online Courses (MOOC), or ted.com. However, the uncontrolled variables in the observational dataset always keep a possibility of incorporating the effects of the “data bias” into the prediction model. Recently, the problems of using biased datasets are becoming apparent. Buolamwini and Gebru (2018) showed that the error rates in the commercial face-detectors for the dark-skinned females are times higher than the light-skinned males due to the bias in the training dataset. The unfortunate incident of Google’s photo app tagging African-American people as “Gorilla” (Guynn, 2015) also highlights the severity of this issue.

We address the data bias issue as much as possible by carefully analyzing the relationships of different variables in the data generating process. We use a Causal Diagram (Pearl and Mackenzie, 2018; Pearl, 2009) to analyze and remove the effects of the data bias (e.g., the speakers’ reputations, popularity gained by publicity, etc.) in our prediction model. In order to make the prediction model less biased to the speakers’ race and gender, we confine our analysis to the transcripts only. Besides, we normalize the ratings to remove the effects of the unwanted variables such as the speakers’ reputations, publicity, contemporary hot topics, etc.

For our analysis, we curate an observational dataset of public speech transcripts and other meta-data collected from the ted.com website. This website contains a large collection of high-quality public speeches that are freely available to watch, share, rate, and comment on. Every day, numerous people watch and annotate their perceptions about the talks. Our dataset contains public speech transcripts and over million ratings from the spontaneous viewers of the talks. The viewers annotate each talk by 14 different labels—Beautiful, Confusing, Courageous, Fascinating, Funny, Informative, Ingenious, Inspiring, Jaw-Dropping, Long-winded, Obnoxious, OK, Persuasive, and Unconvincing.

We use two neural network architectures in the prediction task. In the first architecture, we use LSTM (Hochreiter and Schmidhuber, 1997) for a sequential input of the words within the sentences of the transcripts. In the second architecture, we use TreeLSTM (Tai et al., 2015) to represent the input sentences in the form of a dependency tree. Our experiments show that the dependency tree-based model can predict the TED talk ratings with slightly higher performance (average F-score 0.77) than the word sequence model (average F-score 0.76). To the best of our knowledge, this is the best performance in the literature on predicting the TED talk ratings. We compare the performances of these two models with a baseline of classical machine learning techniques using hand-engineered features. We find that the neural networks largely outperform the classical methods. We believe this gain in performance is achieved by the networks’ ability to capture better the natural relationship of the words (as compared to the hand engineered feature selection approach in the baseline methods) and the correlations among different rating labels.

2 Background Research

In this section, we describe a few relevant prior arts on behavioral prediction.

2.1 Predicting Human Behavior

An example of human behavioral prediction research is to automatically grade essays, which has a long history (Valenti et al., 2003). Recently, the use of deep neural network based solutions (Alikaniotis et al., 2016; Taghipour and Ng, 2016) are becoming popular in this field. Farag et al. (2018) proposed an adversarial approach for their task. Jin et al. (2018) proposed a two-stage deep neural network based solution. Predicting helpfulness (Martin and Pu, 2014; Yang et al., 2015; Liu et al., 2017a; Chen et al., 2018) in the online reviews is another example of predicting human behavior. Bertero and Fung (2016) proposed a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based framework to predict humor in the dialogues. Their method achieved an 8% improvement over a Conditional Random Field baseline. Jaech et al. (2016) analyzed the performance of phonological pun detection using various natural language processing techniques. In general, behavioral prediction encompasses numerous areas such as predicting outcomes in job interviews (Naim et al., 2016), hirability (Nguyen and Gatica-Perez, 2016), presentation performance (Tanveer et al., 2015; Chen et al., 2017; Tanveer et al., 2018) etc. However, the practice of explicitly modeling the data generating process is relatively uncommon. In this paper, we expand the prior work by explicitly modeling the data generating process in order to remove the data bias.

2.2 Predicting the TED Talk Performance

There is a limited amount of work on predicting the TED talk ratings. In most cases, TED talk performances are analyzed through introspection (Gallo, 2014; Bull, 2016; Sugimoto et al., 2013; Tsou et al., 2014; Drasovean and Tagg, 2015).

Chen and Lee (2017) analyzed the TED Talks for humor detection. Liu et al. (2017b) analyzed the transcripts of the TED talks to predict audience engagement in the form of applause. Haider et al. (2017) predicted user interest (engaging vs. non-engaging) from high-level visual features (e.g., camera angles) and audience applause. Pappas and Popescu-Belis (2013) proposed a sentiment-aware nearest neighbor model for a multimedia recommendation over the TED talks. Weninger et al. (2013) predicted the TED talk ratings from the linguistic features of the transcripts. This work is most similar to ours. However, we are proposing a new prediction framework using the Neural Networks.

3 Dataset

The data for this study was gathered from the ted.com website on November 15, 2017. We removed the talks published six months before the crawling date to make sure each talk has enough ratings for a robust analysis. More specifically, we filtered any talk that— {enumerate*}

was published less than months prior to the crawling date,

contained any of the following keywords: live music, dance, music, performance, entertainment, or,

contained less than 450 words in the transcript. This filtering left a total of 2231 talks in the dataset.

We collected the manual transcriptions, and the total view counts for each video. We also collected the “ratings” which are the counts of the viewer-annotated labels. The viewers can annotate a talk from a selection of 14 different labels provided in the website. The labels are not mutually exclusive. Viewers can choose at most labels for each talk. If only one label is chosen, it is counted times. We count the total number of annotations under each label as shown in Figure 1. The ratings are treated as the ground truth about the audience perception. A summary of the dataset characteristics is shown in Table 1.

Figure 1: Counts of all the 14 different rating categories (labels) in the dataset
Property Quantity
Number of talks 2,231
Total length of all talks 513.49 Hours
Total number of ratings 5,574,444
Average ratings per talk 2498.6
Minimum ratings per talk 88
Total word count 5,489,628
Total sentence count 295,338
Table 1: Dataset Properties

4 Modeling the Data Generating Process

In order to analyze and remove any bias present in the dataset, we model the data generating process using a Causal Diagram. For a delightful understanding of the importance of this step, please refer to Chapter 6 of Pearl and Mackenzie (2018).

The (assumed) causal diagram of the TED talk data generating process is shown in Figure 2.

Figure 2: Causal Diagram of the Data Generating Process of TED Talks

We know that the popularity (i.e. Total Views, ) depends on the speech contents (i.e. Transcripts). Therefore, the Transcripts () cause the Total Views () to change and thus we draw an arrow from Transcripts to Total Views. Although we know that the popularity also depends on the nonverbal contents (e.g. prosody, facial expressions), we remove those modalities from our prediction system for eliminating any gender or racial bias. Transcripts also cause the distribution of the Rating Counts () to change. Now, the Total Views also cause the Rating Counts to change; because, the more the people watch a specific talk, the high the rating counts. We can safely assume this arrow from Total Views to Rating Counts models a linear relationship.

The causal relationships so far reveal that the Total Views, act as a “mediator” (Pearl and Mackenzie, 2018) between and and thus helps our prediction. However, it is easy to see that is affected by various biases present in the dataset. For example, the longer a TED talk remains on the web, the more views it gets. Therefore, the “Age” of a talk causes the Total Views to change. We can imagine many other variables (e.g., how much the talk is publicized, the speakers’ reputations) can affect the Total Views. We, however, do not want these variables to affect our prediction. Fortunately, all these variables can affect the Rating Counts only through the Total Views—because the viewers must arrive into the page in order to annotate the ratings. Therefore, we can remove the effects of the unwanted variables by removing the effects of the Total Views from the Rating Counts with the help of the linearity assumption mentioned before. We normalize the rating counts of each talk as in the following equation:

(1)

Where represents the count of the label in a talk. Let us assume that in a talk, fractions of the total viewers annotate for the rating category . Then the scaled rating, becomes . Notice that the Total Views, gets canceled from the numerator and the denominator for each talk. This process successfully removes the effect of as evident in Table 2. Scaling the rating counts removes the effects of Total Views by reducing the average correlation from to . This process also removes the effect of the Age of the Talks by reducing the average correlation from to . Therefore, removing reduces the effect of the Age of the Talks in the ratings. It should work similarly for the other unwanted variables as well.

Total Views Age of Talks
noscale scale noscale scale
Beaut. 0.52 0.01 0.03 -0.14
Conf. 0.39 -0.12 0.27 0.20
Cour. 0.52 -0.003 0.01 0.15
Fasc. 0.78 0.05 0.15 0.06
Funny 0.57 0.14 0.10 0.10
Info. 0.76 -0.08 0.07 -0.19
Ingen. 0.59 -0.06 0.18 0.10
Insp. 0.79 0.1 0.05 -0.15
Jaw-Dr. 0.51 0.1 0.18 0.23
Long. 0.44 -0.17 0.36 0.31
Obnox. 0.27 -0.11 0.19 0.17
OK 0.72 -0.16 0.21 0.14
Pers. 0.72 -0.01 0.12 0.02
Unconv. 0.29 -0.14 0.18 0.15
Avg. 0.56 -0.03 0.15 0.06
Table 2: Correlation coefficients of each category of the ratings with the Total Views and the “Age” of Talks

We binarize the scaled ratings by thresholding over the median value which results in a and class for each category of the ratings. The label indicates having a rating higher than the median value. We model using neural networks as discussed in the following sections. refers to the scaled and binarized ratings.

5 Network Architectures

We implement two neural networks to model . Architectures of these networks are described below.

5.1 Word Sequence Model

We use a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) neural network to model the word-sequences in the transcripts. However, the transcripts have around words on average. It is difficult to model such a long chain even with an LSTM due to the vanishing/exploding gradient problem. We, therefore, adopt a “Bag-of-Sentences” model where we model each sentence using the LSTM and average the outputs for predicting the scaled and binarized rating counts. A pictorial illustration of this model is shown in Figure 3.

Figure 3: An illustration of the Word Sequence Model

Each sentence, in the transcript is represented by a sequence of word-vectors111In this paper, we represent the column vectors as lowercase boldface letters; matrices or higher dimensional tensors as uppercase boldface letters and scalars as lowercase regular letters. We use a prime symbol () to represent the transpose operation., . Here, each represents the pre-trained, 300-dimensional GLOVE word vectors (Pennington et al., 2014). We use an LSTM to obtain an embedding vector, , for the sentence in the talk transcript. These vectors () are averaged and passed through a feed-forward network to produce a 14-dimensional output vector corresponding to the categories of the ratings. An element-wise sigmoid () activation function is applied to the output vector. The mathematical description of the model is as follows:

(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)

Here, equations (2) to (7) represent the definitive characteristics of LSTM. The vectors , and are the input, forget, and output gates (at the position), respectively; and represent the memory cell and the hidden states of the LSTM. The notation represents the Hadamard (element-wise) product between two vectors. We chose the sentence embeddings to have 128 dimensions; therefore, the dimensions of the transformation matrices ’s, ’s, ’s are , , and respectively. ’s, ’s, ’s and are the free parameters of the network which are learned through back-propagation. The output vector represents . For the sentence, the index varies from to , where is the number of words in the sentence. represents the total number of the sentences in the transcript. We use zero vectors to initialize the memory cell () and the hidden state () and as any out-of-vocabulary word vector.

5.2 Dependency Tree-based Model

We are interested in representing the sentences as hierarchical trees of dependent words. We use a freely available dependency parser named SyntaxNet222https://opensource.google.com/projects/syntaxnet (Andor et al., 2016) to extract the dependency tree corresponding to each sentence. The child-sum TreeLSTM (Tai et al., 2015) is used to process the dependency trees. As shown in Figure 4, the parts-of-speech and dependency types of the words are used in addition to the GLOVE word vectors. We concatenate a parts-of-speech embedding () and a dependency type embedding () with the word vectors. These embeddings ( and ) are learned through back-propagation along with other free parameters of the network. The mathematical description of the model is as follows:

(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)

Here equation (11) represents the concatenation of the pre-trained GLOVE word-vectors with the learnable embeddings for the parts of speech and the dependency type of a word. represents the set of all the children of node . The parent-child relation of the treeLSTM nodes come from the dependency tree. Zero vectors are used as the children of leaf nodes. A node is processed recursively using the equations (12) through (18). Notably, these equations are similar to equations (2) to (7), except the fact that the memory cell and hidden states flow hierarchically from the children to the parent instead of sequential movement. Each node contains a forget gate () for each child. The sentence embedding vector is obtained from the root node.

Figure 4: An illustration of the Dependency Tree-based Model

6 Training the Networks

We implemented the networks in pyTorch 333pytorch.org. Details of the training procedure are described in the following subsections.

6.1 Optimization

We use multi-label Binary Cross-Entropy loss as defined below for the backpropagation of the gradients:

(22)

Here is the model output and is the ground truth label obtained from data. and represent the element of and . represents the number of the rating categories.

We randomly split the training dataset into 9:1 ratio and name them training and development subsets respectively. The networks are trained over the training subset. We use the loss in the development subset to tune the hyper-parameters, to adjust the learning rate, to control the regularization strength, and to select the best model for final evaluation. The training loop is terminated when the loss over the development subset saturates. The model parameters are saved only when the loss over the development subset is lower than any previous iteration.

We experiment with two optimization algorithms: Adam (Kingma and Ba, 2014) and Adagrad (Duchi et al., 2011). The learning rate is varied in an exponential range from to . The optimization algorithms are evaluated with mini-batches of size , , and . We obtain the best results using Adagrad with learning rate and in Adam with a learning rate of . The training loop ran for iterations which mostly saturates the development set loss. We conducted around experiments with various parameters. Each experiment usually takes about 120 hours to make 50 iterations over the dataset when running in an Nvidia K20 GPU.

Figure 5: Effect of regularization on the training and development subset loss

6.2 Regularization

Neural networks are often regularized using Dropout (Srivastava et al., 2014) to prevent overfitting—where the elements of a layer’s output are set to zero with a probability during the training time. A naive application of dropout to LSTM’s hidden state disrupts its ability to retain long-term memory. We resolve this issue using the weight-dropping technique (Wan et al., 2013; Merity et al., 2018). In this technique, instead of applying the dropout operation between every time-steps, it is applied to the hidden-to-hidden weight matrices (Wan et al., 2013) (i.e. the matrices in equations (2) to (5) and (13) to (16)). We use the original dropout method in the fully connected layers. The dropout probability, is set to . The effect of regularization is shown in Figure 5. We also experimented with weight-decay regularization, which adds the average -norm of all the network parameters to the loss function. However, weight-decay adversely affected the training process in our neural network models.

7 Baseline Methods

We compare the performance of the neural network models against several machine learning techniques. We also compare our results with the one reported in Weninger et al. (2013).

We use a psycholinguistic lexicon named “Linguist Inquiry Word Count” (LIWC) (Pennebaker et al., 2001) for extracting the language features. We count the total number of words under the 64 word categories provided in the LIWC lexicon and normalize these counts by the total number of words in the transcript. The LIWC categories include functional words (e.g., articles, quantifiers, pronouns), various content categories (e.g., anxiety, insight), positive emotions (e.g., happy, kind), negative emotions (e.g., sad, angry), and more. These features have been used in several related works (Ranganath et al., 2009; Zechner et al., 2009; Naim et al., 2016; Liu et al., 2017b) with good prediction performance.

We use the Linear Support Vector Machine (SVM) (Vapnik and Chervonenkis, 1964) and LASSO (Tibshirani, 1996) as the baseline classifiers. In SVM, the following objective function is minimized:

(23)
subject to

Where is the weight vector and the bias term. refers to the norm of the vector . In these equations, we assume that the “higher than median” and “lower than median” classes are represented by and values respectively.

We adapt the original LASSO (Tibshirani, 1996) regression model for the classification purposes. It is equivalent to Logistic regression with norm regularization. It works by solving the following optimization problem:

(24)

where is the inverse of the regularization strength, and is the norm of . The norm regularization is known to push the coefficients of the irrelevant features down to zero, thus reducing the predictor variance.

Model Avg. F-sc. Avg. Prec. Avg. Rec. Avg. Acc.
Word Seq 0.76 0.76 0.76 0.76
Dep. Tree 0.77 0.77 0.77 0.77
Dep. Tree (Unscaled) 0.67 0.70 0.68 0.68
LinearSVM 0.69 0.69 0.69 0.69
LASSO 0.69 0.70 0.70 0.70
Weninger et al. 0.71
Table 3: Average F-score, Precision, Recall and Accuracy for various models. Due to the choice of the median thresholds, the precision, recall, F-score, and accuracy values are practically identical in our experiments.

8 Experimental Results

We allocated randomly sampled TED talks from the dataset as a reserved test subset. Data from this subset was never used for training the models or for tuning the hyper-parameters. We used it only for evaluating the models saved in the training process. All the results shown in this section are computed over this test subset.

We evaluate the predictive models by computing four performance metrics—F-score, Precision, Recall, and Accuracy. We compute the average of each metric over all the rating categories which are shown in Table 3. The first two rows represent the average performances of the Word Sequence model and the Dependency Tree based model respectively. These models were trained and tested on the scaled rating counts (). The dependency tree based model shows a slightly better performance than the word sequence model. We also trained and tested the dependency tree model with unscaled rating counts ( row in Table 3). Notably, the same network architecture that performed best for the scaled ratings is now performing much worse for predicting the unscaled ratings. Modeling the data generating process and removing the effects of the unwanted variables resulted in a improvement in the prediction performance. Furthermore, this is achieved without the inclusion of any additional data. We believe this is because the unscaled ratings are affected by the biases present in the dataset—which are difficult to predict using the transcripts only. Therefore, removing the biases makes the prediction problem easier. We compare our results with Weninger et al. (2013) as well. The average recall for their best performing classifier (SVM) is shown in the last row of the table which is similar to our baseline methods.

Ratings Word Seq. Dep. Tree Weninger et al. (SVM)
Beautiful 0.88 0.91 0.80
Confusing 0.70 0.74 0.56
Courageous 0.84 0.89 0.79
Fascinating 0.75 0.76 0.80
Funny 0.78 0.77 0.76
Informative 0.81 0.83 0.78
Ingenious 0.80 0.81 0.74
Inspiring 0.72 0.77 0.72
Jaw-dropping 0.68 0.72 0.72
Longwinded 0.73 0.70 0.63
Obnoxious 0.64 0.64 0.61
OK 0.73 0.70 0.61
Persuasive 0.83 0.84 0.78
Unconvincing 0.70 0.70 0.61
Average 0.76 0.77 0.71
Table 4: Recall for various rating categories. The reason we choose recall is for making comparison with the results reported by Weninger et al. (2013).

In Table 4, we present the recall values for all the different rating categories. We choose recall over the other metrics to make a comparison with the results reported by Weninger et al. (2013). However, all the other metrics (Accuracy, Precision, and F-score) have practically the identical value as the recall due to our choice of median threshold while preparing . The highest recall is observed for the Beautiful ratings which is . The lowest recall is for Obnoxious. We observe a trend from the table that, the ratings with fewer counts (shown in Figure 1) are usually difficult to predict.

Table 4 provides a clearer picture of how the dependency tree based neural network performs better than the word sequence neural network. The former achieves a higher recall for most of the rating categories ( out of ). Only in three cases (Funny, Longwinded, and OK) the word sequence model achieves higher performance than the dependency tree model. Both these models perform equally well for the Obnoxious and Unconvincing rating category. It is important to realize that the dependency trees we extracted from the sentences of the transcripts, were not manually annotated. They were extracted using SyntaxNet, which itself introduces some error. Andor et al. (2016) described their model accuracy to be approximately . We expected to notice an impact of this error in the results. However, the results show that the additional information (Parts of Speech tags, Dependency Types and Structures) benefited the prediction performance despite the error in the dependency trees. We think the hierarchical tree structure resolves some ambiguities in the sentence semantics which is not available to the word sequence model.

Finally, comparison with the results from Weninger et al. (2013) reveals that the neural network models perform better for almost every rating category except Fascinating and Jaw-Dropping. A neural network is a universal function approximator (Cybenko, 1989; Hornik, 1991) and thus expected to perform better. Yet we think another reason for its excel is its ability to process a faithful representation of the transcripts. In the baseline methods, the transcripts are provided as words without any order. In the neural counterparts, however, it is possible to maintain a more natural representation of the words—either the sequence or the syntactic relationship among them through a dependency tree. Besides, neural networks intrinsically capture the correlations among the rating categories. The baseline methods, on the other hand, consider each category as a separate classification problem. These are possibly a few reasons why the neural networks are a better choice for the TED talk rating prediction task.

9 Conclusion

In summary, we presented neural network-based architectures to predict the TED talk ratings from the speech transcripts. We carefully modeled the data generating process from known causal relations in order to remove the effects of data bias. Our experimental results show that our method effectively removes the data bias from the prediction model. This process resulted in a improvement in the prediction accuracy. This result indicates that modeling the data generating process and removing the effect of unwanted variables can lead to higher predictive capacity even with a moderately sized dataset as ours. The neural network architectures provide the state of the art prediction performance, outperforming the competitive baseline method in the literature.

Our results also show that the dependency tree based neural network architecture performs better in predicting the TED talk ratings as compared to a word sequence model. The exact reason why this happens, however, remains to be explored in the future.

References

  • D. Alikaniotis, H. Yannakoudakis and M. Rei (2016) Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 715–725. Cited by: §2.1.
  • D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov and M. Collins (2016) Globally normalized transition-based neural networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §5.2, §8.
  • D. Bertero and P. Fung (2016) A long short-term memory framework for predicting humor in dialogues. In Proceedings of the 2016 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, pp. 130–135. Cited by: §2.1.
  • P. Bull (2016) Claps and claptrap: the analysis of speaker-audience interaction in political speeches. Journal of Social and Political Psychology 4 (1), pp. 473–492. Cited by: §2.2.
  • J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp. 77–91. Cited by: §1.
  • C. Chen, Y. Yang, J. Zhou, X. Li and F. S. Bao (2018) Cross-domain review helpfulness prediction based on convolutional neural networks with auxiliary domain discriminators. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 602–607. Cited by: §2.1.
  • L. Chen and C. M. Lee (2017) Convolutional neural network for humor recognition. arXiv preprint arXiv:1702.02584. Cited by: §2.2.
  • L. Chen, R. Zhao, C. W. Leong, B. Lehman, G. Feng and M. E. Hoque (2017) Automated video interview judgment on a large-sized corpus collected online. In ACII, pp. 504–509. Cited by: §2.1.
  • G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §8.
  • A. Drasovean and C. Tagg (2015) Evaluative language and its solidarity-building role on TED.com: an appraisal and corpus analysis. Language@ Internet 12. Cited by: §2.2.
  • J. Duchi, E. Hazan and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul). Cited by: §6.1.
  • Y. Farag, H. Yannakoudakis and T. Briscoe (2018) Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of the 2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 263–271. Cited by: §2.1.
  • M. Fung, Y. Jin, R. Zhao and M. E. Hoque (2015) ROC speak: semi-automated personalized feedback on nonverbal behavior from recorded videos. In Ubicomp, pp. 1167–1178. Cited by: §1.
  • C. Gallo (2014) Note: Forbes External Links: Link Cited by: §1.
  • C. Gallo (Ed.) (2014) Talk like TED. Emerald Group Publishing Limited. Cited by: §2.2.
  • J. Guynn (2015) Google photos labeled black people ‘gorillas’. USA Today. External Links: Link Cited by: §1.
  • F. Haider, F. A. Salim, S. Luz, C. Vogel, O. Conlan and N. Campbell (2017) Visual, laughter, applause and spoken expression features for predicting engagement within TED talks. Feedback 10, pp. 20. Cited by: §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, https://doi.org/10.1162/neco.1997.9.8.1735, Link Cited by: §1, §5.1.
  • K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §8.
  • A. Jaech, R. Koncel-Kedziorski and M. Ostendorf (2016) Phonological pun-derstanding. In Proceedings of the 2016 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 654–663. Cited by: §2.1.
  • C. Jin, B. He, K. Hui and L. Sun (2018) TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1088–1097. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §6.1.
  • H. Liu, Y. Gao, P. Lv, M. Li, S. Geng, M. Li and H. Wang (2017a) Using argument-based features to predict and analyse review helpfulness. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1358–1363. Cited by: §2.1.
  • Z. Liu, A. Xu, M. Zhang, J. Mahmud and V. Sinha (2017b) Fostering user engagement: rhetorical devices for applause generation learnt from TED talks. arXiv preprint arXiv:1704.02362. Cited by: §2.2, §7.
  • L. Martin and P. Pu (2014) Prediction of helpful reviews using emotions extraction. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI-14), Cited by: §2.1.
  • S. Merity, N. S. Keskar and R. Socher (2018) Regularizing and optimizing lstm language models. In 6th International Conference on Learning Representations, Cited by: §6.2.
  • I. Naim, M. I. Tanveer, D. Gildea and E. Hoque (2016) Automated analysis and prediction of job interview performance. IEEE Transactions on Affective Computing. Cited by: §2.1, §7.
  • L. S. Nguyen and D. Gatica-Perez (2016) Hirability in the wild: analysis of online conversational video resumes. IEEE Transactions on Multimedia 18 (7), pp. 1422–1437. Cited by: §2.1.
  • N. Pappas and A. Popescu-Belis (2013) Sentiment analysis of user comments for one-class collaborative filtering over TED talks. In SIGIR, SIGIR ’13, pp. 773–776. External Links: Document, ISBN 978-1-4503-2034-4, Link Cited by: §2.2.
  • J. Pearl and D. Mackenzie (2018) The book of why. Basic Books. Cited by: §1, §4, §4.
  • J. Pearl (2009) Causality. 2nd edition, Cambridge University Press. Cited by: §1.
  • J. W. Pennebaker, M. E. Francis and R. J. Booth (2001) Linguistic Inquiry and Word Count: LIWC 2001. Mahwah: Lawrence Erlbaum Associates 71. Cited by: §7.
  • J. Pennington, R. Socher and C. D. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Link Cited by: §5.1.
  • R. Ranganath, D. Jurafsky and D. McFarland (2009) It’s not you, it’s me: detecting flirting and its misperception in speed-dates. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 334–342. Cited by: §7.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §6.2.
  • C. R. Sugimoto, M. Thelwall, V. Larivière, A. Tsou, P. Mongeon and B. Macaluso (2013) Scientists popularizing science: characteristics and impact of TED talk presenters. PloS one 8 (4), pp. e62403. Cited by: §2.2.
  • K. Taghipour and H. T. Ng (2016) A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1882–1891. Cited by: §2.1.
  • K. S. Tai, R. Socher and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1556–1566. Cited by: §1, §5.2.
  • M. I. Tanveer, J. Liu and M. E. Hoque (2015) Unsupervised extraction of human-interpretable nonverbal behavioral cues in a public speaking scenario. In 23rd Annual ACM Conference on Multimedia, pp. 863–866. Cited by: §2.1.
  • M. I. Tanveer, S. Samrose, R. A. Baten and M. E. Hoque (2018) Awe the audience: how the narrative trajectories affect audience perception in public speaking. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), New York, NY, USA, pp. 24:1–24:12. External Links: Document, ISBN 978-1-4503-5620-6, Link Cited by: §2.1.
  • R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, pp. 267–288. Cited by: §7, §7.
  • A. Tsou, M. Thelwall, P. Mongeon and C. R. Sugimoto (2014) A community of curious souls: an analysis of commenting behavior on TED talks videos. PloS one 9 (4), pp. e93609. Cited by: §2.2.
  • S. Valenti, F. Neri and A. Cucchiarelli (2003) An overview of current research on automated essay grading. Journal of Information Technology Education: Research 2, pp. 319–330. Cited by: §2.1.
  • V. Vapnik and A. Chervonenkis (1964) A note on one class of perceptrons. Automation and remote control 25 (1). Cited by: §7.
  • D. Wallechinsky, A. Wallace, J. Farrow and I. Basen (2005) The book of lists: the original compendium of curious information. Knopf Canada. Cited by: §1.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun and R. Fergus (2013) Regularization of neural networks using dropconnect. In Proceedings of the International Conference of Machine Learning (ICML), pp. 1058–1066. Cited by: §6.2.
  • F. Weninger, P. Staudt and B. Schuller (2013) Words that fascinate the listener: predicting affective ratings of on-line lectures. International Journal of Distance Education Technologies (IJDET) 11 (2), pp. 110–123. Cited by: §2.2, §7, Table 4, §8, §8, §8.
  • Y. Yang, Y. Yan, M. Qiu and F. Bao (2015) Semantic analysis and helpfulness prediction of text for online product reviews. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 38–44. Cited by: §2.1.
  • K. Zechner, D. Higgins, X. Xi and D. M. Williamson (2009) Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication 51 (10), pp. 883–895. Cited by: §7.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366115
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description