Multimodal Learning for Classroom Activity Detection


Classroom activity detection (CAD) focuses on accurately classifying whether the teacher or student is speaking and recording both the length of individual utterances during a class. A CAD solution helps teachers get instant feedback on their pedagogical instructions. This greatly improves educators’ teaching skills and hence leads to students’ achievement. However, CAD is very challenging because (1) the CAD model needs to be generalized well enough for different teachers and students; (2) data from both vocal and language modalities has to be wisely fused so that they can be complementary; and (3) the solution shouldn’t heavily rely on additional recording device. In this paper, we address the above challenges by using a novel attention based neural framework. Our framework not only extracts both speech and language information, but utilizes attention mechanism to capture long-term semantic dependence. Our framework is device-free and is able to take any classroom recording as input. The proposed CAD learning framework is evaluated in two real-world education applications. The experimental results demonstrate the benefits of our approach on learning attention based neural network from classroom data with different modalities, and show our approach is able to outperform state-of-the-art baselines in terms of various evaluation metrics.


Hang Li, Yu Kang, Wenbiao Ding, Song Yang, Songfan Yang, Gale Yan Huang, Zitao Liu\sthanksThe corresponding author. Email: \addressTAL AI Lab, TAL Education Group, Beijing, China


Multimodal Learning, Classroom Activity Detection, K-12 Education

1 Introduction

Teacher-student interaction analysis in live classrooms with the goal of accurately quantifying classroom activities, such as lecturing, discussion, etc is very crucial for student achievement [1, 2, 3, 4, 5, 6, 7]. It not only provides students the opportunity to work through their understanding and learn from ideas of others but gives teachers epistemic feedback on their instruction which is important for crafting their teaching skills [8, 9, 10, 11]. Such analysis usually takes into account a classroom recording (e.g., as collected by an audio or video recorder) and outputs pedagogical annotations of classroom activities.

The majority of the current practices of classroom dialogic analysis is logistically complex and expensive, requiring observer rubrics, observer training, and continuous assessment to maintain a pool of qualified observers [12, 10]. Even with performance support tools developed by Nystrand and colleagues, including live coding CLASS 4.25 software [1, 13], it still requires approximately 4 hours of coding time per 1 hour of classroom observation. This is an unsustainable task for scalable research, let alone for providing day-to-day feedback for teacher professional development.

In this work, we focus on the very fundamental and classic classroom activity detection (CAD) problem and aim to automatically distinguish between whether the teacher or student is speaking and record both the length of individual utterances and the total amount of talk during a class. An example annotation trace for a class is illustrated in Figure 1. The CAD results of identified activity patterns during lessons will give valuable information about the quantity and distribution of classroom talk and therefore help teachers improve their interactions with students so as to improve student achievement.

Figure 1: A graphical illustration of CAD results in a sample class. The x-axis represents time within the class.

A large spectrum of models have been developed and successfully applied in solving CAD problems [14, 2, 15, 8]. However, CAD in real-world scenarios poses numerous challenges. First, vocal information is usually not enough when solving the CAD task due to the multimodal classroom environment. The teacher’s voice might be very close to some student’s voice, which undoubtedly poses a hard modeling problem since the existing well-developed approaches either focus on identifying each individual speaker in the clean environment or utilize extra recording devices for such activity detection. Second, teacher-student conversations from real-world classroom scenarios are very causal and open-ended. It is difficult to capture the latent semantic information and how to model the long-term sentence-level dependence remains a big concern. Third, the CAD solution should be flexible and doesn’t rely on additional recording devices like portable microphones for only collecting teacher audio.

In this paper we study and develop a novel solution to CAD problems that is applicable and can learn neural network models from the real-world multimodal classroom environment. More specifically, we present an attention based multimodal learning framework which (1) fuses the multimodal information by attention based networks such that the teachers’ or students’ semantic ambiguities can be alleviated by vocal attention scores; and (2) directly learns and predicts from classroom recordings and doesn’t rely on any additional recording device for teachers.

2 Related Work

There is a long research history on the use of audio (and video) to study instructional practices and student behaviors in live classrooms and many approaches and schemes are designed for CAD annotations due to different purposes [14, 2, 15, 8]. For examples, Owens et al. develop Decibel Analysis for Research in Teaching, i.e., DART, to analyzes the volume and variance of classroom recordings to predict the quantity of time spend on single voice (e.g., lecture), multiple voice (e.g., pair discussion), and no voice (e.g., clicker question thinking) activities [2]. Cosbey et al. improve the DART performance by using deep and recurrent neural network (RNN) architectures [15]. A comprehensive comparison experiments of deep neural network (DNN), gated recurrent network (GRU) and RNN are studied. Mu et al. present the Automatic Classification of Online Discussions with Extracted Attributes, i.e., ACODEA, framework for fully automatic segmentation and classification of online discussion [14]. ACODEA focuses on learners’ argumentation knowledge acquiring and categorizes the content based on the micro-argumentation dimensions such as Claim, Ground, Warrant, Inadequate Claim, etc [16]. Donnelly et al. aim to provide teachers with formative feedback on their instructions by training Naive Bayes models to identify occurrences of some key instructional segments, such as Question & Answer, Procedures and Directions, Supervised Seatwork etc [8].

The closest related work is research by Wang et. al [4], who conduct CAD by using LENA system [5] and identifies three discourse activities of teacher lecturing, class discussion and student group work. Our work is different from Wang et. al since we develop a novel attention based multimodal neural framework to conduct the CAD tasks in the real-world device-free environment. While Wang et. al need to ask teachers to wear the LENA system during the entire teaching process and use differences in volume and pitch in order to assess when teachers were speaking or students were speaking.

Please note that CAD is different from the classic speaker verification [17, 18, 19] and speaker diarization [20] where (1) there is no enrollment-verification 2-stage process in CAD tasks; and (2) not every speaker need to be identified.

3 Our Approach

3.1 Problem Statement

Let be the sequence of segments of a classroom recording, i.e., where denotes the ith segment and is the total number of segments. Let be the corresponding label sequence, i.e., and each represents the classroom activity type, i.e, whether the segment is spoken by a student or a teacher. For each segment , we extract both the acoustic feature and the text feature . and are the dimensionality of and . Let and be the acoustic and text feature matrices of sequence , i.e., and . With the aforementioned notations and definitions, we can now formally define the CAD problem as a sequence labeling problem:

Given a classroom recording segment sequence and the corresponding acoustic and text feature matrices and , the goal is to find the most probable classroom activity type sequence as follows:

where is the collection of all possible labeling sequences and is the predicted classroom activity type sequence.

3.2 The Proposed Framework

Multimodal Attention Layer

In order to capture the information from both vocal and language modalities in the classroom environment, we design a novel multimodal attention layer that is able to alleviate the language ambiguity by using the voice attention mechanism. The majority of classroom conversations is open-ended and it is very difficult to distinguish its activity type when only considering the sentence itself. Furthermore, not every piece of contextual segments contributes equally to the labeling task, especially the context is a mix of segments from teachers and students. Therefore, we use acoustic information as a complementing resource. More specifically, for each segment , we not only utilize its own acoustic and language information but also the contextual information within the entire classroom recording. Moreover, the contextual segments are automatically weighted by the voice attention scores. The voice attention scores aim to cluster segments from the same subject (teacher or student), which is illustrated in Figure 2(a).

(a) (b)
Figure 2: (a) Multimodal attention layer. (b) The proposed neural framework.

As shown in Figure 2(a), firstly, each segment is treated as a query and we compute its attention scores between the query and all the remaining segments by using the acoustic features. We choose the standard scaled dot-product as our attention function [21]. The scaled dot-product scores can be viewed as the substitutes of the cosine similarities between voice embedding vectors, which have been commonly used as a calibration for the acoustic similarity between different speakers’ utterances [17, 20]. After that, we compute the multimodal representation by attending scores to contextual language features. The above process can be concisely expressed in the matrix form as and where , and are the query, key and value matrices in the standard attention framework [21].

Both and are from pre-trained models. The acoustic feature is obtained from the state-of-the-art LSTM-based speaking embedding network proposed by Wan et al. [17]. The d-vector generated by such network has been proven to be effective in representing the voice characteristics of different speakers in speaker verification and speaker diarization problems [20]. The language feature comes from the word embedding network proposed by Mikolov et al. [22], which is widely used in many neural language understanding models [21, 23]. In practice, to achieve better performance on the classroom specific datasets, we also fine tune the pre-trained models with linear transformation operators, i.e., , where and are the linear projection matrices.

The attention score matrix is computed through dot-product of and , and the softmax function is then used to normalized the score values. Finally, the output embedding matrix is calculated by the dot product of and the value matrix . The complete equation is shown as follow:

With our multimodal attention layer, the acoustic features served as a bridge that wisely connects the scattered semantic features across different segments.

The Overall Architecture

By utilizing the above multimodal attention layer, we are able to learn the fused multimodal embeddings for each segment. Similar to [24], we add a residual connection by concatenating with in multimodal attention block, i.e., and is the matrix form of all s. What’s more, to better capture the contextually sequential information within the entire classroom recording, we integrate the position information of the sequence by using a Bi-directional LSTM (BiLSTM) layer after the residual layer [23]. We denote the hidden representation of BiLSTM as , where and is the size of hidden vector of . Finally, we use a two-layer fully-connected position-wise feed forward network to conduct the final predictions, i.e., , where denotes the softmax function and denotes the fully-connected network. The entire framework is illustrated in Figure 2 (b).

In our multimodal learning framework, we use binary cross-entropy loss to optimize the prediction accuracy, which is defined as , where the prediction probability for segment .

Besides that, we also want to optimize the multimodal attention scores with existing label information. Therefore, we introduce an attention score regularization operator . aims to penalize the similarity scores between two segments when they are from different activity types. is defined as , where is the ()th element in and represents the attention score between and . is the indicator function.

Therefore, the final loss function in our multimodal learning framework is shown as , where is the hyper parameter and is selected (in all experiments) by the internal cross validation approach while optimizing models’ predictive performances.

4 Experiments

4.1 Experimental Setup

In the following experiments, we first feed each classroom recording to a publicly available third-party ASR online transcription service to (1) generate the unlabeled segment sequence; (2) filter out silence or noise segments; and (3) obtain the raw sentences transcriptions. Then similar to [20], each segment level audio signal is transformed into frames of width 25ms and step 10ms, and log-mel-filterbank energies of dimension 40 are extracted from each frame. After that, we build sliding windows of a fixed length (240ms) and step (120ms) on these frames. We run the pre-trained acoustic neural network on each window and compute acoustic features, i.e., , by averaging these window level embeddings. Similarly, text features, i.e., , are also generated from a pre-trained word embedding network.

The projected dimension, i.e., , and the number of neurons in BiLSTM, i.e., are set to 64 and 100. The numbers of neurons in the final two-layer fully connected network are 128 and 2. We use ReLU as the activation function. We set to 10 and use ADAM optimizer with learning rate of 0.001. We set batch size and the number of training epoch to 64 and 20, respectively.

We compare with different methods by accuracy and F1 score. Different from standard classification evaluation in which examples are equal, in CAD tasks, the lengths of different segments vary a lot. Therefore, evaluation results are weighted by the time span of each segment. The weight is computed as the proportion of each segment’s duration over the sum of total segment durations of the class recording.

4.2 Baselines

We carefully choose the following state-of-the-art CAD related approaches as our baselines. They are: (1) BiLSTM with acoustic features (): We train the BiLSTM model with acoustic features only. The text features are completely ignored and there is no multimodal attention fusion; (2) BiLSTM with text features (): Similar to but instead only the text features are used; (3) attention based BiLSTM with acoustic features (Attn-): A self-attention layer is added before BiLSTM model, and similar to , only acoustic features are used; (4) attention based BiLSTM with text features (Attn-): Similar to Attn- but instead only the text features are used; (5) BiLSTM with concatenated features (): Both the acoustic and text features are used and the concatenation of them are fed into the BiLSTM model directly without multimodal attention fusion; (6) spectral clustering with d-vectors (): Spectral clustering on speaker-discriminative embeddings (a.k.a. d-vectors) [20]. It first extracts d-vectors from acoustic features, and then applies the spectral clustering to cluster all the segments. After that, a classifier takes both acoustic and text features and predict the final activity type; (7) unbounded interleaved-state recurrent neural networks (): Similar to Spectral, but instead of using spectral clustering, UIS-RNN uses a distance-dependent Chinese restaurant process [25].

4.3 Datasets

To assess the proposed framework, we conduct several experiments on two real-world K-12 education datasets.

Online Classroom Data(“Online”) We collect 400 online class recordings from a third-party online education platform. The data is recorded by webcams via live streaming. After generating segment sequences according to steps in Section 4.1, we label each segment as either or . For those segment consisting of both teacher speaking and student speaking, we label it as the dominant activity type in the segment. The average duration of classroom recordings is 60 minutes. The average length of the segment sequences is 700. We train our model with 350 recordings and use the remaining 50 recordings as the test set.

Offline Classroom Data(“Offline”): We also collect another 50 recordings from offline classroom environment as an additional test set. The data is obtained by indoor cameras that are installed on the ceiling of the classrooms.

4.4 Results & Analysis

The results show that our approach outperforms all other methods on both Online and Offline datasets. Specifically, from Table 1, we find the following results: (1) when comparing acoustic feature only models (, Attn-) to text feature only models (, Attn-), we can see that models based on acoustic features in general perform worse than models based on text features. We believe this is because two similar voice segments may have different activity types but the corresponding spoken terms may differ; (2) comparing and Attn-, and Attn-, blindly incorporating attention mechanism cannot improve the performance in CAD tasks due to the fact that the sequence is mixed by teacher spoken segments and student spoken segments; (3) the performance on teacher is better than student in general, this is because the majority of segments in the classroom recording is spoken by teachers. The percentages of student talk time is 22% on both Online and Offline datasets.

Online Offline
0.76 0.84 0.54 0.72 0.82 0.35
Attn- 0.73 0.80 0.58 0.77 0.86 0.28
0.83 0.90 0.48 0.78 0.87 0.28
Attn- 0.78 0.87 0.13 0.78 0.87 0.03
0.81 0.88 0.61 0.79 0.87 0.24
Spectral 0.78 0.86 0.53 0.73 0.83 0.37
UIS-RNN 0.81 0.88 0.58 0.71 0.81 0.35
Our 0.92 0.95 0.81 0.80 0.87 0.48
Table 1: Experimental results on Online and Offline datasets. and indicate the F1 scores for activity types Teacher and Student.

5 Conclusion

In this paper, we presented an attention based multimodal learning framework for CAD problems. Our approach is able to fuse data from different modalities and capture the long-term semantic dependence without any additional recording device. Experiment results demonstrated that our approach outperforms other state-of-the-art CAD learning approaches in terms of accuracy and F1 score.


  1. Martin Nystrand, “Research on the role of classroom discourse as it affects reading comprehension,” Research in the Teaching of English, pp. 392–412, 2006.
  2. Melinda T Owens, Shannon B Seidel, Mike Wong, Travis E Bejines, Susanne Lietz, Joseph R Perez, Shangheng Sit, Zahur-Saleh Subedar, Gigi N Acker, Susan F Akana, et al., “Classroom sound can be used to classify teaching practices in college science courses,” Proceedings of the National Academy of Sciences, vol. 114, no. 12, pp. 3085–3090, 2017.
  3. Sidney K D’Mello, Andrew M Olney, Nathan Blanchard, Borhan Samei, Xiaoyi Sun, Brooke Ward, and Sean Kelly, “Multimodal capture of teacher-student interactions for automated dialogic analysis in live classrooms,” in ICMI. ACM, 2015, pp. 557–566.
  4. Zuowei Wang, Xingyu Pan, Kevin F Miller, and Kai S Cortina, “Automatic classification of activities in classroom discourse,” Computers & Education, vol. 78, pp. 115–123, 2014.
  5. Hillary Ganek and Alice Eriks-Brophy, “The language environment analysis (lena) system: A literature review,” in Proceedings of the Joint Workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016. Linköping University Electronic Press, 2016, number 130, pp. 24–32.
  6. Jiahao Chen, Hang Li, Wenxin Wang, Wenbiao Ding, Gale Yan Huang, and Zitao Liu, “A multimodal alerting system for online class quality assurance,” in International Conference on Artificial Intelligence in Education. Springer, 2019, pp. 381–385.
  7. Wenbiao Ding, Guowei Xu, Tianqiao Liu, Weiping Fu, Yujia Song, Chaoyou Guo, Cong Kong, Songfan Yang, Gale Yan Huang, and Zitao Liu, “Dolphin: A spoken language proficiency assessment system for elementary education,” in Proceedings of The Web Conference 2020, 2020.
  8. Patrick J Donnelly, Nathan Blanchard, Borhan Samei, Andrew M Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, Martin Nystran, and Sidney K D’Mello, “Automatic teacher modeling from live classroom audio,” in Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization. ACM, 2016, pp. 45–53.
  9. Patrick J Donnelly, Nathaniel Blanchard, Borhan Samei, Andrew M Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, Martin Nystrand, and Sidney K D’Mello, “Multi-sensor modeling of teacher instructional segments in live classrooms,” in ICMI. ACM, 2016, pp. 177–184.
  10. Andrew M Olney, Patrick J Donnelly, Borhan Samei, and Sidney K D’Mello, “Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes,” International Educational Data Mining Society, 2017.
  11. Tiaoqiao Liu, Wenbiao Ding, Zhiwei Wang, Jiliang Tang, Gale Yan Huang, and Zitao Liu, “Automatic short answer grading via multiway attention networks,” in International Conference on Artificial Intelligence in Education. Springer, 2019, pp. 169–173.
  12. Jeff Archer, Steven Cantrell, Steven L Holtzman, Jilliam N Joe, Cynthia M Tocci, and Jess Wood, Better feedback for better teaching: A practical guide to improving classroom observations, John Wiley & Sons, 2016.
  13. Martin Nystrand, “CLASS: A Windows Laptop Computer System for the In-Class Analysis of Classroom Discourse,”, [Online; accessed 10-October-2019].
  14. Jin Mu, Karsten Stegmann, Elijah Mayfield, Carolyn Rosé, and Frank Fischer, “The acodea framework: Developing segmentation and classification schemes for fully automatic analysis of online discussions,” International Journal of Computer-supported Collaborative Learning, vol. 7, no. 2, pp. 285–305, 2012.
  15. Robin Cosbey, Allison Wusterbarth, and Brian Hutchinson, “Deep learning for classroom activity detection from audio,” in ICASSP. IEEE, 2019, pp. 3727–3731.
  16. Armin Weinberger and Frank Fischer, “A framework to analyze argumentative knowledge construction in computer-supported collaborative learning,” Computers & Education, vol. 46, no. 1, pp. 71–95, 2006.
  17. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP. IEEE, 2018, pp. 4879–4883.
  18. FA Rezaur rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno, and Li Wan, “Attention-based models for text-dependent speaker verification,” in ICASSP. IEEE, 2018, pp. 5359–5363.
  19. Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in ICASSP. IEEE, 2016, pp. 5115–5119.
  20. Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with lstm,” in ICASSP. IEEE, 2018, pp. 5239–5243.
  21. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
  22. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  23. Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  25. Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in ICASSP. IEEE, 2019, pp. 6301–6305.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description