Score-level Multi Cue Fusion for Sign Language Recognition

Score-level Multi Cue Fusion for Sign Language Recognition


Sign Languages are expressed through hand and upper body gestures as well as facial expressions. Therefore, Sign Language Recognition (SLR) needs to focus on all such cues. Previous work uses hand-crafted mechanisms or network aggregation to extract the different cue features, to increase SLR performance. This is slow and involves complicated architectures. We propose a more straightforward approach that focuses on training separate cue models specializing on the dominant hand, hands, face, and upper body regions. We compare the performance of 3D Convolutional Neural Network (CNN) models specializing in these regions, combine them through score-level fusion, and use the weighted alternative. Our experimental results have shown the effectiveness of mixed convolutional models. Their fusion yields up to accuracy improvement over the baseline using the full upper body. Furthermore, we include a discussion for fusion settings, which can help future work on Sign Language Translation (SLT).

Sign Language Recognition, Turkish Sign Language (TID), 3D Convolutional Neural Networks, Score-level Fusion

1 Introduction

Sign Language is the means of communication of the Deaf, and each Deaf culture has its own sign language. Sign languages differ from the spoken language of the culture. Communication between the Deaf and the hearing relies mostly on the Deaf individual learning the spoken language and using lipreading and written text to communicate: A huge and unfair burden on the Deaf. The reverse, teaching the general population at least some sign language may be more feasible, and there are available educational courses for such aim. However, gaining expertise in sign language is difficult, and the communication problem is still unsolved. Automatic interpretation of sign languages is a necessary step for not only enabling the human-computer interaction but also facilitating the communication between the Deaf and the hearing individuals.

Automatic Sign Language Recognition (ASLR) refers to a broad field with different tasks, such as recognizing isolated sign glosses and continuous sign sentences. The objective of the ASLR system is to infer the meaning of the sign glosses or sentences and translate it to the spoken language. Recently, there has been an increased progress in these efforts: Sign Language Translation (SLT) has become an active research problem for creating interactive sign language interfaces for the deaf [2, 1, 3, 18]. A number of recent papers on the topic made use of neural network generated features. However, while the quality and representative power of these features in SLT are essential, and it is difficult to evaluate the representative potential of the elements in a pipeline setting where the overall system error is cumulative. For this reason, in this study, we aim to evaluate 3D Residual CNN Based Sign Language embeddings in terms of explanatory power in an Automatic Sign Language Recognition (ASLR) setting where temporal mix-up between signs and co-articulation is minimal. For the general case of Isolated SLR, the system aims to process a sign gloss and assign it to a single sign gloss label. In a limited context of supervised learning set-up, labels are glosses, which are transcription symbols assigned by sign language experts. There may be a single signer or multiple signers in communication; however, the ASLR system should be signer independent.

To convey the meaning of a performed sign gloss, Sign Languages use multiple channels, which are manifested as visual cues. We can classify these visual cues into two categories; (1) cues that are denoted as manual cues including hand shape and movement, and (2) cues that are non-manual features including facial expressions and upper body pose focusing on details without definitive large displacements.

Solving the problem of Isolated SLR requires specialized methods, which can be grouped into two categories. The first category is using handcrafted features, focusing on video trajectories and flow maps  [29, 19, 27]. The second set of methods includes machine learning algorithms and neural networks to improve classification performance  [14, 25, 19]. 3D CNN models have proven successful in various video tasks [23, 24]. Li et al. [14] adopted the same architecture in SLR and reported improved performance. However, Özdemir et al. [19] provided the comparison of 3D CNN models and handcrafted methods but have found that 3D CNNs are inferior to the state-of-the-art handcrafted IDT approach.

The aim of this work is to investigate why 3D CNN models may fail to show similar success in sign language recognition and to observe what modifications improve their performance. We hypothesize that the performance drop occurs because of the common practice of scaling images into smaller size and sampling frames [23, 24], due to computational requirements and difficulty of training bigger neural networks. One solution is handling the negative effect of the sampling by increasing the model complexity as in  [6, 30, 12], yet this increases computational requirements. Instead, we firstly apply attentive data selection at the pre-processing phase by determining cues in SLR data. Secondly, we divide the problem into multiple cues and train different expert classifiers on each kind of dense feature. Thirdly, we refine the expert cue network knowledge into one result, by applying score-level fusion.

The paper organization is as follows. Sections 2 reviews related work, Section 3 explains the presented method, Section 4 presents the experimental results, Section 5 contains the analysis of experiments and Section 6 presents the conclusions.

2 Related Work

Sign Language Recognition (SLR) aims to infer meaning from a performed sign. In the sign classification task, an isolated sign gloss is assigned to a class label. A sign gloss, the written language counterpart of the performed sign, can be used as a mid-level or final stage label for sign language recognition.

SLR is closely connected with video recognition or human action recognition methods, and similar architectures have been used for both. Two popular approaches to sign language representation uses handcrafted features and deep neural network based methods.

Prior to the performance leap achieved by neural networks, hand-crafted features were the best performing approach for representing human actions in a sequential video setting. For a two-frame dynamic flow map estimation, optical flow is used to generate feature-level information. These features perform better representation than RGB image sequences where the motion information is more indicative than appearance [5]. There exist numerous handcrafted feature extraction methods and their application to image sequences such as STIP [15] and spatio-temporal local binary patterns [28]. State of the art performances with constructed features in action recognition and isolated sign language recognition were obtained using Improved Dense Trajectories [27, 19], which is an outlier independent trajectory-based motion specialized feature extractor.

Neural Network based methods focus on convolutional architectures for the classification task. Simonyan et al. [21] use a branched CNN architecture that splits the information into spatial and temporal streams, and fuses them to perform video classification. Tran et al. [23] use 3D convolutional kernels to build a 3D CNN variant to process video data in an end-to-end fashion.

One prerequisite for using deep neural networks is the presence of large datasets with ground truth annotations. Recently, big-scale isolated sign language recognition datasets have become publicly available. Isolated SL datasets contain videos of a user performing a single gloss, usually a single word or a phrase. MS-ASL [25] is an American Isolated SL dataset including 200 native performers performing more than a thousand word categories. WL-ASL [14] is a bigger dataset with two thousand word categories performed by one hundred people. For other languages, Chinese [29] and Turkish [19] are among available datasets. Popular human activity recognition datasets  [22, 13, 10, 11] are also used as extra data and for finetuning in Isolated SLR. Continuous SL datasets are acquired in a less controlled setting, where a user can perform longer sign sequences  [8, 7].

SLR methods often use video pre-processing to reduce network bias and variance, and to increase network performance. Random cropping is one of the popular spatial augmentation techniques when training CNNs. Since CNN variants have small input spatial resolution, e.g., for the popular ResNet50 network [9], such methods increase the transitive invariance of the models by processing different parts of the image in higher resolution compared to directly downsampling the whole image frame.

Temporal pre-processing techniques operate on the temporal dimension of the video data. The aim is to locate the dense temporal regions which have an increased likelihood of the action flow. In recent work, different approaches are applied for the temporal activity localization, e.g., exploiting both short term and long term samples [26], combining high and low-frequency learners [6], and detecting active window boundaries for the long sequences [17]. Our work differs by applying cue selection before the training phase and combining the classifiers in the feature construction stage.

Combining both pre-processing techniques allows an opportunity to exploit covariance between these spatio-temporal features. Spatio-temporal pre-processing can possibly improve the signal to noise ratio of the processed data when the region of interest is selected from dense regions. This process is shown to be beneficial on other video recognition tasks, e.g., when extracted through handcrafted methods such as optical flow [21], or directly through 3D CNNs [24]. In SLR, due to the nature of the task, SL videos consist of the sparse hand and upper body movements as well as facial expressions. It is possible to use the domain-specific knowledge to exploit spatio-temporal sampling using a guided pre-processing technique. Spatio-temporal multi cue networks [30] exploit spatial regions of interest by firstly using a branch to estimate the region of interest, then training different networks for each unit. However, applying sampling at the training phase becomes more computationally expensive and requires deeper architectures. Our score-level multi cue fusion approach addresses this problem as described in the next section.

3 Method

In this section, we describe our method. We firstly describe the mixed convolutional model, follow up with our multi cue sampling process, and finally discuss the score-level fusion method.

3.1 3D Resnets with Mixed Convolutions

Mixed convolutional networks are 3D Residual CNNs [23], which use 3D convolutional kernels to process video frames in an end-to-end fashion. Tran et al. [24] investigate the success of 3D CNNs and shares two effective variants with strong empirical results. The first is mixed convolutional networks, and the second is residual bottleneck based 2+1 convolutional networks.

The mixed CNN variant builds on the plain 2D residual networks, with the difference that the first layers are replaced with 3D convolutional kernels. While the first layers are capable of processing input video directly with 3D convolutional kernels, later layers efficiently model the semantic knowledge using 2D convolutional kernels. Then, a fully connected layer is employed after the final convolutional layer for the video classification task.

Mixed Convolutional networks are denoted with MC, where is the number of 3D convolutional layer blocks. Following the baseline, we empirically experiment with different mixed convolutional variants and employ the MC variant of the mixed convolutional network.

3.2 Spatial and Temporal Sampling

The message in a sign gloss is conveyed through manual and non-manual cues. Information is conveyed through the shape and configuration of the hand, body, and face regions. The informative regions and intervals can be sampled with the help of a state-of-the-art pose estimation approach such as OpenPose [4]. Making use of pose estimation allows researchers to filter the entire frame by cropping specific regions according to keypoints, which are hand, face, and upper body keypoints in the case of SLR.

We would like to sample informative body regions to increase efficiency, and to filter out noise. Our approach is two-fold; (1) We design a SLR system by extracting the body, hand, and face regions by cropping the RGB frame spatially (in Figure 1) using the pose data which was provided in Özdemir et al. [19], (2) We focus on the temporal dense regions in which we define the active window as the temporal window where the active hand is moving. Then, we filter out the sparse frames and only feed the network with the frames in the active window (in Figure 2).

Using isolated sign gloss clips guarantees that the temporal sequence is centered on the hand movement. The following steps are used to extract the active window at the center.

  1. Use the moving hand detection framework in 4.2.1 to detect the active hand(s).

  2. Define a selected hand as the active hand. If both hands are active, select the dominant hand.

  3. For the selected hand, track hand movements using Euclidean distance. Keep the frame ids of the start of the first-hand movement and end of the last hand-movement.

  4. Define two thresholds . Filter the boundary regions from the start and end frame ids using corresponding thresholds defined earlier, and use extracted frames for the training.

In some videos, the movement is not in the middle of the video. We detect such exception cases by checking the position of the hand relative to the hip. We also filter out segments too short to be a sign.

Figure 1: Spatial Sampling operation is visualized. From left to right; cue regions selected for the process, and hand crop settings
Figure 2: Different temporal sampling operations are shown in the above figure. Selected frames are shown with color. Two branches represent uniform sampling and the Active Window Based Sampling Process

3.3 Multi Cue Score Fusion

Extracting multiple cues from different settings allows each model to build expertise on each cue. Therefore, there is a need to combine the cues of each model by combining weak expert classifiers. Zhou et al. [30] experiments with distillation at the training time, by training a big scale model consisting of expert components. This has the drawback of increasing model complexity and training time. Simonyan et al. [21] combines different branches while training, but processes the spatial and temporal branch separately at test time using a score fusion approach. They propose firstly direct score fusion via averaging through the network outputs and secondly, training a meta classifier above the extracted features. We follow the former score fusion approach since it has less model complexity and can achieve better run-time performance.

We experiment with two different multi cue fusion settings. First, we apply the averaging operation to the softmax outputs of each cue network results. Secondly, we apply a weighted fusion, where each cue network is weighted by its validation set performance.

Figure 3: Score-level multi cue fusion operation applied at the test time. Note that cue networks have different test weights even the architecture is same

4 Experiments

4.1 Experimental Setup


To achieve a competitive experimental setting, and to implement our proposal effectively, we have used a recently published Turkish Isolated SLR dataset BosphorusSign22k  [19]. The dataset contains different native signers, performing different sign glosses. Each category is labeled with a sign gloss, that describes the performed sign. The dataset contains over video clips. Authors also share 3D body pose keypoints in Kinectv2 format, and 2D body and hand keypoints obtained from OpenPose [4].

Evaluation Metric

Following the work of Özdemir et al. [19], we aim to compete on the sign language classification task. It is described as estimating the corresponding sign gloss for a given input video at test time, and scoring is evaluated in the accuracy of all of the test estimations. Out of all 6 performers, video clips of User 4 is defined as the test set, which is about 1/6 of the total dataset and it includes samples from all of the 744 classes.

Implementation Details

Our experiment setting follows the baseline paper’s [19] neural network based experimental setting. We apply the proposed preprocessing pipeline, resize the image into , crop the center square region then resize via bilinear interpolation to achieve input resolution. Then, we adopt the PyTorch implementation [20] of the mixed convolutional MC CNN model which was pretrained on the Kinetics dataset [11]. In our experiments, we only fine-tuned the last residual blocks, and apply uniform frame sampling to input video frames. All experiments has been performed with 32 batch size on a Nvidia 1080TI GPU (with 11GB memory).

Our replicated network resulted in accuracy, which is more than lower comparing the reported accuracy in Özdemir et al. [19]. We suspect that the difference is caused by randomized states such as optimizer initialization and different hyperparameter choices such as the learning rate.

4.2 Experimental Results

Spatial sampling.

Spatial sampling operation is applied through two phases. First, the cue region is detected, cropped, and optionally concatenated in a multiple cue setting. Secondly, sampling is applied using bilinear interpolation.

Body Setting. Following the standard SLR pipeline, we crop human body region before training.

Hand Setting. SLR work suggests that the dominant hand, the most used hand, conveys the most information in communication. To detect the dominant hand in the BosphorusSign22k dataset, we employ a hand motion tracking algorithm. The detection process is achieved by the following:

  1. Detect the Thumb keypoints on each frame,

  2. Define the first thumb keypoint on each hand as two anchors,

  3. If the following thumb keypoint on the next frames has greater distance than threshold compared to the anchor, conclude the hand as moving.

To compare keypoints for detecting the dominant hand with threshold values (which is predefined as 150 pixels), we use Euclidean distance. Table 1 provides the detection results on moving hands on BosphorusSign22k dataset. After the detection process, We have seen that signers in the dataset are using their left hands dominantly when performing a sign.

Distribution Relative Frequency (%)
Both Active 66.44
Only Left Active 33.07
Only Right Active 0.40
Crop Setting Accuracy(%)
Single Hand 79.13
Both Hands 85.81
Mixed 86.25
Table 1: Hand spatial sampling settings. First table represents hand activity distribution in the BosphorusSign22k dataset. Second table represents test results of the different hand crop settings and resulting accuracy values

During signing, only one hand may be active, or both hands may be active. We have adopted three different policies; (1) The single cue setting is applied by selecting the dominant hand in which hand crops with resolution are obtained around the keypoint #2. (2) Both cue setting is applied by selecting both hands where hand crops with resolution are obtained around the Thumb keypoint, and concatenated horizontally. (3) The mixed setting uses the single cue setting when a single hand is active, and uses both cue setting when both hands are active. All three settings are followed by downsampling with bilinear interpolation. Experimental results are provided at the right-hand side of the Table 1.

Face Setting. Signers often have cues with facial expressions or lip movements (mouthings) that can give hints about the sign gloss. For this purpose, we have also experimented on a face setting where we crop the entire face from frames. To crop the face, we have used the Nose keypoints which are provided with the Openpose [4] keypoints. After cropping the face, we resize them to resolution.

Score-Level Fusion We follow the insight that the different cue models can capture a different subset of features, which can lead to better results when combined effectively. Standard fusion is applied by averaging softmax outputs as in  [21]. In the weighted setting, we have applied weights to each model proportional to their validation accuracy via standard multiplication. Table 2 provides the result of the fusion.

Spatial Temporal S&T Combined
Setting Acc@1 Acc@5 Acc@1 Acc@5 Acc@1 Acc@5
Body 75.73 93.88 81.83 96.02 86.91 98.17
Hand 86.25 97.61 88.70 97.59 91.73 98.72
Face 24.27 44.45 37.00 57.89 39.12 59.33
Fusion 90.63 98.92 93.88 99.65 94.47 99.78
Weighted Fusion 92.18 99.27 94.03 99.56 94.94 99.76
Table 2: Classification accuracy results of the sampling and fusion settings. Three different settings are provided in the table. From left to right, (1) Single cue spatial sampling results, (2) Active Window Based Temporal Sampling applied to each crop, and (3) Spatial&Temporal settings are combined in one setting. Note that the bottom two rows include the fusion result of the above three models in each setting.

Temporal Sampling

Standard SLR training pipeline involves using the standard uniform frame sampling. We propose the active window based temporal sampling, applied by firstly extracting the dense cue regions before applying the uniform selection. Active window is detected as the part that the active hand is moving and discard the rest of the temporal information.

We used double thresholding for finding the active window. We have found that the start threshold , and the end threshold generates competitive empirical results. Using the temporal sampling framework, we have successfully segmented the active window for each video. Then, we applied uniform sampling along with our standard training pipeline. Experimental results can be seen in Table 2.

Spatio-Temporal Sampling

We applied active window based temporal sampling on top of the spatial multi cue regions. Our experiments have shown that the final spatio-temporal sampling framework has improved on both single cue settings. With the addition of score-level fusion, test accuracy reached to , which is the best result in all proposed settings as seen in the Table 2.

Our best setting provides improvement on our baseline neural network setting [19]. We also managed to improve their previous best hand-crafted result with accuracy rate. Whereas the previous best method uses more than ten times bigger input spatial resolution (), complicated hand-crafted methods [27] and a second stage SVM classifier, our approach only contains a 3D CNN and a sampling pipeline. Comparison with the baseline results is shown in Table 3.

Method Acc@1 Acc@5
Baseline IDT [19] 88.53 -
Baseline MC3_18 [19] 78.85 94.76
Weighted Fusion - S&T Combined 94.94 99.76
Table 3: Comparison with the baseline approaches IDT and MC3_18 model.

5 Discussion and Analysis

Accuracy lacks informativeness when considering whether the fusion will be beneficial or not. Top-N Accuracy measures how often the Top-N ranks contain the correct class. In our experiments, we also analyze Top-5 Accuracy along with Top-1 Accuracy. Top-N Accuracy results will increase with an increasing N, and are expected to be settled to 1 when N approaches to the maximum class number. Our Top-N accuracy analysis can be seen in Figure 4.

In the plot on the left-hand side, we report Top-N accuracy of the individual cues. Our analysis shows that hand cue yields the best performance, which is followed by the body cue. In both, there is a sharp increase between ranks 1 and 2. This shows that in a large number of cases, although the correct class fails to be predicted, it is the runner-up. This explains why the fusion is beneficial. Although the Top-N accuracy of the face cue is much lower, it is still beneficial for fusion.

Figure 4: Comparison of the models in the Top-N accuracy setting. The horizontal axis denotes the increasing N value, and the vertical axis denotes the accuracy value. First plot shows the single cue setting comparison, and second plot shows the multi cue setting additive comparison. Despite the difference in single setting performance, each cue boosts the fusion results.

Top-N accuracy of the muti-cue fusion is given in the right-hand side of Figure 4. We start by the hand model, then include the body model, and finally add the face model to the mix. This analysis allows us to see the cumulative progress over different fusion models. We observe that the Top-2 accuracy of hand alone is higher than Top-1 accuracy of both fusion settings. We believe that this observation is why the weighted fusion outperforms score fusion, and shows that more advanced models can attain higher performance.

5.1 Spatial Ablation Study

To analyse which cue benefits the fusion results the most, we have performed score fusion to all combination pairs of cue settings. According to this ablation study, we were able to observe the effect of each cue to the overall fusion. For example, to find the effect of the face model, we subtract the Body+Hand setting from the Body+Hand+Face setting. Table 4 shows the results of the ablation study. In our analysis, we can see that hand cue has the most effect on the fusion by which is followed by the body model with .

Setting Accuracy Excluded Cue Effect (%)
Body + Hand 91.80 Face 2.08
Body + Face 84.66 Hand 9.22
Hands + Face 88.70 Body 5.18
Table 4: Effects of excluding individual cue units from the final fusion model. Using the different two cue settings and their performance, we infer to the excluded setting and its effect on the final mix.

We have provided an analysis of the two most effective cues by comparing the gloss based performance. As a comparison metric, we adopted the F1-score, which should be more representative of false positives and false negatives, thus is more suitable for the gloss based evaluation.

Gloss Based Cue Comparison We share the top ten sign glosses that the hand cue model has a major advantage compared to the body cue model in Table 5.

Sign Gloss Hand Body Fusion Sign Gloss Hand Body Fusion
Aspirin 0.62 0.00 0.67 Internet_2 1.00 0.33 1.00
Deposit(v)_2 0.89 0.25 1.00 Noon 0.91 0.33 1.00
Exchange(v) 0.57 0.00 1.00 Shout(v)_2 0.91 0.33 0.91
Head 0.89 0.29 1.00 Sleep(v) 0.62 0.00 0.67
Identify(v) 0.89 0.00 1.00 Turn(v) 1.00 0.40 0.89
Table 5: F1-score comparison for the top ten sign glosses that hand sampling outperforms body sampling. (Sorted in the alphabetical order)
Figure 5: Class confusions of IDENTIFY(v) sign gloss for the body cue model

In Figure 5, we provide detailed analysis for the IDENTIFY(v) sign gloss. IDENTIFY(v) sign gloss is performed by using only the left hand, touching the head with the index finger, and the rest of the fingers are on the semi-open position. In this particular example, the body cue model only achieves success in the 5th guess, while the hand cue model has the correct prediction. Additionally, Figure 5 also shows that misclassifications of the body cue model which are HEAD, EAT, PSYCHOLOGY, and PHONE sign glosses. We inspect each confusion as follows:

  • HEAD sign differs from IDENTIFY(v) with the close position on all fingers other than the index finger.

  • EAT sign is performed by moving the left hand close to the mouth and with all fingers are in a closed position.

  • PSYCHOLOGY and PHONE sign glosses are performed with the left hand that and have open and semi-closed hand shapes, respectively.

By evaluating the confused cases, we have concluded that the hand model has an advantage of capturing hand shape information which was possibly due to increased spatial resolution of the hand region.

Effect of the Score-Level Fusion. We share our fusion result of the body and hand cue models in Table 5. Data has shown that the fusion model successfully captures the hand cue features. We have also seen that the fusion model even outperforms both single cue models in out of glosses.

5.2 Analysis of Method on Types of Gestures Recognized

To further analyze the types of signs which the proposed method performs well and fails, we have labeled the sign glosses in the dataset according to specific sign attributes. The sign classes are grouped into categories such as one-handed signs, two-handed signs, mono-morphemic signs, compound signs, and signs involving repetitive and circular movements of the hands.

Table 6 summarizes the analysis: The experiments are performed using temporal sampling with the best performing mixed convolution approach. Attribute-wise accuracy scores are calculated using the test set samples belonging to the classes containing the selected attributes. Overall, the accuracy scores in Table 6 demonstrate that for nearly all the subsets in the dataset, hand, body, and face-based features show consistency in their relative performance.

Number of Classes with Selected Attribute
234 510 75 669 457 287 375 369 744
One Handed Two Handed Circ. Not Circ. Rep. Not Rep. Mono Comp. All
Body 72.94 86.10 86.43 81.33 80.77 83.55 77.40 86.48 81.83
Hand 83.78 91.07 91.40 88.41 87.54 90.59 85.36 92.24 88.70
Face 45.33 33.01 29.41 37.82 36.39 37.99 33.79 40.52 37.00
Fusion 91.82 94.86 96.15 93.63 93.45 94.57 92.08 95.78 93.88
W.Fusion 90.87 95.55 95.70 93.85 93.37 95.09 92.21 95.96 94.03
Table 6: Analysis of the temporal sampling based recognition approach with respect to signs with certain grammatical sign attributes: one-handed signs, two-handed signs, mono-morphemic signs, compound signs, and signs involving repetitive and circular movements, respectively

Looking at the results for different attributes one by one, we can see that signs involving two moving hands are better recognized than the one-handed sign glosses in the dataset. The performance difference can be explained by the fact that in one-handed signs, the weight of handshape may be more critical than the two-handed signs. The relative positioning and appearance of both hands, which is more apparent, may be easier to represent for the neural network.

Secondly, compound signs have a greater recognition accuracy than mono-morphemic signs ( vs ). Considering the number of signing hands, the amount of additional information in the form of consecutive morphemes present in an isolated sign makes recognition easier, thus improving the performance system. From this result, we can infer that the method’s representation power is higher when a sign is greater in length and contains different hand shape and position combinations.

Looking at repetitive gestures, we see a improvement in accuracy when the signs do not contain repetitive hand gestures. The issue with repetitions, which we can attribute to this difference, is that the temporal and spatial forms of repetitions are more prone to differ between performances and users, in comparison to the static hand shape parts of the signs that follow specific rules.

Finally, we take a look at circular signs, which include circular hand and arm movements, which involve at least one entire rotation. These signs are dynamic signs where the hands do not stop while presenting a handshape. As these signs do not conform to the movement-hold phonological model of sign languages [16], representing them by choosing temporal frames is more complicated, reducing the effectiveness of keyframe based approaches [12]. Overall, the method performs well with circular signs, making fusion attempts with methods focusing more on the handshape of signs promising future leads.

6 Conclusion

In this paper, we have proposed a score-level multi cue fusion approach for the Isolated SLR task. Unlike the previous work [19, 12], we focused on both spatial and temporal cues. We employed 3D Residual CNNs [24], and trained different models as an expert on the single cue. We distilled the expert knowledge using the weighted and unweighted score-Level fusion. In our experiments, we have seen that our approach has outperformed the baseline results on the BosphorusSign22k Turkish Isolated SL dataset [19].

We have provided the single cue and multi cue Top-N accuracies to demonstrate incremental performance gain with each cue. Our gloss-level study shows that each cue model has specific expertise and provides an indispensable knowledge source to the fusion model. Our analysis of sign gloss attributes hints that the method performs better on temporally more complex signs with two-handed gestures, while performing comparatively worse on mono-morphemic gestures with a single hand. For that reason, the primary approach to improving performance lies in improving hand shape recognition. Possible strategies involve increasing model depth, finding better optimization techniques, or increasing the model input size. We hope that this work will extend the SLR cues into other Sign Language problems, help progress in unresolved SL tasks such as translation, and help uncover language-independent cues. Prob


This work has been supported by the TUBITAK Project No. 117E059 and TAM Project No. 2007K120610 under the Turkish Ministry of Development.


  1. N. C. Camgoz, S. Hadfield, O. Koller, R. Bowden and H. Ney (2018) Neural Sign Language Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  2. N. C. Camgoz, O. Koller, S. Hadfield and R. Bowden (2020) Multi-channel transformers for multi-articulatory sign language translation. arXiv preprint arXiv:2009.00299. Cited by: §1.
  3. N. C. Camgoz, O. Koller, S. Hadfield and R. Bowden (2020) Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  4. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. A. Sheikh (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §3.2, §4.1.1, §4.2.1.
  5. J. Carreira and A. Zisserman (2017-07) Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. Cited by: §2.
  6. C. Feichtenhofer, H. Fan, J. Malik and K. He (2019) SlowFast networks for video recognition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201–6210. Cited by: §1, §2.
  7. J. Forster, C. Schmidt, O. Koller, M. Bellgardt and H. Ney (2014) Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Cited by: §2.
  8. T. Hanke, L. König, S. Wagner and S. Matthes (2010) DGS Corpus & Dicta-Sign: The Hamburg Studio Setup. In Proceedings of the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, Cited by: §2.
  9. K. He, X. Zhang, S. Ren and J. Sun (2016-06) Deep residual learning for image recognition. In CVPR, 2016, pp. 770–778. Cited by: §2.
  10. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei (2014-06) Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.
  11. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back and P. Natsev (2017) The kinetics human action video dataset. arXiv:1705.06950. Cited by: §2, §4.1.3.
  12. A. A. Kındıroğlu, O. Özdemir and L. Akarun (2019) Temporal accumulative features for sign language recognition. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1288–1297. Cited by: §1, §5.2, §6.
  13. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre (2011-11) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §2.
  14. D. Li, C. R. Opazo, X. Yu and H. Li (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1448–1458. Cited by: §1, §2.
  15. Y. Li, R. Xia, Q. Huang, W. Xie and X. Li (2017) Survey of spatio-temporal interest point detection algorithms in video. IEEE Access 5, pp. 10323–10331. Cited by: §2.
  16. S. K. Liddell and R. E. Johnson (1989) American sign language: the phonological base. Sign language studies 64 (1), pp. 195–277. Cited by: §5.2.
  17. T. Lin, X. Liu, X. Li, E. Ding and S. Wen (2019) BMN: boundary-matching network for temporal action proposal generation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3888–3897. Cited by: §2.
  18. A. Orbay and L. Akarun (2020) Neural sign language translation by learning tokenization. arXiv preprint arXiv:2002.00479. Cited by: §1.
  19. O. Özdemir, A. A. Kındıroğlu, N. C. Camgöz and L. Akarun (2020) BosphorusSign22k sign language recognition dataset. arXiv preprint arXiv:2004.01283. Cited by: §1, §2, §2, §3.2, §4.1.1, §4.1.2, §4.1.3, §4.1.3, §4.2.3, Table 3, §6.
  20. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.3.
  21. K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NIPS, Vol. 1, pp. 568–576. Cited by: §2, §2, §3.3, §4.2.1.
  22. K. Soomro, A. R. Zamir and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Cited by: §2.
  23. D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri (2015-12) Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015, pp. 4489–4497. Cited by: §1, §1, §2, §3.1.
  24. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1, §1, §2, §3.1, §6.
  25. H. Vaezi Joze and O. Koller (2019-09) MS-asl: a large-scale data set and benchmark for understanding american sign language. In The British Machine Vision Conference (BMVC), Cited by: §1, §2.
  26. G. Varol, I. Laptev and C. Schmid (2017) Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1510–1517. Cited by: §2.
  27. H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision, pp. 3551–3558. Cited by: §1, §2, §4.2.3.
  28. Y. Wang, J. See, R. C. Phan and Y. Oh (2015) Efficient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition. PloS one 10 (5), pp. e0124674. Cited by: §2.
  29. J. Zhang, W. Zhou, C. Xie, J. Pu and H. Li (2016) Chinese sign language recognition with adaptive hmm. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §2.
  30. H. Zhou, W. Zhou, Y. Zhou and H. Li (2020) Spatial-temporal multi-cue network for continuous sign language recognition.. In AAAI, pp. 13009–13016. Cited by: §1, §2, §3.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description