Towards Automatic Screening of Typical and Atypical Behaviors in Children With Autism
Autism spectrum disorders (ASD) impact the cognitive, social, communicative and behavioral abilities of an individual. The development of new clinical decision support systems is of importance in reducing the delay between presentation of symptoms and an accurate diagnosis. In this work, we contribute a new database consisting of video clips of typical (normal) and atypical (such as hand flapping, spinning or rocking) behaviors, displayed in natural settings, which have been collected from the YouTube video website. We propose a preliminary non-intrusive approach based on skeleton keypoint identification using pretrained deep neural networks on human body video clips to extract features and perform body movement analysis that differentiates typical and atypical behaviors of children. Experimental results on the newly contributed database show that our platform performs best with decision tree as the classifier when compared to other popular methodologies and offers a baseline against which alternate approaches may developed and tested.
Autism spectrum disorders (ASD) are a group of developmental disorders which involve difficulties with social and communicative processes throughout the lifetime of the affected individual. These disorders are characterized by atypicalities in a range of processes and behaviors across the spectrum of the individuals day to day life. ASD is characterized by stereotyped delays in development of social and communicative skills as well as a pattern of restricted behaviors, interests and activities . At the extremes these disorders may manifest in non-verbality or self-injurious behaviors.
Currently the cause of ASD is not known, or not fully understood. However, as understanding of the signs of the disorders increases the age of practical diagnosis is being reduced, similarly the accuracy of these diagnoses is improving. There is however evidence of a significant delay  in diagnosis caused by the lack of access to specialized resources and expert clinicians. This delay can therefore lead to a delay in interventions and therapies which may be of significant benefit to the affected individual as well as their care-givers.
Diagnosis of ASD in an individual is a complex process requiring significant investment from multiple disciplines within the clinical field. Typically there will be a series of interviews with the parents or carers of the child in question as well as, dependent upon the age of the child, involvement from teachers and primary care physicians. There will also be a series of structured play sessions with the child through which the clinician is able to directly observe the range of diagnostic signs.
Motor stereotypies are a group of common signs displayed across the various ASD characteristics. These are often repetitive movements which are atypical for the developmental level of the individual being observed. Often these stereotypies are displayed more prominently in situations where the individual is excited, stressed or anxious, however they may similarly be elicited during periods of boredom, tiredness or sensory isolation . It is envisioned that there is opportunity for a set of tools which may be used in the diagnostic process to assist the clinicians as well as associated domain specialists. These tools would increase the speed of the diagnostic process allowing for a shorter delay between initial presentation of concern and suitable interventions.
I Related Work
The autism diagnosis procedure has been the focus of a number of studies over the past decades, whereby many machine learning and computer vision methodologies have been investigated to improve the efficiency or accuracy of current methods [30, 12, 20]. Similarly alternate approaches to the highly structured current approaches have been investigated whereby unsupervised rating scales were combined with clinical interviews .
In recent years there has been significant investigation into the use of computer vision methods to identify and classify a number of behavioral atypicalities, including non-intrusive (at a distance) video based monitoring of epileptic seizures in children [19, 16, 15], human activity recognition [18, 17] and assisting social interactions at various levels [20, 33, 6, 9, 32, 21]. The advantage of computer vision techniques over certain other methods which may involve placement of measuring devices on the subjects body is the non-invasiveness (without interfering the patients) of the method , a carefully placed camera is less likely to interfere with the session and likewise will reduce stress on the child in question. Such non-intrusiveness (measurement at a distance) would not alter the usual natural behavior of the child, thereby, improves the effectiveness of ASD assessment. These attributes are highly desirable by the clinicians and practitioners in the evaluation of ASD among children.
Many of these prior studies approach the problem through the use of observation of individuals in structured and semi-structured sessions. These observations are made in similar settings to the standard ASD diagnostic process where a clinician will interact with the individual according to a defined set of guidelines such as the Autism Diagnostic Observation Schedule (ADOS) . The aim of these structured sessions is to elicit display of atypical behaviors in the child in a controlled manner. In the majority of these investigations it was necessary to develop a standardized schema for the interviews with the interactions chosen in such a way as to highlight the areas of interest to the study [11, 3]. From these studies a limited number of databases have been used in academic community [24, 23] and none of them is publicly available.
Recently Dawson et al. in  reported differences in motor function are an early feature of ASD. They explored midline head and body posture control by detecting facial landmarks and head pose angles of children. Their finding shows that toddlers with ASD exhibited significant higher rate of head movement as compared to non-ASD toddlers, suggesting difficulties in balancing midline head position while engaging attentional systems.
Ii YouTube ASD Database
The newly developed database presented in this work is collected from publicly available files on the YouTube video platform . Initial searches were made to select a variety of short video clips highlighting stereotypical behaviors seen in children between the ages of around 3 years and 12 years with an aim of capturing a wide variety of behaviors in a variety of settings. It also presents a number of ground-truth clips of a similar number and length by which the differences in may be compared. Of particular importance to this selection process was the frame rate and resolution of the video. An attempt was made to include as wide of a range of video subjects as possible. Ground truths are created using the captions and descriptions of the user videos and expert knowledge of the authors in collaboration with an autism assessment center.
For each video a number of short sequences were chosen to highlight the key behaviours being displayed to highlight any differences between typical behaviors and the atypical actions which are the focus of this study. Each sequence of around 3 to 12 seconds was selected based upon the positioning of the subject within the frame, the presence of other subjects within those frames. Primarily these videos are focused on stimming behaviors, which are usually quite rapid and atypical, such as hand flapping, spinning, jumping or rocking back and forth, repetitive playing or fiddling with toys/objects behaviors. These rapid atypical stimming movements are often missed / overlooked by caregivers / busy parents. Every care is taken to diversify our database as much as possible. For this database, more than 90 minutes of videos were collected and analyzed resulting in a database consisting of around 330 seconds of focussed sequences.
Our database is termed as ‘YouTube ASD Database’  consists of 68 video clips consisting of 35 atypical and 33 typical (normal movements) videos. Our database is similar in size to  and this size is very common among others, but unlike the existing ones, our database would available to access via the original postings on the youtube platform with start and end frame numbers for each selected sequence provided to allow for further investigation. Thereby, it would help to accelerate research in automatic non-intrusive screening of behaviors in autistic children and related areas.
Iii Modeling Typical and Atypical Behaviors in Children
Understanding atypical/typical behaviors in children is a very challenging problem because it involves uncertainty in appearance of the limbs and their movements fueled by whims/fancies of the performer (children). In this work, our idea is to capture the spatial (relationship) information by selecting the key points of the limbs, track them over time based on the key points detections in the subsequent frames in a video and finally compute the temporal information by computing the displacement, velocity and periodicity. The framework of our proposed skeleton based keypoint feature extraction and periodicity estimation are shown in Fig. 1. This approach allows us to take a short video sequence and extract time-series data in an efficient and automated manner.
Iii-a Behaviors in Unstructured Videos: Challenges Involved
Due to the uncontrolled nature of the videos being used there are a number of challenges which must be addressed, however it is felt that these challenges are the same as those which would be encountered with real-world carer supplied home videos.
Single viewpoint: The videos included in this data set are by their nature captured only from a single viewpoint. This leads to difficulties with identifying behaviors which are occurring normal to the plane of the video. Additionally these videos generally maintain focus on the face and upper torso of the subject thereby removing potential data about the lower-body actions.
Spatial variance: It is rare that a non-professional video would be taken using a fixed position camera, similarly there is a high chance of the subject moving around within the frame. These global movements must be accounted for in any feature extraction processes so as to not introduce additional noise into the data.
Mixed behaviors: In some cases the subject may begin the sequence displaying one type of atypical behavior and during the captured period change to a different behavior or change the spatial alignment of the behavior they are demonstrating.
Multiple subjects: A number of the videos present behavior in settings where multiple people are visible within the frame.
In order to handle the challenges discussed above, OpenPose  is selected as the primary method of extracting key-point data from the input images. OpenPose is a two-part program consisting of a convolutional neural network able to identify each of the major key-points as well as a multi-person part affinity field parser which is able to reconstruct each of these key-points into a skeleton representing a single person within the field of view. The advantages of the OpenPose implementation over other attempts to solve the same problem is the relatively short processing time required for image or frame and the robustness of the results it provides. Another advantage of the approach used by OpenPose is the inclusion of confidence values for each body part in the data output, as such each of the 25 body key points is given an -coordinate, -coordinate and probability value. Where there is the potential for multiple locations to be assigned to a single key-point for a single identified person there is an option to include those data, however, as standard it only provides those locations it is most confident of. It also allows for rendering of these key-points over the original image which is useful for visual verification of its results as well as investigation of any anomalies.
From each image, positional information for each keypoint are visible within the frame, some samples as shown in Figs. 2 and 3. These keypoint locations are compiled on a clip-wise basis before being passed to the next stage in our process. The OpenPose deep learned pretrained model provides a certainty estimate for each keypoint position, for the purposes of our study only those above 0.6 are used in order to retain as much information as possible whilst also removing anomalous identifications, cubic spline interpolation is used to reconstruct those lost frames. There are cases where keypoints are obscured for a significant period of time, as such the interpolation algorithm is only employed for short periods of missing data to maintain confidence in the values reported.
We propose to track only the upper limb positions (wrist, elbow, shoulder) for these experiments as these keypoints are identified to be the most significant group within the source database clips. For each clip a frame-wise change of position is calculated approximating the instantaneous velocity of each keypoint in question between each frame, these velocity estimates are calculated for the both the and axes of the original images as well as an Euclidean distance measure of absolute velocity .
Iii-B Sparse Structure Identification
OpenPose offers a significant number of utility options allowing for customization of the algorithm as well as how the data is reported. The authors have noted that accuracy can be increased in a number of ways, however this comes with associated increases in hardware utilization and processing time. A manual optimization of the options was performed to reach the maximal accuracy possible with the computing resources on hand (i7 8086k 5GHz, 16GB RAM, GTX1070). Specifically, we have to perform the following tasks to identify the body structure.
Single/Multi-person identification: The creators of OpenPose discuss the ability to identify multiple subjects within a single frame without significant increase in processing time, this was found to be an accurate claim, however there were difficulties encountered with the labelling method applied by OpenPose where there was limited correlation between person identification tags between consecutive frames.
Hand-structure identification: A visual inspection of the database clips described in Section II will show significant movement constrained to the hands of the individuals in question, this has proven to be a problem for the data extraction methods attempted. OpenPose offers a hand identification algorithm  however it has displayed difficulties with correctly identifying locations in a majority of images passed to it, this is especially true of those where significant motion blur is present, which unfortunately are those frames where the most useful data will be contained. Additionally, OpenPose has displayed some difficulties in correctly mapping key-points when detected to the correct hand of the subject, this is especially prevalent when both hands are in close proximity, or when hands are in proximity to the subjects foot. In other situations different body parts have been identified as being part of the hand. As such we elected not to include data concerning position of hand substructures due to the level of uncertainty.
Image deblurring: A number of attempts were made to improve the accuracy and response on a selection of test images. Simple image sharpening based upon edge detection provided no additional accuracy potentially due to the dispersed nature of the ‘edges’ of the relevant limbs and digits. Hand identification accuracy was not found to improve across a small number of test frames even though there was some success in removing motion blur and revealing hidden information. In addition to the lack of improvement in the accuracy the processing stage for each of the images tested was around 7 seconds with a maximum resolution during testing of below 1080p (as shown in the example Figs. 2 and 3, the maximum possible size was 720p) due to hardware limitations and Graphics RAM requirements for the neural network.
Key point selection and confidence score: Visual observation of the video clips chosen display a tendency for the upper body to be more represented than the lower body. As such the lower body key-points have been removed from the test database for processing. Similarly, the facial key points were removed to limit the quantity of data to be managed. This left six key points remaining comprising of the Right shoulder, Right elbow, Right wrist, Left shoulder, Left elbow and Left wrist. A visual inspection of the data from a number of sample videos performed against the rendered output frames was performed to investigate any misidentified key-points. From experimental analysis, it is observed that OpenPose has misidentified the locations of multiple key points within the image inspection of the coordinate data, it is investigated that correspondingly these values have been given a low probability rating, as such it was deemed necessary to filter the key points based upon the probability value for each on a frame-wise basis. A threshold value of 0.6 was chosen, with all key points below this value discarded. This was chosen to reduce the number of misidentifications whilst retaining a sufficient number of data points for further processing.
Length metric: Due to the variable nature of the videos with regards to framing and subject depth a metric was devised to account for this as well as controlling for changes in depth-wise position within the frame. By taking a reading of the Euclidean distance between the shoulder and elbow on both arms a metric was devised, whereby each frame a value was computed to represent the distance of movement in terms of upper arm length. There may be cases where this value is miscalculated due to the angle at which the limb is presented to the camera, for instance if the subject was facing the camera and had both elbows raised towards the camera there would be significant foreshortening of their upper arms and therefore a much smaller than expected value would be computed. To account for these effects of foreshortening a rolling median is computed over the metric series to attempt to reduce the effects of these anomalous data.
Velocity calculation: For each pair of consecutive frames in the video clip the difference in both and position of the key-point is calculated, which when divided by the length metric described above provides a frame-wise velocity estimate in terms of “upper arm lengths per frame”. From this an overall velocity was further calculated using the Euclidean distance of these movements. Therefore, for each key-point there are three values presented for each consecutive frame pair within the source video clip: = Change in co-ordinate, Velocity. = Change in co-ordinate, Velocity. = Change in overall position, Total Velocity. Some samples are shown in Figs. 4-5.
Periodicity estimate: By computing the signal over the full range of lags, the length of signal, a correlogram is produced which describes the periodic nature of the sequence. A highly periodic sequence will produce a periodic correlogram with peaks at or approaching a value of 1, and similarly troughs approaching a value of -1. In order to compare two signals of differing length using this metric the strength of correlations may be compared as such for each set of values computed for each key point an autocorrelation series is calculated for each possible lag period over the length of the signal. From these values the local maxima are selected and the correlation coefficient at those points is selected. We then proceed to calculate the overall maximum and the mean and standard deviation of the coefficients at all local maxima in the series are computed to give a measure of the strength of periodicity for each signal, thereby providing a signature of that observation which may further be used to classify given data.
Initial analysis of these velocity curves showed a tendency for the atypical examples to display periodicity in one or more key points, as such our feature extraction pipeline estimated the ‘periodicity’ of the time-series via analysis of the magnitude of the local maxima in the autocorrelogram of each time series, as shown in Figs. 4 and 5. As such for each of the three velocity measures of each of the six tracked keypoints, we record the mean velocity (calculated over the duration of the signal), standard deviation of velocity (calculated over the duration of the signal), maximum magnitude of the autocorrelogram maxima (maximum value of autocorrelation maxima calculated over the full range of lags in the video clip), mean magnitude of the autocorrelogram maxima (mean value of the autocorrelation maxima calculated over the full range of lags in the video clip) and standard deviation of the autocorrelogram maxima (standard deviation of the value of the autocorrelation maxima calculated over the full range of lags in the video clip). These features are computed for each of the three velocity measures tracked for each of the six key-points being observed in this project. Therefore, each clip has been decomposed into a vector of length 90 with an additional value for both the name of the clip and the baseline classification of Typical (given value 0) and Atypical (given value 1).
Our data set consists of 68 short clips split between typical and atypical displays of behavior. For each clip a short openCV script is used to extract sequences of individual frames at a rate of 30 frames per second (fps) and saved in .png format. These frames are subsequently passed individually to an open source key-point identification tool [31, 4]. This tool uses a caffe deep convolutional neural network (DCNN) trained to identify 25 key points within a presented image. Initially trained on the 19 keypoint Coco dataset  it has since been expanded to identify 25 important locations within the human figure. This software is chosen due to its exceptional results in the 2016 COCO keypoints challenge [5, 7].
The extracted features encoding both the spatial information (skeleton key points) as well as the temporal information (body parts movements along time (velocities)) are then evaluated with three popular machine learning algorithms to assess the efficacy of the baseline methodologies. The algorithms used are:
Linear Support Vector Machine (SVM): The data are projected into a multi-dimensional space and a linear hyperplane is calculated to separate as best as possible the data of the two classes.
Decision Tree: A method of dividing the data based upon a sequence of linear rules, leading to a final classification of the input data.
Random Forest: An ensemble learning method where multiple decision trees are generated and a voting mechanism is used to arrive at an eventual overall classification.
|Method||Avg F1||Avg Precision||Avg Recall||Avg Accuracy|
|Random Forest||0.63 (0.24)||0.70 (0.28)||0.60 (0.24)||0.69 (0.14)|
|Linear SVM||0.66 (0.12)||0.71 (0.19)||0.68 (0.22)||0.68 (0.06)|
|Decision Tree||0.71 (0.09)||0.73 (0.19)||0.75 (0.14)||0.71 (0.08)|
Iv Experimental Results and Analysis
We conduct our experiments on the YouTube ASD database  using the benchmarking baseline framework using three classification methods: random forest, linear SVM and decision trees, upon which future methodologies can be developed and compared against. This database is randomly partitioned into 80% for the training data and the remaining 20% for the testing data. 5-folds of cross-validation procedure is followed. For all the methods, default parameters with minor optimizations are used for all the experiments. Table V shows the average values of 5-folds of cross-validation for precision (the ratio of true positives to the sum of positive results. This displays how likely the classifier is to not label a negative sample as positive.), recall (the ratio of true positives to the sum of true positives and false negatives. This displays how able the classifier is to find all the positive samples.), F1-score (this is twice the ratio of the product of precision and recall to the sum of precision and recall.), accuracy (the ratio of the sum of true positives and true negatives to the size of the test batch. This represents the overall accuracy of the classification attempt.) and their respective standard deviation of each of the main performance measures for classification methods used. Based upon the results shown in Table V the linear SVM achieves lower overall accuracy than the other two methods, however there is less variance in accuracy of its final classification. The best average accuracy has been obtained by using decision tree of 71%.
Table VI shows the average confusion matrices obtained in percentage (%) by evaluating various baseline methodologies on YouTube ASD database in various types of brackets for each of the methodologies used. It is evident from Tables V and VI that the benchmarking baseline framework performs best with the decision trees as the classifier.
V Conclusions and Future Work
In this work, we have contributed a new database, it is available here , consisting of video clips of typical (normal behavior) and atypical, such as hand flapping, spinning, jumping or rocking back and forth, repetitive playing or fiddling with toys/objects ASD behavioral stereotypies in uncontrolled settings similar to home or family environments. These rapid atypical stimming movements are often missed or overlooked by carers / busy parents. This would help to advance the research on non-intrusive behavioral monitoring assessment or early intervention of children with ASD.
In future work, we would need to extend the ground truth of the database to be annotated by multiple experts in the ASD research areas. Similarly this database may need to be expanded to include additional source videos and clips as they are found or made available online.
We have presented a non-intrusive baseline analytic framework based on the skeleton keypoint detectors using pretrained deep neural networks developted on human body images to extract features and we subsequently perform body movement analysis in videos that differentiates typical and atypical rapid stimming behaviors of children.
Experimental results on this YouTube ASD database show that our baseline method performs best with decision tree as the classifier when compared to other baseline methods, upon which future methodologies can be developed and compared against. More robust detection and tracking of limbs head and other body parts algorithms would help to get more accurate key point features, thereby, this would help to improve the accuracy of the future ASD behavior analysis.
It is envisioned that once a suitable feature extraction and classification pipeline has been developed a tool can be constructed whereby an overlapping sliding window could be applied to a submitted video marking time-points within the video highlighting sequences where atypical behaviours are identified and subsequently a scoring system applied to the video as a whole. this may require an ensemble approach where other behavioural markers are investigated in order to capture the full extent of physical and behavioral stereotypies important to the diagnostic procedure.
-  “Youtube website,” https://www.youtube.com/, accessed: 20-08-2018.
-  H. Abbas, F. Garberson, E. Glover, and D. P. Wall, “Machine learning approach for early detection of autism by combining questionnaire and home video screening,” Journal of the American Medical Informatics Association, vol. 25, no. 8, pp. 1000–1007, 2018.
-  K. Campbell, K. L. Carpenter, J. Hashemi, S. Espinosa, S. Marsan, Borg et al., “Computer vision analysis captures atypical attention in toddlers with autism,” Autism, pp. 619–628, 2019.
-  Z. Cao, S. T., S. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Chia, B. Mandal, Q. Xu, L. Li, and J. Lim, “Enhancing social interaction with seamless face recognition on google glass: Leveraging opportunistic multi-tasking on smart phones,” in Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, MobileHCI ’15, Copenhagen, Denmark, August 24-27, 2015, 2015, pp. 750–757.
-  M. COCO, “Common objects in context - keypoint challenge 2016,” http://cocodataset.org/, 2016.
-  L. Crane, J. W. Chester, L. Goddard, L. A. Henry, and E. Hill, “Experiences of autism diagnosis: A survey of over 1000 parents in the united kingdom,” Autism, vol. 20, no. 2, pp. 153–162, 2016.
-  T. Gan, Y. Wong, B. Mandal, V. Chandrasekhar, and M. S. Kankanhalli, “Multi-sensor self-quantification of presentations,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, 2015, pp. 601–610.
-  D. Geraldine, K. Campbell, J. Hashemi, and S. J. L. et al., “Atypical postural control can be detected via computer vision analysis in toddlers with autism spectrum disorder,” Scientific Reports, vol. 8, no. 17008, 2018.
-  J. Hashemi, M. Tepper, T. Vallin Spina, A. Esler, V. Morellas, N. Papanikolopoulos, H. Egger, G. Dawson, and G. Sapiro, “Computer vision tools for low-cost and noninvasive measurement of autism-related behaviors in infants,” Autism research and treatment, vol. 2014, 2014.
-  J. Kosmicki, V. Sochat, M. Duda, and D. Wall, “Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning,” Translational psychiatry, vol. 5, no. 2, p. e514, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
-  C. Lord, S. Risi, L. Lambrecht, E. H. Cook, B. L. Leventhal, P. C. DiLavore, A. Pickles, and M. Rutter, “The autism diagnostic observation scheduleâgeneric: A standard measure of social and communication deficits associated with the spectrum of autism,” Journal of autism and developmental disorders, vol. 30, no. 3, pp. 205–223, 2000.
-  H. Lu, H.-L. Eng, B. Mandal, D. W. S. Chan, and Y.-L. Ng, “Markerless video analysis for movement quantification in pediatric epilepsy monitoring,” in IEEE International Engineering in Medicine and Biology Conference (EMBC), Sep 2011, pp. 8275–8278.
-  H. Lu, Y. Pan, B. Mandal, H. Eng, C. Guan, and D. W. S. Chan, “Quantifying limb movements in epileptic seizures through color-based video analysis,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 2, pp. 461–469, 2013.
-  B. Mandal and H.-L. Eng., “3-parameter based eigenfeature regularization for human activity recognition,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Mar 2010, pp. 954–957.
-  B. Mandal and H.-L. Eng, “Regularized discriminant analysis for holistic human activity recognition,” IEEE Intelligent Systems, vol. 27, no. 1, pp. 21–31, 2012.
-  B. Mandal, H.-L. Eng, H. Lu, D. W. S. Chan, and Y.-L. Ng, “Non-intrusive head movement analysis of videotaped seizures of epileptic origin,” in IEEE International Engineering in Medicine and Biology Conference (EMBC), Sep 2012, pp. 6060–6063.
-  B. Mandal, S. Chia, L. Li, V. Chandrasekhar, C. Tan, and J. Lim, “A wearable face recognition system on google glass for assisting social interactions,” in Computer Vision - ACCV 2014 Workshops - Singapore, Singapore, November 1-2, 2014, Revised Selected Papers, Part III, 2014, pp. 419–433.
-  B. Mandal, L. Li, V. Chandrasekhar, and J. Lim, “Whole space subclass discriminant analysis for face recognition,” in 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, 2015, pp. 329–333.
-  B. Mandal, W. Zhikai, L. Li, and A. A. Kassim, “Performance evaluation of local descriptors and distance measures on benchmarks and first-person-view videos for face identification,” Neurocomputing, vol. 184, pp. 107–116, 2016.
-  S. Rajagopalan, A. Dhall, and R. Goecke, “Self-stimulatory behaviours in the wild for autism diagnosis,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 755–761.
-  J. Rehg, G. Abowd, A. Rozga, M. Romero, M. Clements, S. Sclaroff, I. Essa, O. Ousley, Y. Li, C. Kim et al., “Decoding children’s social behavior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3414–3421.
-  M. D. Samad, N. Diawara, J. L. Bobzien, J. W. Harrington, M. A. Witherow, and K. M. Iftekharuddin, “A feasibility study of autism behavioral markers in spontaneous facial, visual, and hand movement response data,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 26, no. 2, pp. 353–361, 2018.
-  N. H. Service, “Autism spectrum disorder,” https://www.nhs.uk/conditions/autism/causes, 2016.
-  T. Simon, H. Joo, M. I.A., and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  H. S. Singer, “Motor stereotypies,” in Seminars in pediatric neurology, vol. 16, no. 2. Elsevier, 2009, pp. 77–81.
-  YouTubeASD Data, “YouTube ASD Database,” https://drive.google.com/drive/folders/1j-8ytJjrGadIy-h6I79957kLXhHQzLjQ?usp=sharing, 2019.
-  D. Wall, J. Kosmicki, T. Deluca, E. Harstad, and V. Fusaro, “Use of machine learning to shorten observation-based screening and diagnosis of autism,” Translational psychiatry, vol. 2, no. 4, p. e100, 2012.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
-  Q. Xu, S. C. Chia, B. Mandal, L. Li, J.-H. Lim, M. A. Mukawa, and C. Tan, “Socioglass: social interaction assistance with face recognition on google glass,” Scientific Phone Apps and Mobile Devices, vol. 2, no. 1, pp. 1–4, 2016.
-  Q. Xu, M. Mukawa, L. Li, J. Lim, C. Tan, S. Chia, T. Gan, and B. Mandal, “Exploring users’ attitudes towards social interaction assistance on google glass,” in Proceedings of the 6th Augmented Human International Conference, AH 2015, Singapore, March 9-11, 2015, 2015, pp. 9–12.