Cricket stroke extraction: Towards creation of a large-scale cricket actions dataset

Cricket stroke extraction: Towards creation of a large-scale cricket actions dataset

Arpan Gupta\orcidID0000-0002-9417-3169 The LNM Institute of Information Technology, Jaipur, Rajasthan, India. 11email: {arpan,sakthi.balan}@lnmiit.ac.in
https://www.lnmiit.ac.in/
   Sakthi Balan M.\orcidID0000-0003-1817-7173 The LNM Institute of Information Technology, Jaipur, Rajasthan, India. 11email: {arpan,sakthi.balan}@lnmiit.ac.in
https://www.lnmiit.ac.in/
Abstract

In this paper, we deal with the problem of temporal action localization for a large-scale untrimmed cricket videos dataset. Our action of interest for cricket videos is a cricket stroke played by a batsman, which is, usually, covered by cameras placed at the stands of the cricket ground at both ends of the cricket pitch. After applying a sequence of preprocessing steps, we have million frames for videos in the dataset at constant frame rate and resolution. The method of localization is a generalized one which applies a trained random forest model for CUTs detection(using summed up grayscale histogram difference features) and two linear SVM camera models(CAM1 and CAM2) for first frame detection, trained on HOG features of CAM1 and CAM2 video shots. CAM1 and CAM2 are assumed to be part of the cricket stroke. At the predicted boundary positions, the HOG features of the first frames are computed and a simple algorithm was used to combine the positively predicted camera shots. In order to make the process as generic as possible, we did not consider any domain specific knowledge, such as tracking or specific shape and motion features.

The detailed analysis of our methodology is provided along with the metrics used for evaluation of individual models, and the final predicted segments. We achieved a weighted mean TIoU of over a small sample of the test set.

Keywords:
Cricket stroke extraction sports video processing HOG shot boundary detection untrimmed videos temporal localization

1 Introduction

The vision researchers are still a long way from achieving human level understanding of videos by a machine. Though, we get good results for basic tasks on images, but extending the same methods for videos may not be that straight-forward. Major challenges in understanding of videos include camera motions, illumination changes, partial occlusion etc. Many of these challenges can be avoided in applications where a stationary camera is used, since, the moving foreground objects are much easier to segment out. On the other hand, video content from a non-stationary camera, like telecast videos, movies and ego-centric videos, give rise to all these challenges. It is tough to come up with a unified model that deals with all of the above challenges, while at the same time ensuring the real-time processing of the video frames.

Vision researchers, after seeing the success of deep neural networks[18] on ImageNet[25], have tried to apply them for activity recognition tasks[32, 15, 27, 3]. The need for large-scale annotated video datasets, for training the deep neural networks, was a driving factor, that resulted in a number of such benchmark video datasets. Some of them are Sports-1M[15], Youtube-8M[2], ActivityNet[10], Kinetics[16] etc.

Activity recognition in sports telecast videos is an active area of research. There have been quite a few works that focus on sports events. Thomas et al. [31] and Wang et al. [33] provide detailed survey of some of the current and past works, respectively. Though, Sports-1M, UCF Sports[30], are some of the largest available sports dataset, but they cannot be used to learn models on any one sport in particular, as there is not enough data to recognize all types of events in a single sport.

We take a large-scale dataset of untrimmed cricket videos, and try to localize the individual cricket strokes being played by the batsman, which is our event of interest, hereafter, referred to as stroke. Usually, annotating a dataset of such scale, is done using a crowd sourcing platform like Amazon Mechanical Turk, similar to the works in [19, 10, 25]. In this work, we hand annotate only a small subset of validation and test set videos for evaluation purpose and bootstrap the model to make predictions on the entire dataset.

Our motivation for this work is to come up with a generalized solution of developing a large-scale dataset for domain-specific telecast videos. We do not use any form of data(audio, text, etc.) other than the RGB frames of the untrimmed videos, and make minimum assumptions regarding our domain of interest i.e., cricket. Though, there are highly accurate tracking systems, like Hawk-Eye [1], which are already being used as Decision Review System (DRS), but their data is private and they have their own set of stationary cameras and sensors installed in the sporting ground. A dataset of such scale can be used to train, for example C3D[14] type of deep neural networks, and later solve the problem of automatic content-based recognition of types cricket strokes. A direct use of this model would be in automatic commentary generation, apart from content-based browsing and indexing of cricket videos.

We collected a large set of untrimmed cricket telecast videos, performed a series of pre-processing steps to make them have a uniform frame rate and resolution, and then applied our model for the prediction of temporal localized stroke segments. The applied model involves simple shot boundary detection using summed up grayscale histogram difference feature, a couple of camera models that recognize the starting frames of specific camera shots that are part of the stroke and finally make stroke segment predictions by finding stroke shots.

Section 2 provides a brief review of the related works. Our methodology is explained in detail in Section 3, which includes the details about the sub-problems of boundary detection, training of camera models and prediction of stroke segments. Section 4 describes our experimental setup and the evaluation metrics used for boundary detection, camera models, and action localization. Finally, we give the results in Section 5, followed by the conclusion in Section 6.

2 Related Work

The problem of action recognition in videos has picked up pace with the onset of deep neural networks[15, 32, 27, 8, 34, 21]. These works modify the architecture of the Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) and train them on the video data. As a result, they produce state-of-the-art results at the cost of increasing the number of parameters, and training time.

The tasks of classification, tracking, segmentation and temporal localization are quite inter-dependent and may involve similar approaches and features. In some works, the problem of temporal action localization has also been tackled using an end-to-end learning network [35, 37]. Segmenting the object of interest and tracking it in the sequence of frames has been done in a few works like [36, 13, 12].

Some of the above approaches need pre-trained models that can be fine-tuned on their own problem specific datasets, while some other use large benchmark datasets for training purpose. Applying such techniques for a sporting event requires a lot of hand-annotated training data, which is hard to get or may include a lot of noise. Automated ways of creating datasets may involve using a third party API, like YouTube Data API (as done in [15]), or extraction using text meta-data of videos, which may not, at all, be accurate. Nga et al.[22] proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these reasons, content-based action extraction from videos is a good choice for automatic construction of large-scale action dataset. Hu et al.[11] provide a survey of the content-based methods for video extraction.

Content-based retrieval of actions from untrimmed videos is tough when the number of actions increase or when a more generalized set of actions is considered. The second problem is prevented in case of sports videos, as the videos of a particular sport have same type of actions performed at intervals. We take cricket as our test case and try to come up with a framework for cricket stroke extraction. A cricket stroke is a cricketing shot played by a batsman when a bowler bowls a ball. In this paper, we refer to the cricket shot as stroke, so that it may not be confused with a video shot, which can be defined as a sequence of frames captured by a single camera over a time interval during which period it does not switch to any other camera.

Coming up with a purely content-based domain specific event extraction for untrimmed cricket telecast videos has not been attempted by many. Sharma et al. [26] tried to annotate the videos based by mapping the segments to the text-commentary using dynamic programming alignment, which may not always be available or may be noisy. Moreover, their dataset is quite small, as compared to what we are trying to achieve. Some other works have also looked at extraction of cricketing events but they have not considered it at such a scale as ours, or have not tried to generalize their solutions. Kumar et al.[20] and some similar works have tried to classify frames based on ground, pitch, or players present in them, or came up with a rule-based solution to classify motion features. None of them have analyzed their results on an entirely new set of cricket videos.

In our work, we collect a large set of raw untrimmed cricket videos and report our results on a subset of these videos, that have been hand-annotated for the temporal localization task. The architecture (Figure 1) can be generalized to other sports, since there are no cricket specific assumptions involved.

Temporal localization using deep neural networks has been quite successful recently, [38, 9]. Though, there are other works that do not use deep neural nets, such as Soomro et al. [29]. They use an unsupervised spectral clustering approach to find similar action types and localizing them. Our approach also does not use deep neural networks as our object is to come up with a dataset that is large enough to train CNNs with millions of parameters. Even the pre-trained networks need sufficiently large amount of labeled data for fine-tuning the network. We have done the labeling for only a small set of highlight videos (1GB of World Cup T20 2016) and bootstrapped simple machine learning models trained on only grayscale histogram differences and HOG [7] features.

3 Cricket shot extraction

Extracting cricket shots in untrimmed videos is analogous to the action detection (localization) task where the action of interest is a cricket stroke being played by a batsman.

Figure 1: Our framework for prediction of cricket strokes based on the learned models for shot boundary detection (SBD), camera models (CAM1 and CAM2) and combining the predictions using Algorithm 1

The live telecast videos of a sports event are created by a set of cameras that have the task of covering most of the sporting arena. There are two types of cameras, fixed and moving. The fixed cameras (assigned to camera-men) have a defined objective of capturing a specific sporting activity, while the moving cameras, for e.g., spider-cams in cricket fields, may be controlled remotely, and may not, necessarily, cover the relevant sporting activity at all times.

The main sporting activity in a cricket match is a bowler bowling a ball, followed by a batsman playing a stroke and then scoring runs. An over consists of 6 such deliveries, each of which is, potentially, an event of interest. Automatically recognizing the outcome of each delivery is a tough problem, which may require a complex system with domain knowledge and learned models that interact in a rule-based manner, as done in [20]. In our work, we consider a basic problem of localization of a cricketing event, which is the direction of stroke play. Here, the starting point of our event of interest is the camera shot where the bowler is about to deliver the ball, and ending at the camera shot that captures the direction of the cricket stroke played. Generally, our event of interest is captured by one camera shot or at most two subsequent camera shots. An illustration is provided in Figure 2, where frame () is the starting point and is the ending point of the event of interest. The major portion of the event is captured using two subsequent camera shots, and 111Please note that we use and , for referring to the camera shots as well as the models trained on the first frames of these shots, where captures a wider area to locate the movement of the ball and then, gradually, focuses on it. There may also be a case where the batsman just taps the ball and it doesn’t travel beyond the field-of-view. Therefore, we need to segment out either shots or shots.

The above type of modeling can be applied (with minor changes) to a number of other sports, like tennis, badminton, or baseball. An illustration of our temporal cricket stroke localization model, is given in Figure 1. The pipeline of simple models, trained on only a small set of sample videos, is used to predict temporal stroke segments in a large set of raw untrimmed cricket telecast videos. The evaluation for the bootstrapped predictions is done on a subset of the main dataset, called test set sample. This subset, along with a validation set sample, has been hand-annotated with ground-truth cricket stroke segments. An evaluation over these subsets would give an estimate of what we can expect as an overall accuracy.

A learning-based approach for detection and localization, in order to get generalized results, would require a large amount of labeled data, where labels should contain minimum amount of noise. Such a dataset for cricket telecast videos may help the research community to come up with better learned models for this particular sporting domain. Choosing a purely content-based modeling approach, we need to minimize the manual annotation effort, which leaves us with the idea of bootstrapping smaller models’ predictions to the large raw video datasets, where a “smaller” model is one having only a few parameters.

Our method is purely content-based, since, we do not use any extra information, other than the RGB video frames and features extracted from them. The training of simple machine learning models on shallow and high dimensional feature descriptor, for localization, has been performed using a small highlights video dataset. Here, the shallow feature is the summed-up absolute values of histogram differences of consecutive grayscale frames, that are used for shot boundary detection (CUTs), and HOG is the high dimensional feature descriptor of the frame, used to recognize starting frames of CAM1 and CAM2 video shots. The shallow descriptor is, computationally, less expensive as compared to the high dimensional feature descriptor.

The steps involved in the cricket stroke extraction are as follows:

  1. Dividing the telecast videos into individual camera shots using shot boundary detection. Here only the CUTs are considered, since the detection of gradual transitions involved an additional overhead. The CUT predictions are done using a random forest[5, 23] model, trained on summed-up absolute values of histogram differences of consecutive grayscale video frames taken from our sample dataset (refer Table 1).

  2. A small dataset of first frames was created using only the sample dataset videos for training and camera models. These sets had and frames each, where nearly half are positive samples and half are negative samples. We trained two linear SVMs[6, 23] on the HOG [7] features of the training subset of these first frames.

  3. Having the boundary predictions and the first frame recognition models, a simple approach is to extract only those video shots that give positive results for their first frame HOG features. These will be cricket strokes. Algorithm 1 describes a simple approach for extracting these events.

  4. The evaluation of the extracted cricket strokes can be done by defining an evaluation metric and having a test set of human annotated cricket stroke localized segments. We choose the 1D version of IoU metric (Intersection over Union), which is called the temporal IoU (TIoU), and see how much overlap is there between our predictions and the ground-truth annotations.

Given below are the steps in greater detail.

3.1 Preprocessing

The raw cricket telecast videos, collected from different sources, like YouTube and Hotstar, had different frame rates and resolutions. All the videos were resized to with a constant frame rate of FPS. This step was carried out using FFmpeg. Having a constant frame rate and frame dimensions ensures the uniformity of the motion features that may be extracted from the videos.

3.2 Shot boundary detection

Shot boundary detection has been studied for decades and is, often, the first step of any content-based video processing system. There may be two types of boundaries, such as a hard CUT transitions (occurs where one camera shot ends abruptly and the next begins) or gradual transition (like fade, wipe, dissolve, etc.). Our focus is only on the detection of CUT boundaries, since, CUTs are the most common in sports telecast videos. As a future work, one might focus on detection of gradual transitions as well, but that would tend to increase the overall processing time, which needs to be minimized while dealing with any kind of large-scale data processing. The use of CUT predictions, is when we need to jump directly from one camera shot to the next, in an untrimmed video. Iterating over the CUT predictions, and extracting the HOG features from only the first frames, speeds up our processing.

We tested for histogram differences (grayscale and RGB) and weighted- differences of consecutive frames. The equation 1 gives the value of summed-up absolute value of histogram differences for consecutive and grayscale frames, where is the number of bins representing the different gray-levels.

(1)

3.3 Camera Models

We may assume that each fixed camera has a defined task, i.e., it will, regularly, cover similar actions being performed. The starting scene for cricket will have a bowler taking a run-up to the bowling crease and the batsman standing at the other end of the pitch ready to face the delivery. This stroke can be detected by identifying the first frame of this video shot (using predicted CUT positions) and segmenting out the shot till the next CUT position. The accuracy of this method relies on how accurate the CUT predictions are and how accurately we detect the first frames of the stroke. Figure 2 shows a sample of cricket stroke containing and video shots.

Figure 2: The sequence of frames in a cricket stroke. As the ball goes outside the field-of-view of , the camera is switched to . Note there is a CUT between frames and

3.4 Extraction Algorithm

Algorithm 1 is a simple approach to localize our activities of interest, given the individual pretrained camera models and a set of cut predictions for an untrimmed video. The final predictions for the stroke segments may be of any length, which includes segments obtained from a false positive CUT followed by a false positive / prediction. This occurs, mostly, at places where there are gradual transitions. These false positive segments, are of small duration, and can, simply, be neglected by a filtering step, which is explained in Section 4.4.

1:procedure localizeActions() the input video list of predicted CUTs for trained camera models
2:     
3:     for  do Iterate over the CUT positions
4:         
5:         
6:         
7:         
8:         
9:         if  then
10:              if  then positive prediction for
11:                                 
12:         else if  then
13:              if  then
14:                  
15:                   save a predicted segment to list
16:                  
17:                  if  then
18:                                          
19:              else
20:                  if  then
21:                       
22:                  else
23:                        get number of frames in video                   
24:                  
25:                                               
26:     
Algorithm 1 Extract cricket strokes

3.5 Bootstrapping the predictions

We refer to the final predictions made on the large dataset, as the bootstrapped predictions, since, the trained models used an entirely different dataset (Highlight videos) for training. The accuracy of these predictions depends, largely, on the Highlight videos.

4 Experimentation Details

This section describes our experimental setup in detail.

4.1 Data Description

The raw cricket telecast videos were first transformed to have a constant frame rate (25 FPS) and resolution () by using FFmpeg. We had a sample dataset that had only 26 highlight videos, each of around 2-5 minute duration. This dataset was set aside for experimentation and training of intermediate CUTs and CAM models. The main dataset of untrimmed videos (also referred to as full dataset) had 273 GB ( million frames) of untrimmed videos from various sources, containing Test Match videos, ODI videos and T20 videos. We neglected any local cricket match videos, where the telecast camera was positioned at a non-standard position. A brief summary of both the datasets is given in Table 1. Further, we partitioned the datasets into training, validation and testing sets in the given ratios. The highlight videos are annotated with CUT positions and stroke segments. A subset of validation set videos and a subset of test set videos of the full dataset are manually annotated with the stroke segments. They are used for the final evaluation.

Property Highlights dataset Full dataset
No. of Videos
Total Size (approx.) (approx.)
Dimensions (H,W)
Frame Rate (FPS)
Event Description ICC World Cup T20 2016 Varied sources
Training set 16 videos ()
Validation set 5 videos ()
Test set 5 videos ()
Annotations CUTs, strokes strokes on a subset
Table 1: Details of our datasets

4.2 Shot Boundary Detection

We tested a number of approaches for detecting the shot boundaries (CUTs) based on only the sum of the absolute values of histogram differences of consecutive frames; like global thresholding on grayscale/ RGB differences [28], weighted- differences [17] and applying simple classifiers on these features. The best performance on the sample datasets’ test set was given by applying a random forest model on grayscale histogram differences of consecutive frames 5.

Evaluation Criteria: We follow the evaluation process of TRECVid for detection of only CUT transitions. Details are specified in [24]. A false positive is referred to as an insertion, while a false negative to a deletion. The presence of gradual transition or setting a low threshold (in case of global thresholding), creates insertions, and, as a result, tends to reduce the overall precision. While setting a global threshold to a high value misses out on actual CUT boundaries and reduces the overall recall. Therefore, F-score is a suitable evaluation criteria for CUT boundary detection.

4.3 Camera Models

The only assumption in our work that is specific to cricket, as illustrated in figure 2, is that and + video shots comprise of our events of interest, therefore, need to be localized. These video shots can be recognized by extracting a number of features, such as shape features, tracking cricketing objects, textures, etc, but all of these may not be generally applicable to other sporting domains. We notice that the first frames of shots are quite “similar” and same is the case with first frames of . We chose to extract a fine grained HOG feature vector for the grayscale first frames of and and train simple machine learning models on them. The sample datasets for these two models had and frames, respectively, that include about half positive samples and half negative samples. These samples were extracted manually from a subset of highlight sample videos dataset and the negative frames are the first frames of some other random camera shots that are not cricket strokes. Figure 3 shows a few training frames used for training the CAM models.

Figure 3: A few training examples from our and samples datasets

4.4 Filtering final predictions

The final predictions are of the form where and are the starting and ending frame positions, respectively, for the predicted segment. The value of for a cricket stroke should be sufficiently large, considering the fact that the actions take time and the frame rate is constant at FPS. A low value of , generally, occurs, mainly, due to the gradual transitions, fast camera motion, or some advertisements occurring in between the telecast video. We apply a filtering step to remove any segments that have . The results for on the validation set samples, are given in Figure 4. As the best result occurs for , therefore, we set this value for our final accuracy, on the test set samples.

4.5 Temporal Intersection over Union (IoU)

The metric used for evaluation of the localization task is the Weighted Mean Temporal IoU, motivated from [4]. If for a specific untrimmed video, the set of predicted segments are and the set of ground truth segments are , then the Weighted Mean Temporal IoU is given by equation 2, where mean is taken over the weighted sum of values of untrimmed video, each weighted by the number of ground truth segments in that video and being the total number of untrimmed videos in the test dataset. Each segment or is of the form [start segment position, end segment position]. In equation 3, is calculated by taking the maximum overlaps of each ground truth segment with all the predicted segments and vice versa.

(2)
(3)

4.6 A Note on Data Parallelism

When dealing with any large-scale data processing, parallelism is essential. The extraction of a single feature from the entire dataset may take weeks, if performed serially. For our dataset of GB, the extraction of grayscale histogram differences takes more than 10 days. If we consider extraction of any fine grained feature like HOG, which is computationally expensive, that would take much more than time. We followed a data parallelism approach for any kind of feature extraction or prediction. The untrimmed videos were divided into batches and each batch was parallelized over a fixed number of processes running in parallel over different cores. Table 2 shows approximate time for some of the feature extraction operations, with different batch sizes and running on “#Jobs” number of cores in parallel.

Feature Sorted Videos? Batch Size #Jobs Time(approx.)
HDiffs Gray Unsorted hours
HDiffs BGR Unsorted hours
HDiffs BGR Sorted hours
Wt-Diffs Sorted hours
Table 2: Execution times compared with and without data parallelization. The Sorted Videos column refers to whether the videos were sorted based on their sizes.

5 Results and Discussion

In this section, we describe the results of our individual models and our final predicted segments.

Shot Boundary Detection: CUTs detection worked best with the grayscale histogram difference feature. The number of histogram bins was set to maximum (256), since changing it to any lower value decreased the accuracy. In case of global thresholding, weighted- difference values, performs better than the grayscale histogram differences and RGB histogram differences. The accuracy for a trained random forest model exceeds the global thresholding approach by a big margin, and a random forest trained on summed-up absolute grayscale histogram difference features, is works better than the one trained on weighted- differences. Refer Table 3 for the accuracy of the CUTs model on the test set of our Highlight videos dataset.

Model Feature Dataset Precision Recall F-score
SBD RF(Bins:256) HDiffs(Gray) Highlights
SBD RF(Bins:128) HDiffs(Gray) Highlights
Table 3: Evaluation results for shot boundary detection model

Camera Models: We train linear SVMs on the HOG features of the first frames. The accuracy values for the two trained models is given in table 4. Out of 111 test samples, there was 1 false negative and 1 false positive for , while for , there were 4 false positives and 1 false negative. For final predictions, we used these models, on extracted HOG features of the first frames.

Model Feature #Test samples Error
CAM1 LinearSVM HOG
CAM2 LinearSVM HOG
Table 4: Evaluation results for CAM models.

TIoU: The weighted mean TIoU metric can be applied to any temporal localization task for untrimmed videos. A minimum value of for this metric denotes that our predictions are completely off, while a the maximum value of denotes that we predict perfectly. Our predictions of the stroke segments were filtered, as explained in section 4.4, and then evaluated on the validation set sample videos. Refer to figure 4 for the results on the validation set samples. The value of the filtering parameter was set to , i.e., any predicted shot segment that is of size less than will be neglected. The final evaluation result is calculated against the untrimmed videos taken from the test set of our full dataset. The weighted mean TIoU was 0.5097.

We tried to modify the algorithm 1 to make predictions on the first 5 frames and take only those CAM shots for which the out of 5 are positive predictions. But, in each case, the accuracy was lower than what has been reported.

Figure 4: Evaluation on validation set samples after filtering out segments less than given size

6 Conclusion and Future work

In this work, we demonstrated a simple approach for extraction of similar sporting actions from telecast videos by performing experiments over a large-scale dataset of cricket videos. Here, our action of interest was a cricket stroke played by a batsman. The sequence of extraction steps, that we followed, may be generalized for other sporting events as well. Extracting cricket strokes is a temporal localization problem, for which the final accuracy (weighted mean TIoU) was which is quite accurate, considering the fact that we did not use any complicated approach, such as training CNNs or extracting any motion information.

Our objective is to come up with a large-scale cricket actions dataset, which can be used to train deep neural networks for cricket video understanding. There is a lot of scope for improvement of our results, which is our future work. In addition to that, if we are given the individual cricket stroke segments, how we can cluster them into different types, by looking into the motion features. These clusters should represent the different types of cricketing strokes, such as, stroke towards mid-wicket, towards long-off etc.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
330543
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description