A Proposed Artificial intelligence Model for Real-Time Human Action Localization and Tracking
Abstract- In recent years, artificial intelligence (AI) based on deep learning (DL) has sparked tremendous global interest. DL is widely used today and has expanded into various interesting areas. It is becoming more popular in cross-subject research, such as studies of smart city systems, which combine computer science with engineering applications. Human action detection is one of these areas. Human action detection is an interesting challenge due to its stringent requirements in terms of computing speed and accuracy. High-accuracy real-time object tracking is also considered a significant challenge. This paper integrates the YOLO detection network, which is considered a state-of-the-art tool for real-time object detection, with motion vectors and the Coyote Optimization Algorithm (COA) to construct a real-time human action localization and tracking system. The proposed system starts with the extraction of motion information from a compressed video stream and the extraction of appearance information from RGB frames using an object detector. Then, a fusion step between the two streams is performed, and the results are fed into the proposed action tracking model. The COA is used in object tracking due to its accuracy and fast convergence. The basic foundation of the proposed model is the utilization of motion vectors, which already exist in a compressed video bit stream and provide sufficient information to improve the localization of the target action without requiring high consumption of computational resources compared with other popular methods of extracting motion information, such as optical flows. This advantage allows the proposed approach to be implemented in challenging environments where the computational resources are limited, such as Internet of Things (IoT) systems. The experimental results obtained using the proposed model show its superiority with respect to various other online and offline systems in terms of accuracy and calculation time for the detection and tracking of human actions in various video sequences.
Artificial Intelligence (AI) is intelligence exhibited by the machine. In computer science, the field of AI research is represented as the study of ”intelligent agents”: machines that can perceive their environment and act accordingly to maximize their chances of achieving a certain goal  . Colloquially, the term ”artificial intelligence” is used to refer to cases in which a machine imitates ”cognitive” functions similar to those of humans. However, until circa 2012, AI research was restricted to sophisticated technology businesses, governments, and research organizations, fueling both perceptions. Since then, AI has broken away from the hypothetical and into real-world company alternatives. Many of these alternatives are motivated by the broad accessibility of graphics processing units (GPUs), which makes parallel processing quicker and cheaper.
AI now includes many sub-fields and relies on a variety of techniques such as neural networks (e.g. brain modeling, time series prediction, classification), evolutionary computation (e.g. genetic algorithms, genetic programming), swarm intelligence (e.g. particle swarm optimization), machine vision (e.g. object recognition, image comprehension), Robotics (e.g., smart control, independent exploration), specialist systems (e.g. decision support systems, teaching systems), voice processing (e.g. voice recognition and manufacturing), natural language processing (e.g. machine translation), planning (e.g. scheduling, game play) and machine learning (e.g. decision tree learning, space learning version) . Most of these fields and techniques have aspects of both science and engineering.
Historically, researchers have turned to nature for guidance when designing AI systems. Not surprisingly, the first model to be explored was the most familiar to our own brains. Starting with the perceptron’s of the 1950s and continuing to this day, neural networks and other neurologically inspired architectures are the dominant models for AI research. Recently, neural-based deep learning (DL) has become one of the preeminent machine learning techniques based on information learning depictions. DL (also known as deep structured learning, hierarchical learning or profound machine learning) is the study of artificial neural networks and associated machine learning algorithms containing more than one hidden layer . There are many layers between input and output in a profound network (and these layers are not made up of neurons, although it may assist to think about them that way). In DL , a computer model learns straight from pictures, text, or sounds to execute classification functions. DL models can attain state-of-the-art precision, sometimes surpassing human output. Models are taught using big sets of labeled information and architectures of neural networks that contain many layers. Various architectures of deep learning, such as profound neural networks, convolutionary profound neural networks, profound networks of beliefs and recurrent neural networks.
Beside Neural network, There is another sub-field of nature-inspired AI, namely swarm intelligence. Billions of years of evolution have generated at least one alternative technique of constructing high-level intelligence that is not neural-rather collective .
Swarm intelligence methods are based on the research of collective behavior in distributed systems. Such a system is made up of a population of simple agents communicating locally to each other and to their environment. The scheme is initialized with an individual population (i.e. prospective solutions). These individuals are then modified through several phases of repetition by imitating the social behavior of insects or animals in an attempt to find the optimum in problem space. Exploring the search space is enhanced by changing each prospective alternative according to past experiences of the individual and their interactions with other members of the population and with the surroundings.
This work aims to utilize recent research in various AI sub-fields to provide an automated solution for human action localization and tracking in real-time videos. The problem of automated action understanding is becoming increasingly relevant as the enormous technological developments occurring in daily life are giving rise to a need for end-to-end security and monitoring solutions that can function in challenging environments. Localization of real-time action corresponds to the task of simultaneously locating actions and detecting their classes in real-time from input video streams. Real-time action localization is a challenging problem that requires expensive features that are difficult or impossible to extract due to the real-time processing requirements and device limitations.
Advances in localization and tracking of human behavior are linked to advances in study fields such as object identification, human dynamics, domain adaptation and semant segmentation. Over the past century, the associated techniques have developed from early systems, whose applications are often restricted to certain controlled settings and sophisticated solutions that can learn from millions of videos and be implemented to almost all daily operations. Given the wide spectrum of associated apps, from video surveillance to human-computer interaction, science milestones in action localization are being accomplished increasingly quickly, ultimately leading to the disappearance of techniques that used to be efficient within an ever-short timeframe.
Enabled by rapid developments in AI and machine learning and by the success that has been achieved using DL approaches in the processing of still images for tasks such as human action recognition, video analysis tasks such as recognition and detection have been evolving from the relatively simple classification of present states to the prediction of future states. The localization and classification of actions must be performed even before the actions are fully observed. Many successful solutions have been introduced , , , ,  and  for both real-time action localization  and offline action localization , , ,  and . Most of these solutions follow the two-stream approach, in which there are two input streams, i.e., appearance and motion streams, with appearance information being extracted from RGB frames and motion information being extracted from the optical flows of the input. Such a two-stream architecture can achieve excellent performance for action localization, with high accuracy and speed.
This paper presents a new model that enables the detection and tracking of human actions and actors in real time videos. The resource requirements of the proposed approach are relatively inexpensive compared with previous approaches, allowing the proposed model to be implemented in challenging environments where computing resources are limited, such as Internet of Things (IoT) systems. This model can provide a complete solution for surveillance scenarios in which there is a need to detect actions and also track the corresponding actors in an environment with occlusion. This is considered an important issue in security systems.
Contributions. The main contributions of this work can be summarized as follow:
First :Propose a novel real-time action localization model that can be implemented in challenging environments (e.g., IoT systems).
Second :Provide a solution to the challenges of optical flow methods by using motion vectors instead of optical flows to extract motion information while still achieving good accuracy compared with state-of-the-art approaches ], in which the most expensive step preventing these previous approaches from achieving real-time performance is the calculation of the optical flows.
Third :Ensure that the proposed model can be implemented in challenging environments, such as an IoT environment, in which the available resources and speed are limited compared with DL using powerful CPU and GPU devices.
Fourth :Improve the accuracy of the motion detection network by training an optical flow detection network and then using the weights of this network for transfer learning to a motion-vector-based detection network.
Fifth :Incorporate the Coyote Optimization Algorithm (COA) to track the actor trying to perform every detected action in real-time and introduce a new COA-based tracking model.
The rest of the paper is structured as follows. In Section 2, the basics and background related to the proposed approach are first discussed, followed by a discussion of recent state-of-the-art approaches for action detection and localization in Section 3. The proposed system is presented in Section 4, and the results are discussed in Section 5. Section 6 summarizes the main findings of this paper.
Ii Basic and Background
Ii-a Coyote Optimization Algorithm (COA)
Coyote Optimization Algorithm (COA) is considered to be a population-based algorithm inspired by the species of Canis latrans classified as swarm intelligence and developmental heuristic and affected by coyote behavior . In the COA, The coyote population is split into Np (N * packs) with Nc (N * coyotes) each. The number of coyotes per package in this first proposal is static and similar for all packages. Consequently, Np and Nc multiply the complete population in the algorithm. For simplification purposes, In this first version of the algorithm, solitary (or transient) coyotes are not considered. Each coyote is a feasible answer to the optimization problem and the cost of the objective function is its social situation. The COA mechanism was intended on the basis of coyote social conditions, which means the decision variables of an global optimization problem. Thus, the social condition soc (set of decision variables) of the cth coyote of the pth pack in the tth instant of time:
And This means in the coyote’s adaptation to the environment (the objective function cost) fit cp ,t R. initialize the global population of coyotes is the first steps in COA. COA is consider a stochastic algorithm, the initial social conditions for each coyote are determined randomly. This is done by assigning random values within the search space for the cth coyote of the pth pack of the jth dimension :
where in lbj and ubj its representatives, the lower and upper bounds of the decision variable respectively, D is the search space dimension and is a real random number with range [0,1].the coyotes’ adaptation in the respective current social conditions are evaluated:
COA only considers one alpha that is best suited for the environment. Considering the problem of minimization, the alpha of the pack in the instant of time is defined as:
Because of the obvious indications in this species of swarm intelligence, the COA assumes that the coyotes are organized enough to share and to contribute the social conditions to the pack’s maintenance. Thus, the COA connects all data from the coyotes and calculates it as a cultural tendency of the pack as the median social conditions of all coyotes from that specific pack:
Where Op,t represents the ranked social conditions of all coyotes of the pack in the instant time for each j in the range [1, D]. To compute the culture influence COA assumes that coyotes are under the alpha influence ()and the pack influence ().
Where and are a random coyote. the coyote’s new social condition is update by using the alpha and the pack influence using these equation:
Where and are the weights of the alpha and the pack influence , respectively. r1 and r2 Initially defined as random numbers inside the range [0, 1]. The new social condition is computed with the following :
Coyote’s cognitive capacity decide if the new social condition is better than the older one to keep it, it means:
Ii-B Motion Vector
Motion vectors  Originally suggested for video encoding by saving image modifications from one frame to the next. It is intended to use the motion data of the respective image blocks to decrease the video bit rate. We can use these vectors to detect and track motion and replace traditional motion information extractor such as optical flow.
The vector of motion is similar to the optical flow. Both are two dimensional vectors to describe the respective data on the motion of pixels in two ongoing frames. Unlike optical flow, vector motion is widely used in various HEVC  and H.264  video coding norms. Motion vector is accessible in a compressed video stream and can be achieved directly with almost no computing expenses. This property makes motion vector an attractive substitute for optical flow to achieve an effective action analysis.
There are many video compression standards like H.264  and HEVC  , the input video frames are coarsely divided into macro-blocks (MBs), which form the basis for inter (and intra) prediction. Inter-predicted macro-blocks (MBs) are (optionally) partitioned into blocks that are predicted via motion vectors representing the displacement from matching blocks in previous or subsequent frames. MB motion information can be extracted from a compressed video bitstream using different available library.
Iii Literature Review
This section reviews related works, about the problem of action detection and tracking in a video sequence. We explore the main related work into three categories: 1) Action Recognition, 2) Action Localization, and 3)Object tracking
Iii-a Action Recognition
In this section, we briefly present the main families of methods for action recognition. Action recognition (classification) can be considered a video classification problem it can be defined as assign a set of predefined action classes to a video. It is . Action recognition comprises of three major steps consists of three main steps as follow : first step feature extraction, second step a representation for a video based on the extracted features, and finally classification of the video using the representation. There is an amount of work in action recognition with several recent surveys [, ].
In  They Provide a realtime action recognition method with high-performance and accuracy. Their approaches accelerate the two stream CNN architecture to speed reach to 390.7 FPS, but their approach deals with action recognition problem and not deal with action localization problem.
In  They deal with the issue of human action recognition from sequences of videos. Motivated by the exemplary results obtained via deep learning and automatic feature learning approaches in computer vision, They focus their work towards learning salient-spatial features via a convolutional-neural-network (CNN) and then map their temporal relationship by use Long Short Term Memory ( LSTM ) networks.
Iii-B Action Localization
Action localization, called also action detection, refers to the problem of recognizing the actions Including their extent. In this paper, we concentrate on human action localization in time and space. Also we give a consider actions performed by animals a perspective view point. Significant attention was given to Action Localization in the last few years [, ]. Action localization focuses on detecting the actions inside the videos. Action localization was the objective of less studies than Action recognition. Action localization problem is much more advanced than action recognition. Successful action localization requires the action class to be correctly recognized, and also its spatio-temporal location to be identified. In action localization, we can consider action recognition as a sub-problem of it.
In this section, we review most techniques for online and offline action localization.
Iii-B1 Offline Action Localization
In offline mode we have the full video we can know sub-actions, we deal with all frame and processing time, not a challenge. In  introduce models that can be located and classified the actions in the video using convolutional-neural-networks on kinematic and static cues. Using motion-saliency to eliminate regions that are unlikely to include the action. They extract spatiotemporal feature representations to build strong classifiers using Convolutional-Neural-Networks. They train two Convolutional-Neural-Networks for Action detection task, Motion CNN and Spatial CNN closer to RCNN . Spatial-CNN captures the actor’s appearance and the cues from the scene. Motion CNN, capture motion Patterns operates on the optical flow and the movement of the actor. And finally, they make the prediction after the use of specially trained SVM classifiers.The SVM classifiers trained on the spatio temporal representations produced by both CNNs. They test their method on UCF Sports  dataset and J-HMDB  dataset. on UCF sports, achieving a threshold = 0.6 , with an improvement of 87.3%, and have a Mean AUC = 41.2% in comparison with to other approaches .on J-HMDB a larger dataset they can accomplish an accuracy of 62.5%. In  introduce a strategy to produce 2D+t se-quences of bounding boxes, called tubelets. They have two contributions first, using super-voxels instead Of super-pixels to produce spatio-temporal shapes. Second, to identify the action motion from the background Motion they use independent motion evidence as a feature. They test their approach on two datasets UCF Sports  and MSR-II . On UCF they obtain 80.24% of accuracy. On MSR-II tubelets significantly outperform 46.0 % for Boxing, 31.4 % for Handclapping and 85.8 % for Hand-waving. In  this research aims at predicting the action class of a partly observed video before the action is end. Develop a new formulation of learning to capture temporal evolution. For the early recognition of incomplete actions they propose a new multiple-temporal-scale-support-vector-machine (MTSSVM) formulated based on the structured SVM, They test the proposed MTSSVM approach on three datasets: the UTInteraction dataset (UTI) Set 1 (UTI 1) and Set 2 (UTI 2) , and the BIT Interaction dataset (BIT) .on the UTI 1 dataset their approach obtain a recognition accuracy 78.33% only half frames of test videos will be noted. And 95% recognition Accuracy with full videos. On the UTI 2 datasets, their MTSSVM achieves 75% and 83.33% prediction results. 75% accuracy when only half frames of test videos will be noted. And an 83.33% recognition Accuracy with full videos. On the BIT-Interaction dataset .their method achieves 60.16% recognition accuracy with only the first 50% frames of testing videos are observed. In  introduce a proposed method which aims to obtain temporal segments actions from untrimmed Videos. Propose a framework for scoring temporal segments according to how likely they are to contain an action. Test their method into a standard activity detection framework. Using two datasets The MSR-II Action dataset  and THUMOS 2014 Detection Challenge dataset . On the MSR-II achieve a 60.3, compared to 54.5% obtained by APT .On the Thumos14 detection achieve a 13.5%, compared to 14.3% obtained by the top per-former in Thu-mos14.
Iii-B2 Online Action Localization
In online action localization, action localization proplem is becoming an action detection and prediction problem. it deals with the video stream as frame by frame and here processing time is considered a chal-lenge due to real-time constraints. A little work is done here there is more to do here. In  introduced Online Action Localization in a streaming video to localize and pre-dict actions in an online manner. Using pose-estimation to learn a mid-level super-pixel based foreground model at every instant. Using dynamic programming on SVM to predict the confidences and label for action segments. Test their approach on two datasets, JHMDB  UCF Sports .They test the observed video for various overlaps thresholds ( 10%-60% ) for JHMDB and UCF Sports. for the JHMDB dataset, Initially, the results enhance but then deteriorate at 60 % observation percentage. For UCF Sports localization improves and then worsens unexpectedly at 15% observation percentage. In  provide a real-time system for action detection. Build method to detect human-actions in 3D-skeleton-sequences. Their system extracted features from 3D-skeleton-data. They perform action detection using a Support-Vector-Machine (SVM) classifier with a linear kernel. Feature selection is per-formed using the Recursive-Feature-Elimination algorithm (SVM-RFE). they test their result on two datasets: MSRC-12 , and G3D . On MSRC-12 improvements reached up to 7.7%. 16.5%, for “image and text” and 8.8% for “text” modality. On G3D Results for the “Fighting” action is reached to 0.937. In  introduce a random-forest (RF)-based online action detection framework. They using a convolutional-neural-network (CNN)-based features obtained from the RGBD raw images and the relationships between the temporal context present in the past and future frames. Using random forests (RF) algorithm where the refined RF parameters are learned with the aid of contexts. They test their result on using three datasets MSRAction3D , G3D datasets, and OAD . On MSR Action3D Dataset using the unsegment setting obtained maximum 6% improvement compared to other methods. On G3d their ’RF’ baseline is less competitive than compared approaches. On OAD the performance of their ’RF’ baseline is similar to compared approaches. In  a huge move have been achieved there are successfully integrate SSD  in their approach provide the best speed and accuracy compared with other approaches.
Iii-C Object Tracking
Object tracking is an optimization method for estimating a target’s locations and movement in a video sequence where the first frame is given the target’s initialized position. This method of optimization can be classified as either stochastic or deterministic methods.Deterministic method example, Snakes model  , Mean-shift [,] and Trust region , are iteratively searching for ideal solution for the objective function. Stochastic method example, Kalman-Filter (KF) [citeReid1979,, and ] and Particle-Filter (PF) [, and ], they are generally fast, but in many real-world object monitoring systems their efficiency is restricted. For example, -Filter (KF) can not deal with a non-Gussian or non-linear problem and Particle-Filter (PF) tends to suffer from heavy computing costs because it requires a higher number of particles to represent the posterior volume of the object state.
Over the last few years, Particle-Swarm-Optimization (PSO) many researchers have used it in object tracking to overcome the limitations of conventional KF and PF in object tracking. Particle-Swarm-Optimization (PSO), first method by , is a population-based stochastic heuristic algorithm designed to tackle complex optimization problems. It is motivated by the conduct of natural swarms, such as bird flocking and fish education.In  Applied PSO in object tracking by pixel-flying particles where tracking of objects arose from the interaction between particles and their environment. In  implement a sequential particle-swarm-optimization strategy by integrating temporal continuity data into PSO.In , implement a classified PSO algorithm that applies different search strategies on particles based on their fitness values. Overall, owing to its rapid exploration capacities, PSO is an efficient way to track items. Early convergence, however, is an important issue with PSO. Once particles converge prematurely into any specific region of the search space, the entire swarm stagnates. This will have poor results, especially if there is partial or complete occlusion in video sequences due to absence of efficiency for sub-optimal alternatives.Some variants of PSOs like  Adding one or more parts to enhance monitoring efficiency for PSO equations. However, this increases the complexity of PSO, so the computation price is generally high.
Iv Proposed Method of Human Action localization and tracking
The localization of human actions of interest across multiple frames and the tracking of the actors performing these actions are still challenging tasks. To this end, the proposed model attempts to satisfy all requirements of the real-time human action localization and tracking problems and to ensure that any human action captured in a video will be localized and that the corresponding actor will be successfully tracked. Additionally, we attempt to improve the convenience of the model with respect to IoT system requirements. With this model, we introduce a fully real-time approach for human action localization and tracking in videos based on learning to directly detect an action and predict its class. Our input is a video that consists of a sequence of frames captured over time; we will process this video frame by frame. We can think of our video as a volume of data in which each data frame is a function of space and time.
Iv-a Action Localization and Tracking Model
For the first phase of real-time human action localization and tracking, we adopt a two-stream architecture, as shown in Figure 1. The proposed real-time human action localization model consists of a video decoder and a two-stream architecture (in which each stream contains a YOLO  or SSD  detection networks). The video decoder takes input from a compressed video stream as it directly collects RGB and motion vector frames during the decoding stage.The RGB and motion vector frames are then fed in two distinct streams to perform real-time action localization and prediction for each frame. The primary distinction between our suggested technique of localization of human action in real time and other methods , , , ,  and  is that during these steps, our technique does not involve optical flow computations. This allow our model to be more suitable for usage in an IoT setting.
Additionally, YOLO  or SSD  Are used in our framework to obtain high-level motion and appearance data, respectively. Optical flow computations, which are considered the most time-consuming part of past techniques, are presented in our real-time action localization model, making it more convenient to execute in real-time situations.
First, we extract the current RGB frame and the current motion vector (x,y) frames and input each frame into the corresponding suitable network that has been trained on test data of the same form. Each detection network outputs the class of the detected action and its bounding box, if found. We take the information related to the action location, which is the current location of the actor, and input it into the human tracking model (a COA-based tracker), which uses this location Information to extract the target actor’s features to be tracked and start the tracking process.
Iv-B Detection Network
In this paper, we use YOLO  a new approach to object detection, for bounding box prediction and classification. YOLO treats object detection in each frame as a regression problem and outputs spatially separated bounding boxes and associated class probabilities. YOLO  ] is faster than region-proposal-based methods because it uses a single network to predict the bounding boxes and class probabilities in a single-pass evaluation. We have trained three YOLO detection networks: one to process the RGB frames, another to process the components of the motion vectors, and a third to process the components of the motion vectors. Additionally, we have trained a detection network for optical flow frame extraction based on  following the approach in  and have used the weights of this network for transfer learning for extension to the motion vector frames.
Iv-C Motion Vector
This paper use a motion-based detection network to improve the scores of the appearance-based detection network. Motion vectors, which already exist in a compressed video bit-stream, provide sufficient information to indicate the approximate location of a target object. A motion vector is comparable to an optical flow and can be used for object tracking and person detection. They are both two-dimensional vectors used to identify the motion of corresponding pixels in two continuous frames. In contrast to optical flows, however, motion vectors are used commonly in different video coding standards (e.g., HEVC and H.264). Thus, they are available from the compressed video stream and can be directly obtained at nearly no computational cost. This advantage makes motion vectors an powerful substitute for optical flows when attempting to obtain efficient action analysis.
In  a deep neural network framework based on motion vectors was presented to provide a solution to the main difficulty arising from the noise and inaccurate block-wise motion information offered by motion vector images. Motion vectors contain only block-level information and suffer from noisy and inaccurate motion information. Thus, training a CNN on motion vectors with high accuracy is difficult. Experiments demonstrate that directly using motion vectors in place of optical flow information will lead to accuracy degradations of 7%, 10% and 26% accuracy on UCF101-24. Our aim is to take advantage of the real-time processing enabled by motion vectors to develop a model that is suitable for implementation in challenging environments, such as IoT systems, while still achieving a high detection accuracy comparable to that achieved using the optical flow approach. Driven by this motivation, we have applied the transfer learning approach  to design multiple methods of leveraging the rich and fine grained features that can be learned by an optical flow based network to improve our motion vector based network. These methods can be recognized as transferring the learned knowledge in the optical flow domain to the motion vector domain.
Iv-D Fusion of Appearance and Motion
In the two-stream model, the motion features are processed separately and are then fused with the appearance stream. The fusion step is an important component of the two-stream model. Researchers have used various fusion methods, such as mean, max and multiplicative fusion. We consider two of these methods: mean fusion and max fusion. Algorithm 2 shows how our fusion algorithm works in detail.
For the motion stream, we first treat the Y motion as the basis stream and compare the X motion to it as follows: Let BMVX denote the X-motion-based detection box with the maximum overlap with a given Y-motion-based detection box, denoted by BMVY. If this maximum overlap, quantified in terms of the intersection over union (IOU), is greater than a given threshold , we calculate the fused box as follows:
The second approach is to let BMVY denote the Y-motion-based detection box with the maximum overlap with a given X-motion-based detection box, BMVX. Similarly, if this maximum overlap, quantified in terms of the IOU, is greater than a given threshold , we again calculate the fused box using equation 11.
To extract the final bBounding box, let denote the fused motion-based detection box with the maximum overlap with a given appearance-based detection box, denoted by . If this maximum overlap, quantified in terms of the IoU, is greater than a given threshold , we calculate the final fused box as shown in equation 12 below. This fusion method takes the box from each stream and returns the average of both the appearance and motion streams.
As an alternative, the max fusion method as shown in equation 13 takes the box from each stream and returns the maximum between the appearance and motion streams. In this method, to ensure that the max fusion result will take advantage of the complementary contributions between the two streams, we take the maximum between the boxes at each point, not the maximum area.
Algorithm 2 sshows the sequence followed in our approach to fuse the different streams to reach the required high accuracy. Here, we attempt to increase the accuracy by fusing the different bounding boxes identified from the different streams.
Iv-E Object Tracking Using the COA
Object tracking is described as estimating of the trajectory of a target object through a video frame sequence. The ultimate goal of human tracking is to automatically initialize and track all humans in a scene. In this paper, we apply the COA  for human actor tracking. Each actor being tracked is represented by a rectangular window centered on the middle point of the target person. These rectangular windows are obtained from our action localization and detection network after the fusion of the different streams to obtain the most accurate location of each action and the corresponding actor. Each such location is represented in the form P=( x , y , w , h), where the (x , y) coordinates are the centre of the box and the (w, h) parameters are width and height of the box, respectively. Then we are applying the COA To search this four dimensional feature space. Because the motion of an object is usually continuous, we can suppose the target is moving to some new location that is near its location in the previous frame. First, a set of coyotes (individuals/solutions) is initialized in a sub-search space around the target object, which is defined as A = ( , , , ), where the (, ) coordinates represent the central point of the sub-search space and the ( , ) parameters are its width and height, respectively, and are equal to (w, h), i.e., width and height of the bounding box of the target object. The search space is updated in each frame in accordance with the following equations:
Where and are represent target object’s velocities on horizontal and vertical axis in last frame and obtain by eq.8 ; then, the image extracted from each search window is used to compare with the target object to evaluate the fitness of that Coyote.
A fitness function is used for quantification of the strength of each coyote. The fitness function used for evaluation of the fitness value (F), which measures the similarity of each candidate coyote with respect to the target object and is used to determine, for each coyote, whether its new condition is better than its previous one according to equation 10. This equation represents the personal evaluation of each coyote, which will drive the coyote to move towards the best position that it has found so far. Thus, the velocity in the search space is dynamically adjusted according to the experience of each coyote or the collective experience of its pack.
There are many approaches for calculating the fitness value, most of which are based on histogram comparisons. In this paper, our aim is to develop a real-time tracking approach, meaning that there is a need to reduce the computational burden and processing time. Therefore, we use an alternative approach  to measure the similarity between each coyote and the target object based on the distance, which can be geometrically interpreted as the Euclidean distance between two vectors. This distance takes the following form:
Where Ps denotes the pixel of the image captured by the search window of a Coyote and Pt denotes the pixel of the target image. To find the fitness value F, the dl2 value is divided by where Am represents the area of the search window for the target object. And is then normalized to the range of 0 to 1. A value of 0 indicates a complete mismatch, and a perfect match corresponds to a value of 1. That is, the higher the fitness value, the more similar the coyote’s search window to the target object. the fitness function for a coyote is defined as follows:
In the tracking of a moving object, Occlusion is one of the most challenging faced, in which the target may be covered by some background feature or may even disappears from the field of view and then re-enter the screen from an uncertain location. Using the technique presented in  for detecting and handling occlusion, we build a histogram-based target model from the beginning of the video sequence to the current time point, which we will replace with the target search window image. The target model is recalled and compared against the current best-adapted coyote, which will be the next alpha in the pack. If the likelihood of similarity is lower than some predefined occlusion detection threshold, denoted by FT, then the target will be marked to be lost and the search space is to be extended.
V Experimental results
In this experimental section, we introduce first the dataset used in our experiments and then analyse the experimental results for the proposed model of human action detection and tracking.
We evaluate our model on the the UCF-101-24 . UCF-101-24 is a subset of UCF-101 , one of the largest and most diversified and challenging action datasets. UCF-101 includes total number of 101 action classes which have divided into five types: Human-Object Interaction, Body-Motion, Sports , and Playing Musical Instruments. Each video includes only one category of action, it may contain multiple action instances of the same action class (up to 12 in a video) , with different temporal and spatial boundaries. UCF101-24 is a subset consisting of 24 of the 101 classes and includes spatio-temporal localization annotation in the form of bounding box annotations for human targets, released as part of the THUMOS-2015 challenge.
V-A1 Experimental Settings
The goal is for our model to be suitable for implementation in challenging environments such as IoT environments; thus, the training of our model was a one time one-time process. We used a machine with a Tesla k80 GPU, an Intel Xenon CPU and 14 gigabytes of RAM. We considered the RGB frames first and trained the appearance model; this phase took three days to complete. Then, we used the method introduced in  to generate optical flow frames from the RGB frames and used these frames to train our motion stream detection network, and finally, we used the motion vector frames generated by the video compression tools and applied transfer learning to extend the trained optical flow weights to a motion-vector-based motion model.
V-B Human Action Detection and Localization Results
Experiments were designed to study the effectiveness of the appearance stream, the motion stream and the two streams after fusion. For the fusion of the motion streams using equation 11. we found that the first approach achieves an accuracy of up to 23.25% at =0.2 , whereas the accuracy of the second approach reaches 25.93% at =0.2; thus, we chose to use the second approach to obtain all further results. When we use equation 12 to fuse the motion and appearance streams, an accuracy of is achieved 72.12% at =0.2, and an accuracy of 62.74% at =0.5.
Figure 2(a, b and d) shows the inputs to our model obtained from the video decoder, which converts a video stream into motion vector frames and RGB frames. Figure (2-a) shows the X-component motion vector information, Figure (2-b) shows the Y-component motion vector information, and Figure (2-c) shows the corresponding optical flow frame, which is not an input to our network but rather is the output of  . We trained and tested an initial motion detection network on the optical flow information and then used the trained weights to perform transfer learning on the motion vector dataset. Figure (2-d) shows the RGB frame used to detect the appearance information for our action model.
Table I shows the class-specific precision (AP) in % achieved for the videos in each action category of UCF-101-24 by using the detection networks based on the appearance stream and the two fused motion vector streams separately and by using the multi-stream (appearance and motion) fusion model.
The results were generated using a threshold of =0.20. For 12 of the 24 action classes, our appearance + motion fusion technique yields the best APs. The appearance-based detection network alone achieves the best APs for 11 of the 24 classes.
|Saha et al. ||39.6||49.7||66.9||73.2||14.1||93.6|
|Singh et al. ||42.0||64.6||73.7||75.2||41.5||100.0|
|Saha et al. ||85.9||99.8||68.3||94.1||63.1||57.2|
|Singh et al. ||86.5||97.9||62.1||96.0||77.6||69.7|
|Saha et al. ||75.1||89.6||31.1||85.1||79.6||96.1|
|Singh et al. ||76.1||96.1||22.2||87.4||81.0||87.1|
|Saha et al. ||89.1||63.2||33.6||52.7||20.9||75.6|
|Singh et al. ||82.4||62.1||37.4||59.4||21.7||85.1|
Table II presents the results we obtained on UCF-101-24, First, to prove the strength of our model in localizing and predicting the actions of different actors in different environments, we compare our model with the top offline and online competitors , , , and  in terms of detection performance.
All of the results reported in table II for other competitors were obtained using the offline approach, although  also presented an online model with a speed of 28 fps and a mean AP (mAP) of 70.2% . All of the results of our model were obtained online, at a speed of 57 fps for our appearance-stream-only method and at a speed of more than 52 fps for multi-stream fusion.
|Weinzaepfel et al. ||46.8||–||–||–|
|Peng and Schmid ||73.5||32.1||02.7||07.3|
|Saha et al. ||66.6||36.4||07.9||14.4|
|Singh et al. ||73.5||46.3||15.0||20.4|
|Ours-A + M (Mean-Fuse)||72.12||62.74|
Figure 4 shows examples of the human action localization results obtained on UCF101-24 . Each row represents one test video clip from UCF-101. As shown in the top rows, our model can successfully localize more than one action per frame with high accuracy and speed. The results show the superiority of our model in detecting multiple actions in the same frame and in localizing and predicting small targets, as in the surfing examples.
Table III compare our detection speed to those reported by Singh et al.  and Saha et al. . Our model has a detection speed of 57 fps when using RGB frames only and a speed of 52 fps when using motion vectors frames.
|Framework modules||A||A+M||A+M |
|Flow computation (ms)||-||2||7|
|Detection network time (ms)||14.9||14.9||21.8|
|Overall speed (fps )||57||52||34.7|
|not include time to generate tube|
V-C Results for Real-Time Human Tracking Using the COA
We test our model using ucf-24  datasets. A value of was used in the COA to determine whether to conduct a global search. A lower value of FT corresponds to a more global search and more iterations are generally required and coyotes for covering a larger search space while maintaining accuracy of tracking. A higher value of FT will lead in a less global search and consequently less iterations and coyotes, That can improve the tracking speed. However, it will sometimes result to failure in handling occlusion problems.
Figure 4 shows the tracking results for our model which has a speed of up to 20 fps when only the CPU is used for processing. The model has the ability to track more than one actor by using multiple swarms, one for each actor.
V-D Tracking Results Obtained Using the COA vs. YOLO
In this section, we show the advantages and disadvantages of using a COA-based tracker as a tracking model by comparing it with a tracking model based on YOLO , which is based on an approach that we call tracking by detection because the detection network attempts to detect the target in each frame. As seen in Figure 5, the COA-based tracker achieves full accuracy and success in tracking the target object across different frames (248 out of 248 frames) and can also distinguish the target actor from other actors that collide with it, as shown in rows 3 and 4. However, the COA has a drawback that arises from the adopted fitness function: the tracker requires a clear target object, and it may fail in detecting the intended target and instead select another target that has a close similarity with the true target. Moreover, in the COA-based approach, an additional tracker is needed to track each additional object to be tracked (one tracker for each target actor). The COA has a target fitness function with a minimum threshold of = . Every value represents the best solution in each target frame.
As seen in Figure 6, YOLOv2  shows lower accuracy than the COA-based tracker in detecting the actor in every frame of the selected video. YOLO must detect every target actor while treating every frame as a new source; therefore, YOLO alone cannot identify that a given detected object is the same object as an object detected in the previous frame. YOLOv2 achieves good accuracy and success in detecting the target object in different frames (157 out of 248 frames).
As seen in Figure 7, YOLOv3  shows higher accuracy than YOLOv2 , successfully detecting the actor in every frame of the selected video. YOLOv3 also detects every target actor. YOLOv3 can distinguish objects from other objects that collide with them, as shown in rows 3 and 4; however, because it still treats every frame as a new source, YOLOv3 alone still cannot identify that a detected object is the same object as an object detected in the previous frame.
Tracking actors in real time requires both high speed and good accuracy. A COA-based tracker achieves good performance in little processing time but also has certain disadvantages. YOLOv3 achieves good accuracy in detecting target objects but treats each frame as a new source of data. In future work, we will attempt to combine YOLOv3 with the COA to develop a high-accuracy real-time tracker that can detect target objects efficiently and cope with the collision problem. Note that YOLOv3 achieves full accuracy and success in detecting the target object in different frames (248 out of 248 frames).
Vi Conclusion and Future Work
Real-time action localization refers to the task of simultaneously localizing actions and identifying their classes from video stream. It is a challenging problem that requires expensive features that are difficult or impossible to extract due to the real-time processing requirements. The localization and classification of actions must be performed even before the actions are fully observed. In this work, we have introduced a fast and accurate model that can achieve action detection with an accuracy of 72% (mAP) and a speed of 58% fps. Our model can simultaneously detect multiple actions. It also produces good results for action prediction in videos using information available in the current frame only. The proposed tracking model based on the COA has the ability to track more than one actor by using multiple swarms, one for each actor.
In future work, we will consider a multi-camera scenario in which it is necessary to detect an action and track the corresponding actor across different cameras as the actor moves from one place to another. We will also consider the tracking of multiple objects. Furthermore, we will combine the COA with YOLOv3 to develop an efficient tracker that can cope with collisions while maintaining a high detection accuracy rate.
-  P. Ongsulee. Artificial Intelligence, Machine Learning and Deep Learning, 2017 Fifteenth International Conference on ICT and Knowledge Engineeringm, Bangkok, Thailand, DOI: 10.1109/ICTKE.2017.8259629, 2017.
-  S.J. Russell,and P. Norvig. Artificial Intelligence: A Modern Approach, Third Edition ”PRENTICE HALL SERIES IN ARTIFICIAL INTELLIGENCE”, ISBN-10: 0-13-604259-7, Copyright@ 2010, 2003, 1995 by Pearson Education, Inc., Upper Saddle River, New Jersey 07458.
-  U. Michelucci. Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks, DOI:10.1007/978-1-4842-3790-8, Apress, Berkeley, CA, 2018.
-  Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature ,521(7553), pp: 436-444,2015.
-  R.S. Parpinelli, H.S. Lopes. New inspirations in swarm intelligence: a survey, Int. J. Bio Inspired Comput, 3, pp:1-16, 2011. .
-  B. K. Panigrahi, Y. Shi, and M.H. Lim . Handbook of Swarm Intelligence: Concepts, Principles and Applications, Springer-Verlag Berlin Heidelberg, DOI: 10.1007/978-3-642-17390-5, ISBN: 978-3-642-17390-5
-  P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2015.
-  G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016, October. Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.
-  Singh, G., Saha, S., Sapienza, M., Torr, P.H. and Cuzzolin, F., 2017, October. Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction. In ICCV (pp. 3657-3666).
-  X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV, 2016.
-  S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In BMVC, 2016.
-  J. Gemert, M. Jain, E. Gati, C. G. Snoek, et al. APT: Action localization proposals from dense trajectories. In BMVC, 2015.
-  Z. Shou, D. Wang, and S.-F. Chang. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In CVPR, 2016
-  F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. In CVPR, 2016.
-  Jain, M.; Van Gemert, J.; Jégou, H.; Bouthemy, P.; Snoek, C. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping.In ECCV, 2004.
-  T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI,2011.
-  Redmon, J. and Farhadi, A., 2017. YOLO9000: better, faster, stronger. arXiv preprint.
-  Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587).
-  Soomro, K., Zamir, A.R. and Shah, M., 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
-  Rodriguez, M.D., Ahmed, J. and Shah, M., 2008, June. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.
-  Jhuang, H., Gall, J., Zuffi, S., Schmid, C. and Black, M.J., 2013. Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3192-3199).
-  L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action detection. In CVPR, Jun. 2010.
-  Baek, S., Kim, K.I. and Kim, T.K., 2017, March. Real-time online action detection forests using spatio-temporal contexts. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 158-167). IEEE.
-  Ryoo, M., Aggarwal, J.: UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities, SDHA (2010)
-  Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: Fitzgib-bon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 300–313. Springer, Heidelberg (2012).
-  Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. Http: //crcv.ucf.edu/THUMOS14/, 2014.
-  Kong, Yu, Dmitry Kit, and Yun Fu. ”A discriminative model with multiple temporal scales for action prediction.” European Conference on Computer Vision. Springer Interna-tional Publishing, 2014.
-  V. Bloom, D. Makris, and V. Argyriou. G3d: A gaming action dataset and real time action recognition evaluation framework. In Computer Vision and Pattern Recognition Work-shops (CVPRW), 2012 IEEE Computer Society Conference on, pages 7–12. IEEE, 2012.
-  A. Sharaf, M. Torki, M. E. Hussein, and M. El-Saban. Realtime multi-scale action detec-tion from 3D skeleton data. In WACV, 2015.
-  W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3D points. In CVPR Workshop, 2010
-  Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu. Online human action detection using joint classification-regression recurrent neural networks. In ECCV, 2016.
-  B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Realtime action recognition with enhanced motion vector cnns. CVPR, 2016.
-  D. Weinland, R. Ronfard, and E. Boyer. A survey of vision based methods for action rep-resentation, segmentation and recognition. CVIU, 115(2), 2011.
-  J. K. Aggarwal and M. S. Ryoo. Human activity analysis: Areview. ACM Computing Surveys (CSUR), 43(3), 2011.
-  M. Jain, J. C. van Gemert, and C. G. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, 2015.
-  L. Wang, Y. Qiao, and X. Tang. Video action detection with relational dynamic-poselets. In ECCV. 2014.
-  K. Soomro, H. Idrees and M. Shah, ”Predicting the Where and What of Actors and Ac-tions through Online Action Localization,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” IEEE Trans. Circuits Syst. Video Techn., vol. 13, no. 7, pp. 560–576, Jul. 2003.
-  Liu, G., Chung, Y.Y. and Yeh, W.C., 2016, July. A simplified swarm optimization for ob-ject tracking. In Neural Networks (IJCNN), 2016 International Joint Conference on (pp. 169-176). IEEE.
-  Pierezan, J. and Coelho, L.D.S., 2018, July. Coyote Optimization Algorithm: A New Metaheuristic for Global Optimization Problems. In 2018 IEEE Congress on Evolutionary Computation (CEC) (pp. 1-8). IEEE.
-  Hu, J.S., Juan, C.W. and Wang, J.J., 2008. A spatial-color mean-shift object tracking algorithm with scale and orientation estimation. Pattern Recognition Letters, 29(16), pp.2165-2173.
-  Li, Z., Tang, Q.L. and Sang, N., 2008. Improved mean shift algorithm for occlusion pedestrian tracking. Electronics Letters, 44(10), pp.622-623.
-  Seo, K.H. and Lee, J.J., 2005, November. Real-time object tracking and segmentation using adaptive color snake model. In Industrial Electronics Society, 2005. IECON 2005. 31st Annual Conference of IEEE (pp. 5-pp). IEEE.
-  Liu, T.L. and Chen, H.T., 2004. Real-time tracking using trust-region methods. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3), pp.397-402.
-  Maybeck, P.S., 1990. The Kalman filter: An introduction to concepts. In Autonomous robot vehicles (pp. 194-204). Springer, New York, NY.
-  Cuevas, E.V., Zaldivar, D. and Rojas, R., 2005. Kalman filter for vision tracking.
-  Huang, Y., Huang, T.S. and Niemann, H., 2002, June. Segmentation-based object tracking using image warping and kalman filtering. In Image Processing. 2002. Proceedings. 2002 International Conference on (Vol. 3, pp. 601-604). IEEE.
-  Arulampalam, M.S., Maskell, S., Gordon, N. and Clapp, T., 2002. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing, 50(2), pp.174-188.
-  Yang, C., Duraiswami, R. and Davis, L., 2005, October. Fast multiple object tracking via a hierarchical particle filter. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on (Vol. 1, pp. 212-219). IEEE.
-  Cho, J.U., Jin, S.H., Dai Pham, X., Jeon, J.W., Byun, J.E. and Kang, H., 2006, October. A real-time object tracking system using a particle filter. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on (pp. 2822-2827). IEEE.
-  Kennedy, J., 2006. Swarm intelligence. In Handbook of nature-inspired and innovative computing (pp. 187-219). Springer, Boston, MA.
-  Anton-Canalis, L., Hernandez-Tejera, M. and Sanchez-Nielsen, E., 2006, October. Particle swarms as video sequence inhabitants for object tracking in computer vision. In Intelligent Systems Design and Applications, 2006. ISDA’06. Sixth International Conference on (Vol. 2, pp. 604-609). IEEE.
-  Zhang, X., Hu, W., Maybank, S., Li, X. and Zhu, M., 2008, June. Sequential particle swarm optimization for visual tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.
-  Sha, F., Bae, C., Liu, G., Zhao, X., Chung, Y.Y. and Yeh, W., 2015, May. A categorized particle swarm optimization for object tracking. In Evolutionary Computation (CEC), 2015 IEEE Congress on (pp. 2737-2744). IEEE.
-  Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
-  Bross, B., 2013. High efficiency video coding (HEVC) text specification draft 10 (for FDIS & last call). In Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, 12th Meeting, Geneva,(Jan. 2013).
-  Gammulle, H., Denman, S., Sridharan, S. and Fookes, C., 2017, March. Two stream lstm: A deep fusion framework for human action recognition. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 177-186). IEEE.