Associative Embedding for Game-Agnostic Team Discrimination
Assigning team labels to players in a sport game is not a trivial task when no prior is known about the visual appearance of each team. Our work builds on a Convolutional Neural Network (CNN) to learn a descriptor, namely a pixel-wise embedding vector, that is similar for pixels depicting players from the same team, and dissimilar when pixels correspond to distinct teams. The advantage of this idea is that no per-game learning is needed, allowing efficient team discrimination as soon as the game starts. In principle, the approach follows the associative embedding framework introduced in [Newell17a] to differentiate instances of objects. Our work is however different in that it derives the embeddings from a lightweight segmentation network and, more fundamentally, because it considers the assignment of the same embedding to unconnected pixels, as required by pixels of distinct players from the same team. Excellent results, both in terms of team labelling accuracy and generalization to new games/arenas, have been achieved on panoramic views of a large variety of basketball games involving players interactions and occlusions. This makes our method a good candidate to integrate team separation in many CNN-based sport analytics pipelines.
Team sports analytics has numerous applications, ranging from broadcast content enrichment to game statistical analysis for coaches [Chen16, Thomas17, Zheng16]. Assigning team labels to detected players is of particular interest when investigating the relationship between team positioning and sport action success/failure statistics [Bialkowski14, Hobbs18, Liu14], but also for some specific tasks such as offside detection in soccer [Dorazio09] or ball ownership prediction in basketball [Wei15].
Many previous works have investigated computer vision methods to detect and track team sport players [Cioppa18, Dorazio09, Lu18, Lu13, Manafifard17, Parisot17, Tong11]. They can detect individual players, but generally resort to unpractical manual intervention or to unreliable heuristics to adapt their processing pipeline to recognize the players’ team. Specifically, they generally need human intervention to adjust the team discriminant features (e.g. RGB histogram in [Lu13], or CNN features in [Lu18]) to the game at hand [Bialkowski14, Liu14, Lu18, Lu13]. A few methods have attempted to derive game-specific team features in an automatic manner [Dorazio09, Tong11]. They consider the unsupervised clustering of color histograms [Dorazio09] or bags of color features [Tong11] computed on the spatial support of the players that are detected in the game at hand. Those methods depend on how well color discriminates the two teams, but is also quite sensitive to occlusions and to the quality of player detection and segmentation [Manafifard17]. This probably explains why those previous works have been demonstrated in outdoor and highly contrasted scenes, as encountered in soccer for example. We show in Section 4 that those methods fail to address real-life indoor cases.
As observed in [Lu18], indoor sports analytics have to deal with lower color contrast between players and background, and more dynamic scenes, with more frequent occlusions. [Parisot16, Parisot17] also point out the low illumination, the strong reflections induced by dynamic advertising boards, the severe shadows, the large player density and the lack of color discrimination in indoor scenes.
In our work, we do not arbitrarily select a handcrafted feature to discriminate the teams. We do not consider a framework that requires game-specific adjustment either. Instead we adopt a generic learning-based strategy that aims at predicting a feature vector in each pixel, in such a way that, independently of the game at hand, similar vectors are predicted in pixels lying in players from a same team, while distinct vectors are assigned to pairs of pixels that correspond to distinct teams. In other words, we train a neural network to separate, in an embedding space, the pixels of different teams and to group those in common team. A simple and efficient clustering algorithm can then be used to dissociate different teams in an image. Hence, we do not rely on explicit recognition of specific teams, but rather learn how to map player pixels to a feature space that promotes team clustering, whatever the team appearance. Although teams change at each game, there is thus no need for fine tuning or specific manual annotation for new games. The approach has been inspired by the associative embedding strategy recently introduced to discriminate instances in object detection problems [Newell17b, Newell17a]. However, differently from [Newell17b, Newell17a], it is demonstrated using a lightweight ICNet convolutional neural network (opening broader deployment perspectives than the heavy stacked hourglass architecture promoted in [Newell17b, Newell17a]) and, to our knowledge, is the first work assigning similar embeddings to unconnected pixels, thereby extending the field of application of pixel-wise associative embedding.
To validate our method, we have trained our network on a representative set of images captured in a variety of games and arenas. Since only a few player keypoints (head, pelvis, and feet) have been annotated in addition to the player team index, the player segmentation component of our network has been trained with approximate ground-truth masks, corresponding to ellipses connecting the key points. Our CNN model is validated on games (teams) and arenas that have not been seen during training. It achieves above team recognition accuracy, despite the challenging scenes (indoor, dynamic background, low contrast) and the inaccurate segmentation ground-truth considered during training. Interestingly, the lightweight backbone makes the solution realistic for real-time deployment.
Our paper is organized as follow. Section 2 reviews the related works associated to CNN-based sport analysis, segmentation, and associative embedding. Section 3 then introduces our proposed method, using a ICNet variant to both segment the players and compute pixel-wise team discriminant embeddings. The experiments presented in Section 4 demonstrate the relevance of our approach, while conclusions and some perspectives are provided in Section 5.
2 Related works
Recent developments in computer vision make an extensive use of Convolutional Neural Networks [Russakovsky15]. This section reviews the specific type of CNNs, named Fully Convolutional Network (FCN), that is used for image segmentation. It then introduces the recent associative embedding methods considered to turn object class segmentation into object instance segmentation.
2.1 Fully Convolutional Network (FCN)
Fully Convolutional Networks are characterized by the fact that they output spatial feature maps, strictly computed by the recursive application of convolutional layers, generally completed with ReLu activation and batch-normalization or dropout regularization layers.
In recent works dealing with sport video analysis, FCNs have been considered for specific segmentation tasks, including player jersey number extraction [Gerke17], soccer field lines and players segmentation [Cioppa18]. In [Lu18], a two-steps architecture, inspired by [Yang16] and [Yu16], is even proposed to extract players bounding-boxes with team labels. The network however needs to be trained on a game-per-game basis, which is impractical for large scale deployment. None of these works is thus able to differentiate player teams without requiring a dedicated training for each game, as proposed in Section 3, where a real-time amenable FCN provides the player segmentation mask, as well as a pixel-wise team-discriminant feature vector.
There are two main categories of real-time FCNs: encoder-decoder networks and multi-scale networks.
Encoder-decoder architectures adopt the encoder structure of classification networks, but replace their dense classification layers by fully convolutional layers that upsample and convolve the coded features up to pixel-wise resolution.
SegNet (Segmentation Network) [Badrinarayanan17] was the first segmentation architecture to reach near real-time inference. It is a symmetrical encoder-decoder network, with skip connection of pooling indices from encoder layers to decoder layers. ENet (Efficient Neural Network) [Paszke16] follows SegNet, but comes with various improvements, whose most prominant one is the use of a smaller decoder than the encoder.
Quite recently, several authors proposed to adopt multi-scale architectures to better balance accuracy and inference complexity. Considering multiple scales allows to exploit both a large receptive field and a fine image resolution, with a reduced number of network layers. Among those networks, ICNet (Image Cascade Network) [Zhao18] is based on PSPNet (Pyramid Scene Parsing Network) [Zhao17], a state-of-the-art network for non real-time segmentation. ICNet encodes the features at three scales. The coarsest branch is a PSPNet, while finer ones are lighter networks, allowing to infer segmentation in real-time. Two-columns network [Wu17], BiSeNet (Bilateral Segmentation Network) [Yu18], GUN (Guided Upsampling Network) [Mazzini18] and ContextNet [Poudel18] are composed of two branches.
2.2 Associative embedding
An embedding vector denotes a local descriptor that characterizes a signal locally in a way that can support a task of interest. Embeddings are thus not defined a priori. Instead, they are defined in an indirect manner, to support the task of interest. In computer vision, FCNs have recently been considered to compute pixel-wise embeddings in a variety of contexts related to pixel clustering or pixel association tasks. In this context, FCN training is not supervised to output a specified value. Rather, FCN training supervises the relations between the embedded vectors, and checks that they are consistent with the task of interest.
In [Vondrick18], the embedding vector is used to compute the similarity between two pixel neighborhoods from two distinct images, typically to support a tracking task. Interestingly, a proxy task that consists in predicting the (known) color of a target frame based on the color in a reference frame is used to supervise the training of the FCN computing the embeddings. Good embeddings indeed result in relevant pixel associations, and in accurate color predictions. This reveals that a FCN can be trained in an indirect manner to support various higher-level tasks based on richer pixel-wise embedding.
Of special interest with respect to our team discrimination problem, associative embeddings have been introduced in [Newell17b, Newell17a] and used in [Law18, Newell17b, Newell17a] to associate pixels sharing a common semantic property, namely the fact that they belong to the same object instance. Authors in [Newell17a] introduced associative embedding in the context of multi-person pose estimation from joints detection and grouping, and extended it to instance segmentation. More recently, [Law18] proposed CornerNet, a new state-of-the art one-shot bounding box object detector, by using associative embedding to group top-left and bottom-right box corners. In all these publications, the network is trained to give close embeddings to pixels from the same instance and distant embeddings to pixels corresponding to different instances. All these works are based on the same heavy stacked hourglass architecture. However, [Newell17b] suggest that the approach is not strictly restricted to this architecture, as long as two important properties are fulfilled: first, the network should have access both to global and local information; second, pixel-wise prediction at fine resolution is recommended, in order to avoid that a vector is subject to concurrent instances. This makes ICNet a premium candidate to segment players and compute team-specific embeddings in real time, since it computes features at three scales instead of two for other lightweight multi-branch FCN architectures.
3 Team segmentation using pixel-wise associative embedding
Player team discrimination is not a conventional segmentation problem since the visual specificities of each class are not known in advance. This section explains how associative embedding can be combined with player segmentation to address this problem.
3.1 Team discrimination & player segmentation
We propose to adopt the associative embedding paradigm to support the team discrimination task. In short, we design a fully convolutional network so that, in addition to a player segmentation mask, it outputs for each pixel a -dimensional feature vector that is similar for pixels that correspond to players of the same team, while being distinct for pixels associated to distinct teams. As explained in the previous section, embeddings learning is not based on an explicit supervision. Instead, embeddings are envisioned as a latent pixel-wise representation, trained to support a pixel-wise association task, typically to group [Law18] or match [Vondrick18] pixels together. In the context of object detection, associative embedding has been applied with success in [Law18, Newell17a] to group pixels corresponding to a same object instance. In these works, multiple hourglass-shaped networks are stacked recursively in order to progressively refine the 1-D embedding value that aims to differentiate object instances in a given class. Our work differs from [Newell17a, Newell17b] and [Law18] in two main aspects.
First, and because we target real-time deployment, the stacked hourglass architecture is replaced by an ICNet [Zhao18] backbone, as illustrated in Figure 1. As stated in [Zhao18], ICNet reaches 30 FPS for images of pixels on one Titan X GPU card. We use ICNet because its multi-scale encoders, along with a spatial pyramidal pooling, give access to a reasonably large receptive field (important to share embedding information spatially) while preserving the opportunity to exploit high-resolution image signal locally (important for a fine characterization of the content).
Second, our work deals with the problem of associating pixels of players that are scattered across the whole image. This is in contrast with the association of neighboring/connected pixels generally considered in traditional association tasks [Law18, Newell17a].
3.2 Network architecture
The ICNet network architecture has mostly been left unchanged. Only the final convolution layer has been adapted to provide channels. Those comprise channel for semantic segmentation, with a sigmoid activation, along with channels for embeddings with linear activation. Figure 1 presents the player segmentation channel in blue while the channels for embeddings are represented in orange.
A number of loss functions are combined to train the network.
Along with the multi-scale semantic segmentation loss from [Zhao18], composed by , and , we add an embedding loss inspired by [Newell17a, Newell17b, Law18]. It comprises two components, and , which respectively pull teammates embeddings together and push opponents embeddings away from each other.
and only apply to the finest resolution.
We have defined all loss components based on mean square distances.
, and losses are defined as:
with and being the layer height and width, while and respectively denote the predicted and ground-truth player masks at scale . Similarly, is formulated as:
where is the mean of the embeddings predicted across the pixels of team , i.e. .
In [Newell17a], the push loss is expressed as a mean over pairs of pixels of a cost function that is chosen to be high (low) when pixels that are not supposed to receive the same embedding have a similar (different) embedding. Recently, [Newell17b] and [Law18] employed a ”margin-based penalty”, and wrote that this is the most reliable formulation they tested. Hence, we also adopt a margin-based penalty loss. Formally, is defined similarly to , except that rather than penalizing embeddings that are far away from their centroid, it penalizes embeddings that are too close from the centroid of another team:
Our global objective function finally becomes:
with the lambda loss factors having to be tuned (chosen values are explained in Section 3.3).
At inference, upsampling of last layer is inserted before activation (respectively bilinear and nearest neighbor interpolations for segmentation and embedding channels). Then, a clustering algorithm is required to group pixels in teams. Fortunately, as observed in [Newell17a], the network does a great job at separating the embeddings for distinct teams, so that a simple and greedy method such as the one detailed in Algorithm 1 is able to handle the clustering properly. As appears from the pseudocode, our naive clustering algorithm relies on the assumption that a player pixel embedding surrounded by similar embeddings is representative of its team embedding. Given a team embedding vector, player pixels are likely to be assigned to that team if their embedding lies in a sphere of radius around the team embedding. We incorporate a refinement step in which we compute the centroid of the selected pixels. Then, to resolve ambiguities, player pixels are associated to the closest of the centroids.
3.3 Implementation details and hyperparameters
Our network is trained to extract players only, and to estimate associative embeddings for team discrimination. Referees and other non-player persons are part of the background class. Our work is based on the PyTorch ICNet implementation [mshahsemseg]. Parameters have been empirically tuned. For the training, we employ Adam optimizer [Kingma15]. Losses factors defined in Equation 4 are: and as in original ICNet [Zhao18], and are thus very different than in [Newell17a, Newell17b, Zhao18] because our pull and push losses definitions are averaged over pixels rather than over instances. Our best found learning rate is , and has been implemented with the ”poly” learning rate decay taken from [Zhao17, Zhao18, Yu18] and their own sources. Compared to them, we apply the decay by epochs instead of iterations, but we keep the same power of 0.9. Hence, the learning rate at epoch is , with denoting the total number of epochs, and being the base learning rate defined above. All but last layers of ICNet are initialized with pretrained Cityscapes ([Cordts16]) weights from [Zhao18], but a full training is done as the point of view adopted for sport field coverage is too different from the frontal point of view considered by cars in Cityscapes. Minibatch size is 16 and batch-normalization is applied. Neither weight decay regularization, nor dropout are added, but the following random data augmentation is considered: mirror flipping, central rotation of maximum 10 degrees, scaling such that , color jitter in the perceptually uniform CIE L*C*h color space fixed to L , C and h degrees, to keep natural colors. We trained the network on crops of pixels, located randomly in scaled images. Validation is performed on pixels patches, extracted from images scaled such as its equals 512. For each model, we select the parameters of the best epoch according to a validation score defined as the mean of intersection over union of the two teams, between prediction and our approximate reference masks. Inference for testing is done on court images downsampled to and padded to preserve the aspect ratio.
In our implementation, we adopted -D embeddings, mainly because more dimensions a priori get more ability to capture/encode visual team characteristics unambiguously. We expect this ability to become especially useful when the receptive field does not cover the whole scene. In that case, the embedding prediction in one pixel may not be able to rely on a teammate appearance or on the absence of collision with an opponent embedding when those players are far and disconnected from the pixel of interest. The embeddings have thus to be consistent across the scene, despite their relatively local receptive field. In other words, they have to capture local team characteristics unambiguously. In practice, ICNet builds a global receptive field, and our trials provided similar results with 1- to 5-D embeddings.
4 Experimental validation
To assess our method, this section first introduces an original dataset, and associated evaluation metrics. It then runs a K-fold cross-validation procedure, and compares the performance of our associative embedding team discrimination, with a conventional color histogram clustering, applied on top of instance segmentation.
4.1 Dataset characteristics
To demonstrate our solution, we have considered a proprietary basketball dataset. It involves a large variety of games and sport halls: images come from 40 different games and 27 different arenas. Images show innumerable situations: occlusions between teammates and/or opponents, regular player distribution, absence or presence of all the players, images from training sessions and professional games with public, various game actions, still and moving players, presence of referees, managers, mascots, dynamic led advertisements, photographers or other humans, various lighting conditions, different image sizes (smaller dimension is generally close or superior to 1000 pixels). This dataset is composed of 648 images covering a bit more than half of the sport field. Each player has been manually annotated. Annotations considered in our work include a team label (Team A vs. Team B), and an approximate player mask. This mask has been derived from manual annotation of head, pelvis, and feet. It consists in seven ellipses approximately covering the head, the body (between head and pelvis), the pelvis, the legs (between pelvis and each foot), and the feet. Occlusions between ellipses of players located at different depth has been taken into account. Similarly to [Cioppa18], our experiments reveal that the network can learn despite the coarseness of the masks. Players size in images feeding the network (scaling strategy in Section 3.3) is around pixels.
4.2 Evaluation metrics
Our network enables player segmentation, as well as team discrimination. Evaluation metrics should thus reflect whether players have been properly detected, and whether teammates have received the same team label. Therefore, we consider the following counters and metrics, to be computed on a set of test images:
: Number of missing players
: Number of correct team associations
: Number of incorrect team associations
Missed players rate,
Correct team assignments rate,
We now explain how the outputs of our network, namely the player segmentation mask and the map of team labels derived from the embeddings clusters, are turned into those evaluation metrics111Since accurate ground truth segmentation masks are not available from the dataset (see Section 4.1), the segmentation quality can not be assessed based on conventional intersection over union metrics.. Given a reference segmentation mask and a team label for each player instance, a simple majority vote strategy is adopted. A player is considered to be detected when the majority of pixels in the player instance segmentation mask are part of the segmentation mask predicted by the network. In that case, the majority label observed in the instance mask defines the team of the player. In practice, since our ground-truth mask only provides a rough approximation of the actual player instance silhouette, we resort to the part of the instance mask that is the most relevant for team classification, i.e. to the two ellipses that respectively cover the body and the pelvis area. Since pixels that are in the central part of the body and pelvis ellipses are less likely to be part of the background, only the pixels that are sufficiently close to the main principal axis of the body/pelvis shape are considered. (A distance threshold equal to one third of the maximal distance between ellipse border and principal axis has been adopted. Changing this threshold does not impact significantly the results.)
In order to validate the proposed team discrimination method with available data, we consider a K-fold cross-validation framework. It partitions the 648-images dataset into K disjoint subsets, named folds. Each K-fold iteration preserves one fold for the test, and use the other folds for training and validation. Average and standard deviation metrics can then be computed based on the K iterations of the training/testing procedure. In our case, ten folds of approximately equal size have been considered. Moreover, to assess whether the model generalizes properly on new games and new arenas, we construct the folds so that each fold contains images from distinct games and/or arenas. Table 1 lists cross-game folds characteristics, and Table 2 cross-arena folds characteristics.
To estimate the value to give to our results, we compare them to a baseline reference. Since most previous methods recognize teams based on color histograms [Dorazio09, Lu13, Tong11], generally after team-specific training, we compare associative embeddings to a method that collects color histograms on player instances, before clustering them into two sets. In practice, as for the associative embedding evaluation, only the player pixels that are sufficiently close to the body/pelvis principal axis are considered to build the histogram in RGB, with 8 bins per dimension (512-dimensional histogram). Adopted clustering is the [scikit-learn] implementation of variational inference algorithm with a Dirichlet process prior [Blei06], to fit at max two gaussians representing our two clusters (two teams). This method has the advantage of being able to automatically reduce the number of prototypes, it is useful when less than two teams are visible in an image.
|Fold||1 .. 3||4||5 .. 9||10|
|Fold||1||2 .. 5||6||7||8 .. 9||10|
Results of cross-game validation are presented in Table 3, while cross-validation on sport halls is presented in Table 4. Standard deviations are low, demonstrating the weak dependence to a specific set of training data. Rate of missing detections is about 11%, which is an acceptable rate considering our backbone is the real-time ICNet model [Zhao18] with arduous indoor basketball images. It could probably be improved with a finer tuning of hyperparameters, as well as more accurate segmentation masks and a formulation that involves a class for referees (see failure cases analysis below). More recent and effective improved segmentation networks could also be considered as long as they are compatible with associative embedding.
In Figure 23, we observe that players are generally well detected but roughly segmented, probably due to our approximate training masks. However, segmentation masks are very clean compared to the background-subtracted foreground masks derived for such kind of scenes (see for example [Parisot17]). Therefore, they could advantageously replace those masks in algorithms using camera calibration to detect individual players from the segmentation mask [Alahi11, Carr12, Delannay09].
In terms of team assignment, [Lu18] mentions that they can not achieve good cross-game team assignment without fine-tuning. In comparison, our method reaches more than 90% of correct team assignments while testing on games and sport halls that are not seen during training. The baseline Bayesian color histogram clustering only reaches 62% of correct team assignments, which confirms that the team assignment task in the context of indoor sport is extremely difficult, as described in Section 1. We get near identical results for cross-arena evaluation.
|Associative Embedding||0.11 0.04||0.91 0.04|
|Color Histogram||0.62 0.02|
|Associative Embedding||0.11 0.06||0.91 0.03|
|Color Histogram||0.63 0.02|
Qualitative results are shown in Figure 23. As written in Section 3.3, we intend to extract players only, excluding referees and other humans. Images belong to testing folds, meaning that they originate from games or arenas not seen during training. Teams masks are drawn in red and blue.
The first five rows in Figure 23 illustrate how well the proposed method can deal with indoor basketball conditions. Players in fast movement and low contrast are detected and well grouped in teams. Occlusions, led advertisements, and artificial lighting are not a major problem. Associative embedding has a low sensitivity to high color similarities between background and foreground. Specific treacherous scenes with players of only one team and some other humans are correctly handled.
We estimate to of the number of annotated players, the quantity of isolated regions that could fit humans, extracted in addition to reference instances. These detections come from referees and other unwanted persons on or close to the ground, and in certain cases from scenery elements. In basketball, the proportion of the number of referees related to the players is from to (we usually count 2 or 3 referees in a complete field, while players are ). Thus, it is interesting to see that our FCN trained on players generally avoids referees and other people. However, this is a challenging task, as can be seen in the two prominent failure cases shown in the last two rows of Figure 23, where referees shirts or pants are visually similar to a team. In the first example, a referee is detected as a player and included in a team (referee on the right, under the basket), and a player is filtered from predicted player class probably because it is seen as a referee by the network (background player in side of a referee). In the second example, the dark pants of a referee and a coach in the back of the court are assimilated to the team in black. This sample also presents a severe occlusion implying four players; inside and around this area, detection is inaccurate and team assignment of the orange player mixed with black teammates is lost.
Associative embedding is considered to address the team assignment problem in team sport competitions. It offers the advantage of discriminating teams in sport scenes, without requiring an unpractical per-game training. Promising results are obtained on a challenging basketball dataset, with few tuning and only approximate player mask annotations. In this work, the embeddings come with a player segmentation mask from a relatively simple multi-scale CNN, rather than the stacked hourglass network considered in previous works [Law18, Newell17b, Newell17a]. Our work could be extended to support instance segmentation, by using either instance embeddings [Newell17a] or projective geometry [Alahi11, Carr12, Delannay09]. Future investigations of interest include the explicit recognition of referees, a deeper analysis of the embeddings distribution and a more careful weighting of losses [Kendall18].