Soccer Field Localization from a Single Image
In this work, we propose a novel way of efficiently localizing a soccer field from a single broadcast image of the game. Related work in this area relies on manually annotating a few key frames and extending the localization to similar images, or installing fixed specialized cameras in the stadium from which the layout of the field can be obtained. In contrast, we formulate this problem as a branch and bound inference in a Markov random field where an energy function is defined in terms of field cues such as grass, lines and circles. Moreover, our approach is fully automatic and depends only on single images from the broadcast video of the game. We demonstrate the effectiveness of our method by applying it to various games and obtain promising results. Finally, we posit that our approach can be applied easily to other sports such as hockey and basketball.
Keywords:Sports Analytics, 3D vision, Homography Estimation
According to recent studies
Core to most analytics is the ability to automatically extract valuable information from video. Being able to identify team formations and strategies as well as assessing the performance of individual players is reliant upon understanding where the actions are taking place in 3D space.
Most approaches to player detection [1, 2, 3, 4], game event recognition , and team tactical analysis [6, 7, 8] perform field localization by either semi-manual methods [9, 10, 11, 12, 13, 14, 15, 16, 17] or by obtaining the game data from fixed and calibrated camera systems installed around the venue.
In this paper, we tackle the challenging task of field localization as applied to a single broadcast image. We propose a method that requires no manual initialization and is applicable to any video of the game recorded with a single camera. The input to our system is a single image and the 3D model of the field, and the output is the mapping that takes the image to the model as illustrated in Fig. 1. In particular, we frame the field localization problem as inference in a Markov Random Field. We parametrize the field in terms of four rays, cast from two automatically detected horizontal vanishing points. The rays correspond to the outer lines of the field and thus define the field’s precise localization. Our MRF energy uses several potentials that exploit semantic segmentation of the image in terms of “grass”, as well as agreement between the lines found in the image and those defined by the known model of the field. All of our potentials can be efficiently computed. We perform inference with branch-and-bound, achieving on average 0.7 seconds running time per frame. The weights in our MRF are learned using structure SVM .
We focus our efforts in the game of soccer as it is more challenging than other sports, such as hockey or basketball. A hockey rink or a basketball court are much smaller compared to a soccer field and are in a closed venue. In contrast, a soccer field is usually in an open stadium exposed to different weather and lightning conditions which might create difficulties in identifying the important markings of the field. Furthermore, the texture and pattern of the grass in a soccer field differs from one stadium to another in comparison to say a hockey rink which is always white. We note however that our method is sports agnostic and is easily extendable as long as the sport venue has known dimensions and primitive markings such as lines and circles.
To evaluate our method, we collected a dataset of 259 images from 12 games in the World Cup 2014. We report the Intersection over Union (IOU) scores of our method against the ground truth, and show very promising results. In the following, we start with a discussion of related literature, and then describe our method. Experimental section provides an exhaustive evaluation of our method, and we finish with a conclusion and a discussion of future work.
2 Related Work
A variety of approaches have been developed in industry and academia to tackle the field localization problem. In the industrial setting, companies such as Pixelot and Prozone have proposed a hardware approach to field localization by developing advanced calibrated camera systems that are installed in a sporting venue. This requires expensive equipment, which is only possible at the highest performance level. Alternatively, companies such as Stathleates rely entirely on human workers for establishing the homography between the field and the model for every frame of the game.
In the academic setting, the common approach to field registration is to first initialize the system by either searching over a large parameter space (e.g. camera parameters) or by manually establishing a homography for various representative keyframes of the game and then propagating this homography throughout the consecutive frames. In order to avoid accumulated errors, the system needs to be reinitialized by manual intervention. Many methods have been developed which exploit geometric primitives such as lines and/or circles to estimate the camera parameters[9, 10, 11, 12, 13]. These approaches rely on hough transforms or RANSAC and require manually specified color and texture heuristics.
An approach to limit the search space of the camera parameters is to find the two principal vanishing points corresponding to the field lines [19, 20] and only look at the lines and intersection points that are in accordance with these vanishing points and which satisfy certain cross ratios. The efficacy of the method was demonstrated only on goal areas where there are lots of visible lines. However, this approach faces problems for views of the centre of the field, where there are usually fewer lines and thus one cannot estimate the vanishing point reliably.
In , the authors proposed an approach that matches images of the game to 3D models of the stadium for initial camera parameter estimation . However, these 3D models only exist in well known stadiums, limiting the applicability of the proposed approach.
Recent approaches, applied to Hockey, Soccer and American Football [14, 15, 16, 17] require a manually specified homography for a representative set of keyframe images per recording. In contrast, in this paper we propose a method that only relies on images taken from a single camera. Also no temporal information or manual initialization is required. Our approach could be used, for example in conjunction with [14, 15] to automatically produce smooth high quality field estimates from video.
3 3D Soccer Field Registration
The goal of this paper is to automatically compute the transformation between a broadcast image of a soccer field, and the 3D geometric model of the field. In this section, we first show how to parameterize the problem by making use of the vanishing points, reducing the effective number of degrees of freedom to be estimated. We then formulate the problem as energy minimization in a Markov random field that encourages agreement between the model and the image in terms of grass segmentation as well as the location of the primitives (i.e., lines and ellipses) that define the soccer field. Furthermore, we show that inference can be solved exactly very efficiently via branch and bound.
3.1 Field Model and Parameterization
Assuming that the ground is planar, a soccer field can be represented by a 2D rectangle embedded in a 3D space. The rectangle can be defined by two long line segments referred to as touchlines and two shorter line segments, each behind a goal post, referred to as goallines. Each soccer field has also a set of vertical and horizontal lines defining the goal areas, the penalty boxes, and the midfield. Additionally, a full circle and two semicircles are also highlighted which define distances that opposing players should maintain from the ball at kickoff . We refer the reader to Fig. 1 for an illustration of the geometric field model.
The transformation between the field in the broadcast image and our 3D model can be parameterized with a homography , which is a invertible matrix defining a bijection that maps lines to lines between 2D projective spaces . The matrix has 8 degrees of freedom and encapsulates the transformation of the broadcast image to the soccer field model. A common way to estimate this homography is by detecting points and lines in the image and associating them with points and lines in the soccer field model. Given these correspondences, the homography can be estimated in closed form using the Direct Linear Transform (DLT) algorithm . While a closed form solution is very attractive, the problem lies on the fact that the association of lines/points between the image and the soccer model is not known a priori. Thus, in order to solve for the homography, one needs to evaluate all possible assignments. As a consequence DLT-like algorithms are typically used in the scenario where a nearby solution is already known (from a keyframe or previous frame), and search is done over a small set of possible associations.
In this paper, we follow a very different approach, which jointly solves for the association and the estimation of the homography. Towards this goal, we first reduce the effective number of degrees of freedom of the homography. In an image of the field, parallel lines intersect at two orthogonal vanishing points. If we can estimate the vanishing points reliably we can reduce the number of degree of freedom from 8 to 4. We defer the discussion about how we estimate the vanishing points to section 5.
For convenience of presentation, we refer to the lines parallel to the touchlines as horizontal lines, and the lines parallel to the goallines as vertical lines. Let be an image of the field. Denote by and the (orthogonal) vertical and horizontal vanishing points respectively. Since a football stadium conforms to a Manhattan world, there also exists a third vanishing point which is orthogonal to both and . We omit this third vanishing point from our model since there are usually not many lines enabling us to compute it reliably.
We define a hypothesis field by four rays emanating from the vanishing points. The rays and originate from and correspond to the touchlines. Similarly, the rays and originate from and correspond to the goallines. As depicted in Fig. 2, a hypothesis field is constructed by the intersection of the four rays. Let the tuple be the parametrization of the field, where we have discretized the set of possible candidate rays. Each ray falls in an interval and is the product space of these four integer intervals. Thus corresponds to a grid.
3.2 Field Estimation as Energy Minimization
In this section, we parameterize the problem as the one of inference in a Markov random field. In particular, given an image of the field, we obtain the best prediction by solving the following inference task:
with a feature vector encoding various potential functions and the set of corresponding weights which we learn using structured SVMs . In particular, our energy defines different potentials encoding the fact that the field should contain mostly grass, and high scoring configurations prefer the projection of the field primitives (i.e., lines and circles) to be aligned with the detected primitives in the image (i.e. detected line segments and conic edges). In the following we discuss the potentials in more detail.
This potential encodes the fact that a soccer field is made of grass. We perform semantic segmentation of the broadcast image into grass vs. non-grass. Towards this goal, we exploit the prediction from a CNN trained using DeepLab  for our binary segmentation task. Given a hypothesis field , let denote the field restricted to the image . We would like to maximize the number of grass pixels in . Hence, we define a potential function, denoted by , that counts the percentage of total grass pixels that fall inside the hypothesis field . However, note that for any hypothesis with , would have at least as many grass pixels as . This introduces a bias towards hypotheses that correspond to zoom-in cameras. We thus define three additional potentials such that we try to minimize the number of grass pixels outside the field and the number of non-grass pixels inside , while maximizing the number of non-grass pixels outside . We denote these potentials as , and respectively. We refer the reader to Fig. 3 for an illustration.
The observable lines corresponding to the white marking of the soccer field provide strong clues on the location of the touchlines and goallines. This is because their positions and lengths must always adhere to the FIFA specifications. In a soccer field there are 7 vertical and 10 horizontal line segments as depicted in Fig. 1. Using the line detector of , we find all the line segments in the image and also the vanishing points as described in section 5. A byproduct of our vanishing point estimation procedure is that each detected line segment is assigned to , or none (e.g. line segments that fall on the ellipse edges) as demonstrated in Fig. 4. We then define a scoring function for each line , that is large when the image evidence agrees with the predicted line position obtained by reprojecting the model using the hypothesis . The exact reprojection can be easily obtained by using the invariance property of cross ratios , Fig. 5(a). Giving the exact position of a line on the grid , the score counts the percentage of line segment pixels that are aligned with the same vanishing point, Fig. 5(b). We refer the reader to the suppl. material for more in details.
A soccer field has white markings corresponding to a full circle centered at the middle of the field and two circular arcs next to the penalty area, all three with the same radius. When the geometric model of the field undergoes a homography , these circular shapes transform to conics in the image. Similar to the line potentials, we seek to construct potential functions that count the percentage of supporting pixels for each circular shape given a hypothesis field . These supporting pixels are edge pixels that do not fall on any line segments belonging to or . Unlike the projected line segments, the projected circles are not aligned with the grid . However, as shown in Fig. 6, we note that there are two unique inner and outer rectangles for each circular shape in the model which transform in the image to quadrilaterals aligned with the vanishing points. Their position in the grid can be computed similarly to lines using cross ratios. We define a potential for each conic which simply counts the percentage of (non horizontal/vertical) line pixels inside the region defined by the two quadrilaterals.
4 Exact Inference via Branch and Bound
Note that the cardinality of our configuration space , i.e. the number of hypothesis fields, is of the order , which is a very large number. In this section, we show how to solve the inference task in Eq. (1) efficiently and exactly. Towards this goal, we design a branch and bound  (BBound) optimization over the space of all parametrized soccer fields. We take advantage of generalizations of integral images to 3D  to compute our bounds very efficiently.
Our BBound algorithm thus requires three key ingredients:
A branching mechanism that can divide any set into two disjoint subsets of parametrized fields.
A set function such that .
A priority queue which orders sets of parametrized fields according to .
In what follows, we describe the first two components in detail.
Suppose that is a set of hypothesis fields. At each iteration of the branch and bound algorithm we need to divide into two disjoint subsets and of hypothesis fields. This is achieved by dividing the largest interval in half and keeping the other intervals the same.
We need to construct a set function that upper bounds for all where is any subset of parametrized fields. Since all potential function components of are positive proportions, we decompose into potential with strictly positive weights and those with weights that are either zero or negative:
with , the vector of negative and positive weights respectively.
We define the upper bound on Eq. (2) to be the sum of an upper bounds on the positive features and a lower bound on the negative ones,
It is trivial to see that this is a valid bound. In what follows, we construct a lower bound and an upper bound for all the potential functions of our energy.
Bounds for the Grass Potential:
Let be the smallest possible field in , and let be the largest. We now show how to construct the bounds for , and note that one can construct the other grass potential bounds in a similar fashion. Recall that counts the percentage of grass pixels inside the field. Since any possible field is contained within the smallest and largest possible fields and (Fig. 3b), we can define the the upper bound as the percentage of grass pixels inside the largest possible field and the lower bound as the percentage of grass pixels inside the smallest possible field. Thus:
We refer the reader to Fig. 3(b) for an illustration.
Bounds for the Line Potentials:
We compute our bounds by finding a lower bound and an upper bound for each line independently. Since the method is similar for all the lines, we will illustrate it only for the left vertical penalty line of (Fig. 5a). For a hypothesis set of fields , we find the upper bound by computing the maximum value of in the horizontal direction (i.e. along the rays from ) but only for the maximal extended projection of in the vertical direction (i.e. along the rays from ). This is demonstrated in (Fig. a). Finding a lower bound consists instead of finding the minimum for minimally extended projections of .
Note that for a set of hypothesis fields , this task requires a linear search over all the possible rays in the horizontal (for vertical lines) at each iteration of branch and bound. However, as the branch and bound continues, the search space becomes smaller and finding the maximum becomes faster.
Bounds for the Circle Potentials:
Referring back to the definition of the ellipse potentials provided in section 3.2.3 and a set of hypothesis fields , we aim to construct lower and upper bounds for each circle potential. For an upper bound, we simply let be the percentage of non-vp line pixels contained in the region between the smallest inner and largest outer quadrilaterals as depicted in (Fig. b). A lower bound is obtained in a similar fashion.
4.3 Integral Accumulators for Efficient Potentials and Bounds
We construct five 2D accumulators corresponding to the grass pixels, non-grass pixels, horizontal line edges, vertical line edges, and non-vp line edges. In contrast to , and in the same spirit of , our accumulators are aligned with the two orthogonal vanishing points and count the fraction of features in the regions of corresponding to quadrilaterals restricted by two rays from each vanishing point. In this manner, the computation of a potential function over any region in boils down to four accumulator lookups. Since we defined all the lower and upper bounds in terms of their corresponding potential functions, we use the same accumulators to compute the bounds in constant time.
We use structured support vector machine (SSVM) to learn the parameters of the log linear model. Given a dataset composed of training pairs , we obtain by minimizing the following objective
where is a regularization parameter and is a loss function measuring the distance between the ground truth labeling and a prediction , with if and only if . In particular, we employ the parallel cutting plane implementation of .
The loss function is defined very similarly to . However, instead of segmenting the image to grass vs. non-grass pixels, we segment the grid to field vs. non-field cells by reprojecting the ground truth field into the image. Then given a hypothesis field , we define the loss for a training instance to be
Note that the loss can be computed using integral accumulators, and loss augmented inference can be performed efficiently and exactly using our BBound.
5 Vanishing Point Estimation
In a Manhattan world, such as a soccer stadium, there are three principal orthogonal vanishing points. Our goal is the find the two orthogonal vanishing points and that correspond to the lines of the soccer field. We forgo the estimation of the third orthogonal vanishing point since in a broadcast image of the field there are not usually many lines corresponding to this vanishing point. However, a reasonable assumption is to take the direction of the third vanishing point to be in the direction of gravity since the main camera rarely rotates. We find an initial estimate of the positions of and by deploying the line voting procedure of . This procedure is robust when there are sufficiently enough line segments for each vanishing point. In some cases, for example when the camera is facing the centre of the field (Fig. 4b), there might not be enough line segments belonging to to estimate its position reliably but enough to distinguish its corresponding line segments. In this case, we take the line segments that belong to neither vanishing point and fit an ellipse  which is an approximation to the conic in the centre of the field. We then take the 4 endpoints of the ellipses’ axes and also one additional point corresponding to the crossing of the ellipses’ minor axis from the grass region to non-grass region to find an approximate homography which in turn gives us an approximate .
For assessing our method, we recorded 12 games from the World Cup 2014 held in Brazil. Out of these games we annotated 259 images with the ground truth fields and also the grass segmentations. We used 6 games with 154 images for training and validation sets, and 105 images from 6 other games for the test set. The images consist of different views of the field with different grass textures. Some images, due to the rain, seem blurry and lack some lines. We remind the reader that these images do not have a temporal ordering.
Out of the 259 images, the vanishing point estimation failed for 5 images in the training/validation set and for 3 images in the test set. We discarded these failure cases from our training and evaluation. In what follows we assess different components of our method.
Grass Segmentation: is a major component of our method since it has its own potentials and is also used for restricting the set of detected line segments in the image to the ones that correspond to white markings of the field. Most of the existing approaches mentioned in the related work’s section, use heuristics based on color and hue information to segment the image into grass vs. non-grass pixels. We found these heuristics to be unreliable at times since the texture and color of the grass can be different from one stadium to another. Moreover, at some games, the spectators wear clothing with similar colors to the grass which further makes the task of grass segmentation difficult.
As a result, we fine-tune the CNN component of the DeepLab network  on the train/validation images annotated with grass and non-grass pixels. Our trained CNN grass segmentation method achieves an Intersection over Union (IOU) score of 0.98 on the test set. Some grass segmentation examples are shown in Fig. 4.
Ablation studies: In Table 1 we present the IOU score of test images based on employing different potentials in our energy function. For each set of features we used the weights corresponding to the value of that maximizes the IOU score of the validation set. We notice that just including the grass potentials achieves a very low test IOU of 0.57. This is expected since grass potentials by themselves do not take into account the geometry of the field. However, when we include line and circle potentials, the test IOU increases by about 30%.
|Potentials||Mean Test IOU||Median Test IOU|
Comparison of Our Method to Two Baselines: There is currently no baseline in the literature for automatic field localization in the game of soccer. As such, we derive two baselines based on our segmentation and line segment detection methods. As the first baseline, for each test image we retrieve its nearest neighbour (NN) image from the training/validation sets based on the grass segmentation IOU and apply the homography of the training/val image on the test image. The second baseline is similar but instead of the NN based on grass, we retrieve based on the distance transform computed from the edges . Note that these approaches could be considered similar to the keyframe initialization methods of [14, 15, 16, 17]. In contrast to those papers, here we retrieve the closest homography from a set of different games.
In Table 2, we compare the IOU of these baseline with our learned branch and bound inference method. We observe that if we use only the grass potentials, the baseline is similar to the NN with grass segmentation. Using NN with line segment detections improves the baseline. When we introduce potential functions based on lines, the IOU metric is increased by about 30%. Our method with the best set of features outperform the baseline by about 34%. The best set of features that achieve an IOU of 90% have 4 weights for the grass potentials, one shared weight for the vertical lines, one shared weight for the horizontal lines, and similarly one shared weight for the circles. By releasing our dataset and the annotations, we hope that other baselines will be established.
|method||Mean Test IOU||Median Test IOU|
|Nearest Neighb. based on grass segmentation||0.56||0.64|
|Nearest Neighb. based on lines distance transform||0.59||0.66|
|our method with just grass potentials|
|our method with line potentials|
|our method best features (G+VerL+HorL+C)||0.90||0.94|
Qualitative Results: In Fig. 8 we project the model on a few test images using the homography obtained with our best features (G+VerL+HorL+C). We also project the image on the model of the field. We observe great agreement between the image and the model.
Failure Modes: Fig. 9 shows failure modes which are mainly due to errors due to failure of the circle potential.
Speed and Number of Iterations. For the best set of features (denoted with G+VerL+VerH+C in Table 1), it takes on average 0.7 seconds (a median of 0.5) to perform inference and on average 2964 BBound iterations (with median of 1848 iterations). Times clocked on one core of AMD Opteron 6136.
7 Conclusion and Future Work
In this paper, we presented a new framework for fast and automatic field localization as applied to the game of soccer. We framed this problem as a branch and bound inference task in a Markov Random Field. We evaluated our method on collection of broadcast images recorded from World Cup 2014. As was mentioned, we do not take into account temporal information in our energy function. For future work, we intend to construct temporal potential functions and evaluate our method on video sequences. We also plan to incorporate player detection and tracking in our framework. Finally, we aim to extend our method to other team sports such as hockey, basketball, rugby and American Football.
- Okuma, K., Taleghani, A., De Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: ECCV. (2004)
- Tong, X., Liu, J., Wang, T., Zhang, Y.: Automatic player labeling, tracking and field registration and trajectory mapping in broadcast soccer video. TIST (2011)
- Okuma, K., Lowe, D.G., Little, J.J.: Self-learning for player localization in sports video. arXiv preprint arXiv:1307.7198 (2013)
- Lu, W.L., Ting, J.A., Little, J.J., Murphy, K.P.: Learning to track and identify players from broadcast sports videos. PAMI (2013)
- Gao, X., Niu, Z., Tao, D., Li, X.: Non-goal scene analysis for soccer video. Neurocomputing (2011)
- Niu, Z., Gao, X., Tian, Q.: Tactic analysis based on real-world ball trajectory in soccer video. Pattern Recognition (2012)
- Franks, A., Miller, A., Bornn, L., Goldsberry, K., et al.: Characterizing the spatial structure of defensive skill in professional basketball. The Annals of Applied Statistics (2015)
- Liu, Y., Liang, D., Huang, Q., Gao, W.: Extracting 3d information from broadcast soccer video. Image and Vision Computing (2006)
- Kim, H.K.H., Hong, K.S.H.K.S.: Soccer video mosaicking using self-calibration and line tracking. In: ICPR. (2000)
- Yamada, A., Shirai, Y., Miura, J.: Tracking players and a ball in video image sequence and estimating camera parameters for 3D interpretation of soccer games. In: ICPR. (2002)
- Farin, D., Krabbe, S., Effelsberg, W., Others: Robust camera calibration for sport videos using court models. In: Electronic Imaging 2004. (2003)
- Watanabe, T., Haseyama, M., Kitajima, H.: A soccer field tracking method with wire frame model from TV images. In: ICIP. (2004)
- Wang, F., Sun, L., Yang, B., Yang, S.: Fast arc detection algorithm for play field registration in soccer video mining. In: SMC. (2006)
- Gupta, A., Little, J.J., Woodham, R.J.: Using line and ellipse features for rectification of broadcast hockey video. In: CRV. (2011)
- Okuma, K., Little, J.J., Lowe, D.G.: Automatic rectification of long image sequences. In: ACV. (2004)
- Dubrofsky, E., Woodham, R.J.: Combining line and point correspondences for homography estimation. In: Advances in Visual Computing. (2008)
- Hess, R., Fern, A.: Improved video registration using non-distinctive local image features. In: CVPR. (2007)
- Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. In: Journal of Machine Learning Research. (2005) 1453–1484
- Hayet, J.B., Piater, J., Verly, J.: Robust incremental rectification of sports video sequences. In: BMVC. (2004)
- Hayet, J.B., Piater, J.: On-line rectification of sport sequences with moving cameras. In: MICAI. (2007)
- Gedikli, S., Bandouch, J., Hoyningen-Huene, N.V., Kirchlechner, B., Beetz, M.: An adaptive vision system for tracking soccer players from variable camera settings. In: ICVS. (2007)
- Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Second edn. Cambridge University Press, ISBN: 0521540518 (2004)
- Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. (2015)
- von Gioi, R.G., Jakubowicz, J., Morel, J.M., Randall, G.: Lsd: a line segment detector. Image Processing On Line (2012)
- Lampert, C.H., Blaschko, M.B., Hofmann, T.: Efficient subwindow search: A branch and bound framework for object localization. PAMI (2009)
- Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Efficient structured prediction for 3d indoor scene understanding. In: CVPR. (2012) 2815–2822
- Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR. (2001)
- Schwing, A., Fidler, S., Pollefeys, M., Urtasun, R.: Box in the box: Joint 3d layout and object reasoning from single images. In: ICCV. (2013)
- Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: ICCV. (2009)
- Fitzgibbon, A., Pilu, M., Fisher, R.B.: Direct least square fitting of ellipses. PAMI (1999)
- Meijster, A., Roerdink, J.B., Hesselink, W.H.: A general algorithm for computing distance transforms in linear time. In: Mathematical Morphology and its applications to image and signal processing. (2002)