Spin Detection in Robotic Table Tennis*
In table tennis the rotation (spin) of the ball plays a crucial role. A table tennis match will feature a variety of strokes. Each generates different amounts and types of spin. To develop a robot which can compete with a human player, the robot needs to be able to detect spin, so that it can plan an appropriate return stroke. In this paper we compare three methods for estimating spin. The first two approaches use a high-speed camera that captures the ball in flight at a frame rate of 380 Hz. This camera allows the movement of the circular brand logo printed on the ball to be seen. The first approach uses background difference to determine the position of the logo. In a second alternative, a CNN is trained to predict the orientation of the logo. The third method evaluates the trajectory of the ball and derives the rotation from the effect of the Magnus force. In a demonstration, our robot must respond to different spin types in a real table tennis rally against a human opponent.
One of the most difficult tasks when playing table tennis is judging the amount of spin on a ball. To achieve the goal of beating human players of different levels, a table tennis robot needs to be able to accurately predict spin. A lot of prior knowledge is required to assign the right spin to a shot. The major factor used by human players to judge spin is the opponent’s stroke. It is, however, difficult to detect stroke movement with a camera. Such an approach would also require training with a number of different people and rackets.
Some professional players with excellent eyesight are able to see the rotation of the ball from the movement of the brand logo. By recording the ball with high-speed cameras, it is possible to identify markers on the ball and detect its rotation. This is the most common approach in the literature. Tamaki et al.  use black lines on the ball for tracking. Zhang et al. [2, 3] use the logo printed on the ball. Basing on Zhang, the first method described in this paper uses a high-speed camera at 380 fps to capture the ball and estimate the spin from the movement of the brand logo on the ball.
Another promising approach is to use measurements of the ball’s trajectory to determine spin. In our third approach we detect the spin from the ball’s trajectory through the effect of the Magnus force. Huang et al.  used a similar approach, involving a physical force model which included the Magnus force, to determine the rotation of the ball. Zhao et al. [5, 6] replace the norm of the velocity necessary to calculate the air resistance. Thus, a differential equation can be solved and one can fit the speed and spin values. Blank et al.  capture stroke motion using an IMU mounted on the bat to predict the rotation of the ball. Gao et. al.  track the table tennis bat using stereo cameras and use a neural network to classify the different types of strokes.
Going beyond the topic of this research work, these results could have an impact on spin detection in other research areas, especially research focusing on sports with fast flying and strongly rotating balls, like tennis, baseball or football. As well as developing robots for these sports, spin detection can also be used for match analysis or for evaluating and improving player technique. There are various general robotic applications where it is necessary to determine the rotation of objects. In the case of table tennis, processing time is the key factor in determining whether or not the application will be successful. In modern highly-dynamic robotic systems, time-optimized object pose detection is essential, e.g. when grasping objects in human-robot collaboration or during autonomous driving in high-speed traffic.
Ii Spin Estimation from the Brand Logo by Background Subtraction
A PointGrey Grasshopper3 camera is mounted on the ceiling above the center of the tablet tennis table. The camera can achieve 162 fps at full resolution (1920 x 1200). A very high ball spin exceeds 100 revolutions per second. In this case the ball’s brand logo would be visible only every second frame. We therefore selected a ROI of 1920 x 400 and record at 380 fps, which is possible with this camera type.
Ii-a Ball Detection
Ii-B Logo Contour Detection
Ball detection yields an image containing only the ball. The process of marking all the pixels that belong to the brand logo is described in figure 2. For all pixels of the logo contour, we want to know the 3D positions on the ball.
Ii-C 3D Projection
The ball extraction also gives the radius of the ball in pixels. This is calculated by fitting a circle to the ball blob. For each contour pixel, we first calculate its position relative to the ball’s center. The and components are then divided by the ball’s radius. Since our 3D point must lie on the unit sphere, the component can be derived from
Ii-D Brand Logo Center
The next step involves calculating the logo center. In our first approach, we simply normalize the average of all 3D contour points. This does not take into account the fact that contour points closer to the ball’s center in the image are more frequent. The centroid can also be calculated iteratively using Ritter’s bounding sphere  on the 3D contour points. Normalization projects the centroid onto the unit sphere. As this did not significantly boost accuracy, we used the first approach, which was faster.
Ii-E Circular Segment Fitting
On the camera images only one side of the ball is visible. Therefore, brand logos may be only partially in view when they are located at the edge of the shown area. Figure 3 shows a contour transformed into the 3D ball coordinate space for such an edge case. In this case the contour does not form a circle but a 2D circle segment, so the center position cannot be obtained as before. We approximate the area from the contour points (green crosses in figure 3) and its centroid by
We know the actual radius of the logo from measurements. We can therefore derive the angle from the area :
The distance from the centroid to the real center , see , is given by
To get the real 3D center we rotate the centroid by angle around the axis . The circular segment fitting stabilizes the spin detection for the challenging edge case compared to the original approach of Zhang et al .
Ii-F Fitting Rotation
After processing to images captured every from the ball’s trajectory as described above, we can estimate the ball’s spin. For this purpose, we fit a plane through the center points. The fitted plane should minimize the distance to the points. Additionally, the distance of the points to the circle created by intersecting the plane with the ball, represented as the unit sphere, should be minimized. An example is shown in figure 5.
To get the angular velocity, we project the logo positions onto the plane and calculate the angle between two consecutive logo positions. If the logo was not found on two or more successive images, the ball has made a half revolution. The rotation is therefore described not by the short angle between the points before and after but by the large angle . At the end we have a sequence of the accumulated angles and fit a regression line to the sequence. The gradient of that line gives us the angular velocity.
To evaluate logo position detection, we use a dataset of balls with human-labeled orientations (see section III-A), which we collected to train the CNN as described in the next chapter. The total dataset includes 4656 images. In 245 images, the algorithm wrongly found no brand logo. There were also 40 images where a logo was found, but the human did not label it. For the rest there were 2080 images without visible logo and 2291 with a visible logo. On these balls, the average angular error to the labeled ground truth logo position was with a standard deviation of .
Iii Spin Estimation by CNN on ball image
Our second approach uses a Convolutional Neural Network (CNN) to estimate the visibility of the logo and the 3D pose of the ball. We then use the same algorithm as in section II-F to estimate the rotation axis and angular velocity.
To train and test the network, a total of 4656 images were recorded using our PointGrey Grasshopper3 camera. The images were cropped around the table tennis ball to have a fixed size of 60 x 60 pixels. 46.7% were labeled as having no visible brand logo. The ball’s pose was labeled with the help of a 3D scene containing a ball with realistic logo texture. The 3D scene was modeled with the open-source 3D computer graphics software Blender . Each real ball image was placed transparently over the scene. Next, the 3D ball model can be optimized to fit the actual image and the pose can be read out by the Blender Python API.
In addition to the recorded data, augmentation was used to extend the dataset and make the network more robust. The following augmentation procedures have been tested: rotations, Gaussian noise (stdev ), Gaussian blur (stdev ), change of brightness(), saturation () and hue (). The results are given in table II in the experiments section.
Iii-C Network Architecture
Related work on pose detection with neural networks favours the residual network architecture from He et al. . In Mahendran et. al.  the top performing network is an 18-layer ResNet. Salehi et. al.  have chosen a ResNet-18 as well.
Iii-D Network Output
There are several mathematical representations of a rotation. One can use rotation matrices, Euler angles, axis angles representation, or quaternions. Matrices do not fit as output of our network, as more parameters need to be estimated and one needs to ensure that the result is within the matrix subgroup of rotation matrices. With Euler angles it is difficult to represent continuous rotations. As a result, we trained networks to predict the pose of the table tennis ball in either axis angle representation or in quaternions. For either representation, the output is concatenated with a real number for the visibility of the brand logo. The range of the visibility value is to match the -positions away from the camera. In the dataset non-visible logos are labeled as .
|model||GAP||FC||dropout||train. loss||test loss||classification||geodesic||vector angle|
|cond.-||cond.-||accuracy||in deg.||in deg.|
Iii-E Loss Functions and Metrics
The proposed neural network has to learn two tasks simultaneously. It needs to classify whether the brand logo of the ball is visible and predict the pose of the ball. If the logo is not visible, the pose cannot be determined. In this case, the network should not learn any incorrect poses. Hence, we define a conditional loss function that splits the loss into the two tasks:
where denotes the binary ground truth visibility value. For a visible logo, the value is . Otherwise it is . Therefore, we call it conditional loss.
When outputting in axis angle or quaternion representation, we adjust the pose losses for ambiguity. In both representations, the negative value gives the same orientation since the rotation in the opposite direction about the negative axis corresponds to the original rotation. We tested both and norms to get the following conditional losses:
A more complex, but fairly exact measurement of the accuracy of rotations is the geodesic distance in . For two rotations this metric returns the angle (from axis-angle representation) of the rotation aligning them both. If are rotation matrices the geodesic distance is calculated as
For quaternions the geodesic distance is computed by
where denotes the absolute value and is the inner product of 4-dimensional quaternion vectors. As before, we define a new loss function
The most difficult part of the rotation for the networks to determine is the logo’s orientation about its center. We therefore also want to evaluate the accuracy of the network’s prediction of the position of the logo on the ball only, i.e. without considering whether it is rotated in itself. For that we need an additional metric not affected by the orientation. We convert the rotation of the ball to logo positions, represented by points on the unit sphere, by rotating the base logo position . The metric is then the vector angle describing the angle between two positions.
The network is used on several images of the ball trajectory. For the final spin estimation the poses outputted from the networks are converted to logo positions as described in the previous paragraph. We then use the same algorithm as in section II-F to estimate rotation axis and angular velocity.
Iv CNN Experiments
Iv-a Training Setup
The dataset from section III-A was split into training and test set with a ratio. As a result, 3725 images were used for training and 931 for testing. The networks were trained with Tensorflow using an Nvidia GeForce GTX 1080 Ti graphics card.
Iv-B Architecture Tuning
In the first experiment we wanted to tweak the architectures proposed in section III-C. The baseline ResNet model was tested with several modifications. For comparison, we also included a VGG network with modifications similar to the best ResNet variant. All models were optimized with stochastic gradient descent including a momentum of . An initial learning rate of was decayed by every steps. All models were to output quaternions and were trained using the conditional loss from III-E.
The results of the different models can be seen in table I. Variants A-G are built upon a ResNet architecture. Network A is a standard 18-layer ResNet with global average pooling (GAP) after the convolutional layers at the end, as proposed in the original ResNet paper. Expanding the idea of Mahendran et. al.  networks B-D use two additional fully-connected layers (FC) with 512 neurons each right before the final regressor. This modification should improve the transformation from feature space to pose space. In E-G we incorporate a combination of the two approaches starting with a GAP layer. Within each of these categories the models differ only in the dropout rate. An exception is model C, where an initial learning rate of was necessary.
As expected, most ResNet variants outperformed the VGG network. The best results were obtained by network G using both GAP and FC layers at a dropout rate of .
Iv-C Augmentation Comparison
We continue to use model G with the different augmentation techniques from section III-B. The results can be seen in table II. Only the rotation and the noise increased the accuracy. For further experiments we therefore used the augmentation model N with both of these methods.
|model||augmentation||train. loss||test loss||class.||geodesic||vector angle|
|cond.-||cond.-||accuracy||in deg.||in deg.|
Iv-D Test of Output Representations and Loss Functions
In this section we compare the proposed loss functions as well as the two representations, axis angle and quaternions, as described in section III-E. While a network can theoretically transform one representation into the other, the performance may differ depending on the target output. Mahendran et. al.  observed very different results for these rotation representations. We tested the conditional and loss, the unconditional loss for comparison, and the geodesic loss function. For this test the hyperparameters from before are used except for the initial learning rate. The initial learning rate for the geodesic loss was 0.01. For the unconditional loss it is also set to 0.01.
|loss||rotation||train. loss||test loss||geodesic||vector angle|
|function||representation||in deg.||in deg.|
In our experiments quaternions outperformed axis angles, except for the unconditional loss. This is somehow unexpected as the conversion between these representations is particularly simple. It may result from the fact that it is slightly easier to apply rotations by using quaternions. The difference could also be caused by the additional scaling. The scaling is required to bring the angle part of the axis angle into the range in order to work with activation. As expected, the conditional loss has an advantage over the unconditional loss. Conditional and geodesic losses for quaternions are on par with each other evaluated on the geodesic metric. But the conditional has a lower vector angle error. This is worth noting, as it describes the accuracy for finding the correct logo center, which we use to derive the spin of the ball.
Iv-E Inference Time and Model Complexity
In our scenario, it is not just accuracy that matters - time for the evaluation (inference time) is also important. From the camera we get images at Hz. This results in a processing time of ms per image for real-time performance. Segmenting the ball out and cropping takes ms, leaving ms for the network. The first tested network achieves an inference time of ms. We try two different ways to accelerate the model. Firstly, we create more shallow networks by removing layers. Secondly, we reduce the breadth by half/quarter of the number of feature maps per layer. The compared networks are presented in table IV and the results can be found in table V. The inference time can be reduced by more than half without significant loss of accuracy. The deeper models with fewer features per layer work slightly better. The model at the bottom with 20 layers and quartered filter breadth appears to be a good candidate for our system.
|layer name||output size||20 weight||18 weight||14 weight||10 weight||6 weight|
|conv3_x||||, 64, stride 2|
|||global average pooling|
|layers||accuracy||in deg.||in deg.||time in ms|
Our best performing network is an 18-layer ResNet plus global average pooling and two fully connected layers (see table IV) trained with rotation and noise augmentation:
|layers||class. acc.||geodesic||vector angle|
V Spin Estimation from the Trajectory
In this section we introduce a way of estimating the spin from the trajectory of the ball. We utilize the fact that the rotation of the ball acts on the ball via the Magnus force. Previous work on the topic has been done by Huang et al. .
The forces are depicted in figure 7. The gravitational force is directed towards the ground. The drag force coming from the air resistance acts in the opposite direction to the flight of the ball. The Magnus force is perpendicular to the rotation axis and the flight direction. The acceleration of the ball is therefore calculated by
The notation is shortened with and , where the constants are the mass of the ball , the gravitational constant , the drag coefficient , the density of the air , the lift coefficient , the ball radius , and the ball’s cross-section . For a ball with medium to heavy spin the forces all have similar magnitudes, as can be seen in figure 9.
Given 3D observation of the ball with ball positions at times we first fit a curve to get a smoothed trajectory. For each axis we fit a third degree polynomial using a standard least-squares algorithm. Then, we have a function representing a smoothed version of the trajectory. In addition, we also have the speed with its derivation .
Next, we want to fit the spin of the ball. We choose a time span and equidistant time points between and . At each time point we take the speed state . Using a no-spin motion model considering only gravitation and drag, we predict a look-ahead speed value at time denoted by . The difference should be due to the Magnus force. Assuming constant acceleration within this time step we have
for each . The whole equation system can be transformed into with a matrix and an m-dimensional vector of accelerations . We then get a least-squares solution for . In figure 9 the forces are shown for each step of the look-ahead fitting of an example trajectory.
V-B Preprocessing: Outlier filtering
The process is error prone to outliers. Even for a slight impact for the fitted trajectory these outliers can produce unrealistic fitted spin values. Especially at the beginning of the trajectory misrecognitions can occur when a part of the human body, e.g. the hand, is detected instead of the ball due to its roughly circular shape. For the first 20 balls we select the last 5 balls and make a polynomial fit as above. If the error for the ball is below a specific threshold we start again with balls to otherwise we remove ball as outlier. Repeating this process we remove detected objects which do not belong to the trajectory at the beginning.
With the position , speed and spin we predict the future trajectory. The improvement for the prediction can be seen in table VI. We tested backspin, side spin and topspin at three different speed settings with our TTmatic ball throwing machine. For comparison a Kalman filter is used to only predict position and speed without considering the angular velocity of the ball. The statistic includes 90 trajectories in total divided into 10 trajectories each. The estimated spin values significantly improves bounce estimation. In contrast to the first two approach using the ball’s logo, the spin can used for predicting the future trajectory without adjusting the Magnus coefficient . As we divide by it for spin estimation we multiply again for prediction. For the other methods, we found no constant independent of the spin type, which gave good results.
|With fitted spin||Without spin value|
In this paper, we looked at three different algorithms to detect the spin of a table tennis ball. The first two approaches can be compared by evaluating the angular error between the actual and the predicted logo position. The background subtraction method gives a larger angular error of than the most accurate convolutional neural network with an error of . A fast network gave an error of . However, background subtraction is much faster, with a processing time of ms, compared to the most accurate and the fast network at ms and ms, respectively.
For the final spin prediction there are no ground truth values available. Using our ball throwing machine we recorded 50 trajectories for 3 spin types and 3 different powers, 9 settings in total. For each approach and setting the median spin was derived. The median defines a cluster center each. Each spin value is then assigned to the nearest cluster center. The accuracy value for a setting now tells how many of its trajectories were correctly classified. The results are shown in table VI.
Unfortunately there are no ground truth values available for a comparison of the methods. Therefore we evaluate how good the algorithms are for the classification of spin. With our TTmatic ball throwing machine, 50 flight curves were recorded for each of nine settings. Three types of spin with different strengths were applied. Unfortunately the machine does not allow the speed and rotation of balls to be set independently. Faster spin is therefore accompanied by a higher velocity. The median spin is calculated for each algorithm and setting. This 3D vector defines a cluster in three-dimensional space. Each spin value is distributed to the nearest cluster. The accuracy then indicates what percentage of the trajectories belonging to a setting has been assigned to the corresponding cluster. The evaluation is in table VI.
Surprisingly, the algorithms are similar in accuracy. A drop in performance is noticeable for balls with a lot of side spin. In many cases, if the logo rotates around itself at the top and hardly changes position. In this case the first two variants reached their limit. For the same case appearing on the invisible side the logo cannot be seen and evaluated with this methods. The third algorithm does not suffer from brand logo dependence. Although it had difficulty distinguishing between the medium and high backspin because the two median values were relatively close to each other. All in all, a good classification can be achieved with all methods. An improvement would probably achieved by combining an approach using the brand logo with the Magnus force fitting.
Vii Evaluation on a table tennis robot
The success of spin detection is demonstrated with a real table tennis robot. For this demonstration we used the trajectory Magnus force fitting. It is easier to set up and uses fewer resources, as no additional camera hardware is necessary. The ball’s positions are captured either way in order to predict its future trajectory. In a real table tennis environment, our KUKA Agilus KR6 R900 robot arm has to respond to different spin types. The return stroke is programmed simply but efficiently by moving the table tennis racket attached to the robot with a velocity of m/s towards the ball. The orientation of the bat is given in Euler angles in the order . The angles and are defined linearly dependent on the -position of the hitting point and the -angle is linearly dependent on the -component of the fitted rotational velocity of the ball. A video demonstration of the experiment is sent with this submission (and on Youtube111https://youtu.be/SjE1Ptu0bTo ). The rubber of the bat is a professional table tennis rubber with high friction. A lot of spin therefore acts on the ball after contact with the bat. As far as we are aware, to date no other table tennis robot has achieved the feat of returning the ball under such challenging conditions. In future, we plan to go from cooperative to competitive strokes. Although our robot control approach is effective for cooperative spin play, it clearly has limits in terms of adaptability. In a next step, the predicted spin may help to train the speed and orientation of the bat using reinforcement learning.
-  T. Tamaki, H. Wang, B. Raytchev, K. Kaneda, and Y. Ushiyama, “Estimating the spin of a table tennis ball using inverse compositional image alignment,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2012, pp. 1457–1460.
-  Y. Zhang, Y. Zhao, R. Xiong, Y. Wang, J. Wang, and J. Chu, “Spin observation and trajectory prediction of a ping-pong ball,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), May 2014, pp. 4108–4114.
-  Y. Zhang, R. Xiong, Y. Zhao, and J. Wang, “Real-time spin estimation of ping-pong ball using its natural brand,” IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 8, pp. 2280–2290, Aug 2015.
-  Y. Huang, D. Xu, M. Tan, and H. Su, “Trajectory prediction of spinning ball for ping-pong player robot,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sept 2011, pp. 3434–3439.
-  Y. Zhao, Y. Zhang, R. Xiong, and J. Wang, “Optimal state estimation of spinning ping-pong ball using continuous motion model,” IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 8, pp. 2208–2216, Aug 2015.
-  Y. Zhao, R. Xiong, and Y. Zhang, “Model based motion state estimation and trajectory prediction of spinning ball for ping-pong robots using expectation-maximization algorithm,” Journal of Intelligent & Robotic Systems, vol. 87, no. 3, pp. 407–423, Sep 2017. [Online]. Available: https://doi.org/10.1007/s10846-017-0515-8
-  P. Blank, B. H. Groh, and B. M. Eskofier, “Ball speed and spin estimation in table tennis using a racket-mounted inertial sensor,” in Proceedings of the 2017 ACM International Symposium on Wearable Computers, ser. ISWC ’17. New York, NY, USA: ACM, 2017, pp. 2–9. [Online]. Available: http://doi.acm.org/10.1145/3123021.3123040
-  Y. Gao, J. Tebbe, J. Krismer, and A. Zell, “Markerless racket pose detection and stroke classification based on stereo vision for table tennis robots,” in 2019 Third IEEE International Conference on Robotic Computing (IRC), Feb 2019, pp. 189–196.
-  J. Tebbe, Y. Gao, M. Sastre-Rienietz, and A. Zell, “A table tennis robot system using an industrial kuka robot arm,” in Pattern Recognition, T. Brox, A. Bruhn, and M. Fritz, Eds. Cham: Springer International Publishing, 2019, pp. 33–45.
-  J. Ritter, “Graphics gems,” A. S. Glassner, Ed. San Diego, CA, USA: Academic Press Professional, Inc., 1990, ch. An Efficient Bounding Sphere, pp. 301–303. [Online]. Available: http://dl.acm.org/citation.cfm?id=90767.90836
-  Wikipedia contributors, “List of centroids — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 27-February-2019]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=List˙of˙centroids&oldid=883001666
-  Blender Online Community, Blender - a 3D modelling and rendering package, Blender Foundation, Blender Institute, Amsterdam, 2016. [Online]. Available: http://www.blender.org
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Tech. Rep. [Online]. Available: http://image-net.org/challenges/LSVRC/2015/
-  S. Mahendran, H. Ali, and R. Vidal, “3D Pose Regression using Convolutional Neural Networks,” Tech. Rep., 2017. [Online]. Available: https://shapenet.cs.stanford.edu/media/syn
-  S. S. M. Salehi, S. Khan, D. Erdogmus, and A. Gholipour, “Real-time Deep Pose Estimation with Geodesic Loss for Image-to-Template Rigid Registration,” Tech. Rep. [Online]. Available: https://github.com/SadeghMSalehi/DeepRegistration
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2015.