Identifying top football players and springboard clubs from a football player collaboration and club transfer networks
We consider all players and clubs in top twenty world football leagues in the last fifteen seasons. The purpose of this paper is to reveal top football players and identify springboard clubs. To do that, we construct two separate weighted networks. Player collaboration network consists of players, that are connected to each other if they ever played together at the same club. In directed club transfer network, clubs are connected if players were ever transferred from one club to another. To get meaningful results, we perform different network analysis methods on our networks. Our approach based on PageRank reveals Christiano Ronaldo as the top player. Using a variation of betweenness centrality, we identify Standard Liege as the best springboard club.
keywords:football network, sports networks, network analysis, measures of centrality
Football is probably the most popular sport in the world with around 265 million active players BCount () around the globe and even more people enjoy watching it. Every year a lot of money is spent by football clubs in attempt to build a strong team by buying good players from their rivals.
Most of the data available on official football unions or tournament websites normally only addresses a specific match, tournament or season. In order to collect the data for top leagues in the last few seasons, we need to look elsewhere.
A very interesting website from this perspective is www.transfermarkt.co.uk. It contains all the major leagues including all the clubs, rankings, players, information about the players and also their estimated market value.
In an attempt to analyse players and the connections between them we construct a large network of professional football players from different clubs in different leagues. We are particularly interested in the influence of the teammates on a football player and if it is possible to identify the best players, based on knowing with whom they play now and where they played in the past. Using these analyses we could be able to find out which players are the best according to different metrics. In a football player network, two players are connected to each other if they have ever played for the same club. Such network can be represented by a bipartite graph consisting of clubs and players. Every player is connected to all the clubs he has played for and through the nodes that represent clubs we are able to see which players played together for a specific club. For simpler analysis we separate these problems and project the bipartite graph to a network constructed only from nodes representing football players. Two nodes are connected to each other if they were ever teammates. This is an undirected network. Apart from the analysis of players, we also want to identify the best springboard clubs that are the players entry point into the best football clubs in the world. Because we do not include information about clubs in the first network, we construct a second network. The second network is a club transfer network. Clubs from the top twenty leagues represent nodes which are connected if any player was ever transferred from one club to another. The direction of the edge points from the club that sold the player to the club that bought the player.
Preliminary analysis on an unweighted undirected player collaboration network shows that weighted networks are needed in order to extract information about the best players. We expect very well known football players to come on top when analysing a weighted player collaboration network. In order to identify springboard clubs a weighted directed club transfer network has to be constructed. Weights of the edges are calculated using different equations that take into account multiple metrics. Using those networks we identify the top players in the world of football and the top springboard clubs.
2 Related Work
As it has been pointed out in pena2012network (), football data is becoming more easily available in the past years since FIFA has made more data regarding different matches available on their website. Many authors took advantage of that and constructed different networks to perform network analysis and gather information from the networks.
In pena2012network () the authors used some interesting approaches to reveal key players of a certain team, performing analysis on a passing network of a specific team. They showed we are able to identify different kinds of strategies of a team such as focusing passes on a single player or evenly distributing passes between all players in the team. They performed several analyses on a team passing network using very well known network analysis methods such as PageRank, Betweenness centrality and Closeness centrality.
Player contribution to a team was also analysed in duch2010quantifying (). They used a variation of betweenness centrality of the player with regard to opponent’s goal, which authors denoted as flow centrality. We use similar network theory methods, but we adapt them to test different theories. In cotta2013network () they dug a little deeper but followed the same idea. They only concentrated on one specific team and constructed more networks for the same match, introducing the time dimension.
Although there are various papers regarding football network analysis, the majority of football networks are only considering a certain match or tournament. In this paper we construct a much larger network consisting of thousands of players. In other team sports, such as cricket, some authors have already tried to identify the best individuals among all the players that played over a certain time period. A very interesting networks considering sportsmen throughout several decades were analysed in mukherjee2012identifying (); radicchi2011best (). In radicchi2011best () the authors attempted to find out who is the best player of tennis in history of this sport. We try to construct a somewhat similar network but since football is very different from tennis the networks still differ a lot. Since the main difference is that football is a team sport, we can not just link players based on their matches. Here players are connected based on their affiliation to a club.
3.1 Data Extraction and Network Construction
In this paper we analyse a large set of football players throughout the past fifteen seasons. In order to collect this data we use the site www.transfermarkt.co.uk, which is becoming the leading portal when it comes to football players and information about them. Several scripts are used to extract relevant data for different clubs and players. Network is constructed from players out of 20 most valuable football leagues from year 2001 to 2016. The leagues and their values are presented in Table 3.
Using the gathered data we constructed two separate networks, first one consisting of football players and the other consisting of football clubs. Football player network is a player collaboration network where players are connected if they ever played together at the same club. It is an undirected weighted network consisting of 36,214 nodes and 1,412,232 edges. Other basic network properties are shown in Table 1. Club network is a directed transfer network between all the clubs in the top twenty world football leagues. Nodes represent clubs and a club is connected to another club if a player was ever transferred from the first club to the second club. It is a directed weighted network consisting of 330 nodes and 12,841 edges. Other basic network properties are shown in Table 2.
|Fraction of nodes in LCC||99.96%|
|Fraction of nodes in LCC||99.90%|
|League||Value [£]||League||Value [£]|
|Permier League (ENG)||3,01bn||Pro League (BEL)||354m|
|La Liga (ESP)||2,25bn||Primera Division (ARG)||306m|
|Serie A (ITA)||1,79bn||Premier Liga (UKR)||299m|
|1. Bundesliga (GER)||1,65bn||Super League(GRE)||227m|
|Ligue 1 (FRA)||1,06bn||Super League (SWI)||175m|
|Super Lig (TUR)||698m||MLS (USA)||162m|
|Premier Liga (RUS)||638m||Liga 1 (ROM)||118m|
|Serie A (BRA)||608m||1. HNL (CRO)||115m|
|Liga NOS (POR)||574m||Bundessliga (AUT)||110m|
|Eredivisie (NED)||375m||Premiership (SCO)||85m|
3.2 Player Network Analysis
In order to reveal the best players in our network, we choose an appropriate method of determining node importance. Since we wanted to identify the best players in the last fifteen seasons, we expected the most known and valued names of football to be at the top of the list. Not to neglect younger players, we also separate players into age groups. We analysed each age group individually in order to identify the most perspective players. Since our network is a collaboration network, we have to categorize the edges. The players that play with the best players are usually good themselves. Players with a lower value may change a lot of clubs and change a lot of teammates in a couple of seasons, but this categorization penalises their edges. In general, player market value is a good identifier of the quality of a player. Therefore we choose market value as a core property to calculate the edge weight. Since our data spans over fifteen seasons, we have to take into account the inflation, so that good players that played in the past are not penalised. We gather average inflation rate from InfRatio (). The final formula for calculating the weight of a specific edge is
Symbols and are values of players that are connected by the edge, represents the seasons in which players played together and represents average inflation ratio per year for Europe in the last 13 years. The equation is divided by 100000, to obtain smaller numbers. To calculate which node is the most important, we choose one of the most popular node importance algorithms, PageRank page1997pagerank (). We calculate the PageRank score of every node in our weighted network. To identify the most perspective players, we separate players into age groups. The most perspective players have the highest score in their age groups.
3.3 Club Network Analysis
From the club transfer network we want to identify the springboard clubs. These are the clubs where younger players gather experience and are later sold to better or even the best clubs in the world. Similar to the player collaboration network, this network has to be weighted as well. We are able to extract the number of transfers in both directions for all pairs of clubs but the absolute number does not provide the necessary information for springboard club identification. Thus, we have to weight every edge, representing the number of transfers from one club to another, with a weight related to the importance of the destination club. The importance of the destination club is calculated using two different equations. One is based on average ranking of the destination club in the past fifteen seasons and the ranking of the league they play in, and the other one is based on the destination club value. Both equations are stated and explained below.
Weight in the Equation 2 is calculated as a reciprocal value of destination club average ranking in the past fifteen seasons multiplied by the our predefined destination club league ranking . Predefined league rankings can be found in Table 4 and are defined for the purpose of this paper.
Weight in the Equation 3 is calculated as destination club average value in the past fifteen seasons divided by to lower the weight values.
To identify springboard clubs we have to choose a different method from the one we use for player collaboration network. The most important thing in this network are the transfer paths from less valuable to the most valuable clubs. A club is considered a springboard if it is involved in a lot of transfers to the most valuable clubs. Thus, the betweenness centrality freeman1977set () is the most suitable measure. We implement a fast betweenness algorithm discussed in fastBet (). Since our network is weighted we have to modify the proposed algorithm so it takes weights into account. The only difference from the proposed algorithm is calculation of path lengths where we do not add one for every hop but take weight into account. We have to take the reciprocal value of weight as in our network larger weight is better and we want to favour edges with larger weights.
|La Liga (ESP)||100||Premier League (UKR)||20|
|Premier League (ENG)||95||Super League (SWI)||20|
|Serie A (ITA)||85||Serie A (BRA)||20|
|Bundesliga (GER)||75||Super Lig (TUR)||15|
|Ligue 1 (FRA)||50||Primera Division (ARG)||15|
|Primera Liga (POR)||40||Super League(GRE)||13|
|Eredivisie (NED)||40||Liga 1 (ROM)||12|
|Pro League (BEL)||25||1. HNL (CRO)||10|
|Premier League(RUS)||25||Bundesliga (AUT)||10|
|Premiership (SCO)||20||MLS (USA)||5|
4 Results and Discussion
4.1 Top players
After running the analysis on the player collaboration network, we can show that the best player according to our analysis is Cristiano Ronaldo. He is followed by several other players that have played for several of the best clubs. By looking at the Table 5, where top 20 players identified by our algorithm and their scores are listed, we can see that the value of the player is not the only thing that affects the score of a player. Players like Beckham, Ronaldinho, Kaká and Keane, whose market value decreased a lot lately because of their age, but they played for a lot of important clubs in their career, have high scores. Most players on the top 20 list are still active today and are playing in the best leagues.
The most perspective players in each age group are listed in Tables 9, 8, 7 and 6. When assessing player’s perspectiveness, the most important factor besides his value and the values of his teammates is the player’s age. Since our network is an undirected network connecting two players, age can not be simply added to the weight equation. Including age into weight equation would favour players that have valuable teammates and also players that have younger teammates, which is not desired. Therefore, for identifying the most perspective players, the network can stay the same, we just need to interpret results differently. We divide players into different groups based on their age and compare only scores of players in the same groups. On average, older players have higher scores, which is expected as they played more seasons, which results in higher degree. Thus, the separation into age groups is beneficial. Some of the most perspective players based on our algorithm already play for the best clubs and others, despite their young age, play an important role in their clubs.
Based on the results, we can conclude that PageRank is an appropriate algorithm for determining the best players in our weighted network.
|Player||PageRank score||Value 2015/16 [£]|
|Daniele De Rossi||0.000389||5.250.000|
|Player||PageRank score||Player||PageRank score|
|Gianluigi Donnarumma||0.000020||Hachim Mastour||0.000023|
|Alexandru Petrus||0.000011||Ianis Hagi||0.000020|
|Maximiliano Romero||0.000010||Dani Olmo||0.000017|
|Robert Moldoveanu||0.000009||Martin Ödegaard||0.000015|
|Vlad Dragomir||0.000009||Reece Oxford||0.000015|
|Player||PageRank score||Player||PageRank score|
|Youri Tielemans||0.000070||Alen Halilovic||0.000052|
|Ante Ćorić||0.000036||Timo Werner||0.000049|
|Andrija Balić||0.000035||Fabrice Olinga||0.000044|
|Player||PageRank score||Player||PageRank score|
|Max Meyer||0.000058||Mateo Kovacic||0.000107|
|Adrien Rabiot||0.000053||Domenico Berardi||0.000091|
|Ángel Correa||0.000052||Raheem Sterling||0.000089|
|Dorin Rotariu||0.000052||Gerard Deulofeu||0.000086|
|Player||PageRank score||Player||PageRank score|
|Julian Draxler||0.000143||Mario Götze||0.000175|
|Raphaël Varane||0.000104||Christian Eriksen||0.000157|
|Luciano Vietto||0.000096||Jack Wilshere||0.000148|
4.2 Springboard Clubs Identification
From the club transfer network analysis we can show that the best springboard club among the clubs in the top twenty leagues is Standard Liege. The analysis provides very good results, since the top 15 clubs list is lacking the most valuable and the best clubs in the world. Top 15 clubs by betweenness centrality scores and their scores calculated on network using both weight equations are listed in Table 10. The results also show very slight difference between both proposed weight equations. The top two clubs are the same regardless of the weight and the third and the fourth switch positions if we change the weight calculation equation. All the clubs on the top 15 list are from less valuable leagues and these clubs normally buy younger players that are more affordable and sell the ones whose value rises above a certain level. This makes them a perfect springboard for younger and less experienced players. Because of such transfer activity such clubs get high score according to betweenness centrality as they play an important role in the transfer paths from less valuable clubs to the best clubs.
|Club ranking using betweenness centrality|
|Club||Score by value (Eq. 3)||Club||Score by rank (Eq. 2)|
|Standard Liege||0.013605||Standard Liege||0.012823|
|AEK Athens||0.011217||AEK Athens||0.012240|
|SL Benfica||0.010937||Sporting CP||0.010424|
|Sporting CP||0.010312||SL Benfica||0.010172|
|Skoda Xanthi||0.009605||AS Monaco||0.009275|
|Dinamo Bukarest||0.008743||FC Porto||0.008988|
|AS Monaco||0.008704||Rubin Kazan||0.008884|
|Dinamo Zagreb||0.008675||CFR Cluj||0.008681|
|Olympiacos Pir.||0.008553||Skoda Xanthi||0.008638|
|CFR Cluj||0.008542||Dinamo Bukarest||0.008518|
|Steaua Bucharest||0.008180||Olympiacos Pir.||0.008397|
|Udinese Calcio||0.007899||Rangers FC||0.008216|
|FC Porto||0.007889||Dinamo Zagreb||0.008170|
|Celtic FC||0.007849||Iraklis Thess.||0.007925|
|Petrolul Ploiesti||0.007794||Red Bull Salzburg||0.007907|
Player collaboration network from the past fifteen seasons from the top twenty football leagues consists of over 36 thousand nodes and nearly 1.5 million edges. Therefore, time and space consuming algorithms can prove too demanding to run on regular computers. Weighted PageRank algorithm however was able to calculate the scores for all the players in a very reasonable time. With the PageRank algorithm and proper edge weight, we are able to identify the top players from the period of last fifteen seasons. A very important factor in the weight equation is the inflation rate which ensures that older players that were never as valuable as the best players of the last seasons are also present on the top players list.
Using the same network, we are also able to identify the most perspective football players by separating their PageRank scores into age groups. Using this approach, we compare only players of similar age that have played for similar number of seasons. This ensures the same conditions for all the players in a specific age group. Results highlight some young players that already play for the best football clubs and some young players from less known clubs, where they play an essential role.
Results from club transfer network analysis are very similar to initial hypothesis. We expect clubs from less valuable leagues to come on top. We are able to identify springboard clubs by using the data about player transfers from the past fifteen seasons by constructing a directed weighted network with adequate weights using the data we have on the club value or the club rankings in the past seasons. With the proposed network, we use a weighted betweenness centrality algorithm to reveal the best springboard clubs in the top football leagues in the world. Our algorithm identifies some clubs from Belgian, Greek and Portuguese leagues as the best springboard clubs.
FIFA, Big Count (2006).
- (2) J. L. Peña, H. Touchette, A network theory analysis of football strategies, arXiv preprint arXiv:1206.6904.
- (3) J. Duch, J. S. Waitzman, L. A. N. Amaral, Quantifying the performance of individual players in a team activity, PloS one 5 (6) (2010) e10937.
- (4) C. Cotta, A. M. Mora, J. J. Merelo, C. Merelo-Molina, A network analysis of the 2010 fifa world cup champion team play, Journal of Systems Science and Complexity 26 (1) (2013) 21–42.
- (5) S. Mukherjee, Identifying the greatest team and captainâa complex network approach to cricket matches, Physica A: Statistical Mechanics and its Applications 391 (23) (2012) 6066–6076.
- (6) F. Radicchi, M. Perc, Who is the best player ever? a complex network analysis of the history of professional tennis, PloS one 6 (2) (2011) e17249.
Eurostat, HICP - inflation rate (2015).
- (8) L. Page, S. Brin, R. Motwani, T. Winograd, Pagerank: Bringing order to the web, Tech. rep., Stanford Digital Libraries Working Paper (1997).
- (9) L. C. Freeman, A set of measures of centrality based on betweenness, Sociometry (1977) 35–41.
- (10) U. Brandes, A faster algorithm for betweenness centrality, Journal of Mathematical Sociology 25 (2) (2001) 163–177.