An Elo-based rating system for TopCoder SRM
We present an Elo-based rating system for programming contests. We justify a definition of performance using the logarithm of a player’s rank. We apply the definition to rating TopCoder SRM. We improve the accuracy, guided by experimental results. We compare results with SRM ratings.
SRM (single round matches) are regular programming contests organized by TopCoder  since 2001. TopCoder has developed its own rating system, which has been used throughout its 19 year history . Various shortcomings of SRM ratings have been documented on TopCoder forums and elsewhere. Our purpose here is not to discuss these issues, but to provide a concrete proposal to remedy them. Players have sometimes asked: What would SRM ratings be if they were Elo-based? We would like to obtain a reasonable answer to this question.
There isn’t a standard method for applying Elo ratings to rounds of more than two players. We could consider a ranking of players as the set of results between each pair of players. However, such results are not independent from each others: a ranked result is the product of a single performance by each player. Instead, we will consider the ranking as a tournament. From desirable properties, we will deduce a formula for performance in ranked games.
We then use this formula to specifically rate SRM. Our goal is to more accurately predict the players’ performances after each round.
2.1 Rank performance
Let be the performance of a player ranked in a round of players. We consider the ranking as an elimination tournament, and count the number of wins.
If the tournament has players, the winner must win 1-1 rounds, so we have:
The player ranked has one less win than rank :
Multiplying both and by any does not change the number of wins:
With these constraints, we find the only solution is:
We use the standard Elo formula for expectations . A player with rating is expected to outperform a player with rating with probability:
As with SRM ratings, the rating of new players is initially 1200.
In programming contests like SRM, ties generally reflect a limitation of the problem set or scoring, rather than an unexpected performance of the tied players. We would like ties to not affect the ratings.
We experimented with several accounting methods. We find the most accurate is to split the ties equally in actual and expected ranks, in each. Slightly less accurate is to compute expected ranks regardless of ties, then split the tied ranks as is expected.
We now split the ties equally.
2.4 Relative performance
We have the results of a round, which may include multiple divisions and ties. We consider the results of each division separately. The result of a round is a list of scores , where is the score obtained by player .
We compute rank and expected rank for each player :
The relative performance of a player in the round is the difference of actual and expected rank performance:
This can be written as:
The performance equals a number of wins above or below expectations in a tournament of appropriately matched players.
Rank performance is convex (Figure 1). The sum of performances of a set of ranks is maximal if the ranks are distinct, and minimal if the ranks are uncertain or tied. Having split ties equally in actual and expected ranks, the expected ranks are at least as tied as the actual ranks. This ensures the sum of performances in a round is positive or zero. We have:
The rank is expected in logarithm (Figure 2). A player’s performance may average in several ways:
We will compute , preserving these properties.
We define the prediction error for expected and actual ranks :
Our primary accuracy metric is the average error for all participants in all rated rounds.
3 Proposed rating system
We have the performances of each player in a SRM. We would like to compute rating changes which better predict future performances.
3.1 Initial factor
With a prior of , a performance is outperforming expectations by a factor . A rating difference is expecting a better performance by a factor . We can convert the performances in rating units:
If we expected the same performances the next round, and had no other information, this would be a reasonable . Thus we consider as Elo-based or ’Elo’.
3.2 Fixed K
Here we compute , with K minimizing the error. We find the most accurate choice is (Figure 3). As we add factors in , we automatically adjust to the most accurate.
3.3 Weight factor
A player’s prior rating should weigh in according to the player’s experience. Let be the weight, such that , and the round number for the player, starting with . We experimented with several possibilities, and find the best results with . Figure 4 shows choices of . Thus we choose .
, the error is .7562.
3.4 Variance factor
A player’s performance has variance for various reasons, not necessarily predicting future performance. We compute the derivative of a player’s expected performance per change in rating:
We write the expected rank 1 as the loss:win ratio relative to the player’s current rating:
In units of performance, we have:
Extrapolating linearly, we can solve with .
, so may be conservative.
when . No extrapolation is possible.
when . Here has bits of precision, hence more likely predicts future performance.
We compute in a direction accounting for and a multiplier :
We find the most accurate (Figure 5). Thus we choose .
We define the variance factor . We now have:
, the error is .7536.
3.5 Maximum factor
An unexpected performance predicts future performance less reliably than a consistent performance. The ratings gain accuracy if we limit the magnitude of to a maximum using a sigmoid. A sigmoid preserves symmetry, and exactly linearity of performance around . We define the adjusted performance . We find the best results with:
We find the most accurate (Figure 6). Thus we choose . The rating change for each player is now:
, the error is .7513.
3.6 Natural inflation
We have computed which make the ratings more accurate after a round. Each player is expected a performance , thus has an expected . The ratings are stable in expectation.
Because performances have a positive sum, more rating is won by the outperforming players than is lost by the underperforming players. The ratings have net inflation. Because the players gain experience during a round, the players on average have better performances after the round. Thus inflation better predicts future performance than deflation. Because we minimized the prediction error, the average should approximately predict the next performances of participants relative to non-participants. We define this rate of inflation as natural inflation.
For comparison with SRM ratings, we consider natural inflation as stable. We refer to this Elo-based implementation as ’Elo’ in our results.
We have , predicting the players’ relative performances. Now we would like to estimate the performances over time, such that players with stable performances have stable ratings. To maintain relative accuracy, we will not adjust our current parameters.
3.8 Performance bonus
As long as exactly, the expected rating change of any player is zero. However, the expected performance in a round is a better performance than not participating. Players having practiced already have better performances before the round. Thus in expectation predicts future performance less accurately than .
We adjust the expectation to expect inflation, as if the ratings increased. We choose a parameter , then add the difference in expected performance to each player’s performance:
We find (Figure 7), and little accuracy can be gained from this parameter alone.
3.9 New players
So far we have a constant rating for new players. However, the performance of new players is not constant. As the performance of existing players improves, SRMs become more difficult. This raises the barrier to entry.
Before participating, potential players have opportunities to practice on recent rounds. Some may be experienced players coming from other platforms. Thus the performance of new players improves over time. Thus we adjust the initial rating for inflation.
We choose a parameter , the increase in per 100 rounds.
After each round, adjust by .
Simultaneously, adjust for accuracy.
We find most accurately (Figure 8).
Thus, our parameters estimating a stable performance are:
We first compare our ’Elo’ implementation to SRM ratings. Table 1 shows the average computed , performances, and prediction error, using our definitions.
The first row is our primary metric.
The players by experience.
Existing players, in each division.
In each division, the top and bottom half ranks.
Because SRM ratings use a different definition of performance, we include results using independent metrics. Each round, we compute rank correlation statistics :
For each metric, we compute the fraction of rounds where ’Elo’ better predicted the result than SRM ratings, splitting ties equally. Table 2 shows the percentages.
Our ’Elo’ implementation generally better predicts the players’ relative performances than SRM ratings. The ranks are also better predicted, with predictions improving with the number of players. Our ’Elo2’ adjustments improve stability and slightly improve accuracy.
Our primary metric considers all the players’ performances in all SRM. The predictions are empirically accurate, on average, but not necessarily precise for any player or at any time.
We include source code and charts in appendix. Other results are posted on our web site .
We would like to thank Ivan Kazmenko for reviewing this paper and helpful comments.
- Note: \urlhttp://tc.eloranked.com Cited by: §5.
- TopCoder: algorithm competition rating system. Note: \urlhttps://community.topcoder.com/tc?module=Static&d1=help&d2=ratings Cited by: §1.
- TopCoder. Note: \urlhttps://www.topcoder.com/about/ Cited by: §1.
- Wikipedia: elo rating system. Note: \urlhttps://en.wikipedia.org/wiki/Elo_rating_system Cited by: §2.2.
- Wikipedia: rank correlation. Note: \urlhttps://en.wikipedia.org/wiki/Rank_correlation Cited by: §4.