|
# Glicko-2 Ranking System Implementation |
|
|
|
## Overview |
|
|
|
The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system. |
|
|
|
## Glicko-2 Theory |
|
|
|
Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where: |
|
|
|
1. Devices have different numbers of benchmark runs |
|
2. There's uncertainty about a device's true performance capabilities |
|
3. Performance metrics need to be compared across different model sizes and configurations |
|
|
|
### Key Components |
|
|
|
1. **Rating (μ)**: A numerical value representing a device's relative performance level (higher is better) |
|
2. **Rating Deviation (RD)**: The uncertainty in the performance rating |
|
3. **Volatility (σ)**: A measure of how consistent a device's performance is across different benchmarks |
|
|
|
### Rating System Parameters |
|
|
|
- **Initial Rating**: 1500 (standard starting point on the Glicko-2 scale) |
|
- **Initial RD**: 350 (high uncertainty for new devices) |
|
- **Volatility**: 0.06 (controls how quickly performance ratings can change) |
|
- **Tau**: 0.5 (system constant that limits the change in volatility) |
|
|
|
Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings. |
|
|
|
## Implementation Details |
|
|
|
### Data Preparation |
|
|
|
Before applying Glicko-2, we preprocess the benchmark data: |
|
|
|
1. Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices. |
|
2. Normalize scores within each model group to account for different model difficulties |
|
3. Convert continuous performance metrics into relative comparisons: |
|
- For each pair of devices running the same model, we compare their token generation and prompt processing speeds |
|
- If a device is faster in both metrics, it "wins" the comparison (outcome = 1) |
|
- If a device is slower in both metrics, it "loses" the comparison (outcome = 0) |
|
- If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5) |
|
- This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values |
|
|
|
For example, if: |
|
- Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec |
|
- Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec |
|
|
|
Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings. |
|
|
|
### Match Processing |
|
|
|
For each model, we compare devices pairwise based on their token generation and prompt processing speeds: |
|
|
|
```python |
|
# Example of match processing |
|
for model, group in df.groupby("Model ID"): |
|
devices = group["Normalized Device ID"].unique() |
|
for i in range(len(devices)): |
|
for j in range(i + 1, len(devices)): |
|
device1 = devices[i] |
|
device2 = devices[j] |
|
|
|
# Compare performance metrics |
|
token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0] |
|
token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0] |
|
|
|
prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0] |
|
prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0] |
|
|
|
# Determine performance outcome |
|
if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2: |
|
outcome = 1 # device1 performs better |
|
elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2: |
|
outcome = 0 # device2 performs better |
|
else: |
|
outcome = 0.5 # mixed performance |
|
``` |
|
|
|
### Rating Updates |
|
|
|
The Glicko-2 system updates performance ratings after each benchmark comparison: |
|
|
|
1. **Calculate Expected Performance**: |
|
```python |
|
def expected_performance(rating1, rating2, rd1, rd2): |
|
q = math.log(10) / 400 |
|
g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2) |
|
return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400)) |
|
``` |
|
|
|
2. **Update Performance Rating and RD**: |
|
```python |
|
def update_performance(rating, rd, outcome, expected): |
|
q = math.log(10) / 400 |
|
d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected)) |
|
new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared)) |
|
new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected) |
|
return new_rating, new_rd |
|
``` |
|
|
|
### Confidence Thresholds |
|
|
|
We implement several confidence thresholds: |
|
|
|
1. **Minimum Benchmarks**: Devices must have at least 5 benchmark runs to be included in confident rankings |
|
2. **Performance Deviation**: Devices with RD > 100 tokens/second are considered less reliable |
|
3. **Performance Consistency**: High volatility indicates inconsistent performance across benchmarks |
|
|
|
## Practical Considerations |
|
|
|
### Handling Sparse Data |
|
|
|
The system is designed to handle sparse benchmark data by: |
|
1. Using conservative initial performance ratings for new devices |
|
2. Increasing RD for devices with few benchmark runs |
|
3. Implementing a minimum benchmark threshold |
|
|
|
### Performance Metrics |
|
|
|
We track several performance metrics: |
|
- Combined performance rating (overall tokens/second) |
|
- Token generation rating (tokens/second) |
|
- Prompt processing rating (tokens/second) |
|
- Performance deviation (uncertainty in tokens/second) |
|
- Number of benchmark runs |
|
- Performance comparison statistics |
|
|
|
### Visualization |
|
|
|
The system provides: |
|
1. Overall performance rankings with confidence intervals |
|
2. Platform-specific performance statistics |
|
3. Head-to-head performance comparison tools |
|
4. Performance trend analysis across different model sizes |
|
|
|
## Advantages Over Other Systems |
|
|
|
1. **Better Handling of Performance Uncertainty**: Explicit modeling of performance measurement uncertainty |
|
2. **More Accurate with Fewer Benchmarks**: Can provide meaningful performance ratings with limited data |
|
3. **Dynamic Performance Updates**: Volatility parameter allows for appropriate rating changes |
|
4. **Transparent Confidence**: Performance deviations provide clear confidence measures |
|
|
|
## Limitations |
|
|
|
1. **Computational Complexity**: More complex than Elo, requiring more calculations |
|
2. **Parameter Sensitivity**: Results can be sensitive to system parameters |
|
3. **Continuous Metrics**: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons |
|
|
|
## References |
|
|
|
1. Glicko, M. (2001). "The Glicko-2 Rating System" |
|
2. Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments" |
|
3. Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances" |