Spaces:

a-ghorbani
/

ai-phone-leaderboard

Running

App Files Files Community

ai-phone-leaderboard / docs /ranking_system.md

agh123

feat: add Glicko2 ranking

94a1f00 5 months ago

preview code

raw

history blame contribute delete

7.37 kB

	# Glicko-2 Ranking System Implementation

	## Overview

	The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.

	## Glicko-2 Theory

	Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:

	1. Devices have different numbers of benchmark runs
	2. There's uncertainty about a device's true performance capabilities
	3. Performance metrics need to be compared across different model sizes and configurations

	### Key Components

	1. Rating (μ): A numerical value representing a device's relative performance level (higher is better)
	2. Rating Deviation (RD): The uncertainty in the performance rating
	3. Volatility (σ): A measure of how consistent a device's performance is across different benchmarks

	### Rating System Parameters

	- Initial Rating: 1500 (standard starting point on the Glicko-2 scale)
	- Initial RD: 350 (high uncertainty for new devices)
	- Volatility: 0.06 (controls how quickly performance ratings can change)
	- Tau: 0.5 (system constant that limits the change in volatility)

	Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.

	## Implementation Details

	### Data Preparation

	Before applying Glicko-2, we preprocess the benchmark data:

	1. Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
	2. Normalize scores within each model group to account for different model difficulties
	3. Convert continuous performance metrics into relative comparisons:
	- For each pair of devices running the same model, we compare their token generation and prompt processing speeds
	- If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
	- If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
	- If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
	- This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values

	For example, if:
	- Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
	- Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec

	Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.

	### Match Processing

	For each model, we compare devices pairwise based on their token generation and prompt processing speeds:

	```python
	# Example of match processing
	for model, group in df.groupby("Model ID"):
	devices = group["Normalized Device ID"].unique()
	for i in range(len(devices)):
	for j in range(i + 1, len(devices)):
	device1 = devices[i]
	device2 = devices[j]

	# Compare performance metrics
	token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
	token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]

	prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
	prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]

	# Determine performance outcome
	if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
	outcome = 1 # device1 performs better
	elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
	outcome = 0 # device2 performs better
	else:
	outcome = 0.5 # mixed performance
	```

	### Rating Updates

	The Glicko-2 system updates performance ratings after each benchmark comparison:

	1. Calculate Expected Performance:
	```python
	def expected_performance(rating1, rating2, rd1, rd2):
	q = math.log(10) / 400
	g_rd = 1 / math.sqrt(1 + 3 * q*2 (rd22) / math.pi2)
	return 1 / (1 + 10*(-g_rd (rating1 - rating2) / 400))
	```

	2. Update Performance Rating and RD:
	```python
	def update_performance(rating, rd, outcome, expected):
	q = math.log(10) / 400
	d_squared = 1 / (q*2 g_rd*2 expected * (1 - expected))
	new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared))
	new_rating = rating + q / (1 / rd*2 + 1 / d_squared) g_rd * (outcome - expected)
	return new_rating, new_rd
	```

	### Confidence Thresholds

	We implement several confidence thresholds:

	1. Minimum Benchmarks: Devices must have at least 5 benchmark runs to be included in confident rankings
	2. Performance Deviation: Devices with RD > 100 tokens/second are considered less reliable
	3. Performance Consistency: High volatility indicates inconsistent performance across benchmarks

	## Practical Considerations

	### Handling Sparse Data

	The system is designed to handle sparse benchmark data by:
	1. Using conservative initial performance ratings for new devices
	2. Increasing RD for devices with few benchmark runs
	3. Implementing a minimum benchmark threshold

	### Performance Metrics

	We track several performance metrics:
	- Combined performance rating (overall tokens/second)
	- Token generation rating (tokens/second)
	- Prompt processing rating (tokens/second)
	- Performance deviation (uncertainty in tokens/second)
	- Number of benchmark runs
	- Performance comparison statistics

	### Visualization

	The system provides:
	1. Overall performance rankings with confidence intervals
	2. Platform-specific performance statistics
	3. Head-to-head performance comparison tools
	4. Performance trend analysis across different model sizes

	## Advantages Over Other Systems

	1. Better Handling of Performance Uncertainty: Explicit modeling of performance measurement uncertainty
	2. More Accurate with Fewer Benchmarks: Can provide meaningful performance ratings with limited data
	3. Dynamic Performance Updates: Volatility parameter allows for appropriate rating changes
	4. Transparent Confidence: Performance deviations provide clear confidence measures

	## Limitations

	1. Computational Complexity: More complex than Elo, requiring more calculations
	2. Parameter Sensitivity: Results can be sensitive to system parameters
	3. Continuous Metrics: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons

	## References

	1. Glicko, M. (2001). "The Glicko-2 Rating System"
	2. Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
	3. Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances"