Spaces:

double-ai
/

FormulaOne-Leaderboard

Running on CPU Upgrade

App Files Files Community

FormulaOne-Leaderboard / src /about.py

galb-dai

Add figure.

0600810 about 1 month ago

raw

history blame

11.6 kB

	# The paper's URL for linking
	PAPER_URL = "https://arxiv.org/abs/2507.13337"

	WHAT_IS_F1_HTML_TOP = f"""
	<div class="f1-container">
	<header class="text-center mb-12">
	<h1 class="text-4xl md:text-5xl font-bold text-gray-900 f1-h1">FormulaOne</h1>
	</header>

	<section>
	<p class="text-lg mb-4 f1-p">Frontier AI models have recently demonstrated strong performance on mathematical and algorithmic benchmarks, including earning <a href="https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/" target="_blank" rel="noopener noreferrer" class="f1-a">gold medals in olympiads</a>, and attaining <a href="https://arxiv.org/html/2502.06807v1" target="_blank" rel="noopener noreferrer" class="f1-a">top percentile ratings</a> in competitive programming contests. How well do such benchmarks capture the true depth of algorithmic reasoning, as it arises in real-world research problems?</p>

	<p class="text-lg mb-4 f1-p">We believe that existing benchmarks fail to capture the deep reasoning skills required for complex, research-level algorithmic problems. To address this gap, <a href="{PAPER_URL}" target="_blank" rel="noopener noreferrer" class="f1-a">we introduce <strong>FormulaOne</strong></a>.</p>

	<p class="mb-4 f1-p"><strong>FormulaOne</strong> consists of 220 novel dynamic programming problems over graphs. The problems are organised into three categories, ranging from moderate difficulty and all the way up to research-level.</p>

	<div class="f1-grid-wrap" role="region" aria-label="FormulaOne categories">
	<div class="f1-grid-table" role="table">
	<div class="f1-grid-row f1-grid-head" role="row">
	<div class="f1-grid-cell" role="columnheader">Category</div>
	<div class="f1-grid-cell" role="columnheader">Size</div>
	<div class="f1-grid-cell" role="columnheader">Description</div>
	</div>
	<div class="f1-grid-row" role="row">
	<div class="f1-grid-cell" role="cell">Warmup</div>
	<div class="f1-grid-cell" role="cell">100</div>
	<div class="f1-grid-cell" role="cell">A set of “easier” problems.</div>
	</div>
	<div class="f1-grid-row" role="row">
	<div class="f1-grid-cell" role="cell">Tier 1</div>
	<div class="f1-grid-cell" role="cell">100</div>
	<div class="f1-grid-cell" role="cell">A set of challenging problems.</div>
	</div>
	<div class="f1-grid-row" role="row">
	<div class="f1-grid-cell" role="cell">Tier 2</div>
	<div class="f1-grid-cell" role="cell">20</div>
	<div class="f1-grid-cell" role="cell">A set of highly challenging problems.</div>
	</div>
	</div>
	</div>
	</section>
	</div>
	"""

	WHAT_IS_F1_HTML_BOTTOM_TOP = """
	<div class="f1-container">
	<section>
	<p class="mb-4 f1-p">The latter category is incredibly demanding, requiring resolution of many points of uncertainty, and involving an array of reasoning steps, including topological and geometric insight, knowledge of mathematical domains such as extremal graph theory and logic, combinatorial considerations, precise implementation, and more.</p>
	<p class="f1-p">Despite <a href="https://epoch.ai/frontiermath" target="_blank" rel="noopener noreferrer" class="f1-a">impressive</a> <a href="https://artificialanalysis.ai/evaluations/gpqa-diamond" target="_blank" rel="noopener noreferrer" class="f1-a">performance</a> on existing benchmarks, presently <strong>no model solves even a single FormulaOne Tier 2 problem</strong>.<sup><a href="#evaluation" class="f1-a">1</a></sup></p>
	</section>

	<section>
	<h2 class="f1-h2">An “Infinite Well” of Problems</h2>
	<p class="mb-4 f1-p">While the problems are often natural to state, their solutions are far from obvious. The solvability of this vast class of problems is guaranteed by an algorithmic <strong>meta-theorem</strong> due to <a href="https://en.wikipedia.org/wiki/Courcelle%27s_theorem" target="_blank" rel="noopener noreferrer" class="f1-a">Courcelle</a>, which broadly states:</p>
	<blockquote class="my-6 f1-blockquote">
	“For every sufficiently tree-like graph, any problem definable in an expressive formal logic — Monadic Second-Order (MSO) logic — can be solved by a dynamic programming algorithm that operates in time linear in the order of the graph.”
	</blockquote>
	<p class="f1-p">The key is to use a structure known as a tree decomposition, which organises the graph’s vertices into a series of overlapping sets, or “bags”, that are themselves arranged in a tree.</p>
	<figure class="f1-figure">
	<img src="assets/bag_modifications.png" alt="An illustration of local modifications to bags (dashed boxes)" class="max-w-full md:max-w-2xl mx-auto rounded-lg shadow-md">
	<figcaption class="f1-figcaption">An illustration of local modifications to bags: Introduce, Forget, and Join.</figcaption>
	</figure>
	<p class="mb-4 f1-p">An algorithm can then traverse this tree of bags, solving the problem piece by piece using dynamic programming. This process involves designing a “state” that summarises all necessary information about the partial solution within a bag, and then defining how this state transforms as vertices are introduced, forgotten, or bags are merged.</p>
	<!-- VIDEO INSERTED HERE VIA gr.Video IN app.py -->
	"""

	WHAT_IS_F1_HTML_BOTTOM_TAIL = """
	<p class="f1-p">The deceptive simplicity of the problem statements belies the <strong>extraordinary difficulty</strong> of discovering the correct dynamic programming solution. This process is riddled with subtle combinatorial and logical pitfalls, demanding a profound understanding of the problem’s underlying structure. For a detailed walkthrough of the fifteen interdependent reasoning steps required to solve a single hard problem — <code>Maximal-Cluster-Graph</code> — <a href="https://arxiv.org/pdf/2507.13337#appendix.A" target="_blank" rel="noopener noreferrer" class="f1-a">see the appendix of our paper</a>.</p>
	</section>

	<section id="evaluation">
	<h2 class="f1-h2">Evaluation</h2>
	<p class="mb-4 f1-p">To give models the best possible chance of success, we provide a generous few-shot prompt that covers a broad array of the ideas and techniques involved in solving these problems. All models were evaluated using their highest available reasoning settings and with the maximum context length permitted.</p>
	<p class="mb-4 f1-p">Each submitted solution is subjected to a rigorous and automated <a href="https://arxiv.org/pdf/2507.13337#section.4" target="_blank" rel="noopener noreferrer" class="f1-a">test suite</a> that measures three key aspects of its validity:</p>
	<ul class="list-disc list-inside space-y-2 mb-6">
	<li class="f1-li"><strong>Correctness:</strong> The output of the submitted algorithm must be correct on all graphs.</li>
	<li class="f1-li"><strong>Consistency:</strong> The solution must produce the same output for a given graph, regardless of the specific tree decomposition.</li>
	<li class="f1-li"><strong>Efficiency:</strong> The solution must be truly <a href="https://en.wikipedia.org/wiki/Parameterized_complexity" target="_blank" rel="noopener noreferrer" class="f1-a">fixed-parameter linear</a>.</li>
	</ul>
	<p class="mb-4 f1-p">To support research and encourage community contributions, the <code>FormulaOne-Warmup</code> dataset is released as a public resource for training and fine-tuning models. The complete test suite for all 100 Warmup problems is available, alongside a standalone evaluation environment, in our <a href="https://github.com/double-ai/formulaone-dataset/tree/main" target="_blank" rel="noopener noreferrer" class="f1-a">GitHub repository</a>.</p>
	<p class="f1-p">To maintain the integrity of the core benchmark, only a minimal subset of tests is released for the Tier 1 and Tier 2 problems.</p>

	<h2 class="f1-h2">Model Accuracy</h2>
	<p class="mb-4 f1-p">On the <strong>FormulaOne-Warmup</strong> problems, frontier models perform reasonably well. This confirms they have a foundational capability for these types of algorithmic tasks.</p>
	<figure class="f1-figure">
	<img src="/file=assets/warmup_performance.png" alt="Plot showing model performance on FormulaOne-Warmup" class="max-w-full md:max-w-2xl mx-auto rounded-lg shadow-md">
	<figcaption class="f1-figcaption">Performance of frontier models on the FormulaOne-Warmup dataset.</figcaption>
	</figure>
	<p class="mb-4 f1-p">However, as the reasoning depth increases in <strong>Tier 1</strong>, and solutions require the discovery and integration of novel and more complex state representations, model performance drops off sharply.</p>
	<figure class="f1-figure">
	<img src="/file=assets/tier1_performance.png" alt="Plot showing model performance on Tier 1" class="max-w-full md:max-w-2xl mx-auto rounded-lg shadow-md">
	<figcaption class="f1-figcaption">Performance of frontier reasoning models on the FormulaOne dataset.</figcaption>
	</figure>
	<p class="f1-p">This trend culminates in <strong>Tier 2</strong>, where the difficulty is characteristic of exploratory research problems. On this set of 20 problems, no current frontier model solves even a single one. This result starkly illustrates the gap that remains between high performance on existing benchmarks and the deep algorithmic reasoning required for truly complex problems.</p>
	</section>
	</div>
	"""


	EVALUATION_QUEUE_TEXT = """
	## Submitting to the FormulaOne Leaderboard

	This leaderboard evaluates systems on the FormulaOne core dataset. Submissions consist of a .jsonl file with solution code for each problem.

	### 📁 I. Format Your Submission File

	Your submission must be a .jsonl file with one entry per problem:

	```json
	{"problem_id": "1", "solution": "<your Python code here>"}
	{"problem_id": "2", "solution": "<your Python code here>"}
	...
	```

	- problem_id: Must match the official list of FormulaOne core problems.
	- solution: A Python code implementing the required callback functions.

	📄 Full list of problem_ids:
	View the [FormulaOne core dataset](https://github.com/double-ai/formulaone-dataset-release/dataset/formulaone) for the complete list of problem IDs.

	⚠️ Validation Rules:
	Submissions must:
	- Contain exactly two columns: ["problem_id", "solution"]
	- Include all required problems (no missing/unknown IDs)
	- Provide solutions as Python strings
	- Avoid duplicates

	### 📤 II. Submit via the UI below

	- Upload your `.jsonl` file.
	- Fill in the following fields:
	- System Name
	- Organization
	- System Type
	- Click Submit.

	### ⏱️ After Submission

	Submissions are validated and evaluated within ~24 hours. Results will appear on the leaderboard once processed.
	"""


	CITATION_BUTTON_LABEL = """📚 How to cite FormulaOne"""
	CITATION_BUTTON_TEXT = r"""
	@misc{beniamini2025formulaonemeasuringdepthalgorithmic,
	title={FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming},
	author={Gal Beniamini and Yuval Dor and Alon Vinnikov and Shir Granot Peled and Or Weinstein and Or Sharir and Noam Wies and Tomer Nussbaum and Nadav Schweiger and Ido Ben Shaul and Tomer Zekharya and Yoav Levine and Shai Shalev-Shwartz and Amnon Shashua},
	year={2025},
	eprint={2507.13337},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2507.13337},
	}
	"""