|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
task0 = Task("anli_r1", "acc", "ANLI") |
|
task1 = Task("logiqa", "acc_norm", "LogiQA") |
|
|
|
NUM_FEWSHOT = 0 |
|
|
|
|
|
|
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">Eval-Anything Leaderboard</h1>""" |
|
|
|
|
|
|
|
|
|
INTRODUCTION_TEXT = """ |
|
Eval-anything is a framework designed specifically for evaluating all-modality models, and it is a part of the [Align-Anything](https://github.com/PKU-Alignment/align-anything) framework. It consists of two main tasks: All-Modality Understanding (AMU) and All-Modality Generation (AMG). AMU assesses a model's ability to simultaneously process and integrate information from all modalities, including text, images, audio, and video. On the other hand, AMG evaluates a model's capability to autonomously select output modalities based on user instructions and synergistically utilize different modalities to generate output. Eval-anything aims to comprehensively assess the ability of all-modality models to handle heterogeneous data from multiple sources, providing a reliable evaluation tool for this field. |
|
|
|
**Note:** Since most current open-source models lack support for all-modality output, (β ) indicates that models are used as agents to invoke [AudioLDM2-Large](https://huggingface.co/cvssp/audioldm2-large) and [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) for audio and image generation. |
|
""" |
|
|
|
|
|
LLM_BENCHMARKS_TEXT = f""" |
|
""" |
|
|
|
EVALUATION_QUEUE_TEXT = """ |
|
""" |
|
|
|
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" |
|
CITATION_BUTTON_TEXT = """ |
|
@misc{align_anything, |
|
author = {PKU-Alignment Team}, |
|
title = {Align Anything: training all modality models to follow instructions with unified language feedback}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\\url{https://github.com/PKU-Alignment/align-anything}}, |
|
} |
|
""" |
|
|
|
|
|
ABOUT_TEXT = """ |
|
""" |
|
|
|
SUBMISSION_TEXT = """ |
|
<h1 align="center"> |
|
How to submit models/results to the leaderboard? |
|
</h1> |
|
We welcome the community to submit evaluation results for new models. These results will be added as non-verified. However, the authors are required to upload their generations in case other members want to verify the results. |
|
|
|
### 1 - Running Evaluation πββοΈ |
|
|
|
We have written a detailed guide for running the evaluation on your model. You can find it in the `[align-anything](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/benchmarks/leaderboard)`. This process will generate a JSON file and a zip file summarizing the results, along with the raw generations and metric files. |
|
|
|
### 2 - Submitting Results π |
|
|
|
To submit your results create a **Pull Request** in the community tab to add them under the [folder](hhttps://huggingface.co/spaces/PKU-Alignment/EvalAnything-LeaderBoard/tree/main/community_results) `community_results` in this repository: |
|
- Create a folder named `ORG_MODELNAME_USERNAME`. For example `PKU-Alignment_gemini1.5-pro_XuyaoWang` |
|
- Place your JSON file and ZIP file with grouped scores from the guide, along with the generations folder and metrics folder, inside this newly created folder. |
|
|
|
The title of the PR should be `[Community Submission] Model: org/model, Username: your_username`, replace org and model with those corresponding to the model you evaluated. |
|
|
|
### 3 - Getting your model verified β
|
|
A verified result in Eval-Anything indicates that a core maintainer has decoded the outputs from the model and performed the evaluation. To have your model verified, please follow these steps: |
|
|
|
1. Email us and provide a brief rationale for why your model should be verified. |
|
2. Await our response and approval before proceeding. |
|
3. Prepare a script to decode from your model that does not require a GPU. Typically, this should be the same script used for your model contribution. It should run without requiring a local GPU. It should run without requiring a local GPU. We strongly recommend that you modify the scripts in [align-anything](https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/evaluation/benchmarks/leaderboard) to adapt to your model's operation. |
|
4. Generate temporary OpenAI API keys for running the script and share them with us. Specifically, we need the keys for evaluation. |
|
5. We will check and execute your script, update the results, and inform you so that you can revoke the temporary keys. |
|
|
|
**Please note that we will not re-evaluate the same model. Due to sampling variance, the results might slightly differ from your initial ones. We will replace your previous community results with the verified ones.** |
|
""" |