Spaces:
Running
Running
File size: 1,326 Bytes
75c0f08 70c3574 07319d6 70c3574 259df53 70c3574 cb0f4bc 70c3574 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
---
title: README
emoji: π
colorFrom: purple
colorTo: indigo
sdk: static
pinned: false
---
<p align="center">
<img src="logo.png" alt="Logo of Selective Context" width="400" height="auto" />
</p>
# "Uncheatable" LLMs Evaluation - LatestEval
Humans receive new test questions every exam, but LLMs? They've been evaluated with the same benchmarks for too long. Why not assess LLMs with fresh test just like we test our students? In this project, we introduce LatestEval, which automatically constructs language model benchmarks using the latest materials (e.g., arXiv, BBC, GitHub, etc.) to prevent "cheating" and data contamination.
**News!!**
- **15 Dec, 2023** - This project was accpeted by the main track of **AAAI 2024** π₯³! Check out the paper here: π [Dynamic Test Construction with Latest Materials](https://arxiv.org/abs/2312.12343).
# Key Features
1. We maintain a QA benchmark that updates every half month using the latest online resources (created in the past half month). This approach aims to avoid 1) LLMs being trained on the test set (cheating); and 2) the unintentional inclusion of test questions in the training dataset (data contamination).
2. We analyzed real Human-AI conversations to ensure the automated benchmark aligns well with real-life applications (see paper for more detail). |