more wording fixes
Browse files
README.md
CHANGED
@@ -50,7 +50,7 @@ If you ask it questions that lie significantly beyond those tasks, it can fail.
|
|
50 |
|
51 |
We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
|
52 |
The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
|
53 |
-
The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions
|
54 |
The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
|
55 |
|
56 |
The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
|
|
|
50 |
|
51 |
We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
|
52 |
The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
|
53 |
+
The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions (a reasonable amount for frontier model evals).
|
54 |
The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
|
55 |
|
56 |
The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
|