futurehouse
/

ether0

Text Generation

Model card Files Files and versions Community

whitead commited on 29 days ago

Commit

79cf380

·

verified ·

1 Parent(s): bcde5cf

more wording fixes

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -50,7 +50,7 @@ If you ask it questions that lie significantly beyond those tasks, it can fail.
 We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
 The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
-The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions so that is reasonable length for frontier model evals.
 The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
 The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.

 We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
 The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
+The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions (a reasonable amount for frontier model evals).
 The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
 The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.