whitead commited on
Commit
79cf380
·
verified ·
1 Parent(s): bcde5cf

more wording fixes

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -50,7 +50,7 @@ If you ask it questions that lie significantly beyond those tasks, it can fail.
50
 
51
  We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
52
  The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
53
- The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions so that is reasonable length for frontier model evals.
54
  The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
55
 
56
  The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
 
50
 
51
  We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
52
  The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
53
+ The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions (a reasonable amount for frontier model evals).
54
  The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
55
 
56
  The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.