futurehouse
/

ether0

Text Generation

Model card Files Files and versions Community

whitead commited on Jun 5

Commit

bcde5cf

·

verified ·

1 Parent(s): 1a1879f

added notes on benchmark

Files changed (1) hide show

README.md +11 -2

README.md CHANGED Viewed

@@ -43,11 +43,20 @@ It has been trained specifically for these tasks:
 * natural product elucidation (formula + organism to SMILES)
 * blood-brain barrier permeability
-<img src="./images/benchmarks.png" width="800">
 For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
 If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
 ## Limitations
 It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).

 * natural product elucidation (formula + organism to SMILES)
 * blood-brain barrier permeability
 For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
 If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
+## Benchmarks
+We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
+The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
+The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions so that is reasonable length for frontier model evals.
+The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
+The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
+<img src="./images/benchmarks.png" width="800">
 ## Limitations
 It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).