whitead commited on
Commit
bcde5cf
·
verified ·
1 Parent(s): 1a1879f

added notes on benchmark

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -43,11 +43,20 @@ It has been trained specifically for these tasks:
43
  * natural product elucidation (formula + organism to SMILES)
44
  * blood-brain barrier permeability
45
 
46
- <img src="./images/benchmarks.png" width="800">
47
-
48
  For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
49
  If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
50
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Limitations
52
 
53
  It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).
 
43
  * natural product elucidation (formula + organism to SMILES)
44
  * blood-brain barrier permeability
45
 
 
 
46
  For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
47
  If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
48
 
49
+ ## Benchmarks
50
+
51
+ We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
52
+ The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
53
+ The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions so that is reasonable length for frontier model evals.
54
+ The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
55
+
56
+ The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
57
+
58
+ <img src="./images/benchmarks.png" width="800">
59
+
60
  ## Limitations
61
 
62
  It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).