added notes on benchmark
Browse files
README.md
CHANGED
@@ -43,11 +43,20 @@ It has been trained specifically for these tasks:
|
|
43 |
* natural product elucidation (formula + organism to SMILES)
|
44 |
* blood-brain barrier permeability
|
45 |
|
46 |
-
<img src="./images/benchmarks.png" width="800">
|
47 |
-
|
48 |
For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
|
49 |
If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
## Limitations
|
52 |
|
53 |
It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).
|
|
|
43 |
* natural product elucidation (formula + organism to SMILES)
|
44 |
* blood-brain barrier permeability
|
45 |
|
|
|
|
|
46 |
For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
|
47 |
If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.
|
48 |
|
49 |
+
## Benchmarks
|
50 |
+
|
51 |
+
We tested ether0, along with some experts and frontier models, on [a benchmark we developed](https://huggingface.co/datasets/futurehouse/ether0-benchmark/).
|
52 |
+
The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification.
|
53 |
+
The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions so that is reasonable length for frontier model evals.
|
54 |
+
The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.
|
55 |
+
|
56 |
+
The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.
|
57 |
+
|
58 |
+
<img src="./images/benchmarks.png" width="800">
|
59 |
+
|
60 |
## Limitations
|
61 |
|
62 |
It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).
|