File size: 7,808 Bytes
98737de
 
8ee44dc
 
 
 
c76f9bd
 
 
 
98737de
518724d
 
 
 
 
3314357
518724d
 
 
c400d17
518724d
 
 
 
 
 
 
80eaba2
 
518724d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f36d2b
c1d7e42
 
 
 
 
 
4f36d2b
c1d7e42
 
 
4f36d2b
c1d7e42
 
 
4f36d2b
c1d7e42
 
 
 
4f36d2b
 
c1d7e42
 
 
 
 
 
4f36d2b
c1d7e42
 
 
4f36d2b
c1d7e42
 
 
4f36d2b
c1d7e42
 
 
518724d
 
 
 
918b935
518724d
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: cc-by-sa-4.0
tags:
- DNA
- biology
- genomics
- protein
- kmer
- cancer
- gleason-grade-group
---
## Project Description 
This repository contains the trained model for our paper: **Fine-tuning a Sentence Transformer for DNA & Protein tasks** that is currently under review at BMC Bioinformatics. This model, called **simcse-dna**; is based on the original implementation of **SimCSE [1]**. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.

###  Prerequisites 
-----------
Please see the original [SimCSE](https://github.com/princeton-nlp/SimCSE) for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580).

### Usage 

Run the following code to get the sentence embeddings:

```python 

import torch
from transformers import AutoModel, AutoTokenizer

# Import trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna")
model = AutoModel.from_pretrained("dsfsi/simcse-dna")


#sentences is your list of n DNA tokens of size 6 
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output


```
The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.

## Performance on evaluation tasks

Find out more about the datasets and access in the paper **(TBA)**

**Table:** Accuracy scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. 

| Model | Embed.    | T1             | T2             | T3             | T4             | T5             | T6             | T7             | T8             |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LR    | Proposed  | _0.65 ± 0.01_  | _0.67 ± 0.0_   | _0.85 ± 0.01_  | _0.64 ± 0.01_  | _0.80 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.70 ± 0.01_  |
|       | DNABERT   | 0.62 ± 0.01    | 0.65 ± 0.0     | 0.84 ± 0.04    | 0.69 ± 0.01    | 0.85 ± 0.01    | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.60 ± 0.01    |
|       | NT        | **0.66 ± 0.0** | **0.67 ± 0.0** | 0.84 ± 0.01    | **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** |
 |-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LGBM  | Proposed  | _0.64 ± 0.01_  | _0.66 ± 0.0_   | _0.90 ± 0.02_  | _0.61 ± 0.01_  | _0.78 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.81 ± 0.01_  |
|       | DNABERT   | 0.62 ± 0.01    | 0.65 ± 0.01    | 0.90 ± 0.02    | 0.65 ± 0.01    | 0.83 ± 0.0     | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.75 ± 0.01    |
|       | NT        | 0.63 ± 0.01    | 0.66 ± 0.0     | **0.91 ± 0.02**| 0.72 ± 0.0     | **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| 0.97 ± 0.0     |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| XGB   | Proposed  | _0.60 ± 0.01_  | _0.62 ± 0.0_   | _0.90 ± 0.02_  | _0.60 ± 0.0_   | _0.77 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.85 ± 0.01_  |
|       | DNABERT   | 0.59 ± 0.01    | 0.62 ± 0.01    | 0.90 ± 0.01    | 0.64 ± 0.01    | 0.82 ± 0.01    | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.79 ± 0.01    |
|       | NT        | 0.61 ± 0.01    | 0.64 ± 0.0     | 0.90 ± 0.02    | **0.89 ± 0.03**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| 0.98 ± 0.0     |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| RF    | Proposed  | _0.61 ± 0.0_   | _0.66 ± 0.01_  | _0.90 ± 0.02_  | _0.61 ± 0.01_  | _0.77 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.86 ± 0.0_   |
|       | DNABERT   | 0.60 ± 0.0     | 0.66 ± 0.01    | 0.90 ± 0.02    | 0.63 ± 0.01    | 0.82 ± 0.0     | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.81 ± 0.01    |
|       | NT        | 0.62 ± 0.01    | **0.67 ± 0.01**| 0.90 ± 0.01    | 0.71 ± 0.01    | **0.85 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| 0.97 ± 0.0     |


**Table:** F1-scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. 

| Model | Embed.    | T1             | T2             | T3             | T4             | T5             | T6             | T7             | T8             |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LR    | Proposed  | **_0.78 ± 0.0_** | **_0.80 ± 0.01_** | _0.20 ± 0.05_  | _0.64 ± 0.01_  | _0.79 ± 0.0_   | _0.13 ± 0.37_  | _0.16 ± 0.0_   | _0.70 ± 0.01_  |
|       | DNABERT   | 0.75 ± 0.01    | 0.78 ± 0.0     | 0.47 ± 0.09    | 0.69 ± 0.01    | 0.84 ± 0.01    | 0.13 ± 0.37    | 0.16 ± 0.0     | 0.59 ± 0.01    |
|       | NT        | 0.56 ± 0.01    | 0.54 ± 0.0     | **0.78 ± 0.01**| **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LGBM  | Proposed  | _0.76 ± 0.01_  | _0.79 ± 0.0_   | _0.60 ± 0.11_  | _0.63 ± 0.01_  | _0.77 ± 0.0_   | _0.47 ± 0.20_  | _0.26 ± 0.04_  | _0.82 ± 0.0_   |
|       | DNABERT   | 0.74 ± 0.0     | 0.78 ± 0.0     | 0.60 ± 0.08    | 0.66 ± 0.01    | 0.82 ± 0.01    | 0.47 ± 0.20    | 0.26 ± 0.04    | 0.75 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.56 ± 0.0     | **0.89 ± 0.02**| **0.72 ± 0.01**| **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| **0.97 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| XGB   | Proposed  | _0.72 ± 0.01_  | _0.75 ± 0.0_   | _0.59 ± 0.08_  | _0.60 ± 0.0_   | _0.76 ± 0.0_   | _0.47 ± 0.20_  | _0.26 ± 0.04_  | _0.85 ± 0.01_  |
|       | DNABERT   | 0.71 ± 0.01    | 0.75 ± 0.01    | 0.58 ± 0.05    | 0.64 ± 0.01    | 0.82 ± 0.01    | 0.47 ± 0.20    | 0.26 ± 0.04    | 0.79 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.57 ± 0.01    | 0.72 ± 0.01    | **0.85 ± 0.01**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| **0.9893 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| RF    | Proposed  | _0.73 ± 0.0_   | _0.79 ± 0.0_   | _0.58 ± 0.08_  | _0.61 ± 0.01_  | _0.75 ± 0.0_   | _0.53 ± 0.17_  | _0.24 ± 0.05_  | _0.86 ± 0.0_   |
|       | DNABERT   | 0.72 ± 0.0     | 0.79 ± 0.0     | 0.59 ± 0.09    | 0.63 ± 0.01    | 0.80 ± 0.01    | 0.53 ± 0.17    | 0.24 ± 0.05    | 0.82 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.56 ± 0.01    | **0.89 ± 0.02**| **0.71 ± 0.01**| **0.84 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| **0.97 ± 0.0** |

## Authors 
-----------

* Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
* Contact details : [email protected]

## Citation 
-----------
Bibtex Reference **TBA**

### References

<a id="1">[1]</a> 
Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).