File size: 7,808 Bytes

---
license: cc-by-sa-4.0
tags:
- DNA
- biology
- genomics
- protein
- kmer
- cancer
- gleason-grade-group
---
## Project Description 
This repository contains the trained model for our paper: **Fine-tuning a Sentence Transformer for DNA & Protein tasks** that is currently under review at BMC Bioinformatics. This model, called **simcse-dna**; is based on the original implementation of **SimCSE [1]**. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.

###  Prerequisites 
-----------
Please see the original [SimCSE](https://github.com/princeton-nlp/SimCSE) for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580).

### Usage 

Run the following code to get the sentence embeddings:

```python 

import torch
from transformers import AutoModel, AutoTokenizer

# Import trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna")
model = AutoModel.from_pretrained("dsfsi/simcse-dna")


#sentences is your list of n DNA tokens of size 6 
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output


```
The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.

## Performance on evaluation tasks

Find out more about the datasets and access in the paper **(TBA)**

**Table:** Accuracy scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. 

| Model | Embed.    | T1             | T2             | T3             | T4             | T5             | T6             | T7             | T8             |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LR    | Proposed  | _0.65 ± 0.01_  | _0.67 ± 0.0_   | _0.85 ± 0.01_  | _0.64 ± 0.01_  | _0.80 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.70 ± 0.01_  |
|       | DNABERT   | 0.62 ± 0.01    | 0.65 ± 0.0     | 0.84 ± 0.04    | 0.69 ± 0.01    | 0.85 ± 0.01    | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.60 ± 0.01    |
|       | NT        | **0.66 ± 0.0** | **0.67 ± 0.0** | 0.84 ± 0.01    | **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** |
 |-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LGBM  | Proposed  | _0.64 ± 0.01_  | _0.66 ± 0.0_   | _0.90 ± 0.02_  | _0.61 ± 0.01_  | _0.78 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.81 ± 0.01_  |
|       | DNABERT   | 0.62 ± 0.01    | 0.65 ± 0.01    | 0.90 ± 0.02    | 0.65 ± 0.01    | 0.83 ± 0.0     | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.75 ± 0.01    |
|       | NT        | 0.63 ± 0.01    | 0.66 ± 0.0     | **0.91 ± 0.02**| 0.72 ± 0.0     | **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| 0.97 ± 0.0     |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| XGB   | Proposed  | _0.60 ± 0.01_  | _0.62 ± 0.0_   | _0.90 ± 0.02_  | _0.60 ± 0.0_   | _0.77 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.85 ± 0.01_  |
|       | DNABERT   | 0.59 ± 0.01    | 0.62 ± 0.01    | 0.90 ± 0.01    | 0.64 ± 0.01    | 0.82 ± 0.01    | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.79 ± 0.01    |
|       | NT        | 0.61 ± 0.01    | 0.64 ± 0.0     | 0.90 ± 0.02    | **0.89 ± 0.03**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| 0.98 ± 0.0     |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| RF    | Proposed  | _0.61 ± 0.0_   | _0.66 ± 0.01_  | _0.90 ± 0.02_  | _0.61 ± 0.01_  | _0.77 ± 0.0_   | _0.49 ± 0.0_   | _0.33 ± 0.0_   | _0.86 ± 0.0_   |
|       | DNABERT   | 0.60 ± 0.0     | 0.66 ± 0.01    | 0.90 ± 0.02    | 0.63 ± 0.01    | 0.82 ± 0.0     | 0.49 ± 0.0     | 0.33 ± 0.0     | 0.81 ± 0.01    |
|       | NT        | 0.62 ± 0.01    | **0.67 ± 0.01**| 0.90 ± 0.01    | 0.71 ± 0.01    | **0.85 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| 0.97 ± 0.0     |


**Table:** F1-scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. 

| Model | Embed.    | T1             | T2             | T3             | T4             | T5             | T6             | T7             | T8             |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LR    | Proposed  | **_0.78 ± 0.0_** | **_0.80 ± 0.01_** | _0.20 ± 0.05_  | _0.64 ± 0.01_  | _0.79 ± 0.0_   | _0.13 ± 0.37_  | _0.16 ± 0.0_   | _0.70 ± 0.01_  |
|       | DNABERT   | 0.75 ± 0.01    | 0.78 ± 0.0     | 0.47 ± 0.09    | 0.69 ± 0.01    | 0.84 ± 0.01    | 0.13 ± 0.37    | 0.16 ± 0.0     | 0.59 ± 0.01    |
|       | NT        | 0.56 ± 0.01    | 0.54 ± 0.0     | **0.78 ± 0.01**| **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| LGBM  | Proposed  | _0.76 ± 0.01_  | _0.79 ± 0.0_   | _0.60 ± 0.11_  | _0.63 ± 0.01_  | _0.77 ± 0.0_   | _0.47 ± 0.20_  | _0.26 ± 0.04_  | _0.82 ± 0.0_   |
|       | DNABERT   | 0.74 ± 0.0     | 0.78 ± 0.0     | 0.60 ± 0.08    | 0.66 ± 0.01    | 0.82 ± 0.01    | 0.47 ± 0.20    | 0.26 ± 0.04    | 0.75 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.56 ± 0.0     | **0.89 ± 0.02**| **0.72 ± 0.01**| **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| **0.97 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| XGB   | Proposed  | _0.72 ± 0.01_  | _0.75 ± 0.0_   | _0.59 ± 0.08_  | _0.60 ± 0.0_   | _0.76 ± 0.0_   | _0.47 ± 0.20_  | _0.26 ± 0.04_  | _0.85 ± 0.01_  |
|       | DNABERT   | 0.71 ± 0.01    | 0.75 ± 0.01    | 0.58 ± 0.05    | 0.64 ± 0.01    | 0.82 ± 0.01    | 0.47 ± 0.20    | 0.26 ± 0.04    | 0.79 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.57 ± 0.01    | 0.72 ± 0.01    | **0.85 ± 0.01**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| **0.9893 ± 0.0** |
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| RF    | Proposed  | _0.73 ± 0.0_   | _0.79 ± 0.0_   | _0.58 ± 0.08_  | _0.61 ± 0.01_  | _0.75 ± 0.0_   | _0.53 ± 0.17_  | _0.24 ± 0.05_  | _0.86 ± 0.0_   |
|       | DNABERT   | 0.72 ± 0.0     | 0.79 ± 0.0     | 0.59 ± 0.09    | 0.63 ± 0.01    | 0.80 ± 0.01    | 0.53 ± 0.17    | 0.24 ± 0.05    | 0.82 ± 0.01    |
|       | NT        | 0.59 ± 0.01    | 0.56 ± 0.01    | **0.89 ± 0.02**| **0.71 ± 0.01**| **0.84 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| **0.97 ± 0.0** |

## Authors 
-----------

* Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
* Contact details : [email protected]

## Citation 
-----------
Bibtex Reference **TBA**

### References

<a id="1">[1]</a> 
Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).