Update README.md
Browse files
README.md
CHANGED
@@ -3,3 +3,62 @@ First release of ProkBERT
|
|
3 |
---
|
4 |
license: cc-by-nc-4.0
|
5 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
license: cc-by-nc-4.0
|
5 |
---
|
6 |
+
|
7 |
+
## ProkBERT-mini Model
|
8 |
+
|
9 |
+
ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis, employs a unique tokenization strategy to effectively capture and interpret complex genomic data.
|
10 |
+
|
11 |
+
### Model Details
|
12 |
+
|
13 |
+
**Developed by:** Neural Bioinformatics Research Group
|
14 |
+
|
15 |
+
**Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
|
16 |
+
|
17 |
+
**Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
|
18 |
+
|
19 |
+
**Parameters:**
|
20 |
+
|
21 |
+
| Parameter | Description |
|
22 |
+
|----------------------|--------------------------------------|
|
23 |
+
| Model Size | 20.6 million parameters |
|
24 |
+
| Max. Context Size | 1024 bp |
|
25 |
+
| Training Data | 206.65 billion nucleotides |
|
26 |
+
| Layers | 6 |
|
27 |
+
| Attention Heads | 6 |
|
28 |
+
|
29 |
+
### Intended Use
|
30 |
+
|
31 |
+
**Intended Use Cases:** ProkBERT-mini-k6-s1 is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
|
32 |
+
- sequence classification tasks
|
33 |
+
- Exploration of genomic patterns and features
|
34 |
+
|
35 |
+
**Out-of-scope Uses:** Not intended for use in non-genomic contexts or applications outside the realm of bioinformatics.
|
36 |
+
|
37 |
+
### Training Data and Process
|
38 |
+
|
39 |
+
**Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
|
40 |
+
|
41 |
+
**Training Process:**
|
42 |
+
- **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences, involving an intricate strategy for masking overlapping k-mers.
|
43 |
+
- **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
|
44 |
+
|
45 |
+
### Evaluation Results
|
46 |
+
|
47 |
+
| Metric | Result | Notes |
|
48 |
+
|-------------------------|--------------|-------|
|
49 |
+
| Metric 1 (e.g., Accuracy) | To be filled | |
|
50 |
+
| Metric 2 (e.g., Precision) | To be filled | |
|
51 |
+
| Metric 3 (e.g., Recall) | To be filled | |
|
52 |
+
|
53 |
+
*Additional details and metrics can be included as they become available.*
|
54 |
+
|
55 |
+
### Ethical Considerations and Limitations
|
56 |
+
|
57 |
+
As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
|
58 |
+
|
59 |
+
### Reporting Issues
|
60 |
+
|
61 |
+
Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:
|
62 |
+
|
63 |
+
- **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
|
64 |
+
- **Feedback and inquiries:** [[email protected]](mailto:[email protected])
|