ligeti commited on
Commit
16f37ae
·
verified ·
1 Parent(s): e34f855

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -3,3 +3,62 @@ First release of ProkBERT
3
  ---
4
  license: cc-by-nc-4.0
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
  license: cc-by-nc-4.0
5
  ---
6
+
7
+ ## ProkBERT-mini Model
8
+
9
+ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis, employs a unique tokenization strategy to effectively capture and interpret complex genomic data.
10
+
11
+ ### Model Details
12
+
13
+ **Developed by:** Neural Bioinformatics Research Group
14
+
15
+ **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
16
+
17
+ **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
18
+
19
+ **Parameters:**
20
+
21
+ | Parameter | Description |
22
+ |----------------------|--------------------------------------|
23
+ | Model Size | 20.6 million parameters |
24
+ | Max. Context Size | 1024 bp |
25
+ | Training Data | 206.65 billion nucleotides |
26
+ | Layers | 6 |
27
+ | Attention Heads | 6 |
28
+
29
+ ### Intended Use
30
+
31
+ **Intended Use Cases:** ProkBERT-mini-k6-s1 is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
32
+ - sequence classification tasks
33
+ - Exploration of genomic patterns and features
34
+
35
+ **Out-of-scope Uses:** Not intended for use in non-genomic contexts or applications outside the realm of bioinformatics.
36
+
37
+ ### Training Data and Process
38
+
39
+ **Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
40
+
41
+ **Training Process:**
42
+ - **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences, involving an intricate strategy for masking overlapping k-mers.
43
+ - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
44
+
45
+ ### Evaluation Results
46
+
47
+ | Metric | Result | Notes |
48
+ |-------------------------|--------------|-------|
49
+ | Metric 1 (e.g., Accuracy) | To be filled | |
50
+ | Metric 2 (e.g., Precision) | To be filled | |
51
+ | Metric 3 (e.g., Recall) | To be filled | |
52
+
53
+ *Additional details and metrics can be included as they become available.*
54
+
55
+ ### Ethical Considerations and Limitations
56
+
57
+ As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
58
+
59
+ ### Reporting Issues
60
+
61
+ Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:
62
+
63
+ - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
64
+ - **Feedback and inquiries:** [[email protected]](mailto:[email protected])