README.md · neuralbioinfo/prokbert-mini at ccc8eb911dba9453c8478ded6ff1d07afec7dac1

metadata

license: cc-by-nc-4.0

ProkBERT-mini Model

ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.

Simple Usage Example

The following example demonstrates how to use the ProkBERT-mini model for processing a DNA sequence:

from transformers import MegatronBertForMaskedLM
from prokbert.prokbert_tokenizer import ProkBERTTokenizer

# Tokenization parameters
tokenization_parameters = {
    'kmer': 6,
    'shift': 1
}

# Initialize the tokenizer and model
tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")

# Example DNA sequence
sequence = 'ATGTCCGCGGGACCT'

# Tokenize the sequence
inputs = tokenizer(sequence, return_tensors="pt")

# Ensure that inputs have a batch dimension
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}

# Generate outputs from the model
outputs = model(**inputs)

Model Details

Developed by: Neural Bioinformatics Research Group

Architecture: ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.

Tokenizer: The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.

Parameters:

Parameter	Description
Model Size	20.6 million parameters
Max. Context Size	1024 bp
Training Data	206.65 billion nucleotides
Layers	6
Attention Heads	6

Intended Use

Intended Use Cases: ProkBERT-mini-k6-s1 is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:

sequence classification tasks
Exploration of genomic patterns and features

Segmentation and Tokenization in ProkBERT Models

Preprocessing Sequence Data

Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome. The initial stage of our pipeline involves two primary steps: segmentation and tokenization. For more details about tokenization, please see the following notebook: Tokenization Notebook in Google Colab.

For more details about segmentation, please see the following notebook: Segmentation Notebook in Google Colab.

Segmentation

Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.

The first practical step in segmentation involves loading the sequence from a FASTA file, often including the reverse complement of the sequence.

Segmentation process:

Tokenization Process

After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.

Basic Steps for Preprocessing:

Load Fasta Files: Begin by loading the raw sequence data from FASTA files.
Segment the Raw Sequences: Apply segmentation parameters to split the sequences into manageable segments.
Tokenize the Segmented Database: Use the defined tokenization parameters to convert the segments into tokenized forms.
Create a Padded/Truncated Array: Generate a uniform array structure, padding or truncating as necessary.
Save the Array to HDF: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.

import pkg_resources
from os.path import join
from prokbert.sequtils import *

# Directory for pretraining FASTA files
pretraining_fasta_files_dir = pkg_resources.resource_filename('prokbert','data/pretraining')

# Define segmentation and tokenization parameters
segmentation_params = {
    'max_length': 256,  # Split the sequence into segments of length L
    'min_length': 6,
    'type': 'random'
}
tokenization_parameters = {
    'kmer': 6,
    'shift': 1,
    'max_segment_length': 2003,
    'token_limit': 2000
}

# Setup configuration
defconfig = SeqConfig()
segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)

# Load and segment sequences
input_fasta_files = [join(pretraining_fasta_files_dir, file) for file in get_non_empty_files(pretraining_fasta_files_dir)]
sequences = load_contigs(input_fasta_files, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
segment_db = segment_sequences(sequences, segmentation_params, AsDataFrame=True)

# Tokenization
tokenized = batch_tokenize_segments_with_ids(segment_db, tokenization_params)
expected_max_token = max(len(arr) for arrays in tokenized.values() for arr in arrays)
X, torchdb = get_rectangular_array_from_tokenized_dataset(tokenized, tokenization_params['shift'], expected_max_token)

# Save to HDF file
hdf_file = '/tmp/pretraining.h5'
save_to_hdf(X, hdf_file, database=torchdb, compression=True)

Installation of ProkBERT (if needed)

For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):

try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")

Training Data and Process

Overview: The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.

Training Process:

Masked Language Modeling (MLM): The MLM objective was modified for genomic sequences for masking overlapping k-mers.
Training Phases: The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.

Evaluation Results

Metric	Result	Notes
Metric 1 (e.g., Accuracy)	To be filled
Metric 2 (e.g., Precision)	To be filled
Metric 3 (e.g., Recall)	To be filled

Additional details and metrics can be included as they become available.

Ethical Considerations and Limitations

As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.

Reporting Issues

Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:

Model issues: GitHub repository link
Feedback and inquiries: [email protected]

Reference

If you use ProkBERT-mini in your research, please cite the following paper: @ARTICLE{10.3389/fmicb.2023.1331233, AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János}, TITLE={ProkBERT family: genomic language models for microbiome applications}, JOURNAL={Frontiers in Microbiology}, VOLUME={14}, YEAR={2024}, URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233}, DOI={10.3389/fmicb.2023.1331233}, ISSN={1664-302X}, ABSTRACT={...} }