BERT Hash Nano Models
This is a set of 3 Nano BERT models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings.
The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size.
This significantly reduces the number of parameters necessary for token embeddings.
For example:
Standard token embeddings:
- 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters
- 23,440,896 x 4 (float32) = 93,763,584 bytes
Hash token embeddings:
- 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters
- 156,450 x 4 (float32) = 625,800 bytes
These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.
Below is a subset of GLUE scores on the dev set using the script provided by Hugging Face Transformers with the following parameters.
python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True
Model | Parameters | MNLI (acc m/mm) | MRPC (f1/acc) | SST-2 (acc) |
---|---|---|---|---|
baseline (bert-tiny) | 4.4M | 0.7114 / 0.7161 | 0.8318 / 0.7353 | 0.8222 |
bert-hash-femto | 0.243M | 0.5697 / 0.5750 | 0.8122 / 0.6838 | 0.7821 |
bert-hash-pico | 0.448M | 0.6228 / 0.6363 | 0.8205 / 0.7083 | 0.7878 |
bert-hash-nano | 0.969M | 0.6565 / 0.6670 | 0.8172 / 0.7083 | 0.8131 |
Usage
These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, trust_remote_code
needs to be set.
from transformers import AutoModel
model = AutoModel.from_pretrained("neuml/bert-hash-femto", trust_remote_code=True)
Training
Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using txtai.
from datasets import concatenate_datasets, load_dataset
from transformers import AutoTokenizer
from txtai.pipeline import HFTrainer
from configuration_bert_hash import *
from modeling_bert_hash import *
dataset = load_dataset("path to target HF dataset")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = BertHashConfig(
hidden_size=128,
num_hidden_layers=2,
num_attention_heads=2,
intermediate_size=512,
projections=16
)
model = BertHashForMaskedLM(config)
print(config)
print("Total parameters:", sum(p.numel() for p in model.bert.parameters()))
train = HFTrainer()
# Train using MLM
train((model, tokenizer), dataset, task="language-modeling", output_dir="model",
fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3,
warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6,
tokenizers=True, dataloader_num_workers=20,
save_strategy="steps", save_steps=5000, logging_steps=500,
)
Future Work
This model demonstrates that smaller models can still be productive models.
The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs.
Imagine more specialized models like this for medical, legal, science and more.
- Downloads last month
- 24