BERT Hash Nano Models

This is a set of 3 Nano BERT models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings.

The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size.

This significantly reduces the number of parameters necessary for token embeddings.

For example:

Standard token embeddings:

30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters
23,440,896 x 4 (float32) = 93,763,584 bytes

Hash token embeddings:

30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters
156,450 x 4 (float32) = 625,800 bytes

These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.

Below is a subset of GLUE scores on the dev set using the script provided by Hugging Face Transformers with the following parameters.

python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True

Model	Parameters	MNLI (acc m/mm)	MRPC (f1/acc)	SST-2 (acc)
baseline (bert-tiny)	4.4M	0.7114 / 0.7161	0.8318 / 0.7353	0.8222
bert-hash-femto	0.243M	0.5697 / 0.5750	0.8122 / 0.6838	0.7821
bert-hash-pico	0.448M	0.6228 / 0.6363	0.8205 / 0.7083	0.7878
bert-hash-nano	0.969M	0.6565 / 0.6670	0.8172 / 0.7083	0.8131

Usage

These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, trust_remote_code needs to be set.

from transformers import AutoModel

model = AutoModel.from_pretrained("neuml/bert-hash-femto", trust_remote_code=True)

Training

Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using txtai.

from datasets import concatenate_datasets, load_dataset
from transformers import AutoTokenizer

from txtai.pipeline import HFTrainer

from configuration_bert_hash import *
from modeling_bert_hash import *

dataset = load_dataset("path to target HF dataset")

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

config = BertHashConfig(
       hidden_size=128,
       num_hidden_layers=2,
       num_attention_heads=2,
       intermediate_size=512,
       projections=16
)
model = BertHashForMaskedLM(config)

print(config)
print("Total parameters:", sum(p.numel() for p in model.bert.parameters()))

train = HFTrainer()

# Train using MLM
train((model, tokenizer), dataset, task="language-modeling", output_dir="model",
       fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3,
       warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6,
       tokenizers=True, dataloader_num_workers=20,
       save_strategy="steps", save_steps=5000, logging_steps=500,
)

Future Work

This model demonstrates that smaller models can still be productive models.

The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs.

Imagine more specialized models like this for medical, legal, science and more.

Downloads last month: 24

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeuML/bert-hash-femto

Finetunes

2 models

Collection including NeuML/bert-hash-femto

BERT Hash Nano Models

Collection

Set of BERT models with a modified embeddings layer • 3 items • Updated 3 days ago • 6