SA-BERT-V1: Saudi-Dialect Embeddings
Model Details
- Fine-Tuned Model ID: Omartificial-Intelligence-Space/SA-BERT-V1
- License: Apache 2.0
- Designed For: Saudi Dialect
- Model Type: Sentence-Embedding (BERT encoder with mean-pooling)
- Architecture: 12-layer Transformer, 768-dim hidden states
- Embedding Size: 768
- Pretrained On: UBC-NLP/MARBERTv2
- Fine-Tuned On: Over 500K Saudi-dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more)
- Supported Language: Arabic (Saudi dialect)
- Intended Tasks: Semantic similarity, clustering, retrieval, downstream classification
SA-BERT-V1 delivers unparalleled Saudi-dialect understanding—achieving a +0.0022 in-vs-cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.”



▪️SA-BERT-V1 shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi-dialect fine-tuning.
▪️In vs Cross: Both near ~0.98, with a slight positive gap (+0.0023), meaning same-topic embeddings are closer.
▪️Performance: Exceptional clustering for Saudi dialect; ideal for retrieval or grouping tasks.
▪️The evaluations—both the similarity metrics and the “in- vs-cross” gap plots—were run on a held-out test set of 1280 Saudi-dialect sentences covering 44 diverse categories (e.g. Greetings, Weather, Law & Justice, etc.).
▪️Dataset is create by the space and released to evaluate embedding models by sampling intra-category and cross-category pairs from that set to compute:
◽️Average in-category / cross-category cosine similarities ◽️Top-5 most/least similar pairs ◽️Per-category average similarities
▪️ Access Test Samples: saudi-dialect-test-samples
Implementation Example
import torch
from transformers import AutoTokenizer, AutoModel
# Configuration
MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE")
model = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval()
def embed_sentence(text: str) -> torch.Tensor:
"""
Tokenizes `text`, feeds it through SA-BERT-V1, and returns
a 768-dimensional mean-pooled sentence embedding.
"""
# Encode the text
enc = tokenizer(
text,
truncation=True,
padding="max_length",
max_length=256,
return_tensors="pt"
).to(DEVICE)
# Forward pass
with torch.no_grad():
outputs = model(**enc).last_hidden_state # shape: (1, seq_len, 768)
# Mean-pooling over valid tokens
mask = enc["attention_mask"].unsqueeze(-1) # shape: (1, seq_len, 1)
summed = (outputs * mask).sum(dim=1) # shape: (1, 768)
counts = mask.sum(dim=1).clamp(min=1e-9) # shape: (1, 1)
embedding = summed / counts # shape: (1, 768)
return embedding.squeeze(0) # shape: (768,)
# Example usage
if __name__ == "__main__":
sentences = [
"شتبي من البقالة؟",
"كيف حالك؟",
"وش رايك في الموضوع هذا؟"
]
for s in sentences:
vec = embed_sentence(s)
print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n")
Citation
If you use MarBERTv2-SA in your research or applications, please cite:
@misc{nacar2025SABERTV1,
title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings},
author={NAcar, Omer},
year={2025},
publisher={Omartificial-Intelligence-Space},
howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}},
}
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
year = "2021",
publisher = "Association for Computational Linguistics",
pages = "7088--7105",
}
- Downloads last month
- 6
Model tree for Omartificial-Intelligence-Space/SA-BERT-V1
Base model
UBC-NLP/MARBERTv2