SA-BERT-V1: Saudi-Dialect Embeddings

MarBERTv2-SA Logo

Model Details

Fine-Tuned Model ID: Omartificial-Intelligence-Space/SA-BERT-V1
License: Apache 2.0
Designed For: Saudi Dialect
Model Type: Sentence-Embedding (BERT encoder with mean-pooling)
Architecture: 12-layer Transformer, 768-dim hidden states
Embedding Size: 768
Pretrained On: UBC-NLP/MARBERTv2
Fine-Tuned On: Over 500K Saudi-dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more)
Supported Language: Arabic (Saudi dialect)
Intended Tasks: Semantic similarity, clustering, retrieval, downstream classification

SA-BERT-V1 delivers unparalleled Saudi-dialect understanding—achieving a +0.0022 in-vs-cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.”

▪️SA-BERT-V1 shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi-dialect fine-tuning.

▪️In vs Cross: Both near ~0.98, with a slight positive gap (+0.0023), meaning same-topic embeddings are closer.

▪️Performance: Exceptional clustering for Saudi dialect; ideal for retrieval or grouping tasks.

▪️The evaluations—both the similarity metrics and the “in- vs-cross” gap plots—were run on a held-out test set of 1280 Saudi-dialect sentences covering 44 diverse categories (e.g. Greetings, Weather, Law & Justice, etc.).

▪️Dataset is create by the space and released to evaluate embedding models by sampling intra-category and cross-category pairs from that set to compute:

◽️Average in-category / cross-category cosine similarities ◽️Top-5 most/least similar pairs ◽️Per-category average similarities

▪️ Access Test Samples: saudi-dialect-test-samples

Implementation Example

import torch
from transformers import AutoTokenizer, AutoModel

# Configuration
MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE")
model     = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval()

def embed_sentence(text: str) -> torch.Tensor:
    """
    Tokenizes `text`, feeds it through SA-BERT-V1, and returns
    a 768-dimensional mean-pooled sentence embedding.
    """
    # Encode the text
    enc = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    ).to(DEVICE)

    # Forward pass
    with torch.no_grad():
        outputs = model(**enc).last_hidden_state  # shape: (1, seq_len, 768)

    # Mean-pooling over valid tokens
    mask = enc["attention_mask"].unsqueeze(-1)           # shape: (1, seq_len, 1)
    summed = (outputs * mask).sum(dim=1)                 # shape: (1, 768)
    counts = mask.sum(dim=1).clamp(min=1e-9)              # shape: (1, 1)
    embedding = summed / counts                          # shape: (1, 768)

    return embedding.squeeze(0)  # shape: (768,)

# Example usage
if __name__ == "__main__":
    sentences = [
        "شتبي من البقالة؟",
        "كيف حالك؟",
        "وش رايك في الموضوع هذا؟"
    ]
    for s in sentences:
        vec = embed_sentence(s)
        print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n")

Citation

If you use MarBERTv2-SA in your research or applications, please cite:

@misc{nacar2025SABERTV1,
  title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings},
  author={Nacar, Omer & Sibaee, Serry},
  year={2025},
  publisher={Omartificial-Intelligence-Space},
  howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}},
}

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    year = "2021",
    publisher = "Association for Computational Linguistics",
    pages = "7088--7105",
}