Artiwise ModernBERT - Base Turkish Uncased

We present Artiwise ModernBERT for Turkish 🎉. A BERT model with modernized architecture and increased context size. (Older BERT models 512 --> ModernBERT 8192).

This model is a Turkish adaptation of ModernBERT, fine-tuned from answerdotai/ModernBERT-base using only the Turkish part of CulturaX.

Stats

Training Data: CulturaX 192GB (tr)
Base Model: answerdotai/ModernBERT-base

Benchmark

The benchmark results below demonstrate that Artiwise ModernBERT consistently outperforms existing Turkish BERT variants across multiple domains and masking levels, highlighting its superior generalization capabilities.

Dataset & Mask Level	Artiwise Modern Bert	ytu-ce-cosmos/turkish-base-bert-uncased	dbmdz/bert-base-turkish-uncased
QA Dataset (5% mask)	74.50	60.84	48.57
QA Dataset (10% mask)	72.18	58.75	46.29
QA Dataset (15% mask)	69.46	56.50	44.30
Review Dataset (5% mask)	62.67	48.57	35.38
Review Dataset (10% mask)	59.60	45.77	33.04
Review Dataset (15% mask)	56.51	43.05	31.05
Biomedical Dataset (5% mask)	58.11	50.78	40.82
Biomedical Dataset (10% mask)	55.55	48.37	38.51
Biomedical Dataset (15% mask)	52.71	45.82	36.44

For each dataset (QA, Reviews, Biomedical) and each masking level (5 %, 10 %, 15 %), we randomly masked the specified percentage of tokens in every input example and then measured each model’s ability to correctly predict those masked tokens. All models were in bfloat16 precision.

Our experiments used three datasets: the Turkish Biomedical Corpus, the Turkish Product Reviews dataset, and the general‑domain QA corpus turkish_v2.

Model Usage

Note: Torch version must be >= 2.6.0 and transformers version>=4.50.0 for the model to function properly. Also Don't use the do_lower_case = True flag with the tokenizer. Instead, convert your text to lower case as follows:

text.replace("I", "ı").lower()

This is due to a known issue with the tokenizer.

Load the model with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("artiwise-ai/modernbert-base-tr-uncased")
model = AutoModelForMaskedLM.from_pretrained("artiwise-ai/modernbert-base-tr-uncased")

# Example sentence with masked token
text = "Türkiye'nin başkenti [MASK]'dır."
text.replace("I", "ı").lower()

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt")

# Get the position of the masked token
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Get the predictions for the masked token
logits = outputs.logits
mask_token_logits = logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

# Print the predictions
print(f"Original text: {text}")
print("Top 5 predictions for [MASK]:")
for token in top_5_tokens:
    print(f"- {tokenizer.decode([token])}")

artiwise-ai
/

modernbert-base-tr-uncased

Artiwise ModernBERT - Base Turkish Uncased

Stats

Benchmark

Model Usage

Model tree for artiwise-ai/modernbert-base-tr-uncased

Dataset used to train artiwise-ai/modernbert-base-tr-uncased