PLWordNet Semantic Embedder (bi-encoder)

A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings. Every relation between lexical units and synsets is transformed into training/evaluation examples.

The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split). The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations (e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative” relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts:

Model summary

  • Architecture: bi-encoder built with sentence-transformers (transformer encoder + pooling).
  • Use cases: semantic similarity and semantic search for Polish words, senses, definitions, and sentences.
  • Objective: CosineSimilarityLoss on positive/negative pairs.
  • Behavior: preserves the topology of semantic relations derived from plWordNet.

Training data

Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs. Augmented with:

  • definitions,
  • usage examples (including emotion annotations where available),
  • external descriptions from Wikipedia (split into sentences).

Positive pairs correspond to relations expected to increase similarity; negative pairs correspond to relations expected to decrease similarity. Additional hard/soft negatives may include unrelated meanings.

Training details

  • Trainer: SentenceTransformerTrainer
  • Loss: CosineSimilarityLoss
  • Evaluator: EmbeddingSimilarityEvaluator (cosine)
  • Typical hyperparameters:
    • epochs: 5
    • per-device batch size: 10 (gradient accumulation: 4)
    • learning rate: 5e-6 (AdamW fused)
    • weight decay: 0.01
    • warmup: ratio 20k steps
    • fp16: true

Evaluation

  • Task: semantic similarity on dev/test splits built from the relation-derived pairs.
  • Metric: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs.

image/png

image/png

image/png

How to use

Sentence-Transformers:

# Python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("radlab/semantic-euro-bert-encoder-v1", trust_remote_code=True)

texts = ["zamek", "drzwi", "wiadro", "horyzont", "ocean"]
emb = model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)
scores = util.cos_sim(emb, emb)
print(scores)  # higher = more semantically similar

Transformers (feature extraction):

# Python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

name = "radlab/semantic-euro-bert-encoder-v1"
tok = AutoTokenizer.from_pretrained(name)
mdl = AutoModel.from_pretrained(name, trust_remote_code=True)

texts = ["student", "żak"]
tokens = tok(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = mdl(**tokens)
    emb = out.last_hidden_state.mean(dim=1)
    emb = F.normalize(emb, p=2, dim=1)

sim = emb @ emb.T
print(sim)
Downloads last month
4
Safetensors
Model size
608M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for radlab/semantic-euro-bert-encoder-v1

Finetuned
(16)
this model

Dataset used to train radlab/semantic-euro-bert-encoder-v1

Collection including radlab/semantic-euro-bert-encoder-v1