radlab
/

semantic-euro-bert-encoder-v1

Sentence Similarity

sentence-transformers

semantic-relations

semantic-search

Model card Files Files and versions

pkedzia commited on 2 days ago

Commit

bbc621c

·

verified ·

1 Parent(s): 8f2b44f

Create README.md

Files changed (1) hide show

README.md +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+license: apache-2.0
+datasets:
+- clarin-knext/wsd_plwordnet_glex
+language:
+- pl
+- en
+- de
+base_model:
+- EuroBERT/EuroBERT-610m
+tags:
+- sentence-transformers
+- '- embeddings'
+- plwordnet
+- semantic-relations
+- semantic-search
+---
+# PLWordNet Semantic Embedder (bi-encoder)
+A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings.
+Every relation between lexical units and synsets is transformed into training/evaluation examples.
+The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split).
+The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations
+(e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative”
+relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts:
+- GitHub: [https://github.com/radlab-dev-group/radlab-plwordnet](https://github.com/radlab-dev-group/radlab-plwordnet)
+## Model summary
+- **Architecture**: bi-encoder built with `sentence-transformers` (transformer encoder + pooling).
+- **Use cases**: semantic similarity and semantic search for Polish words, senses, definitions, and sentences.
+- **Objective**: CosineSimilarityLoss on positive/negative pairs.
+- **Behavior**: preserves the topology of semantic relations derived from plWordNet.
+## Training data
+Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs.
+Augmented with:
+  - definitions,
+  - usage examples (including emotion annotations where available),
+  - external descriptions from Wikipedia (split into sentences).
+Positive pairs correspond to relations expected to increase similarity;
+negative pairs correspond to relations expected to decrease similarity.
+Additional hard/soft negatives may include unrelated meanings.
+## Training details
+- **Trainer**: `SentenceTransformerTrainer`
+- **Loss**: `CosineSimilarityLoss`
+- **Evaluator**: `EmbeddingSimilarityEvaluator` (cosine)
+- Typical **hyperparameters**:
+    - epochs: 5
+    - per-device batch size: 10 (gradient accumulation: 4)
+    - learning rate: 5e-6 (AdamW fused)
+    - weight decay: 0.01
+    - warmup: ratio 20k steps
+    - fp16: true
+## Evaluation
+- **Task**: semantic similarity on dev/test splits built from the relation-derived pairs.
+- **Metric**: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/DCepnAcPcv4EblAmtgu7R.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/TWHyVDItYwNbFEyI0i--n.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/o-CFHkDYw62Lyh1MKvG4M.png)