pkedzia commited on
Commit
bbc621c
·
verified ·
1 Parent(s): 8f2b44f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - clarin-knext/wsd_plwordnet_glex
5
+ language:
6
+ - pl
7
+ - en
8
+ - de
9
+ base_model:
10
+ - EuroBERT/EuroBERT-610m
11
+ tags:
12
+ - sentence-transformers
13
+ - '- embeddings'
14
+ - plwordnet
15
+ - semantic-relations
16
+ - semantic-search
17
+ ---
18
+
19
+ # PLWordNet Semantic Embedder (bi-encoder)
20
+
21
+ A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings.
22
+ Every relation between lexical units and synsets is transformed into training/evaluation examples.
23
+
24
+ The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split).
25
+ The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations
26
+ (e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative”
27
+ relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts:
28
+ - GitHub: [https://github.com/radlab-dev-group/radlab-plwordnet](https://github.com/radlab-dev-group/radlab-plwordnet)
29
+
30
+ ## Model summary
31
+
32
+ - **Architecture**: bi-encoder built with `sentence-transformers` (transformer encoder + pooling).
33
+ - **Use cases**: semantic similarity and semantic search for Polish words, senses, definitions, and sentences.
34
+ - **Objective**: CosineSimilarityLoss on positive/negative pairs.
35
+ - **Behavior**: preserves the topology of semantic relations derived from plWordNet.
36
+
37
+ ## Training data
38
+
39
+ Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs.
40
+ Augmented with:
41
+ - definitions,
42
+ - usage examples (including emotion annotations where available),
43
+ - external descriptions from Wikipedia (split into sentences).
44
+
45
+ Positive pairs correspond to relations expected to increase similarity;
46
+ negative pairs correspond to relations expected to decrease similarity.
47
+ Additional hard/soft negatives may include unrelated meanings.
48
+
49
+ ## Training details
50
+ - **Trainer**: `SentenceTransformerTrainer`
51
+ - **Loss**: `CosineSimilarityLoss`
52
+ - **Evaluator**: `EmbeddingSimilarityEvaluator` (cosine)
53
+ - Typical **hyperparameters**:
54
+ - epochs: 5
55
+ - per-device batch size: 10 (gradient accumulation: 4)
56
+ - learning rate: 5e-6 (AdamW fused)
57
+ - weight decay: 0.01
58
+ - warmup: ratio 20k steps
59
+ - fp16: true
60
+
61
+ ## Evaluation
62
+ - **Task**: semantic similarity on dev/test splits built from the relation-derived pairs.
63
+ - **Metric**: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs.
64
+
65
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/DCepnAcPcv4EblAmtgu7R.png)
66
+
67
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/TWHyVDItYwNbFEyI0i--n.png)
68
+
69
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/o-CFHkDYw62Lyh1MKvG4M.png)