boltuix commited on
Commit
5738331
Β·
verified Β·
1 Parent(s): 28f82f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +352 -232
README.md CHANGED
@@ -11,7 +11,7 @@ metrics:
11
  - accuracy
12
  pipeline_tag: token-classification
13
  library_name: transformers
14
- new_version: v1.0
15
  tags:
16
  - token-classification
17
  - ner
@@ -34,257 +34,297 @@ tags:
34
  - information-extraction
35
  - search-enhancement
36
  - knowledge-graph
37
- - travel-nlp
38
  - medical-nlp
39
- - logistics-nlp
40
- - education-nlp
41
  base_model:
42
  - boltuix/bert-mini
43
  ---
44
 
 
45
 
46
- **************************** UNDER CONSTRUCTION ******************************
47
-
48
- ![Banner](https://via.placeholder.com/1200x400.png?text=EntityBERT+NER+Model)
49
 
50
  # 🌟 EntityBERT Model 🌟
51
 
52
  ## πŸš€ Model Details
53
 
54
  ### 🌈 Description
55
- The `boltuix/EntityBERT` model is a fine-tuned transformer for **Named Entity Recognition (NER)**, built on the lightweight `boltuix/bert-mini` base model. It excels at identifying 43 entity types, including people, locations, organizations, dates, times, phone numbers, emails, URLs, and more, in English text. Optimized for efficiency and high accuracy, it’s ideal for real-time applications like information extraction, chatbots, and knowledge graph construction across domains such as travel, medical, logistics, and education.
56
 
57
- - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (~143,709 entries, 6.38 MB)
58
- - **Entity Types**: 43 NER tags (18 core entity categories with B-/I- tags + O + padding labels)
59
  - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
60
- - **Domains**: Travel, medical, logistics, education, news, user-generated content
61
  - **Tasks**: Sentence-level and document-level NER
62
  - **Version**: v1.0
63
 
 
 
64
  ### πŸ”§ Info
65
- - **Developer**: Boltuix πŸ§™β€β™‚οΈ
66
- - **License**: Apache-2.0 πŸ“œ
67
- - **Language**: English πŸ‡¬πŸ‡§
68
- - **Type**: Transformer-based Token Classification πŸ€–
69
- - **Trained**: June 2025
70
  - **Base Model**: `boltuix/bert-mini`
71
- - **Parameters**: ~11M
 
72
 
73
  ### πŸ”— Links
74
- - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT)
75
- - **Dataset**: [boltuix/conll2025-ner](#download-instructions)
76
  - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
77
- - **Demo**: [boltuix.github.io/demo](https://boltuix.github.io/demo) (coming soon)
78
 
79
  ---
80
 
81
  ## 🎯 Use Cases for NER
82
 
83
  ### 🌟 Direct Applications
84
- - **Information Extraction**: Extract entities like πŸ‘€ Person (e.g., "Dr. Sarah Lee"), 🌍 Location (e.g., "Baltimore"), πŸ—“οΈ Date (e.g., "July 10, 2025"), and πŸ“ž Phone (e.g., "+1-410-955-5000") from travel itineraries, medical reports, or logistics documents.
85
- - **Chatbots & Virtual Assistants**: Enhance user interactions by recognizing entities in queries like "Book a flight from Dubai to Tokyo on October 10, 2025."
86
- - **Search Enhancement**: Enable semantic search with entity-based indexing, e.g., finding documents mentioning "Emirates" or "Shibuya Crossing."
87
- - **Knowledge Graphs**: Build structured graphs linking entities like 🏒 Organization (e.g., "Johns Hopkins") and πŸ“ Address (e.g., "1800 Orleans St").
88
 
89
  ### 🌱 Downstream Tasks
90
- - **Travel NLP**: Extract travel details like departure/arrival times and transport modes (e.g., "flight," "train") for booking systems.
91
- - **Medical NLP**: Identify doctors, hospitals, and contact info in patient records or consultation requests.
92
- - **Logistics NLP**: Track shipments by extracting locations, dates, and company names (e.g., "FedEx," "DHL").
93
- - **Education NLP**: Parse academic events, university names, and contact details from seminar announcements.
94
 
95
- ---
 
 
 
96
 
97
- ![Banner](https://via.placeholder.com/400x200.png?text=EntityBERT+Applications)
 
98
 
99
  ## πŸ› οΈ Getting Started
100
 
101
  ### πŸ§ͺ Inference Code
102
- Use the model for NER with the following Python code:
103
 
104
  ```python
105
- from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
106
- import json
107
- from collections import defaultdict
108
 
109
  # Load model and tokenizer
110
  tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
111
  model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
112
 
113
- # Create NER pipeline with aggregation
114
- nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
115
-
116
  # Input text
117
- text = (
118
- "Plan a trip to Miami from Orlando"
119
- )
120
 
121
  # Run inference
122
- ner_results = nlp(text)
123
-
124
- # Organize into dictionary by entity_group
125
- entities = defaultdict(list)
126
- for entity in ner_results:
127
- group = entity["entity_group"]
128
- word = entity["word"]
129
- entities[group].append(word)
130
 
131
- # Format results into final JSON structure
132
- formatted_output = {k: " ".join(v) for k, v in entities.items()}
 
 
133
 
134
- # Pretty-print as JSON
135
- print(json.dumps(formatted_output, indent=2))
 
 
136
  ```
137
 
138
  ### ✨ Example Output
139
  ```
140
- Dr. Sarah Lee -> B-person
141
- Johns Hopkins -> B-organization
142
- Baltimore -> B-from-location
143
- MD -> B-from-state
144
- flight -> B-transport-mode
145
- Rochester -> B-to-location
146
- MN -> B-to-state
147
- July 10, 2025 -> B-date
148
- +1-410-955-5000 -> B-phone
149
- sarah.[email protected] -> B-email
150
- www.airmed.com -> B-url
151
  ```
152
 
153
  ### πŸ› οΈ Requirements
154
  ```bash
155
- pip install transformers torch pandas pyarrow seqeval
156
  ```
157
  - **Python**: 3.8+
158
- - **Storage**: ~50 MB for model weights, ~6.38 MB for dataset
159
- - **Optional**: NVIDIA CUDA for GPU acceleration, `seqeval` for evaluation
160
 
161
  ---
162
 
163
  ## 🧠 Entity Labels
164
- The model supports 43 NER tags, including 36 core tags aligned with the `boltuix/conll2025-ner` dataset and 6 padding tags, using the **BIO tagging scheme**:
165
-
166
- | Tag Name | Description | Example |
167
- |-----------------------|------------------------------------------|------------------------|
168
- | O | Non-entity | "visited" |
169
- | B-from-location | Beginning of source location | "Baltimore" |
170
- | I-from-location | Inside source location | "York" (in "New York")|
171
- | B-from-state | Beginning of source state | "MD" |
172
- | I-from-state | Inside source state | |
173
- | B-from-country | Beginning of source country | "USA" |
174
- | I-from-country | Inside source country | |
175
- | B-from-address | Beginning of source address | "1800" |
176
- | I-from-address | Inside source address | "Orleans St" |
177
- | B-to-location | Beginning of destination location | "Rochester" |
178
- | I-to-location | Inside destination location | |
179
- | B-to-state | Beginning of destination state | "MN" |
180
- | I-to-state | Inside destination state | |
181
- | B-to-country | Beginning of destination country | "Japan" |
182
- | I-to-country | Inside destination country | |
183
- | B-to-address | Beginning of destination address | "Shibuya Crossing" |
184
- | I-to-address | Inside destination address | |
185
- | B-transport-mode | Beginning of transport mode | "flight" |
186
- | I-transport-mode | Inside transport mode | "jet" (in "private jet") |
187
- | B-date | Beginning of date | "July" |
188
- | I-date | Inside date | "10" |
189
- | B-time | Beginning of time | "9:00" |
190
- | I-time | Inside time | "AM" |
191
- | B-departure-time | Beginning of departure time | "8:00" |
192
- | I-departure-time | Inside departure time | "AM" |
193
- | B-arrival-time | Beginning of arrival time | "12:00" |
194
- | I-arrival-time | Inside arrival time | "PM" |
195
- | B-company | Beginning of company name | "Emirates" |
196
- | I-company | Inside company name | |
197
- | B-organization | Beginning of organization name | "Johns" |
198
- | I-organization | Inside organization name | "Hopkins" |
199
- | B-person | Beginning of person name | "Sarah" |
200
- | I-person | Inside person name | "Lee" |
201
- | B-job-title | Beginning of job title | "Chief" |
202
- | I-job-title | Inside job title | "Cardiologist" |
203
- | B-phone | Beginning of phone number | "+1-410-955-5000" |
204
- | I-phone | Inside phone number | |
205
- | B-email | Beginning of email | "sarah.lee" |
206
- | I-email | Inside email | "@jhmi.edu" |
207
- | B-url | Beginning of URL | "www.airmed.com" |
208
- | I-url | Inside URL | |
209
- | B-other | Beginning of miscellaneous entity | |
210
- | I-other | Inside miscellaneous entity | |
211
- | B-reserved1 | Reserved padding label | |
212
- | I-reserved1 | Reserved padding label | |
213
- | B-reserved2 | Reserved padding label | |
214
- | I-reserved2 | Reserved padding label | |
215
-
216
- **Example**:
217
- Text: `"Book a flight from Dubai to Tokyo on October 10, 2025 with Emirates."`
218
- Tags: `[O, O, B-transport-mode, O, B-from-location, O, B-to-location, O, B-date, I-date, I-date, O, B-company]`
219
 
220
  ---
221
 
222
  ## πŸ“ˆ Performance
223
- Evaluated on the `boltuix/conll2025-ner` test split using `seqeval`:
 
224
 
225
  | Metric | Score |
226
  |------------|-------|
227
- | 🎯 Precision | 0.88 |
228
- | πŸ•ΈοΈ Recall | 0.90 |
229
- | 🎢 F1 Score | 0.89 |
230
- | βœ… Accuracy | 0.94 |
231
 
232
- These high scores showcase the model’s robust ability to identify entities across diverse domains, ensuring reliability for real-time applications.
233
 
234
  ---
235
 
236
  ## βš™οΈ Training Setup
237
- - **Hardware**: NVIDIA GPU (e.g., A100)
 
238
  - **Training Time**: ~1.5 hours
239
- - **Parameters**: ~11M
240
  - **Optimizer**: AdamW
241
- - **Precision**: FP16 for faster training
242
  - **Batch Size**: 16
243
  - **Learning Rate**: 2e-5
244
 
245
  ---
246
 
247
  ## 🧠 Training the Model
248
- Fine-tune the `boltuix/bert-mini` model on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a training script:
 
249
 
250
  ```python
251
- # Install dependencies
252
- !pip install transformers datasets tokenizers seqeval pandas pyarrow -q
253
 
254
- # Disable Weights & Biases
255
  import os
256
  os.environ["WANDB_MODE"] = "disabled"
257
 
258
- # Import libraries
259
- from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
260
- from transformers import DataCollatorForTokenClassification
261
  import datasets
262
- import evaluate
263
  import numpy as np
 
 
 
 
 
 
 
 
264
 
265
- # Load dataset
266
- dataset = datasets.load_dataset("boltuix/conll2025-ner")
 
 
 
 
 
267
 
268
- # Initialize tokenizer
269
- tokenizer = AutoTokenizer.from_pretrained("boltuix/bert-mini")
 
 
270
 
271
- # Get unique tags
 
272
  all_tags = set()
273
- for split in dataset.values():
274
- for example in split:
275
- all_tags.update(example["ner_tags"])
276
- unique_tags = sorted(list(all_tags))
277
  tag2id = {tag: i for i, tag in enumerate(unique_tags)}
278
  id2tag = {i: tag for i, tag in enumerate(unique_tags)}
 
 
279
 
280
- # Convert tags to IDs
281
  def convert_tags_to_ids(example):
282
  example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
283
  return example
284
- dataset = dataset.map(convert_tags_to_ids)
285
 
286
- # Tokenize and align labels
287
- def tokenize_and_align_labels(examples):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
289
  labels = []
290
  for i, label in enumerate(examples["ner_tags"]):
@@ -293,49 +333,73 @@ def tokenize_and_align_labels(examples):
293
  label_ids = []
294
  for word_idx in word_ids:
295
  if word_idx is None:
296
- label_ids.append(-100)
297
  elif word_idx != previous_word_idx:
298
- label_ids.append(label[word_idx])
299
  else:
300
- label_ids.append(-100)
301
  previous_word_idx = word_idx
302
  labels.append(label_ids)
303
  tokenized_inputs["labels"] = labels
304
  return tokenized_inputs
305
 
306
- tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
 
 
307
 
308
- # Initialize model
309
- model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=len(unique_tags))
 
310
 
311
- # Training arguments
 
 
 
 
 
 
312
  args = TrainingArguments(
313
- output_dir="boltuix/entitybert",
314
- eval_strategy="epoch",
315
  learning_rate=2e-5,
316
  per_device_train_batch_size=16,
317
  per_device_eval_batch_size=16,
318
- num_train_epochs=3,
319
  weight_decay=0.01,
320
- fp16=True,
321
  report_to="none"
322
  )
323
-
324
- # Data collator
325
  data_collator = DataCollatorForTokenClassification(tokenizer)
326
 
327
- # Evaluation metric
328
  metric = evaluate.load("seqeval")
329
 
 
 
 
 
 
 
 
 
330
  def compute_metrics(eval_preds):
 
 
 
 
 
 
 
 
 
331
  pred_logits, labels = eval_preds
332
  pred_logits = np.argmax(pred_logits, axis=2)
333
  predictions = [
334
- [unique_tags[p] for (p, l) in zip(prediction, label) if l != -100]
335
  for prediction, label in zip(pred_logits, labels)
336
  ]
337
  true_labels = [
338
- [unique_tags[l] for (p, l) in zip(prediction, label) if l != -100]
339
  for prediction, label in zip(pred_logits, labels)
340
  ]
341
  results = metric.compute(predictions=predictions, references=true_labels)
@@ -343,107 +407,164 @@ def compute_metrics(eval_preds):
343
  "precision": results["overall_precision"],
344
  "recall": results["overall_recall"],
345
  "f1": results["overall_f1"],
346
- "accuracy": results["overall_accuracy"]
347
  }
348
 
349
- # Initialize trainer
350
  trainer = Trainer(
351
  model,
352
  args,
353
- train_dataset=tokenized_dataset["train"],
354
- eval_dataset=tokenized_dataset["validation"],
355
  data_collator=data_collator,
356
  tokenizer=tokenizer,
357
  compute_metrics=compute_metrics
358
  )
359
-
360
- # Train model
361
  trainer.train()
362
 
363
- # Save model
364
- trainer.save_model("boltuix/entitybert")
365
- tokenizer.save_pretrained("boltuix/entitybert")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
  ```
367
 
368
  ### πŸ› οΈ Tips
369
- - **Hyperparameters**: Adjust `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5) for optimal results.
370
- - **GPU Acceleration**: Enable `fp16=True` for faster training on NVIDIA GPUs.
371
- - **Custom Datasets**: Adapt the script for custom NER datasets by updating `unique_tags` and preprocessing steps.
372
 
373
  ### ⏱️ Expected Training Time
374
- - ~1.5 hours on an NVIDIA A100 GPU for ~115,812 training examples, 3 epochs, batch size 16.
375
 
376
  ### 🌍 Carbon Impact
377
- - Training emits ~40g COβ‚‚eq, optimized with FP16 and the lightweight `bert-mini` base model.
378
-
379
- ---
380
-
381
- ## 🌍 Carbon Impact
382
- - **Emissions**: ~40g COβ‚‚eq
383
- - **Measurement**: ML Impact tool
384
- - **Optimization**: FP16 and efficient architecture
385
 
386
  ---
387
 
388
  ## πŸ› οΈ Installation
 
389
  ```bash
390
  pip install transformers torch pandas pyarrow seqeval
391
  ```
392
  - **Python**: 3.8+
393
- - **Storage**: ~50 MB for model, ~6.38 MB for dataset
394
  - **Optional**: NVIDIA CUDA for GPU acceleration
395
 
396
  ### Download Instructions πŸ“₯
397
- - **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT)
398
- - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner)
399
- - Load with Hugging Face `datasets` or pandas.
400
 
401
  ---
402
 
403
  ## πŸ§ͺ Evaluation Code
404
- Evaluate the model on custom data:
405
 
406
  ```python
407
- from transformers import pipeline
 
 
408
 
409
- # Load NER pipeline
410
- nlp = pipeline("token-classification", model="boltuix/EntityBERT", aggregation_strategy="simple")
 
411
 
412
  # Test data
413
- text = "Book a Lyft from Metropolis on December 1, 2025, contact support@lyft.com."
414
-
415
- # Run inference
416
- results = nlp(text)
417
-
418
- # Print results
419
- for entity in results:
420
- print(f"{entity['word']:15} -> {entity['entity']}")
421
- ```
422
-
423
- ### ✨ Example Output
424
- ```
425
- Book -> O
426
- Lyft -> B-company
427
- Metropolis -> B-from-location
428
- December 1, 2025 -> B-date
429
- [email protected] -> B-email
 
 
 
 
 
 
 
 
 
430
  ```
431
 
432
  ---
433
 
434
  ## 🌱 Dataset Details
435
- - **Entries**: ~143,709
436
- - **Size**: 6.38 MB (Parquet format)
437
  - **Columns**: `split`, `tokens`, `ner_tags`
438
  - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
439
- - **NER Tags**: 43 (18 core entity types with B-/I- tags + O + padding)
440
- - **Source**: Curated from travel, medical, logistics, education, news, and user-generated content
441
- - **Annotations**: Expert-labeled for high accuracy
442
 
443
  ---
444
 
445
  ## πŸ“Š Visualizing NER Tags
446
- Visualize the tag distribution in `boltuix/conll2025-ner`:
447
 
448
  ```python
449
  import pandas as pd
@@ -451,9 +572,7 @@ from collections import Counter
451
  import matplotlib.pyplot as plt
452
 
453
  # Load dataset
454
- df = pd.read_parquet("conll2025-ner.parquet")
455
-
456
- # Count tags
457
  all_tags = [tag for tags in df["ner_tags"] for tag in tags]
458
  tag_counts = Counter(all_tags)
459
 
@@ -475,23 +594,24 @@ plt.show()
475
  ## βš–οΈ Comparison to Other Models
476
  | Model | Dataset | Parameters | F1 Score | Size |
477
  |----------------------|--------------------|------------|----------|--------|
478
- | **EntityBERT** | conll2025-ner | ~11M | 0.89 | ~50 MB |
 
479
  | BERT-base-NER | CoNLL-2003 | ~110M | ~0.89 | ~400 MB|
480
  | DistilBERT-NER | CoNLL-2003 | ~66M | ~0.85 | ~200 MB|
481
 
482
  **Advantages**:
483
- - Lightweight (~11M parameters, ~50 MB)
484
- - High F1 score (0.89) on `conll2025-ner`
485
- - Optimized for real-time inference across domains
486
 
487
  ---
488
 
489
  ## 🌐 Community and Support
490
- - πŸ“ Explore: [Hugging Face Community](https://huggingface.co/community)
491
- - πŸ› οΈ Contribute: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT)
492
- - πŸ’¬ Discuss: [Hugging Face Forums](https://huggingface.co/discussions)
493
- - πŸ“š Learn: [Transformers Docs](https://huggingface.co/docs/transformers)
494
- - πŸ“§ Contact: Boltuix at [[email protected]](mailto:[email protected])
495
 
496
  ---
497
 
@@ -503,6 +623,6 @@ plt.show()
503
  ---
504
 
505
  ## πŸ“… Last Updated
506
- **June 10, 2025** β€” Released v1.0 with fine-tuning on `boltuix/conll2025-ner`, optimized for 43 entity types.
507
 
508
  **[Get Started Now](#getting-started)** πŸš€
 
11
  - accuracy
12
  pipeline_tag: token-classification
13
  library_name: transformers
14
+ new_version: v1.1
15
  tags:
16
  - token-classification
17
  - ner
 
34
  - information-extraction
35
  - search-enhancement
36
  - knowledge-graph
37
+ - legal-nlp
38
  - medical-nlp
39
+ - financial-nlp
 
40
  base_model:
41
  - boltuix/bert-mini
42
  ---
43
 
44
+ ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)
45
 
 
 
 
46
 
47
  # 🌟 EntityBERT Model 🌟
48
 
49
  ## πŸš€ Model Details
50
 
51
  ### 🌈 Description
52
+ The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.
53
 
54
+ - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
55
+ - **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
56
  - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
57
+ - **Domains**: News, user-generated content, research corpora
58
  - **Tasks**: Sentence-level and document-level NER
59
  - **Version**: v1.0
60
 
61
+ > **Note**: Dataset link is a placeholder. Replace with the correct Hugging Face URL once available.
62
+
63
  ### πŸ”§ Info
64
+ - **Developer**: Boltuix
65
+ - **License**: Apache-2.0
66
+ - **Language**: English
67
+ - **Type**: Transformer-based Token Classification
68
+ - **Trained**: Before June 11, 2025
69
  - **Base Model**: `boltuix/bert-mini`
70
+ - **Parameters**: ~4.4M
71
+ - **Size**: ~15 MB
72
 
73
  ### πŸ”— Links
74
+ - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
75
+ - **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
76
  - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
77
+ - **Demo**: Coming Soon
78
 
79
  ---
80
 
81
  ## 🎯 Use Cases for NER
82
 
83
  ### 🌟 Direct Applications
84
+ - **Information Extraction**: Identify names (πŸ‘€ PERSON), locations (🌍 GPE), and dates (πŸ—“οΈ DATE) from articles, blogs, or reports.
85
+ - **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
86
+ - **Search Enhancement**: Enable entity-based semantic search (e.g., β€œnews about Paris in 2025”).
87
+ - **Knowledge Graphs**: Construct structured graphs connecting entities like 🏒 ORG and πŸ‘€ PERSON.
88
 
89
  ### 🌱 Downstream Tasks
90
+ - **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal πŸ“œ, or financial πŸ’Έ NER.
91
+ - **Multilingual Extensions**: Retrain for non-English languages.
92
+ - **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).
 
93
 
94
+ ### ❌ Limitations
95
+ - **English-Only**: Limited to English text out-of-the-box.
96
+ - **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
97
+ - **Generalization**: May struggle with rare or highly contextual entities not in the dataset.
98
 
99
+ ---
100
+ ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)
101
 
102
  ## πŸ› οΈ Getting Started
103
 
104
  ### πŸ§ͺ Inference Code
105
+ Run NER with the following Python code:
106
 
107
  ```python
108
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
109
+ import torch
 
110
 
111
  # Load model and tokenizer
112
  tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
113
  model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
114
 
 
 
 
115
  # Input text
116
+ text = "Elon Musk launched Tesla in California on March 2025."
117
+ inputs = tokenizer(text, return_tensors="pt")
 
118
 
119
  # Run inference
120
+ with torch.no_grad():
121
+ outputs = model(**inputs)
122
+ predictions = outputs.logits.argmax(dim=-1)
 
 
 
 
 
123
 
124
+ # Map predictions to labels
125
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
126
+ label_map = model.config.id2label
127
+ labels = [label_map[p.item()] for p in predictions[0]]
128
 
129
+ # Print results
130
+ for token, label in zip(tokens, labels):
131
+ if token not in tokenizer.all_special_tokens:
132
+ print(f"{token:15} β†’ {label}")
133
  ```
134
 
135
  ### ✨ Example Output
136
  ```
137
+ Elon β†’ B-PERSON
138
+ Musk β†’ I-PERSON
139
+ launched β†’ O
140
+ Tesla β†’ B-ORG
141
+ in β†’ O
142
+ California β†’ B-GPE
143
+ on β†’ O
144
+ March β†’ B-DATE
145
+ 2025 β†’ I-DATE
146
+ . β†’ O
 
147
  ```
148
 
149
  ### πŸ› οΈ Requirements
150
  ```bash
151
+ pip install transformers torch pandas pyarrow
152
  ```
153
  - **Python**: 3.8+
154
+ - **Storage**: ~15 MB for model weights
155
+ - **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
156
 
157
  ---
158
 
159
  ## 🧠 Entity Labels
160
+ The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
161
+ - **B-**: Beginning of an entity
162
+ - **I-**: Inside of an entity
163
+ - **O**: Outside of any entity
164
+
165
+ | Tag Name | Purpose | Emoji |
166
+ |------------------|--------------------------------------------------------------------------|--------|
167
+ | O | Outside of any named entity (e.g., "the", "is") | 🚫 |
168
+ | B-CARDINAL | Beginning of a cardinal number (e.g., "1000") | πŸ”’ |
169
+ | B-DATE | Beginning of a date (e.g., "January") | πŸ—“οΈ |
170
+ | B-EVENT | Beginning of an event (e.g., "Olympics") | πŸŽ‰ |
171
+ | B-FAC | Beginning of a facility (e.g., "Eiffel Tower") | πŸ›οΈ |
172
+ | B-GPE | Beginning of a geopolitical entity (e.g., "Tokyo") | 🌍 |
173
+ | B-LANGUAGE | Beginning of a language (e.g., "Spanish") | πŸ—£οΈ |
174
+ | B-LAW | Beginning of a law or legal document (e.g., "Constitution") | πŸ“œ |
175
+ | B-LOC | Beginning of a non-GPE location (e.g., "Pacific Ocean") | πŸ—ΊοΈ |
176
+ | B-MONEY | Beginning of a monetary value (e.g., "$100") | πŸ’Έ |
177
+ | B-NORP | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️ |
178
+ | B-ORDINAL | Beginning of an ordinal number (e.g., "first") | πŸ₯‡ |
179
+ | B-ORG | Beginning of an organization (e.g., "Microsoft") | 🏒 |
180
+ | B-PERCENT | Beginning of a percentage (e.g., "50%") | πŸ“Š |
181
+ | B-PERSON | Beginning of a person’s name (e.g., "Elon Musk") | πŸ‘€ |
182
+ | B-PRODUCT | Beginning of a product (e.g., "iPhone") | πŸ“± |
183
+ | B-QUANTITY | Beginning of a quantity (e.g., "two liters") | βš–οΈ |
184
+ | B-TIME | Beginning of a time (e.g., "noon") | ⏰ |
185
+ | B-WORK_OF_ART | Beginning of a work of art (e.g., "Mona Lisa") | 🎨 |
186
+ | I-CARDINAL | Inside of a cardinal number | πŸ”’ |
187
+ | I-DATE | Inside of a date (e.g., "2025" in "January 2025") | πŸ—“οΈ |
188
+ | I-EVENT | Inside of an event name | πŸŽ‰ |
189
+ | I-FAC | Inside of a facility name | πŸ›οΈ |
190
+ | I-GPE | Inside of a geopolitical entity | 🌍 |
191
+ | I-LANGUAGE | Inside of a language name | πŸ—£οΈ |
192
+ | I-LAW | Inside of a legal document title | πŸ“œ |
193
+ | I-LOC | Inside of a location | πŸ—ΊοΈ |
194
+ | I-MONEY | Inside of a monetary value | πŸ’Έ |
195
+ | I-NORP | Inside of a NORP entity | 🏳️ |
196
+ | I-ORDINAL | Inside of an ordinal number | πŸ₯‡ |
197
+ | I-ORG | Inside of an organization name | 🏒 |
198
+ | I-PERCENT | Inside of a percentage | πŸ“Š |
199
+ | I-PERSON | Inside of a person’s name | πŸ‘€ |
200
+ | I-PRODUCT | Inside of a product name | πŸ“± |
201
+ | I-QUANTITY | Inside of a quantity | βš–οΈ |
202
+ | I-TIME | Inside of a time phrase | ⏰ |
203
+ | I-WORK_OF_ART | Inside of a work of art title | 🎨 |
204
+
205
+ **Example**:
206
+ Text: `"Tesla opened in Shanghai on April 2025"`
207
+ Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`
 
 
 
 
 
 
 
208
 
209
  ---
210
 
211
  ## πŸ“ˆ Performance
212
+
213
+ Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:
214
 
215
  | Metric | Score |
216
  |------------|-------|
217
+ | 🎯 Precision | 0.84 |
218
+ | πŸ•ΈοΈ Recall | 0.86 |
219
+ | 🎢 F1 Score | 0.85 |
220
+ | βœ… Accuracy | 0.91 |
221
 
222
+ *Note*: Performance may vary on different domains or text types.
223
 
224
  ---
225
 
226
  ## βš™οΈ Training Setup
227
+
228
+ - **Hardware**: NVIDIA GPU
229
  - **Training Time**: ~1.5 hours
230
+ - **Parameters**: ~4.4M
231
  - **Optimizer**: AdamW
232
+ - **Precision**: FP32
233
  - **Batch Size**: 16
234
  - **Learning Rate**: 2e-5
235
 
236
  ---
237
 
238
  ## 🧠 Training the Model
239
+
240
+ Fine-tune `boltuix/bert-mini` on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a simplified training script:
241
 
242
  ```python
243
+ # πŸ› οΈ Step 1: Install required libraries quietly
244
+ !pip install evaluate transformers datasets tokenizers seqeval pandas pyarrow -q
245
 
246
+ # 🚫 Step 2: Disable Weights & Biases (WandB)
247
  import os
248
  os.environ["WANDB_MODE"] = "disabled"
249
 
250
+ # πŸ“š Step 2: Import necessary libraries
251
+ import pandas as pd
 
252
  import datasets
 
253
  import numpy as np
254
+ from transformers import BertTokenizerFast
255
+ from transformers import DataCollatorForTokenClassification
256
+ from transformers import AutoModelForTokenClassification
257
+ from transformers import TrainingArguments, Trainer
258
+ import evaluate
259
+ from transformers import pipeline
260
+ from collections import defaultdict
261
+ import json
262
 
263
+ # πŸ“₯ Step 3: Load the CoNLL-2025 NER dataset from Parquet
264
+ # Download : https://huggingface.co/datasets/boltuix/conll2025-ner/blob/main/conll2025_ner.parquet
265
+ parquet_file = "conll2025_ner.parquet"
266
+ df = pd.read_parquet(parquet_file)
267
+
268
+ # πŸ” Step 4: Convert pandas DataFrame to Hugging Face Dataset
269
+ conll2025 = datasets.Dataset.from_pandas(df)
270
 
271
+ # πŸ”Ž Step 5: Inspect the dataset structure
272
+ print("Dataset structure:", conll2025)
273
+ print("Dataset features:", conll2025.features)
274
+ print("First example:", conll2025[0])
275
 
276
+ # 🏷️ Step 6: Extract unique tags and create mappings
277
+ # Since ner_tags are strings, collect all unique tags
278
  all_tags = set()
279
+ for example in conll2025:
280
+ all_tags.update(example["ner_tags"])
281
+ unique_tags = sorted(list(all_tags)) # Sort for consistency
282
+ num_tags = len(unique_tags)
283
  tag2id = {tag: i for i, tag in enumerate(unique_tags)}
284
  id2tag = {i: tag for i, tag in enumerate(unique_tags)}
285
+ print("Number of unique tags:", num_tags)
286
+ print("Unique tags:", unique_tags)
287
 
288
+ # πŸ”§ Step 7: Convert string ner_tags to indices
289
  def convert_tags_to_ids(example):
290
  example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
291
  return example
 
292
 
293
+ conll2025 = conll2025.map(convert_tags_to_ids)
294
+
295
+ # πŸ“Š Step 8: Split dataset based on 'split' column
296
+ dataset_dict = {
297
+ "train": conll2025.filter(lambda x: x["split"] == "train"),
298
+ "validation": conll2025.filter(lambda x: x["split"] == "validation"),
299
+ "test": conll2025.filter(lambda x: x["split"] == "test")
300
+ }
301
+ conll2025 = datasets.DatasetDict(dataset_dict)
302
+ print("Split dataset structure:", conll2025)
303
+
304
+ # πŸͺ™ Step 9: Initialize the tokenizer
305
+ tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
306
+
307
+ # πŸ“ Step 10: Tokenize an example text and inspect
308
+ example_text = conll2025["train"][0]
309
+ tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)
310
+ tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
311
+ word_ids = tokenized_input.word_ids()
312
+ print("Word IDs:", word_ids)
313
+ print("Tokenized input:", tokenized_input)
314
+ print("Length of ner_tags vs input IDs:", len(example_text["ner_tags"]), len(tokenized_input["input_ids"]))
315
+
316
+ # πŸ”„ Step 11: Define function to tokenize and align labels
317
+ def tokenize_and_align_labels(examples, label_all_tokens=True):
318
+ """
319
+ Tokenize inputs and align labels for NER tasks.
320
+
321
+ Args:
322
+ examples (dict): Dictionary with tokens and ner_tags.
323
+ label_all_tokens (bool): Whether to label all subword tokens.
324
+
325
+ Returns:
326
+ dict: Tokenized inputs with aligned labels.
327
+ """
328
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
329
  labels = []
330
  for i, label in enumerate(examples["ner_tags"]):
 
333
  label_ids = []
334
  for word_idx in word_ids:
335
  if word_idx is None:
336
+ label_ids.append(-100) # Special tokens get -100
337
  elif word_idx != previous_word_idx:
338
+ label_ids.append(label[word_idx]) # First token of word gets label
339
  else:
340
+ label_ids.append(label[word_idx] if label_all_tokens else -100) # Subwords get label or -100
341
  previous_word_idx = word_idx
342
  labels.append(label_ids)
343
  tokenized_inputs["labels"] = labels
344
  return tokenized_inputs
345
 
346
+ # πŸ§ͺ Step 12: Test the tokenization and label alignment
347
+ q = tokenize_and_align_labels(conll2025["train"][0:1])
348
+ print("Tokenized and aligned example:", q)
349
 
350
+ # πŸ“‹ Step 13: Print tokens and their corresponding labels
351
+ for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
352
+ print(f"{token:_<40} {label}")
353
 
354
+ # πŸ”§ Step 14: Apply tokenization to the entire dataset
355
+ tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
356
+
357
+ # πŸ€– Step 15: Initialize the model with the correct number of labels
358
+ model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=num_tags)
359
+
360
+ # βš™οΈ Step 16: Set up training arguments
361
  args = TrainingArguments(
362
+ "boltuix/bert-ner",
363
+ eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
364
  learning_rate=2e-5,
365
  per_device_train_batch_size=16,
366
  per_device_eval_batch_size=16,
367
+ num_train_epochs=1,
368
  weight_decay=0.01,
 
369
  report_to="none"
370
  )
371
+ # πŸ“Š Step 17: Initialize data collator for dynamic padding
 
372
  data_collator = DataCollatorForTokenClassification(tokenizer)
373
 
374
+ # πŸ“ˆ Step 18: Load evaluation metric
375
  metric = evaluate.load("seqeval")
376
 
377
+ # 🏷️ Step 19: Set label list and test metric computation
378
+ label_list = unique_tags
379
+ print("Label list:", label_list)
380
+ example = conll2025["train"][0]
381
+ labels = [label_list[i] for i in example["ner_tags"]]
382
+ print("Metric test:", metric.compute(predictions=[labels], references=[labels]))
383
+
384
+ # πŸ“‰ Step 20: Define function to compute evaluation metrics
385
  def compute_metrics(eval_preds):
386
+ """
387
+ Compute precision, recall, F1, and accuracy for NER.
388
+
389
+ Args:
390
+ eval_preds (tuple): Predicted logits and true labels.
391
+
392
+ Returns:
393
+ dict: Evaluation metrics.
394
+ """
395
  pred_logits, labels = eval_preds
396
  pred_logits = np.argmax(pred_logits, axis=2)
397
  predictions = [
398
+ [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
399
  for prediction, label in zip(pred_logits, labels)
400
  ]
401
  true_labels = [
402
+ [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
403
  for prediction, label in zip(pred_logits, labels)
404
  ]
405
  results = metric.compute(predictions=predictions, references=true_labels)
 
407
  "precision": results["overall_precision"],
408
  "recall": results["overall_recall"],
409
  "f1": results["overall_f1"],
410
+ "accuracy": results["overall_accuracy"],
411
  }
412
 
413
+ # πŸš€ Step 21: Initialize and train the trainer
414
  trainer = Trainer(
415
  model,
416
  args,
417
+ train_dataset=tokenized_datasets["train"],
418
+ eval_dataset=tokenized_datasets["validation"],
419
  data_collator=data_collator,
420
  tokenizer=tokenizer,
421
  compute_metrics=compute_metrics
422
  )
 
 
423
  trainer.train()
424
 
425
+ # πŸ’Ύ Step 22: Save the fine-tuned model
426
+ model.save_pretrained("boltuix/bert-ner")
427
+ tokenizer.save_pretrained("tokenizer")
428
+
429
+ # πŸ”— Step 23: Update model configuration with label mappings
430
+ id2label = {str(i): label for i, label in enumerate(label_list)}
431
+ label2id = {label: str(i) for i, label in enumerate(label_list)}
432
+ config = json.load(open("boltuix/bert-ner/config.json"))
433
+ config["id2label"] = id2label
434
+ config["label2id"] = label2id
435
+ json.dump(config, open("boltuix/bert-ner/config.json", "w"))
436
+
437
+ # πŸ”„ Step 24: Load the fine-tuned model
438
+ model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
439
+
440
+ # πŸ› οΈ Step 25: Create a pipeline for NER inference
441
+ nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
442
+
443
+ # πŸ“ Step 26: Perform NER on an example sentence
444
+ example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
445
+ ner_results = nlp(example)
446
+ print("NER results for first example:", ner_results)
447
+
448
+ # πŸ“ Step 27: Perform NER on a property address and format output
449
+ example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
450
+ ner_results = nlp(example)
451
+
452
+ # 🧹 Step 28: Process NER results into structured entities
453
+ entities = defaultdict(list)
454
+ current_entity = ""
455
+ current_type = ""
456
+
457
+ for item in ner_results:
458
+ entity = item["entity"]
459
+ word = item["word"]
460
+ if word.startswith("##"):
461
+ current_entity += word[2:] # Handle subword tokens
462
+ elif entity.startswith("B-"):
463
+ if current_entity and current_type:
464
+ entities[current_type].append(current_entity.strip())
465
+ current_type = entity[2:].lower()
466
+ current_entity = word
467
+ elif entity.startswith("I-") and entity[2:].lower() == current_type:
468
+ current_entity += " " + word # Continue same entity
469
+ else:
470
+ if current_entity and current_type:
471
+ entities[current_type].append(current_entity.strip())
472
+ current_entity = ""
473
+ current_type = ""
474
+
475
+ # Append final entity if exists
476
+ if current_entity and current_type:
477
+ entities[current_type].append(current_entity.strip())
478
+
479
+ # πŸ“€ Step 29: Output the final JSON
480
+ final_json = dict(entities)
481
+ print("Structured NER output:")
482
+ print(json.dumps(final_json, indent=2))
483
  ```
484
 
485
  ### πŸ› οΈ Tips
486
+ - **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5).
487
+ - **GPU**: Use `fp16=True` for faster training.
488
+ - **Custom Data**: Modify the script for custom NER datasets.
489
 
490
  ### ⏱️ Expected Training Time
491
+ - ~1.5 hours on an NVIDIA GPU (e.g., T4) for ~115,812 examples, 3 epochs, batch size 16.
492
 
493
  ### 🌍 Carbon Impact
494
+ - Emissions: ~40g COβ‚‚eq (estimated via ML Impact tool for 1.5 hours on GPU).
 
 
 
 
 
 
 
495
 
496
  ---
497
 
498
  ## πŸ› οΈ Installation
499
+
500
  ```bash
501
  pip install transformers torch pandas pyarrow seqeval
502
  ```
503
  - **Python**: 3.8+
504
+ - **Storage**: ~15 MB for model, ~6.38 MB for dataset
505
  - **Optional**: NVIDIA CUDA for GPU acceleration
506
 
507
  ### Download Instructions πŸ“₯
508
+ - **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL).
509
+ - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (placeholder, update with correct URL).
 
510
 
511
  ---
512
 
513
  ## πŸ§ͺ Evaluation Code
514
+ Evaluate on custom data:
515
 
516
  ```python
517
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
518
+ from seqeval.metrics import classification_report
519
+ import torch
520
 
521
+ # Load model and tokenizer
522
+ tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
523
+ model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
524
 
525
  # Test data
526
+ texts = ["Elon Musk launched Tesla in California on March 2025."]
527
+ true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
528
+
529
+ pred_labels = []
530
+ for text in texts:
531
+ inputs = tokenizer(text, return_tensors="pt")
532
+ with torch.no_grad():
533
+ outputs = model(**inputs)
534
+ predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
535
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
536
+ word_ids = inputs.word_ids(batch_index=0)
537
+ word_preds = []
538
+ previous_word_idx = None
539
+ for idx, word_idx in enumerate(word_ids):
540
+ if word_idx is None or word_idx == previous_word_idx:
541
+ continue
542
+ label = model.config.id2label[predictions[idx]]
543
+ word_preds.append(label)
544
+ previous_word_idx = word_idx
545
+ pred_labels.append(word_preds)
546
+
547
+ # Evaluate
548
+ print("Predicted:", pred_labels)
549
+ print("True :", true_labels)
550
+ print("\nπŸ“Š Evaluation Report:\n")
551
+ print(classification_report(true_labels, pred_labels))
552
  ```
553
 
554
  ---
555
 
556
  ## 🌱 Dataset Details
557
+ - **Entries**: 143,709
558
+ - **Size**: 6.38 MB (Parquet)
559
  - **Columns**: `split`, `tokens`, `ner_tags`
560
  - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
561
+ - **NER Tags**: 36 (18 entity types with B-/I- + O)
562
+ - **Source**: News, user-generated content, research corpora
 
563
 
564
  ---
565
 
566
  ## πŸ“Š Visualizing NER Tags
567
+ Compute tag distribution with:
568
 
569
  ```python
570
  import pandas as pd
 
572
  import matplotlib.pyplot as plt
573
 
574
  # Load dataset
575
+ df = pd.read_parquet("conll2025_ner.parquet")
 
 
576
  all_tags = [tag for tags in df["ner_tags"] for tag in tags]
577
  tag_counts = Counter(all_tags)
578
 
 
594
  ## βš–οΈ Comparison to Other Models
595
  | Model | Dataset | Parameters | F1 Score | Size |
596
  |----------------------|--------------------|------------|----------|--------|
597
+ | **EntityBERT** | conll2025-ner | ~4.4M | 0.85 | ~15 MB |
598
+ | NeuroBERT-NER | conll2025-ner | ~11M | 0.86 | ~50 MB |
599
  | BERT-base-NER | CoNLL-2003 | ~110M | ~0.89 | ~400 MB|
600
  | DistilBERT-NER | CoNLL-2003 | ~66M | ~0.85 | ~200 MB|
601
 
602
  **Advantages**:
603
+ - Ultra-lightweight (~4.4M parameters, ~15 MB)
604
+ - Competitive F1 score (0.85)
605
+ - Ideal for resource-constrained environments
606
 
607
  ---
608
 
609
  ## 🌐 Community and Support
610
+ - πŸ“ Model page: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder)
611
+ - πŸ› οΈ Issues/Contributions: Model repository (URL TBD)
612
+ - πŸ’¬ Hugging Face forums: [https://huggingface.co/discussions](https://huggingface.co/discussions)
613
+ - πŸ“š Docs: [Hugging Face Transformers](https://huggingface.co/docs/transformers)
614
+ - πŸ“§ Contact: [[email protected]](mailto:[email protected])
615
 
616
  ---
617
 
 
623
  ---
624
 
625
  ## πŸ“… Last Updated
626
+ **June 11, 2025** β€” Released v1.0 with fine-tuning on `boltuix/conll2025-ner`.
627
 
628
  **[Get Started Now](#getting-started)** πŸš€