rntc's picture
Add French medical NER leaderboard frontend
81722bf
raw
history blame
3.47 kB
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
emea_ner = Task("emea_ner", "f1", "EMEA NER")
medline_ner = Task("medline_ner", "f1", "MEDLINE NER")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">🏥 French Medical NLP Leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
This leaderboard evaluates French NLP models on biomedical Named Entity Recognition (NER) tasks.
We focus on BERT-like models with plans to extend to other architectures.
**Current Tasks:**
- **EMEA NER**: Named Entity Recognition on French medical texts from EMEA (European Medicines Agency)
- **MEDLINE NER**: Named Entity Recognition on French medical abstracts from MEDLINE
**Entity Types:** ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS, PROC
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How it works
We evaluate models by **fine-tuning** them on French medical NER tasks following the CamemBERT-bio methodology:
**Fine-tuning Parameters:**
- **Optimizer**: AdamW (following CamemBERT-bio paper)
- **Learning Rate**: 5e-5 (optimal from Optuna search - unchanged)
- **Scheduler**: Cosine with restarts (22.4% warmup ratio)
- **Steps**: 2000 (same as paper)
- **Batch Size**: 4 (CPU constraint)
- **Gradient Accumulation**: 4 steps (effective batch size 16)
- **Max Length**: 512 tokens
- **Output**: Simple linear layer (no CRF)
**Evaluation**: Uses seqeval with IOB2 scheme for entity-level **micro F1**, precision, and recall.
## Reproducibility
Results are obtained through proper fine-tuning, not zero-shot evaluation. Each model is fine-tuned independently on each task.
**Datasets:**
- EMEA: `rntc/quaero-frenchmed-ner-emea-sen`
- MEDLINE: `rntc/quaero-frenchmed-ner-medline`
"""
EVALUATION_QUEUE_TEXT = """
## Before submitting a model
### 1) Ensure your model is compatible with AutoClasses:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("your_model_name")
model = AutoModelForTokenClassification.from_pretrained("your_model_name")
```
### 2) Model requirements:
- Must be a fine-tuned model for token classification (not just a base model)
- Should be trained on French medical NER data
- Must be publicly available on Hugging Face Hub
- Prefer safetensors format for faster loading
### 3) Expected performance:
- Base models without fine-tuning will get very low scores (~0.02 F1)
- Fine-tuned models should achieve significantly higher scores
### 4) Model card recommendations:
- Specify the training dataset used
- Include model architecture details
- Add performance metrics if available
- Use an open license
## Troubleshooting
If your model fails evaluation:
1. Check that it loads properly with AutoModelForTokenClassification
2. Verify it's trained for token classification (not just language modeling)
3. Ensure the model is public and accessible
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
"""