Spaces:
Sleeping
Sleeping
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Select your tasks here | |
# --------------------------------------------------- | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
emea_ner = Task("emea_ner", "f1", "EMEA NER") | |
medline_ner = Task("medline_ner", "f1", "MEDLINE NER") | |
NUM_FEWSHOT = 0 # Change with your few shot | |
# --------------------------------------------------- | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">🏥 French Medical NLP Leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
This leaderboard evaluates French NLP models on biomedical Named Entity Recognition (NER) tasks. | |
We focus on BERT-like models with plans to extend to other architectures. | |
**Current Tasks:** | |
- **EMEA NER**: Named Entity Recognition on French medical texts from EMEA (European Medicines Agency) | |
- **MEDLINE NER**: Named Entity Recognition on French medical abstracts from MEDLINE | |
**Entity Types:** ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS, PROC | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = f""" | |
## How it works | |
We evaluate models by **fine-tuning** them on French medical NER tasks following the CamemBERT-bio methodology: | |
**Fine-tuning Parameters:** | |
- **Optimizer**: AdamW (following CamemBERT-bio paper) | |
- **Learning Rate**: 5e-5 (optimal from Optuna search - unchanged) | |
- **Scheduler**: Cosine with restarts (22.4% warmup ratio) | |
- **Steps**: 2000 (same as paper) | |
- **Batch Size**: 4 (CPU constraint) | |
- **Gradient Accumulation**: 4 steps (effective batch size 16) | |
- **Max Length**: 512 tokens | |
- **Output**: Simple linear layer (no CRF) | |
**Evaluation**: Uses seqeval with IOB2 scheme for entity-level **micro F1**, precision, and recall. | |
## Reproducibility | |
Results are obtained through proper fine-tuning, not zero-shot evaluation. Each model is fine-tuned independently on each task. | |
**Datasets:** | |
- EMEA: `rntc/quaero-frenchmed-ner-emea-sen` | |
- MEDLINE: `rntc/quaero-frenchmed-ner-medline` | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
## Before submitting a model | |
### 1) Ensure your model is compatible with AutoClasses: | |
```python | |
from transformers import AutoTokenizer, AutoModelForTokenClassification | |
tokenizer = AutoTokenizer.from_pretrained("your_model_name") | |
model = AutoModelForTokenClassification.from_pretrained("your_model_name") | |
``` | |
### 2) Model requirements: | |
- Must be a fine-tuned model for token classification (not just a base model) | |
- Should be trained on French medical NER data | |
- Must be publicly available on Hugging Face Hub | |
- Prefer safetensors format for faster loading | |
### 3) Expected performance: | |
- Base models without fine-tuning will get very low scores (~0.02 F1) | |
- Fine-tuned models should achieve significantly higher scores | |
### 4) Model card recommendations: | |
- Specify the training dataset used | |
- Include model architecture details | |
- Add performance metrics if available | |
- Use an open license | |
## Troubleshooting | |
If your model fails evaluation: | |
1. Check that it loads properly with AutoModelForTokenClassification | |
2. Verify it's trained for token classification (not just language modeling) | |
3. Ensure the model is public and accessible | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r""" | |
""" | |