Spaces:

rntc
/

leaderboard-test

Sleeping

File size: 3,467 Bytes

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    emea_ner = Task("emea_ner", "f1", "EMEA NER")
    medline_ner = Task("medline_ner", "f1", "MEDLINE NER")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">🏥 French Medical NLP Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
This leaderboard evaluates French NLP models on biomedical Named Entity Recognition (NER) tasks.
We focus on BERT-like models with plans to extend to other architectures.

**Current Tasks:**
- **EMEA NER**: Named Entity Recognition on French medical texts from EMEA (European Medicines Agency)
- **MEDLINE NER**: Named Entity Recognition on French medical abstracts from MEDLINE

**Entity Types:** ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS, PROC
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How it works

We evaluate models by **fine-tuning** them on French medical NER tasks following the CamemBERT-bio methodology:

**Fine-tuning Parameters:**
- **Optimizer**: AdamW (following CamemBERT-bio paper)
- **Learning Rate**: 5e-5 (optimal from Optuna search - unchanged)
- **Scheduler**: Cosine with restarts (22.4% warmup ratio)
- **Steps**: 2000 (same as paper)
- **Batch Size**: 4 (CPU constraint)
- **Gradient Accumulation**: 4 steps (effective batch size 16)
- **Max Length**: 512 tokens
- **Output**: Simple linear layer (no CRF)

**Evaluation**: Uses seqeval with IOB2 scheme for entity-level **micro F1**, precision, and recall.

## Reproducibility
Results are obtained through proper fine-tuning, not zero-shot evaluation. Each model is fine-tuned independently on each task.

**Datasets:**
- EMEA: `rntc/quaero-frenchmed-ner-emea-sen`
- MEDLINE: `rntc/quaero-frenchmed-ner-medline`
"""

EVALUATION_QUEUE_TEXT = """
## Before submitting a model

### 1) Ensure your model is compatible with AutoClasses:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("your_model_name")
model = AutoModelForTokenClassification.from_pretrained("your_model_name")
```

### 2) Model requirements:
- Must be a fine-tuned model for token classification (not just a base model)
- Should be trained on French medical NER data
- Must be publicly available on Hugging Face Hub
- Prefer safetensors format for faster loading

### 3) Expected performance:
- Base models without fine-tuning will get very low scores (~0.02 F1)
- Fine-tuned models should achieve significantly higher scores

### 4) Model card recommendations:
- Specify the training dataset used
- Include model architecture details
- Add performance metrics if available
- Use an open license

## Troubleshooting
If your model fails evaluation:
1. Check that it loads properly with AutoModelForTokenClassification
2. Verify it's trained for token classification (not just language modeling)
3. Ensure the model is public and accessible
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
"""