Spaces:

rntc
/

leaderboard-test

Sleeping

App Files Files Community

leaderboard-test / src /about.py

rntc

Add French medical NER leaderboard frontend

81722bf about 1 month ago

raw

history blame

3.47 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	emea_ner = Task("emea_ner", "f1", "EMEA NER")
	medline_ner = Task("medline_ner", "f1", "MEDLINE NER")

	NUM_FEWSHOT = 0 # Change with your few shot
	# ---------------------------------------------------



	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">🏥 French Medical NLP Leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	This leaderboard evaluates French NLP models on biomedical Named Entity Recognition (NER) tasks.
	We focus on BERT-like models with plans to extend to other architectures.

	Current Tasks:
	- EMEA NER: Named Entity Recognition on French medical texts from EMEA (European Medicines Agency)
	- MEDLINE NER: Named Entity Recognition on French medical abstracts from MEDLINE

	Entity Types: ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS, PROC
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	## How it works

	We evaluate models by fine-tuning them on French medical NER tasks following the CamemBERT-bio methodology:

	Fine-tuning Parameters:
	- Optimizer: AdamW (following CamemBERT-bio paper)
	- Learning Rate: 5e-5 (optimal from Optuna search - unchanged)
	- Scheduler: Cosine with restarts (22.4% warmup ratio)
	- Steps: 2000 (same as paper)
	- Batch Size: 4 (CPU constraint)
	- Gradient Accumulation: 4 steps (effective batch size 16)
	- Max Length: 512 tokens
	- Output: Simple linear layer (no CRF)

	Evaluation: Uses seqeval with IOB2 scheme for entity-level micro F1, precision, and recall.

	## Reproducibility
	Results are obtained through proper fine-tuning, not zero-shot evaluation. Each model is fine-tuned independently on each task.

	Datasets:
	- EMEA: `rntc/quaero-frenchmed-ner-emea-sen`
	- MEDLINE: `rntc/quaero-frenchmed-ner-medline`
	"""

	EVALUATION_QUEUE_TEXT = """
	## Before submitting a model

	### 1) Ensure your model is compatible with AutoClasses:
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	tokenizer = AutoTokenizer.from_pretrained("your_model_name")
	model = AutoModelForTokenClassification.from_pretrained("your_model_name")
	```

	### 2) Model requirements:
	- Must be a fine-tuned model for token classification (not just a base model)
	- Should be trained on French medical NER data
	- Must be publicly available on Hugging Face Hub
	- Prefer safetensors format for faster loading

	### 3) Expected performance:
	- Base models without fine-tuning will get very low scores (~0.02 F1)
	- Fine-tuned models should achieve significantly higher scores

	### 4) Model card recommendations:
	- Specify the training dataset used
	- Include model architecture details
	- Add performance metrics if available
	- Use an open license

	## Troubleshooting
	If your model fails evaluation:
	1. Check that it loads properly with AutoModelForTokenClassification
	2. Verify it's trained for token classification (not just language modeling)
	3. Ensure the model is public and accessible
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""
	"""