|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
<div align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/cE7UgFfJJ2gUHJr0SSEhc.png"> </img> |
|
</div> |
|
|
|
|
|
|
|
|
|
[π Paper](https://arxiv.org/abs/2503.19740) - [π€ GitHub](https://github.com/visurg-ai/LEMON) |
|
|
|
We provide the models used in our data curation pipeline in [π LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings](https://arxiv.org/abs/2503.19740) to assist with constructing the LEMON dataset (for more details about the LEMON dataset and our |
|
LemonFM foundation model, please visit our github repository at [π€ GitHub](https://github.com/visurg-ai/LEMON)) . |
|
|
|
|
|
If you use our dataset, model, or code in your research, please cite our paper: |
|
|
|
``` |
|
@misc{che2025lemonlargeendoscopicmonocular, |
|
title={LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings}, |
|
author={Chengan Che and Chao Wang and Tom Vercauteren and Sophia Tsoka and Luis C. Garcia-Peraza-Herrera}, |
|
year={2025}, |
|
eprint={2503.19740}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.19740}, |
|
} |
|
``` |
|
|
|
|
|
|
|
This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at [model_loader.py](https://huggingface.co/visurg/Surg3M_curation_models/blob/main/model_loader.py) |
|
|
|
|
|
<div align="center"> |
|
<table style="margin-left: auto; margin-right: auto;"> |
|
<tr> |
|
<th>Model</th> |
|
<th>Architecture</th> |
|
<th colspan="5">Download</th> |
|
</tr> |
|
<tr> |
|
<td>Video storyboard classification models</td> |
|
<td>ResNet-18</td> |
|
<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/video_storyboard_classification">Full ckpt</a></td> |
|
</tr> |
|
<tr> |
|
<td>Frame classification models</td> |
|
<td>ResNet-18</td> |
|
<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/frame_classification">Full ckpt</a></td> |
|
</tr> |
|
<tr> |
|
<td>Non-surgical object detection models</td> |
|
<td>Yolov8-Nano</td> |
|
<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/nonsurgical_object_detection">Full ckpt</a></td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
|
|
The data curation pipeline leading to the clean videos in the LEMON dataset is as follows: |
|
<div align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/jzw36jlPT-V_I-Vm01OzO.png"> </img> |
|
</div> |
|
|
|
Usage |
|
-------- |
|
**Video classification models** are employed in the step **2** of the data curation pipeline to classify a video storyboard as either surgical or non-surgical, the models usage is as follows: |
|
```python |
|
import torch |
|
import torchvision |
|
from PIL import Image |
|
from model_loader import build_model |
|
|
|
# Load the model |
|
net = build_model(mode='classify') |
|
model_path = 'Video storyboard classification models' |
|
|
|
# Enable multi-GPU support |
|
net = torch.nn.DataParallel(net) |
|
torch.backends.cudnn.benchmark = True |
|
state = torch.load(model_path, map_location=torch.device('cpu')) |
|
net.load_state_dict(state['net']) |
|
net.eval() |
|
|
|
# Load the video storyboard and convert it to a PyTorch tensor |
|
img_path = 'path/to/your/image.jpg' |
|
img = Image.open(img_path) |
|
img = img.resize((224, 224)) |
|
transform = torchvision.transforms.Compose([ |
|
torchvision.transforms.ToTensor(), |
|
torchvision.transforms.Normalize( |
|
(0.4299694, 0.29676908, 0.27707579), |
|
(0.24373249, 0.20208984, 0.19319402) |
|
) |
|
]) |
|
img_tensor = transform(img).unsqueeze(0).to('cuda') |
|
|
|
# Extract features from the image |
|
outputs = net(img_tensor) |
|
``` |
|
|
|
**Frame classification models** are used in the step **3** of the data curation pipeline to classify a frame as either surgical or non-surgical, the models usage is as follows: |
|
|
|
```python |
|
import torch |
|
import torchvision |
|
from PIL import Image |
|
from model_loader import build_model |
|
|
|
# Load the model |
|
net = build_model(mode='classify') |
|
model_path = 'Frame classification models' |
|
|
|
# Enable multi-GPU support |
|
net = torch.nn.DataParallel(net) |
|
torch.backends.cudnn.benchmark = True |
|
state = torch.load(model_path, map_location=torch.device('cpu')) |
|
net.load_state_dict(state['net']) |
|
net.eval() |
|
|
|
img_path = 'path/to/your/image.jpg' |
|
img = Image.open(img_path) |
|
img = img.resize((224, 224)) |
|
transform = torchvision.transforms.Compose([ |
|
torchvision.transforms.ToTensor(), |
|
torchvision.transforms.Normalize( |
|
(0.4299694, 0.29676908, 0.27707579), |
|
(0.24373249, 0.20208984, 0.19319402) |
|
) |
|
]) |
|
img_tensor = transform(img).unsqueeze(0).to('cuda') |
|
|
|
# Extract features from the image |
|
outputs = net(img_tensor) |
|
``` |
|
|
|
**Non-surgical object detection models** are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information), the models usage is as follows: |
|
|
|
```python |
|
import torch |
|
import torchvision |
|
from PIL import Image |
|
from model_loader import build_model |
|
|
|
# Load the model |
|
net = build_model(mode='mask') |
|
model_path = 'Frame classification models' |
|
|
|
# Enable multi-GPU support |
|
net = torch.nn.DataParallel(net) |
|
torch.backends.cudnn.benchmark = True |
|
state = torch.load(model_path, map_location=torch.device('cpu')) |
|
net.load_state_dict(state['net']) |
|
net.eval() |
|
|
|
img_path = 'path/to/your/image.jpg' |
|
img = Image.open(img_path) |
|
img = img.resize((224, 224)) |
|
transform = torchvision.transforms.Compose([ |
|
torchvision.transforms.ToTensor(), |
|
torchvision.transforms.Normalize( |
|
(0.4299694, 0.29676908, 0.27707579), |
|
(0.24373249, 0.20208984, 0.19319402) |
|
) |
|
]) |
|
img_tensor = transform(img).unsqueeze(0).to('cuda') |
|
|
|
# Extract features from the image |
|
outputs = net(img_tensor) |
|
``` |
|
|