visurg
/

LEMON_curation_models

Model card Files Files and versions Community

LEMON_curation_models / README.md

chengan98

Update README.md

a7dde2b verified 11 days ago

preview code

raw

history blame contribute delete

5.95 kB

	---
	license: apache-2.0
	---

	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/cE7UgFfJJ2gUHJr0SSEhc.png"> </img>
	</div>




	[📚 Paper](https://arxiv.org/abs/2503.19740) - [🤖 GitHub](https://github.com/visurg-ai/LEMON)

	We provide the models used in our data curation pipeline in [📚 LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings](https://arxiv.org/abs/2503.19740) to assist with constructing the LEMON dataset (for more details about the LEMON dataset and our
	LemonFM foundation model, please visit our github repository at [🤖 GitHub](https://github.com/visurg-ai/LEMON)) .


	If you use our dataset, model, or code in your research, please cite our paper:

	```
	@misc{che2025lemonlargeendoscopicmonocular,
	title={LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings},
	author={Chengan Che and Chao Wang and Tom Vercauteren and Sophia Tsoka and Luis C. Garcia-Peraza-Herrera},
	year={2025},
	eprint={2503.19740},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2503.19740},
	}
	```



	This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at [model_loader.py](https://huggingface.co/visurg/Surg3M_curation_models/blob/main/model_loader.py)


	<div align="center">
	<table style="margin-left: auto; margin-right: auto;">
	<tr>
	<th>Model</th>
	<th>Architecture</th>
	<th colspan="5">Download</th>
	</tr>
	<tr>
	<td>Video storyboard classification models</td>
	<td>ResNet-18</td>
	<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/video_storyboard_classification">Full ckpt</a></td>
	</tr>
	<tr>
	<td>Frame classification models</td>
	<td>ResNet-18</td>
	<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/frame_classification">Full ckpt</a></td>
	</tr>
	<tr>
	<td>Non-surgical object detection models</td>
	<td>Yolov8-Nano</td>
	<td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/nonsurgical_object_detection">Full ckpt</a></td>
	</tr>
	</table>
	</div>


	The data curation pipeline leading to the clean videos in the LEMON dataset is as follows:
	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/jzw36jlPT-V_I-Vm01OzO.png"> </img>
	</div>

	Usage
	--------
	Video classification models are employed in the step 2 of the data curation pipeline to classify a video storyboard as either surgical or non-surgical, the models usage is as follows:
	```python
	import torch
	import torchvision
	from PIL import Image
	from model_loader import build_model

	# Load the model
	net = build_model(mode='classify')
	model_path = 'Video storyboard classification models'

	# Enable multi-GPU support
	net = torch.nn.DataParallel(net)
	torch.backends.cudnn.benchmark = True
	state = torch.load(model_path, map_location=torch.device('cpu'))
	net.load_state_dict(state['net'])
	net.eval()

	# Load the video storyboard and convert it to a PyTorch tensor
	img_path = 'path/to/your/image.jpg'
	img = Image.open(img_path)
	img = img.resize((224, 224))
	transform = torchvision.transforms.Compose([
	torchvision.transforms.ToTensor(),
	torchvision.transforms.Normalize(
	(0.4299694, 0.29676908, 0.27707579),
	(0.24373249, 0.20208984, 0.19319402)
	)
	])
	img_tensor = transform(img).unsqueeze(0).to('cuda')

	# Extract features from the image
	outputs = net(img_tensor)
	```

	Frame classification models are used in the step 3 of the data curation pipeline to classify a frame as either surgical or non-surgical, the models usage is as follows:

	```python
	import torch
	import torchvision
	from PIL import Image
	from model_loader import build_model

	# Load the model
	net = build_model(mode='classify')
	model_path = 'Frame classification models'

	# Enable multi-GPU support
	net = torch.nn.DataParallel(net)
	torch.backends.cudnn.benchmark = True
	state = torch.load(model_path, map_location=torch.device('cpu'))
	net.load_state_dict(state['net'])
	net.eval()

	img_path = 'path/to/your/image.jpg'
	img = Image.open(img_path)
	img = img.resize((224, 224))
	transform = torchvision.transforms.Compose([
	torchvision.transforms.ToTensor(),
	torchvision.transforms.Normalize(
	(0.4299694, 0.29676908, 0.27707579),
	(0.24373249, 0.20208984, 0.19319402)
	)
	])
	img_tensor = transform(img).unsqueeze(0).to('cuda')

	# Extract features from the image
	outputs = net(img_tensor)
	```

	Non-surgical object detection models are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information), the models usage is as follows:

	```python
	import torch
	import torchvision
	from PIL import Image
	from model_loader import build_model

	# Load the model
	net = build_model(mode='mask')
	model_path = 'Frame classification models'

	# Enable multi-GPU support
	net = torch.nn.DataParallel(net)
	torch.backends.cudnn.benchmark = True
	state = torch.load(model_path, map_location=torch.device('cpu'))
	net.load_state_dict(state['net'])
	net.eval()

	img_path = 'path/to/your/image.jpg'
	img = Image.open(img_path)
	img = img.resize((224, 224))
	transform = torchvision.transforms.Compose([
	torchvision.transforms.ToTensor(),
	torchvision.transforms.Normalize(
	(0.4299694, 0.29676908, 0.27707579),
	(0.24373249, 0.20208984, 0.19319402)
	)
	])
	img_tensor = transform(img).unsqueeze(0).to('cuda')

	# Extract features from the image
	outputs = net(img_tensor)
	```