Plonk

Sleeping

Plonk / DATASET.md

squash: merge all unpushed commits

c4c7cee 8 months ago

1.24 kB

	### Dataset
	To download the datataset, run:
	```python
	# download the full dataset
	from huggingface_hub import snapshot_download
	snapshot_download(repo_id="osv5m/osv5m", local_dir="datasets/osv5m", repo_type='dataset')
	```

	and finally extract:
	```python
	import os
	import zipfile
	for root, dirs, files in os.walk("datasets/osv5m"):
	for file in files:
	if file.endswith(".zip"):
	with zipfile.ZipFile(os.path.join(root, file), 'r') as zip_ref:
	zip_ref.extractall(root)
	os.remove(os.path.join(root, file))
	```

	You can also directly load the dataset using `load_dataset`:
	```python
	from datasets import load_dataset
	dataset = load_dataset('osv5m/osv5m', full=False)
	```
	where with `full` you can specify whether you want to load the complete metadata (default: `False`).

	If you only want to download the test set, you can run the script below:
	```python
	from huggingface_hub import hf_hub_download
	for i in range(5):
	hf_hub_download(repo_id="osv5m/osv5m", filename=str(i).zfill(2)+'.zip', subfolder="images/test", repo_type='dataset', local_dir="datasets/osv5m")
	hf_hub_download(repo_id="osv5m/osv5m", filename="README.md", repo_type='dataset', local_dir="datasets/osv5m")
	```