How the data got here:
- Downloading the DocLayNet_core.zip dataset using the
wget
. - Uploaded the downloaded
DocLayNet_core.zip
file to Hugging Face space. - Loading a dataset (
chainyo/rvl-cdip-invoice
) using theload_dataset
function from the Hugging Face library. - Extracting and saving each image from the
train
portion of the loaded dataset into theRVL-CDIP-invoice
directory. - Compressing the
RVL-CDIP-invoice
directory into a zip file (RVL-CDIP-invoice.zip
) using thezip
command. - Uploading the zip file to Hugging Face space.
# # !wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip -O DocLayNet_core.zip
# upload_to_huggingface_space(
# space_name = HUGGINGFACE_SPACE_NAME,
# private = True,
# path_or_fileobj = './DocLayNet_core.zip',
# path_in_repo = 'data/raw/DocLayNet_core.zip')
#
#
# invoices = load_dataset('chainyo/rvl-cdip-invoice')
# # can also be found at: https://huggingface.co/datasets/aharley/rvl_cdip
# os.mkdir('./RVL-CDIP-invoice')
# for index, invoice in enumerate(tqdm(invoices['train'])):
# invoice['image'].save(f'./RVL-CDIP-invoice/{index}.png', format="png")
#
# !ls ./RVL-CDIP-invoice -1 | wc -l
# !zip -r RVL-CDIP-invoice.zip ./RVL-CDIP-invoice
#
# upload_to_huggingface_space(
# space_name = HUGGINGFACE_SPACE_NAME,
# private = True,
# path_or_fileobj = './RVL-CDIP-invoice.zip',
# path_in_repo = f'data/raw/RVL-CDIP-invoice.zip')