File size: 1,609 Bytes
e59c79f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
How the data got here:

1. Downloading the [DocLayNet_core.zip dataset](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) using the `wget`.
2. Uploaded the downloaded `DocLayNet_core.zip` file to Hugging Face space.
3. Loading a dataset (`chainyo/rvl-cdip-invoice`) using the `load_dataset` function from the Hugging Face library.
5. Extracting and saving each image from the `train` portion of the loaded dataset into the `RVL-CDIP-invoice` directory.
7. Compressing the `RVL-CDIP-invoice` directory into a zip file (`RVL-CDIP-invoice.zip`) using the `zip` command.
8. Uploading the zip file to Hugging Face space.

```
# # !wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip -O DocLayNet_core.zip
# upload_to_huggingface_space(
#     space_name = HUGGINGFACE_SPACE_NAME,
#     private = True,
#     path_or_fileobj = './DocLayNet_core.zip',
#     path_in_repo = 'data/raw/DocLayNet_core.zip')
# 
# 
# invoices = load_dataset('chainyo/rvl-cdip-invoice')
# # can also be found at: https://huggingface.co/datasets/aharley/rvl_cdip
# os.mkdir('./RVL-CDIP-invoice')
# for index, invoice in enumerate(tqdm(invoices['train'])):
#   invoice['image'].save(f'./RVL-CDIP-invoice/{index}.png', format="png")
# 
# !ls ./RVL-CDIP-invoice -1 | wc -l
# !zip -r RVL-CDIP-invoice.zip ./RVL-CDIP-invoice
# 
# upload_to_huggingface_space(
#     space_name = HUGGINGFACE_SPACE_NAME,
#     private = True,
#     path_or_fileobj = './RVL-CDIP-invoice.zip',
#     path_in_repo = f'data/raw/RVL-CDIP-invoice.zip')
```