README.md
Browse files- data/raw/README.md +33 -0
data/raw/README.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
How the data got here:
|
2 |
+
|
3 |
+
1. Downloading the [DocLayNet_core.zip dataset](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) using the `wget`.
|
4 |
+
2. Uploaded the downloaded `DocLayNet_core.zip` file to Hugging Face space.
|
5 |
+
3. Loading a dataset (`chainyo/rvl-cdip-invoice`) using the `load_dataset` function from the Hugging Face library.
|
6 |
+
5. Extracting and saving each image from the `train` portion of the loaded dataset into the `RVL-CDIP-invoice` directory.
|
7 |
+
7. Compressing the `RVL-CDIP-invoice` directory into a zip file (`RVL-CDIP-invoice.zip`) using the `zip` command.
|
8 |
+
8. Uploading the zip file to Hugging Face space.
|
9 |
+
|
10 |
+
```
|
11 |
+
# # !wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip -O DocLayNet_core.zip
|
12 |
+
# upload_to_huggingface_space(
|
13 |
+
# space_name = HUGGINGFACE_SPACE_NAME,
|
14 |
+
# private = True,
|
15 |
+
# path_or_fileobj = './DocLayNet_core.zip',
|
16 |
+
# path_in_repo = 'data/raw/DocLayNet_core.zip')
|
17 |
+
#
|
18 |
+
#
|
19 |
+
# invoices = load_dataset('chainyo/rvl-cdip-invoice')
|
20 |
+
# # can also be found at: https://huggingface.co/datasets/aharley/rvl_cdip
|
21 |
+
# os.mkdir('./RVL-CDIP-invoice')
|
22 |
+
# for index, invoice in enumerate(tqdm(invoices['train'])):
|
23 |
+
# invoice['image'].save(f'./RVL-CDIP-invoice/{index}.png', format="png")
|
24 |
+
#
|
25 |
+
# !ls ./RVL-CDIP-invoice -1 | wc -l
|
26 |
+
# !zip -r RVL-CDIP-invoice.zip ./RVL-CDIP-invoice
|
27 |
+
#
|
28 |
+
# upload_to_huggingface_space(
|
29 |
+
# space_name = HUGGINGFACE_SPACE_NAME,
|
30 |
+
# private = True,
|
31 |
+
# path_or_fileobj = './RVL-CDIP-invoice.zip',
|
32 |
+
# path_in_repo = f'data/raw/RVL-CDIP-invoice.zip')
|
33 |
+
```
|