mckabue commited on
Commit
e59c79f
·
1 Parent(s): bd7e669
Files changed (1) hide show
  1. data/raw/README.md +33 -0
data/raw/README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ How the data got here:
2
+
3
+ 1. Downloading the [DocLayNet_core.zip dataset](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) using the `wget`.
4
+ 2. Uploaded the downloaded `DocLayNet_core.zip` file to Hugging Face space.
5
+ 3. Loading a dataset (`chainyo/rvl-cdip-invoice`) using the `load_dataset` function from the Hugging Face library.
6
+ 5. Extracting and saving each image from the `train` portion of the loaded dataset into the `RVL-CDIP-invoice` directory.
7
+ 7. Compressing the `RVL-CDIP-invoice` directory into a zip file (`RVL-CDIP-invoice.zip`) using the `zip` command.
8
+ 8. Uploading the zip file to Hugging Face space.
9
+
10
+ ```
11
+ # # !wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip -O DocLayNet_core.zip
12
+ # upload_to_huggingface_space(
13
+ # space_name = HUGGINGFACE_SPACE_NAME,
14
+ # private = True,
15
+ # path_or_fileobj = './DocLayNet_core.zip',
16
+ # path_in_repo = 'data/raw/DocLayNet_core.zip')
17
+ #
18
+ #
19
+ # invoices = load_dataset('chainyo/rvl-cdip-invoice')
20
+ # # can also be found at: https://huggingface.co/datasets/aharley/rvl_cdip
21
+ # os.mkdir('./RVL-CDIP-invoice')
22
+ # for index, invoice in enumerate(tqdm(invoices['train'])):
23
+ # invoice['image'].save(f'./RVL-CDIP-invoice/{index}.png', format="png")
24
+ #
25
+ # !ls ./RVL-CDIP-invoice -1 | wc -l
26
+ # !zip -r RVL-CDIP-invoice.zip ./RVL-CDIP-invoice
27
+ #
28
+ # upload_to_huggingface_space(
29
+ # space_name = HUGGINGFACE_SPACE_NAME,
30
+ # private = True,
31
+ # path_or_fileobj = './RVL-CDIP-invoice.zip',
32
+ # path_in_repo = f'data/raw/RVL-CDIP-invoice.zip')
33
+ ```