Konthee commited on
Commit
bd07a27
·
verified ·
1 Parent(s): 468c25b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -6
README.md CHANGED
@@ -1,11 +1,30 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
3
  ---
4
  <br>
5
- (Quickstart)
6
 
 
7
  ```python
8
- from transformers import AutoModel, AutoConfig
9
- config = AutoConfig.from_pretrained("Konthee/CLIPTextCamembertModelWithProjection-contrastive", trust_remote_code=True)
10
- model = AutoModel.from_pretrained("Konthee/CLIPTextCamembertModelWithProjection-contrastive", trust_remote_code=True)
11
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - th
4
+ - en
5
+ tags:
6
+ - openthaigpt
7
  ---
8
  <br>
 
9
 
10
+ ## How to use
11
  ```python
12
+ from transformers import AutoModel,AutoProcessor
13
+ model = AutoModel.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection-contrastive", trust_remote_code=True)
14
+ processor = AutoProcessor.from_pretrained("openthaigpt/CLIPTextCamembertModelWithProjection-contrastive", trust_remote_
15
+ ```
16
+
17
+ ### Preprocessing
18
+
19
+ Texts are preprocessed with the following rules:
20
+
21
+ - Replace HTML forms of characters with the actual characters such asnbsp;with a space and \\\\\\\\\\\\\\\\<br /> with a line break [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
22
+ - Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.
23
+ - Replace line breaks with spaces.
24
+ - Replace more than one spaces with a single space
25
+ - Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
26
+ - Word-level tokenization using [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU) ’s `newmm` dictionary-based maximal matching tokenizer.
27
+ - Replace repetitive words; this is done post-tokenization unlike [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146). since there is no delimitation by space in Thai as in English.
28
+ - Replace spaces with <\\\\\\\\\\\\\\\\_>. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with <\\\\\\\\\\\\\\\\_>.
29
+
30
+ <br>