manu commited on
Commit
5d7f6da
·
1 Parent(s): f9e3cdf

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - manu/tok_corpus
4
+ language:
5
+ - fr
6
+ - en
7
+ ---
8
+
9
+ BPE Tokenizer fitted on a custom corpus, with digit separation, byte fallback and other features from LlamaTokenizer.
10
+
11
+ Only fitted on 100,000 samples (7.5M words).