Temuriy commited on
Commit
31c253b
·
verified ·
1 Parent(s): c346869

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - pt
5
+ tags:
6
+ - ai-detection
7
+ - text-classification
8
+ - portuguese
9
+ - bert
10
+ - transformers
11
+ base_model: neuralmind/bert-base-portuguese-cased
12
+ pipeline_tag: text-classification
13
+ datasets:
14
+ - wiki40b-pt # From consolidated human sources
15
+ - oscar-pt # From consolidated human sources
16
+ - cc100-pt # From consolidated human sources
17
+ - europarl-pt # From consolidated human sources
18
+ - opus-books-pt # From consolidated human sources
19
+ - Detecting-ai/ai_pt_corpus # AI-generated corpus
20
+ model_type: bert
21
+ ---
22
+
23
+ # 🇧🇷 pt-ai-detector
24
+
25
+ **pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`).
26
+
27
+ | Metric | Value |
28
+ | --------------------- | ------------------------------ |
29
+ | **Train data** | 1 000 000 human + 1 000 000 AI |
30
+ | **Balanced test set** | 1 954 190 (½ human, ½ AI) |
31
+ | **Accuracy** | ≈ 99 % |
32
+ | **F1 (macro)** | ≈ 0.99 |
33
+ | **Model size** | 434 M parameters (≈ 430 MB) |
34
+
35
+ ---
36
+
37
+ ## 📖 Quick usage
38
+
39
+ ```python
40
+ from transformers import pipeline
41
+
42
+ clf = pipeline(
43
+ "text-classification",
44
+ model="Detecting-ai/pt-ai-detector",
45
+ tokenizer="Detecting-ai/pt-ai-detector",
46
+ device=0 # set -1 for CPU
47
+ )
48
+
49
+ text = "A inteligência artificial está transformando a educação."
50
+ print(clf(text)) # → [{'label': 'AI', 'score': 0.987}]
51
+ ```
52
+
53
+ | id | label |
54
+ | -- | ----- |
55
+ | 0 | Human |
56
+ | 1 | AI |
57
+
58
+ ---
59
+
60
+ ## 🏋️‍♂️ Training details
61
+
62
+ * **Base model:** `neuralmind/bert-base-portuguese-cased`
63
+ * **Epochs:** 3 (fp16 on 1 × A100)
64
+ * **Batch size:** 32
65
+ * **Optimizer/LR:** AdamW 2 × 10⁻⁵
66
+ * **Loss:** Cross-entropy
67
+
68
+ ### Data sources
69
+
70
+ | Corpus | Lines used | Notes |
71
+ | ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
72
+ | Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books |
73
+ | **AI corpus** (`Detecting-ai/ai_pt_corpus`) | 1 M | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 |
74
+
75
+ Datasets were **balanced 1 : 1** and shuffled before training.
76
+
77
+ ---
78
+
79
+ ## 🚦 Intended use
80
+
81
+ Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.
82
+
83
+ ### Limitations
84
+
85
+ * Not trained on code or non-Portuguese language.
86
+ * Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
87
+ * **Commercial use is disallowed** (CC-BY-NC-4.0).
88
+
89
+ ---
90
+
91
+ ## ⚠️ Future work
92
+
93
+ * Evaluate on adversarial paraphrases.
94
+ * Distill/quantize for edge deployment.
95
+ * Extend to multilingual detection.
96
+
97
+ ---
98
+
99
+ ## 📜 Citation
100
+
101
+ ```bibtex
102
+ @misc{abdurazzoqov2025ptaidetector,
103
+ title = {pt-ai-detector: Detecting AI-generated Portuguese Text},
104
+ author = {Abdulla Abdurazzoqov},
105
+ year = {2025},
106
+ howpublished = {Hugging Face hub},
107
+ url = {https://huggingface.co/Detecting-ai/pt-ai-detector}
108
+ }
109
+ ```
110
+
111
+ ---
112
+
113
+ ## 💬 License
114
+
115
+ Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission.
116
+