Detecting-ai
/

pt-ai-detector

+---
+license: cc-by-nc-4.0
+language:
+  - pt
+tags:
+  - ai-detection
+  - text-classification
+  - portuguese
+  - bert
+  - transformers
+base_model: neuralmind/bert-base-portuguese-cased
+pipeline_tag: text-classification
+datasets:
+  - wiki40b-pt    # From consolidated human sources
+  - oscar-pt      # From consolidated human sources
+  - cc100-pt      # From consolidated human sources
+  - europarl-pt   # From consolidated human sources
+  - opus-books-pt # From consolidated human sources
+  - Detecting-ai/ai_pt_corpus # AI-generated corpus
+model_type: bert
+---
+# 🇧🇷 pt-ai-detector
+**pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`).
+| Metric                | Value                          |
+| --------------------- | ------------------------------ |
+| **Train data**        | 1 000 000 human + 1 000 000 AI |
+| **Balanced test set** | 1 954 190 (½ human, ½ AI)      |
+| **Accuracy**          | ≈ 99 %                         |
+| **F1 (macro)**        | ≈ 0.99                         |
+| **Model size**        | 434 M parameters (≈ 430 MB)    |
+---
+## 📖 Quick usage
+```python
+from transformers import pipeline
+clf = pipeline(
+    "text-classification",
+    model="Detecting-ai/pt-ai-detector",
+    tokenizer="Detecting-ai/pt-ai-detector",
+    device=0  # set -1 for CPU
+)
+text = "A inteligência artificial está transformando a educação."
+print(clf(text))        # → [{'label': 'AI', 'score': 0.987}]
+```
+| id | label |
+| -- | ----- |
+| 0  | Human |
+| 1  | AI    |
+---
+## 🏋️‍♂️ Training details
+* **Base model:** `neuralmind/bert-base-portuguese-cased`
+* **Epochs:** 3 (fp16 on 1 × A100)
+* **Batch size:** 32
+* **Optimizer/LR:** AdamW 2 × 10⁻⁵
+* **Loss:** Cross-entropy
+### Data sources
+| Corpus                                                                    | Lines used  | Notes                                                                                                                                      |
+| ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books                                                                                                          |
+| **AI corpus** (`Detecting-ai/ai_pt_corpus`)                               | 1 M         | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 |
+Datasets were **balanced 1 : 1** and shuffled before training.
+---
+## 🚦 Intended use
+Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.
+### Limitations
+* Not trained on code or non-Portuguese language.
+* Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
+* **Commercial use is disallowed** (CC-BY-NC-4.0).
+---
+## ⚠️ Future work
+* Evaluate on adversarial paraphrases.
+* Distill/quantize for edge deployment.
+* Extend to multilingual detection.
+---
+## 📜 Citation
+```bibtex
+@misc{abdurazzoqov2025ptaidetector,
+  title   = {pt-ai-detector: Detecting AI-generated Portuguese Text},
+  author  = {Abdulla Abdurazzoqov},
+  year    = {2025},
+  howpublished = {Hugging Face hub},
+  url     = {https://huggingface.co/Detecting-ai/pt-ai-detector}
+}
+```
+---
+## 💬 License
+Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission.