File size: 3,928 Bytes
31c253b caaf165 31c253b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
license: cc-by-nc-4.0
language:
- pt
tags:
- ai-detection
- text-classification
- portuguese
- bert
- transformers
base_model: neuralmind/bert-base-portuguese-cased
pipeline_tag: text-classification
datasets:
- wiki40b-pt # From consolidated human sources
- oscar-pt # From consolidated human sources
- cc100-pt # From consolidated human sources
- europarl-pt # From consolidated human sources
- opus-books-pt # From consolidated human sources
- Detecting-ai/ai_pt_corpus # AI-generated corpus
model_type: bert
---
# 🇧🇷 pt-ai-detector
**pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`).
| Metric | Value |
| --------------------- | ------------------------------ |
| **Train data** | 1 000 000 human + 1 000 000 AI |
| **Balanced test set** | 1 954 190 (½ human, ½ AI) |
| **Accuracy** | ≈ 99 % |
| **F1 (macro)** | ≈ 0.99 |
| **Model size** | 434 M parameters (≈ 430 MB) |
---
## 📖 Quick usage
**Try it live** at [detecting-ai.com](https://detecting-ai.com/pt) – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online.
```python
from transformers import pipeline
clf = pipeline(
"text-classification",
model="Detecting-ai/pt-ai-detector",
tokenizer="Detecting-ai/pt-ai-detector",
device=0 # set -1 for CPU
)
text = "A inteligência artificial está transformando a educação."
print(clf(text)) # → [{'label': 'AI', 'score': 0.987}]
```
| id | label |
| -- | ----- |
| 0 | Human |
| 1 | AI |
---
## 🏋️♂️ Training details
* **Base model:** `neuralmind/bert-base-portuguese-cased`
* **Epochs:** 3 (fp16 on 1 × A100)
* **Batch size:** 32
* **Optimizer/LR:** AdamW 2 × 10⁻⁵
* **Loss:** Cross-entropy
### Data sources
| Corpus | Lines used | Notes |
| ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books |
| **AI corpus** (`Detecting-ai/ai_pt_corpus`) | 1 M | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 |
Datasets were **balanced 1 : 1** and shuffled before training.
---
## 🚦 Intended use
Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.
### Limitations
* Not trained on code or non-Portuguese language.
* Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
* **Commercial use is disallowed** (CC-BY-NC-4.0).
---
## ⚠️ Future work
* Evaluate on adversarial paraphrases.
* Distill/quantize for edge deployment.
* Extend to multilingual detection.
---
## 📜 Citation
```bibtex
@misc{abdurazzoqov2025ptaidetector,
title = {pt-ai-detector: Detecting AI-generated Portuguese Text},
author = {Abdulla Abdurazzoqov},
year = {2025},
howpublished = {Hugging Face hub},
url = {https://huggingface.co/Detecting-ai/pt-ai-detector}
}
```
---
## 💬 License
Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission.
|