|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- pt |
|
tags: |
|
- ai-detection |
|
- text-classification |
|
- portuguese |
|
- bert |
|
- transformers |
|
base_model: neuralmind/bert-base-portuguese-cased |
|
pipeline_tag: text-classification |
|
datasets: |
|
- wiki40b-pt |
|
- oscar-pt |
|
- cc100-pt |
|
- europarl-pt |
|
- opus-books-pt |
|
- Detecting-ai/ai_pt_corpus |
|
model_type: bert |
|
--- |
|
|
|
# 🇧🇷 pt-ai-detector |
|
|
|
**pt-ai-detector** is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a *human* (`label = 0`) or generated by *AI* (`label = 1`). |
|
|
|
| Metric | Value | |
|
| --------------------- | ------------------------------ | |
|
| **Train data** | 1 000 000 human + 1 000 000 AI | |
|
| **Balanced test set** | 1 954 190 (½ human, ½ AI) | |
|
| **Accuracy** | ≈ 99 % | |
|
| **F1 (macro)** | ≈ 0.99 | |
|
| **Model size** | 434 M parameters (≈ 430 MB) | |
|
|
|
--- |
|
|
|
## 📖 Quick usage |
|
**Try it live** at [detecting-ai.com](https://detecting-ai.com/pt) – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
clf = pipeline( |
|
"text-classification", |
|
model="Detecting-ai/pt-ai-detector", |
|
tokenizer="Detecting-ai/pt-ai-detector", |
|
device=0 # set -1 for CPU |
|
) |
|
|
|
text = "A inteligência artificial está transformando a educação." |
|
print(clf(text)) # → [{'label': 'AI', 'score': 0.987}] |
|
``` |
|
|
|
| id | label | |
|
| -- | ----- | |
|
| 0 | Human | |
|
| 1 | AI | |
|
|
|
--- |
|
|
|
## 🏋️♂️ Training details |
|
|
|
* **Base model:** `neuralmind/bert-base-portuguese-cased` |
|
* **Epochs:** 3 (fp16 on 1 × A100) |
|
* **Batch size:** 32 |
|
* **Optimizer/LR:** AdamW 2 × 10⁻⁵ |
|
* **Loss:** Cross-entropy |
|
|
|
### Data sources |
|
|
|
| Corpus | Lines used | Notes | |
|
| ------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ | |
|
| Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) | 1 M sampled | Diverse Portuguese web/news/books | |
|
| **AI corpus** (`Detecting-ai/ai_pt_corpus`) | 1 M | Generated with **OpenAI models** (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 | |
|
|
|
Datasets were **balanced 1 : 1** and shuffled before training. |
|
|
|
--- |
|
|
|
## 🚦 Intended use |
|
|
|
Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc. |
|
|
|
### Limitations |
|
|
|
* Not trained on code or non-Portuguese language. |
|
* Accuracy may drop on texts < 10 tokens or heavily paraphrased AI. |
|
* **Commercial use is disallowed** (CC-BY-NC-4.0). |
|
|
|
--- |
|
|
|
## ⚠️ Future work |
|
|
|
* Evaluate on adversarial paraphrases. |
|
* Distill/quantize for edge deployment. |
|
* Extend to multilingual detection. |
|
|
|
--- |
|
|
|
## 📜 Citation |
|
|
|
```bibtex |
|
@misc{abdurazzoqov2025ptaidetector, |
|
title = {pt-ai-detector: Detecting AI-generated Portuguese Text}, |
|
author = {Abdulla Abdurazzoqov}, |
|
year = {2025}, |
|
howpublished = {Hugging Face hub}, |
|
url = {https://huggingface.co/Detecting-ai/pt-ai-detector} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## 💬 License |
|
|
|
Creative Commons **CC-BY-NC-4.0** — free for research & personal use; commercial use requires written permission. |
|
|
|
|