Detecting-ai
/

pt-ai-detector

Text Classification

Model card Files Files and versions Community

pt-ai-detector / README.md

Temuriy's picture

Update README.md

caaf165 verified 13 days ago

|

history blame contribute delete

3.93 kB

	---
	license: cc-by-nc-4.0
	language:
	- pt
	tags:
	- ai-detection
	- text-classification
	- portuguese
	- bert
	- transformers
	base_model: neuralmind/bert-base-portuguese-cased
	pipeline_tag: text-classification
	datasets:
	- wiki40b-pt # From consolidated human sources
	- oscar-pt # From consolidated human sources
	- cc100-pt # From consolidated human sources
	- europarl-pt # From consolidated human sources
	- opus-books-pt # From consolidated human sources
	- Detecting-ai/ai_pt_corpus # AI-generated corpus
	model_type: bert
	---

	# 🇧🇷 pt-ai-detector

	pt-ai-detector is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a human (`label = 0`) or generated by AI (`label = 1`).

	\| Metric \| Value \|
	\| --------------------- \| ------------------------------ \|
	\| Train data \| 1 000 000 human + 1 000 000 AI \|
	\| Balanced test set \| 1 954 190 (½ human, ½ AI) \|
	\| Accuracy \| ≈ 99 % \|
	\| F1 (macro) \| ≈ 0.99 \|
	\| Model size \| 434 M parameters (≈ 430 MB) \|

	---

	## 📖 Quick usage
	Try it live at [detecting-ai.com](https://detecting-ai.com/pt) – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online.

	```python
	from transformers import pipeline

	clf = pipeline(
	"text-classification",
	model="Detecting-ai/pt-ai-detector",
	tokenizer="Detecting-ai/pt-ai-detector",
	device=0 # set -1 for CPU
	)

	text = "A inteligência artificial está transformando a educação."
	print(clf(text)) # → [{'label': 'AI', 'score': 0.987}]
	```

	\| id \| label \|
	\| -- \| ----- \|
	\| 0 \| Human \|
	\| 1 \| AI \|

	---

	## 🏋️‍♂️ Training details

	* Base model: `neuralmind/bert-base-portuguese-cased`
	* Epochs: 3 (fp16 on 1 × A100)
	* Batch size: 32
	* Optimizer/LR: AdamW 2 × 10⁻⁵
	* Loss: Cross-entropy

	### Data sources

	\| Corpus \| Lines used \| Notes \|
	\| ------------------------------------------------------------------------- \| ----------- \| ------------------------------------------------------------------------------------------------------------------------------------------ \|
	\| Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) \| 1 M sampled \| Diverse Portuguese web/news/books \|
	\| AI corpus (`Detecting-ai/ai_pt_corpus`) \| 1 M \| Generated with OpenAI models (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0 \|

	Datasets were balanced 1 : 1 and shuffled before training.

	---

	## 🚦 Intended use

	Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.

	### Limitations

	* Not trained on code or non-Portuguese language.
	* Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
	* Commercial use is disallowed (CC-BY-NC-4.0).

	---

	## ⚠️ Future work

	* Evaluate on adversarial paraphrases.
	* Distill/quantize for edge deployment.
	* Extend to multilingual detection.

	---

	## 📜 Citation

	```bibtex
	@misc{abdurazzoqov2025ptaidetector,
	title = {pt-ai-detector: Detecting AI-generated Portuguese Text},
	author = {Abdulla Abdurazzoqov},
	year = {2025},
	howpublished = {Hugging Face hub},
	url = {https://huggingface.co/Detecting-ai/pt-ai-detector}
	}
	```

	---

	## 💬 License

	Creative Commons CC-BY-NC-4.0 — free for research & personal use; commercial use requires written permission.