pt-ai-detector / README.md
Temuriy's picture
Update README.md
caaf165 verified
metadata
license: cc-by-nc-4.0
language:
  - pt
tags:
  - ai-detection
  - text-classification
  - portuguese
  - bert
  - transformers
base_model: neuralmind/bert-base-portuguese-cased
pipeline_tag: text-classification
datasets:
  - wiki40b-pt
  - oscar-pt
  - cc100-pt
  - europarl-pt
  - opus-books-pt
  - Detecting-ai/ai_pt_corpus
model_type: bert

πŸ‡§πŸ‡· pt-ai-detector

pt-ai-detector is a BERT-base model fine-tuned to decide whether a Portuguese sentence or paragraph was written by a human (label = 0) or generated by AI (label = 1).

Metric Value
Train data 1 000 000 human + 1 000 000 AI
Balanced test set 1 954 190 (Β½ human, Β½ AI)
Accuracy β‰ˆ 99 %
F1 (macro) β‰ˆ 0.99
Model size 434 M parameters (β‰ˆ 430 MB)

πŸ“– Quick usage

Try it live at detecting-ai.com – our team at Detecting-ai built this model and demo so you can instantly test any Portuguese text online.

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="Detecting-ai/pt-ai-detector",
    tokenizer="Detecting-ai/pt-ai-detector",
    device=0  # set -1 for CPU
)

text = "A inteligΓͺncia artificial estΓ‘ transformando a educaΓ§Γ£o."
print(clf(text))        # β†’ [{'label': 'AI', 'score': 0.987}]
id label
0 Human
1 AI

πŸ‹οΈβ€β™‚οΈ Training details

  • Base model: neuralmind/bert-base-portuguese-cased
  • Epochs: 3 (fp16 on 1 Γ— A100)
  • Batch size: 32
  • Optimizer/LR: AdamW 2 Γ— 10⁻⁡
  • Loss: Cross-entropy

Data sources

Corpus Lines used Notes
Human corpus (wiki40b-pt, oscar-pt, cc100-pt, europarl-pt, opus-books-pt) 1 M sampled Diverse Portuguese web/news/books
AI corpus (Detecting-ai/ai_pt_corpus) 1 M Generated with OpenAI models (various GPT-4 / GPT-3.5 variants); prompts cover essays, news, tweets, dialogs, paraphrases, T = 0.6–1.0

Datasets were balanced 1 : 1 and shuffled before training.


🚦 Intended use

Detect AI-generated Portuguese text in essays, articles, chats, support tickets, etc.

Limitations

  • Not trained on code or non-Portuguese language.
  • Accuracy may drop on texts < 10 tokens or heavily paraphrased AI.
  • Commercial use is disallowed (CC-BY-NC-4.0).

⚠️ Future work

  • Evaluate on adversarial paraphrases.
  • Distill/quantize for edge deployment.
  • Extend to multilingual detection.

πŸ“œ Citation

@misc{abdurazzoqov2025ptaidetector,
  title   = {pt-ai-detector: Detecting AI-generated Portuguese Text},
  author  = {Abdulla Abdurazzoqov},
  year    = {2025},
  howpublished = {Hugging Face hub},
  url     = {https://huggingface.co/Detecting-ai/pt-ai-detector}
}

πŸ’¬ License

Creative Commons CC-BY-NC-4.0 β€” free for research & personal use; commercial use requires written permission.