|
--- |
|
license: apache-2.0 |
|
base_model: allenai/longformer-base-4096 |
|
tags: |
|
- text-classification |
|
- ai-generated-text-detection |
|
- social-media |
|
- longformer |
|
language: |
|
- en |
|
datasets: |
|
- tarryzhang/AIGTBench |
|
metrics: |
|
- accuracy |
|
- f1 |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# OSM-Det: Online Social Media Detector |
|
|
|
## Model Description |
|
|
|
**OSM-Det** (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)". |
|
<div align="center"> |
|
<img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/> |
|
</div> |
|
|
|
## Model Details |
|
|
|
- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096) |
|
- **Model Type**: Text Classification (Binary) |
|
- **Architecture**: Longformer with classification head |
|
- **Max Sequence Length**: 4096 tokens |
|
- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench) |
|
|
|
### Quick Start |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det") |
|
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det") |
|
|
|
# Example text |
|
text = "Your text to analyze here..." |
|
|
|
# Tokenize and predict |
|
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_class = torch.argmax(predictions, dim=1).item() |
|
|
|
# Interpret results |
|
labels = ["Human-written", "AI-generated"] |
|
confidence = predictions[0][predicted_class].item() |
|
|
|
print(f"Prediction: {labels[predicted_class]}") |
|
print(f"Confidence: {confidence:.3f}") |
|
``` |
|
|
|
### Batch Processing |
|
|
|
```python |
|
def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32): |
|
results = [] |
|
|
|
for i in range(0, len(texts), batch_size): |
|
batch_texts = texts[i:i+batch_size] |
|
|
|
# Tokenize batch |
|
inputs = tokenizer( |
|
batch_texts, |
|
return_tensors="pt", |
|
max_length=max_length, |
|
truncation=True, |
|
padding=True |
|
) |
|
|
|
# Predict |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_classes = torch.argmax(predictions, dim=1) |
|
|
|
# Store results |
|
for j, text in enumerate(batch_texts): |
|
pred_class = predicted_classes[j].item() |
|
confidence = predictions[j][pred_class].item() |
|
results.append({ |
|
'text': text, |
|
'prediction': 'AI-generated' if pred_class == 1 else 'Human-written', |
|
'confidence': confidence, |
|
'ai_probability': predictions[j][1].item(), |
|
'human_probability': predictions[j][0].item() |
|
}) |
|
|
|
return results |
|
``` |
|
|
|
## Labels |
|
|
|
- **0**: Human-written text |
|
- **1**: AI-generated text |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes: |
|
- **28.77M AI-generated samples** from 12 different LLMs |
|
- **13.55M human-written samples** |
|
- Content from **Medium, Quora, and Reddit** platforms |
|
|
|
### Training Configuration |
|
|
|
- **Base Model**: Longformer-base-4096 |
|
- **Training Epochs**: 10 |
|
- **Batch Size**: 5 per device |
|
- **Gradient Accumulation**: 8 steps |
|
- **Learning Rate**: 2e-5 |
|
- **Weight Decay**: 0.01 |
|
- **Max Sequence Length**: 4096 tokens |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{SZSZLBZH25, |
|
title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}}, |
|
author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He}, |
|
booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}}, |
|
pages = {}, |
|
publisher ={ACL}, |
|
year = {2025} |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
- **Paper**: https://arxiv.org/abs/2412.18148 |
|
- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench |
|
- **Contact**: [email protected] |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|