OSM-Det / README.md
tarryzhang's picture
Update README.md
1b12def verified
---
license: apache-2.0
base_model: allenai/longformer-base-4096
tags:
- text-classification
- ai-generated-text-detection
- social-media
- longformer
language:
- en
datasets:
- tarryzhang/AIGTBench
metrics:
- accuracy
- f1
library_name: transformers
pipeline_tag: text-classification
---
# OSM-Det: Online Social Media Detector
## Model Description
**OSM-Det** (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)".
<div align="center">
<img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/>
</div>
## Model Details
- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
- **Model Type**: Text Classification (Binary)
- **Architecture**: Longformer with classification head
- **Max Sequence Length**: 4096 tokens
- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)
### Quick Start
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")
# Example text
text = "Your text to analyze here..."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
# Interpret results
labels = ["Human-written", "AI-generated"]
confidence = predictions[0][predicted_class].item()
print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {confidence:.3f}")
```
### Batch Processing
```python
def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
results = []
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(
batch_texts,
return_tensors="pt",
max_length=max_length,
truncation=True,
padding=True
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=1)
# Store results
for j, text in enumerate(batch_texts):
pred_class = predicted_classes[j].item()
confidence = predictions[j][pred_class].item()
results.append({
'text': text,
'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
'confidence': confidence,
'ai_probability': predictions[j][1].item(),
'human_probability': predictions[j][0].item()
})
return results
```
## Labels
- **0**: Human-written text
- **1**: AI-generated text
## Training Details
### Training Data
OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
- **28.77M AI-generated samples** from 12 different LLMs
- **13.55M human-written samples**
- Content from **Medium, Quora, and Reddit** platforms
### Training Configuration
- **Base Model**: Longformer-base-4096
- **Training Epochs**: 10
- **Batch Size**: 5 per device
- **Gradient Accumulation**: 8 steps
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.01
- **Max Sequence Length**: 4096 tokens
## Citation
```bibtex
@inproceedings{SZSZLBZH25,
title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
pages = {},
publisher ={ACL},
year = {2025}
}
```
## Contact
- **Paper**: https://arxiv.org/abs/2412.18148
- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench
- **Contact**: [email protected]
## License
Apache 2.0