File size: 4,454 Bytes
355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 1b12def bb91339 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a 355de49 958f15a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
license: apache-2.0
base_model: allenai/longformer-base-4096
tags:
- text-classification
- ai-generated-text-detection
- social-media
- longformer
language:
- en
datasets:
- tarryzhang/AIGTBench
metrics:
- accuracy
- f1
library_name: transformers
pipeline_tag: text-classification
---
# OSM-Det: Online Social Media Detector
## Model Description
**OSM-Det** (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)".
<div align="center">
<img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/>
</div>
## Model Details
- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
- **Model Type**: Text Classification (Binary)
- **Architecture**: Longformer with classification head
- **Max Sequence Length**: 4096 tokens
- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)
### Quick Start
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")
# Example text
text = "Your text to analyze here..."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
# Interpret results
labels = ["Human-written", "AI-generated"]
confidence = predictions[0][predicted_class].item()
print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {confidence:.3f}")
```
### Batch Processing
```python
def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
results = []
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(
batch_texts,
return_tensors="pt",
max_length=max_length,
truncation=True,
padding=True
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=1)
# Store results
for j, text in enumerate(batch_texts):
pred_class = predicted_classes[j].item()
confidence = predictions[j][pred_class].item()
results.append({
'text': text,
'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
'confidence': confidence,
'ai_probability': predictions[j][1].item(),
'human_probability': predictions[j][0].item()
})
return results
```
## Labels
- **0**: Human-written text
- **1**: AI-generated text
## Training Details
### Training Data
OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
- **28.77M AI-generated samples** from 12 different LLMs
- **13.55M human-written samples**
- Content from **Medium, Quora, and Reddit** platforms
### Training Configuration
- **Base Model**: Longformer-base-4096
- **Training Epochs**: 10
- **Batch Size**: 5 per device
- **Gradient Accumulation**: 8 steps
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.01
- **Max Sequence Length**: 4096 tokens
## Citation
```bibtex
@inproceedings{SZSZLBZH25,
title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
pages = {},
publisher ={ACL},
year = {2025}
}
```
## Contact
- **Paper**: https://arxiv.org/abs/2412.18148
- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench
- **Contact**: [email protected]
## License
Apache 2.0
|