File size: 4,454 Bytes

---
license: apache-2.0
base_model: allenai/longformer-base-4096
tags:
- text-classification
- ai-generated-text-detection
- social-media
- longformer
language:
- en
datasets:
- tarryzhang/AIGTBench
metrics:
- accuracy
- f1
library_name: transformers
pipeline_tag: text-classification
---

# OSM-Det: Online Social Media Detector

## Model Description

**OSM-Det** (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)".
<div align="center">
  <img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/>
</div>

## Model Details

- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
- **Model Type**: Text Classification (Binary)
- **Architecture**: Longformer with classification head
- **Max Sequence Length**: 4096 tokens
- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)

### Quick Start

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")

# Example text
text = "Your text to analyze here..."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Interpret results
labels = ["Human-written", "AI-generated"]
confidence = predictions[0][predicted_class].item()

print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {confidence:.3f}")
```

### Batch Processing

```python
def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch_texts, 
            return_tensors="pt", 
            max_length=max_length, 
            truncation=True, 
            padding=True
        )
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_classes = torch.argmax(predictions, dim=1)
            
        # Store results
        for j, text in enumerate(batch_texts):
            pred_class = predicted_classes[j].item()
            confidence = predictions[j][pred_class].item()
            results.append({
                'text': text,
                'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
                'confidence': confidence,
                'ai_probability': predictions[j][1].item(),
                'human_probability': predictions[j][0].item()
            })
    
    return results
```

## Labels

- **0**: Human-written text
- **1**: AI-generated text

## Training Details

### Training Data

OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
- **28.77M AI-generated samples** from 12 different LLMs
- **13.55M human-written samples** 
- Content from **Medium, Quora, and Reddit** platforms

### Training Configuration

- **Base Model**: Longformer-base-4096
- **Training Epochs**: 10
- **Batch Size**: 5 per device
- **Gradient Accumulation**: 8 steps
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.01
- **Max Sequence Length**: 4096 tokens

## Citation

```bibtex
@inproceedings{SZSZLBZH25,
    title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
    author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
    booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
    pages = {},
    publisher ={ACL},
    year = {2025}
}
```

## Contact

- **Paper**: https://arxiv.org/abs/2412.18148
- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench
- **Contact**: [email protected]

## License

Apache 2.0