File size: 4,454 Bytes
355de49
958f15a
 
 
 
 
 
 
 
 
 
 
 
 
 
355de49
958f15a
355de49
 
958f15a
355de49
958f15a
355de49
1b12def
bb91339
 
 
355de49
 
 
958f15a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355de49
 
 
 
 
958f15a
 
 
 
355de49
958f15a
355de49
958f15a
 
 
 
 
 
 
355de49
958f15a
355de49
958f15a
 
 
 
 
 
 
 
 
 
355de49
958f15a
355de49
958f15a
 
 
355de49
958f15a
355de49
958f15a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: apache-2.0
base_model: allenai/longformer-base-4096
tags:
- text-classification
- ai-generated-text-detection
- social-media
- longformer
language:
- en
datasets:
- tarryzhang/AIGTBench
metrics:
- accuracy
- f1
library_name: transformers
pipeline_tag: text-classification
---

# OSM-Det: Online Social Media Detector

## Model Description

**OSM-Det** (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)".
<div align="center">
  <img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/>
</div>

## Model Details

- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
- **Model Type**: Text Classification (Binary)
- **Architecture**: Longformer with classification head
- **Max Sequence Length**: 4096 tokens
- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)

### Quick Start

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")

# Example text
text = "Your text to analyze here..."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Interpret results
labels = ["Human-written", "AI-generated"]
confidence = predictions[0][predicted_class].item()

print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {confidence:.3f}")
```

### Batch Processing

```python
def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch_texts, 
            return_tensors="pt", 
            max_length=max_length, 
            truncation=True, 
            padding=True
        )
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_classes = torch.argmax(predictions, dim=1)
            
        # Store results
        for j, text in enumerate(batch_texts):
            pred_class = predicted_classes[j].item()
            confidence = predictions[j][pred_class].item()
            results.append({
                'text': text,
                'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
                'confidence': confidence,
                'ai_probability': predictions[j][1].item(),
                'human_probability': predictions[j][0].item()
            })
    
    return results
```

## Labels

- **0**: Human-written text
- **1**: AI-generated text

## Training Details

### Training Data

OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
- **28.77M AI-generated samples** from 12 different LLMs
- **13.55M human-written samples** 
- Content from **Medium, Quora, and Reddit** platforms

### Training Configuration

- **Base Model**: Longformer-base-4096
- **Training Epochs**: 10
- **Batch Size**: 5 per device
- **Gradient Accumulation**: 8 steps
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.01
- **Max Sequence Length**: 4096 tokens

## Citation

```bibtex
@inproceedings{SZSZLBZH25,
    title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
    author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
    booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
    pages = {},
    publisher ={ACL},
    year = {2025}
}
```

## Contact

- **Paper**: https://arxiv.org/abs/2412.18148
- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench
- **Contact**: [email protected]

## License

Apache 2.0