OSM-Det / README.md

Update README.md

1b12def verified 3 months ago

4.45 kB

	---
	license: apache-2.0
	base_model: allenai/longformer-base-4096
	tags:
	- text-classification
	- ai-generated-text-detection
	- social-media
	- longformer
	language:
	- en
	datasets:
	- tarryzhang/AIGTBench
	metrics:
	- accuracy
	- f1
	library_name: transformers
	pipeline_tag: text-classification
	---

	# OSM-Det: Online Social Media Detector

	## Model Description

	OSM-Det (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media](https://arxiv.org/abs/2412.18148)".
	<div align="center">
	<img src="pipeline.jpg" alt="AIGTBench Pipeline" width="800"/>
	</div>

	## Model Details

	- Base Model: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
	- Model Type: Text Classification (Binary)
	- Architecture: Longformer with classification head
	- Max Sequence Length: 4096 tokens
	- Training Data: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)

	### Quick Start

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
	tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")

	# Example text
	text = "Your text to analyze here..."

	# Tokenize and predict
	inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=1).item()

	# Interpret results
	labels = ["Human-written", "AI-generated"]
	confidence = predictions[0][predicted_class].item()

	print(f"Prediction: {labels[predicted_class]}")
	print(f"Confidence: {confidence:.3f}")
	```

	### Batch Processing

	```python
	def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
	results = []

	for i in range(0, len(texts), batch_size):
	batch_texts = texts[i:i+batch_size]

	# Tokenize batch
	inputs = tokenizer(
	batch_texts,
	return_tensors="pt",
	max_length=max_length,
	truncation=True,
	padding=True
	)

	# Predict
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_classes = torch.argmax(predictions, dim=1)

	# Store results
	for j, text in enumerate(batch_texts):
	pred_class = predicted_classes[j].item()
	confidence = predictions[j][pred_class].item()
	results.append({
	'text': text,
	'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
	'confidence': confidence,
	'ai_probability': predictions[j][1].item(),
	'human_probability': predictions[j][0].item()
	})

	return results
	```

	## Labels

	- 0: Human-written text
	- 1: AI-generated text

	## Training Details

	### Training Data

	OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
	- 28.77M AI-generated samples from 12 different LLMs
	- 13.55M human-written samples
	- Content from Medium, Quora, and Reddit platforms

	### Training Configuration

	- Base Model: Longformer-base-4096
	- Training Epochs: 10
	- Batch Size: 5 per device
	- Gradient Accumulation: 8 steps
	- Learning Rate: 2e-5
	- Weight Decay: 0.01
	- Max Sequence Length: 4096 tokens

	## Citation

	```bibtex
	@inproceedings{SZSZLBZH25,
	title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
	author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
	booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
	pages = {},
	publisher ={ACL},
	year = {2025}
	}
	```

	## Contact

	- Paper: https://arxiv.org/abs/2412.18148
	- Dataset: https://huggingface.co/datasets/tarryzhang/AIGTBench
	- Contact: [email protected]

	## License

	Apache 2.0