tarryzhang
/

OSM-Det

@@ -1,199 +1,146 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: apache-2.0
+base_model: allenai/longformer-base-4096
+tags:
+- text-classification
+- ai-generated-text-detection
+- social-media
+- longformer
+language:
+- en
+datasets:
+- tarryzhang/AIGTBench
+metrics:
+- accuracy
+- f1
 library_name: transformers
+pipeline_tag: text-classification
 ---
+# OSM-Det: Online Social Media Detector
+## Model Description
+**OSM-Det** (Online Social Media Detector) is a state-of-the-art AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "[*Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media*](https://arxiv.org/abs/2412.18148)".
 ## Model Details
+- **Base Model**: [allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096)
+- **Model Type**: Text Classification (Binary)
+- **Architecture**: Longformer with classification head
+- **Max Sequence Length**: 4096 tokens
+- **Training Data**: [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench)
+### Quick Start
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
+tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")
+# Example text
+text = "Your text to analyze here..."
+# Tokenize and predict
+inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class = torch.argmax(predictions, dim=1).item()
+# Interpret results
+labels = ["Human-written", "AI-generated"]
+confidence = predictions[0][predicted_class].item()
+print(f"Prediction: {labels[predicted_class]}")
+print(f"Confidence: {confidence:.3f}")
+```
+### Batch Processing
+```python
+def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
+    results = []
+    for i in range(0, len(texts), batch_size):
+        batch_texts = texts[i:i+batch_size]
+        # Tokenize batch
+        inputs = tokenizer(
+            batch_texts,
+            return_tensors="pt",
+            max_length=max_length,
+            truncation=True,
+            padding=True
+        )
+        # Predict
+        with torch.no_grad():
+            outputs = model(**inputs)
+            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+            predicted_classes = torch.argmax(predictions, dim=1)
+        # Store results
+        for j, text in enumerate(batch_texts):
+            pred_class = predicted_classes[j].item()
+            confidence = predictions[j][pred_class].item()
+            results.append({
+                'text': text,
+                'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
+                'confidence': confidence,
+                'ai_probability': predictions[j][1].item(),
+                'human_probability': predictions[j][0].item()
+            })
+    return results
+```
+## Labels
+- **0**: Human-written text
+- **1**: AI-generated text
 ## Training Details
 ### Training Data
+OSM-Det was trained on [AIGTBench](https://huggingface.co/datasets/tarryzhang/AIGTBench), which includes:
+- **28.77M AI-generated samples** from 12 different LLMs
+- **13.55M human-written samples**
+- Content from **Medium, Quora, and Reddit** platforms
+### Training Configuration
+- **Base Model**: Longformer-base-4096
+- **Training Epochs**: 10
+- **Batch Size**: 5 per device
+- **Gradient Accumulation**: 8 steps
+- **Learning Rate**: 2e-5
+- **Weight Decay**: 0.01
+- **Max Sequence Length**: 4096 tokens
+## Citation
+```bibtex
+@inproceedings{SZSZLBZH25,
+    title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
+    author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
+    booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
+    pages = {},
+    publisher ={ACL},
+    year = {2025}
+}
+```
+## Contact
+- **Paper**: https://arxiv.org/abs/2412.18148
+- **Dataset**: https://huggingface.co/datasets/tarryzhang/AIGTBench
+- **Contact**: [email protected]
+## License
+Apache 2.0