Sarcasmdetection / README.md
gautamnancy's picture
Upload 7 files
6e35907 verified
# Sarcasm Detection with BERT
This repository contains a fine-tuned BERT model for detecting sarcasm in headlines and text. The model achieves high accuracy in distinguishing between sarcastic and non-sarcastic content using natural language processing techniques.
---
## Model Details
- **Model Name:** BERT-Base-Uncased Fine-tuned for Sarcasm Detection
- **Model Architecture:** BERT Base (110M parameters)
- **Task:** Binary Classification (Sarcastic vs Non-Sarcastic)
- **Dataset:** Sarcasm Headlines Dataset
- **Quantization:** Float16 (for optimized deployment)
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Dataset
The model was trained on the **Sarcasm Headlines Dataset** which contains:
- **Total Samples:** 26,709 headlines
- **Features:**
- `headline`: The text content to classify
- `is_sarcastic`: Binary label (1 for sarcastic, 0 for non-sarcastic)
- **Train/Test Split:** 90% training, 10% evaluation
---
## Performance Metrics
| Epoch | Training Loss | Validation Loss | Accuracy |
|-------|---------------|-----------------|----------|
| 1 | 0.2048 | 0.1821 | 92.96% |
| 2 | 0.1138 | 0.2792 | 91.01% |
| 3 | 0.0586 | 0.2372 | **93.86%** |
**Final Model Performance:**
- **Best Accuracy:** 93.86%
- **Final Training Loss:** 0.146
---
## Installation
```bash
pip install transformers datasets evaluate scikit-learn torch
```
---
## Usage
### Quick Start
```python
from transformers import pipeline
import torch
# Load the trained model
classifier = pipeline("text-classification",
model="./sarcasm_model",
tokenizer="./sarcasm_model")
# Test examples
test_inputs = [
"I'm absolutely thrilled to be stuck in traffic again.",
"The weather is nice and sunny today.",
"Oh great, another email from the boss with more tasks."
]
for sentence in test_inputs:
result = classifier(sentence)[0]
label = "Sarcastic" if result["label"] == "LABEL_1" else "Not Sarcastic"
print(f"'{sentence}' β†’ {label} (Confidence: {result['score']:.2f})")
```
### Manual Model Loading
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("./sarcasm_model")
tokenizer = AutoTokenizer.from_pretrained("./sarcasm_model")
# Tokenize input
text = "Oh wonderful, another Monday morning!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
# Inference
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = outputs.logits.argmax(dim=1).item()
label_mapping = {0: "Not Sarcastic", 1: "Sarcastic"}
confidence = predictions[0][predicted_class].item()
print(f"Prediction: {label_mapping[predicted_class]} (Confidence: {confidence:.2f})")
```
---
## Training Configuration
### Model Parameters
- **Base Model:** `bert-base-uncased`
- **Number of Labels:** 2 (binary classification)
- **Max Sequence Length:** 128 tokens
- **Tokenization:** WordPiece with padding and truncation
### Training Arguments
- **Learning Rate:** 2e-5
- **Batch Size:** 16 (training), 32 (evaluation)
- **Epochs:** 3
- **Weight Decay:** 0.01
- **Evaluation Strategy:** Every epoch
- **Optimizer:** AdamW (default)
### Hardware Requirements
- **GPU:** NVIDIA Tesla T4 (or equivalent)
- **Memory:** ~4GB GPU memory for training
- **Training Time:** ~18 minutes for 3 epochs
---
## Model Architecture
The model uses BERT's transformer architecture with:
- **Encoder Layers:** 12
- **Attention Heads:** 12
- **Hidden Size:** 768
- **Vocabulary Size:** 30,522
- **Classification Head:** Linear layer (768 β†’ 2)
---
## File Structure
```
sarcasm-detection/
β”œβ”€β”€ sarcasm_model/ # Main fine-tuned model
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ special_tokens_map.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── tokenizer.json
β”œβ”€β”€ quantized-model/ # Float16 quantized version
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ └── tokenizer files...
β”œβ”€β”€ logs/ # Training logs
β”œβ”€β”€ sarcasm-detection.ipynb # Training notebook
└── README.md # This file
```
---
## Quantization
A quantized version of the model is available for deployment optimization:
```python
# Load quantized model (Float16)
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized-model")
quantized_model = quantized_model.to(dtype=torch.float16)
```
**Benefits of Quantization:**
- **Reduced Memory Usage:** ~50% smaller model size
- **Faster Inference:** Improved speed on compatible hardware
- **Minimal Accuracy Loss:** Maintains classification performance
---
## Limitations
- **Domain Specificity:** Trained primarily on headlines; may not generalize perfectly to other text types
- **Context Dependency:** Sarcasm detection can be highly context-dependent and subjective
- **Cultural Nuances:** May not capture sarcasm patterns from different cultural contexts
- **Short Text Focus:** Optimized for headline-length text (typically under 128 tokens)
---
## Potential Improvements
- **Data Augmentation:** Include more diverse sarcasm examples
- **Ensemble Methods:** Combine multiple models for better accuracy
- **Context Integration:** Incorporate additional context beyond the headline
- **Multi-language Support:** Extend to other languages
- **Real-time Processing:** Optimize for streaming applications
---
## Applications
- **Social Media Monitoring:** Detect sarcastic comments and posts
- **Content Moderation:** Identify potentially misleading sarcastic content
- **Sentiment Analysis Enhancement:** Improve sentiment classification accuracy
- **News Analysis:** Analyze editorial tone and bias in headlines
- **Customer Feedback:** Better understand customer sentiment in reviews
---
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{sarcasm_detection_bert,
title={BERT-based Sarcasm Detection for Headlines},
author={Your Name},
year={2025},
note={Fine-tuned BERT model for binary sarcasm classification}
}
```
---
## Contributing
Contributions are welcome! Please feel free to:
- Report bugs or issues
- Suggest improvements
- Add new features
- Improve documentation
---
## License
This project is licensed under the MIT License. The underlying BERT model follows Google's Apache 2.0 license.
---
## Acknowledgments
- **Hugging Face** for the Transformers library
- **Google Research** for the original BERT model
- **Kaggle** for providing the Sarcasm Headlines Dataset
- **PyTorch** for the deep learning framework