xlm-r-spam-binary: Spam Review Detection for Vietnamese Text
This model is a fine-tuned version of xlm-roberta-large on the ViSpamReviews dataset for spam review detection in Vietnamese e-commerce reviews.
Model Details
- Base Model:
xlm-roberta-large - Description: XLM-RoBERTa Large - Multilingual model
- Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
- Fine-tuning Framework: HuggingFace Transformers
- Task: Spam Review Detection (binary)
- Number of Classes: 2
Hyperparameters
- Max sequence length:
256 - Learning rate:
5e-5 - Batch size:
32 - Epochs:
100 - Early stopping patience:
5
Dataset
The model was trained on the ViSpamReviews dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
- Train set: 14,299 samples (72%)
- Validation set: 1,590 samples (8%)
- Test set: 3,971 samples (20%)
Label Distribution
- Non-spam (0): Genuine product reviews
- Spam (1): Fake or promotional reviews
Results
The model was evaluated on the test set with the following metrics:
- Accuracy:
0.9020 - Macro-F1:
0.8763
Usage
You can use this model for spam review detection in Vietnamese text. Below is an example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "visolex/xlm-r-spam-binary"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example review text
text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"
# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()
probabilities = torch.softmax(outputs.logits, dim=-1)
# Map to label
label_map = {0: "Non-spam", 1: "Spam"}
predicted_label = label_map[predicted_class]
confidence = probabilities[0][predicted_class].item()
print(f"Text: {text}")
print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
Citation
If you use this model, please cite:
@misc{{
{model_key}_spam_detection,
title={{{description}}},
author={{ViSoLex Team}},
year={{2025}},
howpublished={{\url{{https://huggingface.co/{visolex/xlm-r-spam-binary}}}}}
}}
License
This model is released under the Apache-2.0 license.
Acknowledgments
- Base model: {base_model}
- Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
- ViSoLex Toolkit for Vietnamese NLP
Model tree for visolex/xlm-r-spam-binary
Base model
FacebookAI/xlm-roberta-largeEvaluation results
- accuracy on ViSpamReviewsself-reported0.902
- macro-f1 on ViSpamReviewsself-reported0.876