File size: 3,896 Bytes
a6d2542 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
language:
- en
- ru
- zh
- de
- es
- fr
- ja
- it
- pt
- ar
tags:
- text-classification
- spam-detection
- content-filtering
- security
- nlp
license: apache-2.0
base_model: FacebookAI/xlm-roberta-large
metrics:
- accuracy
library_name: transformers
---
# 🦅 Nickup Swallow (v1)
> **"Swallows spam, leaves the essence."**
**Nickup Swallow** is a high-performance, multilingual text classification model specifically engineered to act as a **Gatekeeper** for modern AI search engines and browsers.
It filters out aggressive spam, SEO content farms, adult content, and scams, ensuring that downstream LLMs (like GPT/T5) process only high-quality, relevant data.
## ✨ Key Features
* **🌍 True Multilingualism:** Built on `XLM-RoBERTa-Large`.
* **Verified & Tested:** English, Russian, Chinese, German, Spanish, French, Japanese.
* **Supported:** 100+ other languages supported by the base architecture.
* **🏎️ Blazing Fast:** Optimized for low-latency inference. Ideal for real-time filtering layers.
* **💎 Exclusive Dataset:** Trained on a unique, custom-parsed dataset of **230,000+** search snippets. The data was meticulously collected and labeled using Knowledge Distillation techniques specifically for this project.
* **🛡️ High Recall Philosophy:** The model is tuned to be a strict filter against "Hard Spam" (Casinos, Malware, Adult) while being lenient enough to preserve valuable information.
## 📊 Model Performance
| Metric | Value | Notes |
| :--- | :--- | :--- |
| **Accuracy** | **89.32%** | On a strict, balanced validation set |
| **Training Time** | ~3 hours | Trained on NVIDIA T4 (Google Colab FREE!) |
| **Base Model** | XLM-RoBERTa-Large | 550M params |
## 🧪 Examples (Real-world Tests)
The model is highly confident in distinguishing academic/technical content from low-quality spam.
| Input Text | Language | Verdict | Confidence |
| :--- | :---: | :---: | :---: |
| *"Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability."* | EN 🇺🇸 | **✅ Useful** | **99.9%** |
| *"История правления Петра I: краткая биография и реформы"* | RU 🇷🇺 | **✅ Useful** | **97.4%** |
| *"Die Relativitätstheorie beschäftigt sich mit der Struktur von Raum und Zeit."* | DE 🇩🇪 | **✅ Useful** | **99.7%** |
| *"人工智能是计算机科学的一个分支。"* | CN 🇨🇳 | **✅ Useful** | **99.6%** |
| *"BUY VIAGRA!!! BEST CASINO 100% FREE SPINS CLICK HERE"* | EN 🇺🇸 | **🗑️ Spam** | **99.4%** (Spam) |
| *"СКАЧАТЬ БЕСПЛАТНО БЕЗ СМС РЕГИСТРАЦИИ КЛЮЧИ АКТИВАЦИИ"* | RU 🇷🇺 | **🗑️ Spam** | **97.5%** (Spam) |
| *"ALARGA TU PENE 5 CM EN UNA SEMANA"* | ES 🇪🇸 | **🗑️ Spam** | **92.3%** (Spam) |
## 💻 Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
# Load from Hugging Face
model_name = "Nickup-Swallow-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def classify(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=-1)
# Label 1 = Useful, Label 0 = Spam
spam_prob = probs[0][0].item()
useful_prob = probs[0][1].item()
return useful_prob
# Try it out
text = "Download free cracked software no virus 2024"
score = classify(text)
if score < 0.15: # Threshold can be adjusted for higher recall
print(f"⛔ Blocked (Confidence: {1-score:.2%})")
else:
print(f"✅ Allowed (Confidence: {score:.2%})") |