🦅 Nickup Swallow (v1)

"Swallows spam, leaves the essence."

Nickup Swallow is a high-performance, multilingual text classification model specifically engineered to act as a Gatekeeper for modern AI search engines and browsers.

It filters out aggressive spam, SEO content farms, adult content, and scams, ensuring that downstream LLMs (like GPT/T5) process only high-quality, relevant data.

✨ Key Features

🌍 True Multilingualism: Built on XLM-RoBERTa-Large.
- Verified & Tested: English, Russian, Chinese, German, Spanish, French, Japanese.
- Supported: 100+ other languages supported by the base architecture.
🏎️ Blazing Fast: Optimized for low-latency inference. Ideal for real-time filtering layers.
💎 Exclusive Dataset: Trained on a unique, custom-parsed dataset of 230,000+ search snippets. The data was meticulously collected and labeled using Knowledge Distillation techniques specifically for this project.
🛡️ High Recall Philosophy: The model is tuned to be a strict filter against "Hard Spam" (Casinos, Malware, Adult) while being lenient enough to preserve valuable information.

📊 Model Performance

Metric	Value	Notes
Accuracy	89.32%	On a strict, balanced validation set
Training Time	~3 hours	Trained on NVIDIA T4 (Google Colab FREE!)
Base Model	XLM-RoBERTa-Large	550M params

🧪 Examples (Real-world Tests)

The model is highly confident in distinguishing academic/technical content from low-quality spam.

Input Text	Language	Verdict	Confidence
"Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability."	EN 🇺🇸	✅ Useful	99.9%
"История правления Петра I: краткая биография и реформы"	RU 🇷🇺	✅ Useful	97.4%
"Die Relativitätstheorie beschäftigt sich mit der Struktur von Raum und Zeit."	DE 🇩🇪	✅ Useful	99.7%
"人工智能是计算机科学的一个分支。"	CN 🇨🇳	✅ Useful	99.6%
"BUY VIAGRA!!! BEST CASINO 100% FREE SPINS CLICK HERE"	EN 🇺🇸	🗑️ Spam	99.4% (Spam)
"СКАЧАТЬ БЕСПЛАТНО БЕЗ СМС РЕГИСТРАЦИИ КЛЮЧИ АКТИВАЦИИ"	RU 🇷🇺	🗑️ Spam	97.5% (Spam)
"ALARGA TU PENE 5 CM EN UNA SEMANA"	ES 🇪🇸	🗑️ Spam	92.3% (Spam)

💻 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

# Load from Hugging Face
model_name = "Nickup-Swallow-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def classify(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=-1)
    
    # Label 1 = Useful, Label 0 = Spam
    spam_prob = probs[0][0].item()
    useful_prob = probs[0][1].item()
    
    return useful_prob

# Try it out
text = "Download free cracked software no virus 2024"
score = classify(text)

if score < 0.15: # Threshold can be adjusted for higher recall
    print(f"⛔ Blocked (Confidence: {1-score:.2%})")
else:
    print(f"✅ Allowed (Confidence: {score:.2%})")

Downloads last month: 12

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for NickupAI/Nickup-Swallow-v1

Base model

FacebookAI/xlm-roberta-large

Finetuned

(839)

this model