File size: 3,896 Bytes

a6d2542

---

language:
- en
- ru
- zh
- de
- es
- fr
- ja
- it
- pt
- ar
tags:
- text-classification
- spam-detection
- content-filtering
- security
- nlp
license: apache-2.0
base_model: FacebookAI/xlm-roberta-large
metrics:
- accuracy
library_name: transformers
---


# 🦅 Nickup Swallow (v1)

> **"Swallows spam, leaves the essence."**

**Nickup Swallow** is a high-performance, multilingual text classification model specifically engineered to act as a **Gatekeeper** for modern AI search engines and browsers.

It filters out aggressive spam, SEO content farms, adult content, and scams, ensuring that downstream LLMs (like GPT/T5) process only high-quality, relevant data.

## ✨ Key Features

*   **🌍 True Multilingualism:** Built on `XLM-RoBERTa-Large`.
    *   **Verified & Tested:** English, Russian, Chinese, German, Spanish, French, Japanese.
    *   **Supported:** 100+ other languages supported by the base architecture.
*   **🏎️ Blazing Fast:** Optimized for low-latency inference. Ideal for real-time filtering layers.
*   **💎 Exclusive Dataset:** Trained on a unique, custom-parsed dataset of **230,000+** search snippets. The data was meticulously collected and labeled using Knowledge Distillation techniques specifically for this project.
*   **🛡️ High Recall Philosophy:** The model is tuned to be a strict filter against "Hard Spam" (Casinos, Malware, Adult) while being lenient enough to preserve valuable information.

## 📊 Model Performance

| Metric | Value | Notes |
| :--- | :--- | :--- |
| **Accuracy** | **89.32%** | On a strict, balanced validation set |
| **Training Time** | ~3 hours | Trained on NVIDIA T4 (Google Colab FREE!) |
| **Base Model** | XLM-RoBERTa-Large | 550M params |

## 🧪 Examples (Real-world Tests)

The model is highly confident in distinguishing academic/technical content from low-quality spam.

| Input Text | Language | Verdict | Confidence |
| :--- | :---: | :---: | :---: |
| *"Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability."* | EN 🇺🇸 | **✅ Useful** | **99.9%** |
| *"История правления Петра I: краткая биография и реформы"* | RU 🇷🇺 | **✅ Useful** | **97.4%** |
| *"Die Relativitätstheorie beschäftigt sich mit der Struktur von Raum und Zeit."* | DE 🇩🇪 | **✅ Useful** | **99.7%** |
| *"人工智能是计算机科学的一个分支。"* | CN 🇨🇳 | **✅ Useful** | **99.6%** |
| *"BUY VIAGRA!!! BEST CASINO 100% FREE SPINS CLICK HERE"* | EN 🇺🇸 | **🗑️ Spam** | **99.4%** (Spam) |
| *"СКАЧАТЬ БЕСПЛАТНО БЕЗ СМС РЕГИСТРАЦИИ КЛЮЧИ АКТИВАЦИИ"* | RU 🇷🇺 | **🗑️ Spam** | **97.5%** (Spam) |
| *"ALARGA TU PENE 5 CM EN UNA SEMANA"* | ES 🇪🇸 | **🗑️ Spam** | **92.3%** (Spam) |

## 💻 Usage

```python

from transformers import AutoTokenizer, AutoModelForSequenceClassification

import torch.nn.functional as F



# Load from Hugging Face

model_name = "Nickup-Swallow-v1"



tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name)



def classify(text):

    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

    with torch.no_grad():

        outputs = model(**inputs)

        probs = F.softmax(outputs.logits, dim=-1)

    

    # Label 1 = Useful, Label 0 = Spam

    spam_prob = probs[0][0].item()

    useful_prob = probs[0][1].item()

    

    return useful_prob



# Try it out

text = "Download free cracked software no virus 2024"

score = classify(text)



if score < 0.15: # Threshold can be adjusted for higher recall

    print(f"⛔ Blocked (Confidence: {1-score:.2%})")

else:

    print(f"✅ Allowed (Confidence: {score:.2%})")