🛡️ MLP Cybersecurity Classifier

This repository hosts a lightweight scikit-learn-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts.

📊 Training Data

The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo:
🔗 https://zenodo.org/records/16417939

📦 Model Details

Architecture: MLPClassifier with hidden layers (128, 64)
Embedding model: intfloat/multilingual-e5-large
Input: Cleaned article (removed stopwords) or report text
Output: Binary label (e.g., Cybersecurity, Not Cybersecurity)
Languages: English, German

🔧 Usage

from sentence_transformers import SentenceTransformer
from huggingface_hub import hf_hub_download
import joblib

# 1. Load the embedding model
embedder = SentenceTransformer("intfloat/multilingual-e5-large")

# 2. Load the pretrained MLP classifier from Hugging Face Hub
model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl")
model = joblib.load(model_path)

# 3. Example input texts (can be in English or German)
texts = [
    "A new ransomware attack has affected critical infrastructure in Germany.",
    "The local sports club hosted its annual summer festival this weekend."
]

# 4. Generate embeddings
embeddings = embedder.encode(texts, convert_to_numpy=True, show_progress_bar=False)

# 5. Predict cybersecurity relevance
predictions = model.predict(embeddings)

# 6. Output results
for text, label in zip(texts, predictions):
    print(f"Text: {text}\nPrediction: {label}\n")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support