🔐 Native Log Translator

Maps heterogeneous cloud and OS logs to a unified normalized schema.

Fine-tuned from microsoft/codebert-base using LoRA (PEFT) on a curated dataset
of multi-provider security logs. Trained on Kaggle T4 x2 in FP16.

🚀 Quick Start

import torch
from transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig
from peft import PeftModel

MODEL_REPO = "Swapnanil09/native-log-translator-qlora"
BASE_MODEL = "microsoft/codebert-base"

tokenizer = RobertaTokenizer.from_pretrained(MODEL_REPO)

config = RobertaConfig.from_pretrained(BASE_MODEL)
config.is_decoder = True

base = RobertaForCausalLM.from_pretrained(
    BASE_MODEL,
    config=config,
    ignore_mismatched_sizes=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, MODEL_REPO)
model.eval()

def translate_log(log_input):
    prompt = f"<log>{log_input}</log>\n<schema>"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=60, temperature=0.1,
                             do_sample=True, pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.decode(out[0], skip_special_tokens=True)
    return decoded.split("<schema>")[-1].strip()

print(translate_log("AzureSignInLogs | ResultType=0"))
# event_type: authentication_success
# provider: azure
# risk_level: low

📋 Output Schema

Field	Description	Values
`event_type`	Normalized event category	e.g. `authentication_success`, `privilege_escalation`
`provider`	Source cloud / OS	`azure`, `aws`, `gcp`, `windows`, `linux`, `paloalto`, `cisco`, `fortinet`
`risk_level`	Severity classification	`low`, `medium`, `high`, `critical`

📦 Supported Log Sources

Provider	Log Type
Azure	SignInLogs, Activity, NSGFlowLogs
AWS	CloudTrail
GCP	Audit Logs
Windows	Security Events (4624, 4625, 4688, 4698, 4720, 4732, 1102 ...)
Linux	Syslog (auth, kern)
Network	Palo Alto, Cisco, Fortinet (CommonSecurityLog)

⚙️ Training Details

Setting	Value
Base model	`microsoft/codebert-base`
Method	LoRA (PEFT)
LoRA rank	16
LoRA alpha	32
Target modules	query, key, value
Epochs	15
Batch size	8 per device
Gradient accumulation	4 steps
Learning rate	2e-4
Precision	FP16
Hardware	Kaggle T4 x2

📌 Intended Use

SIEM normalization pipelines
Multi-cloud SOC log ingestion
Security event correlation
Threat detection preprocessing

⚠️ Limitations

Trained on a small curated dataset — production use should involve fine-tuning on your own log corpus
May not generalize to vendor-specific log formats not seen during training
Not a replacement for rule-based parsers in high-stakes pipelines without validation

Downloads last month: 19

Model tree for Swapnanil09/native-log-translator-qlora

Base model

microsoft/codebert-base

Adapter

(6)

this model