VezilkaLLM

Despite the recent explosion of open Large Language Models (LLMs), many languages, especially low-resource ones, have been left behind. Macedonian, spoken by over 2 million people, has had almost no dedicated LLMs to date.

We’re excited to change that.

Today, we're introducing VezilkaLLM, the first 4B parameter base model trained specifically for Macedonian, achieving performance on par with existing 7B and 8B parameter models, while being significantly more efficient to train and deploy.

💡 Why The Macedonian Language Needs Its Own LLM?

Macedonian is widely recognized as a low-resource language in Natural Language Processing (NLP). Publicly available datasets are scarce, scattered, and often noisy. Many books and government documents still exist only as scanned PDFs or non-machine-readable formats, making it hard to build quality corpora.

Most multilingual LLMs provide only basic support for Macedonian, but often suffer from diluted performance due to limited training data, suboptimal vocabulary coverage, and inefficient tokenization for the language.

Our goal: build a performant, efficient, and clean base model, trained on a high-quality, fully Macedonian dataset—openly shared with the community.

📚 The Dataset (by LVSTCK)

We used the Macedonian Corpus constructed by LVSTCK — a high-quality, deduplicated dataset consisting of 1.47 billion words (16.78 GB) of Macedonian text from books, academic texts, web content, and government documents. The LVSTCK team cleaned and deduplicated the corpus using C4-like and Gopher-style filtering, high-confidence language detection (≥0.65), sentence-level and MinHash deduplication, and full PII removal. Additionaly, their preprocessing included chunking for training efficiency and manual correction of OCR and formatting errors, resulting in a high-quality Macedonian dataset available for language modeling.

🔗 Link to the dataset on Hugging Face.

🤖 The Model: 4B Parameters of Macedonian Fluency

We fine-tuned the 4B-parameter gemma-3-4b-pt model to create VezilkaLLM, a compact yet powerful base LLM tailored specifically for the Macedonian language. While most existing Macedonian models are either multilingual with limited fluency or heavyweight 7B–8B models, our 4B model strikes a balance between efficiency and expressiveness. It handles Macedonian grammar, vocabulary, and structure with great fluency, even after just one epoch of training.

Built with a focus on accessibility and scalability, VezilkaLLM is lightweight enough to fine-tune or deploy on a single GPU, yet strong enough to serve as a foundational model for downstream tasks such as summarization, question answering, or chatbot development in Macedonian.

⚙️ Training Recipe: Optimizing VezilkaLLM for Macedonian Fluency

The model was fine-tuned using the 🤗 Transformers Trainer API on a single NVIDIA H100 GPU, leveraging bfloat16 precision and the AdamW optimizer for efficient training. With a per-device batch size of 2 and 16 gradient accumulation steps, we achieved an effective batch size of 32 suitable for stable convergence. Gradient checkpointing was enabled to reduce memory usage, and training ran for one epoch with a cosine learning rate schedule and warmup ratio of 0.2. The model was trained with a context window of 8192 tokens, enabling it to capture longer dependencies and improve coherence over extended passages. Thanks to the power of the H100 and a carefully tuned training recipe, we were able to achieve great fluency and stability—on a budget and in record time. Stay tuned for a full technical report on the training process and performance benchmarks!

📈 Performance: Compact Yet Competitive

While this is just the first training run (1 epoch), early metrics are promising:

lower perplexity than comparable multilingual LLMs on Macedonian test sets,
matches the performance of 7B-8B models on generation tasks and fluency,
faster inference, and lower memory requirements.

We’re currently working on instruct tuning, evaluation benchmarks, and chat capabilities. More results will be shared soon.

📊 Evaluation

We evaluated VezilkaLLM using the macedonian-llm-eval benchmark, comparing it with three strong baselines — including two larger models.

BAs shown in Table 1, despite its smaller size, VezilkaLLM closely matches the performance of both the MKLLM and the domestic-yak 8B base models across tasks. Also, it outperforms its own foundation model, Gemma 3 4B, which shows how impactful domain-specific fine-tuning can be for low-resource languages like Macedonian.

In short: VezilkaLLM holds its own against much larger contenders, while remaining lightweight and efficient enough to run on modest hardware.

Model	ARC Challenge	ARC Easy	BoolQ	HellaSwag	OpenBookQA	PIQA	Winogrande	NQ Open
gemma-3-4b-pt	0.28	0.48	0.75	0.39	0.25	0.62	0.59	0.00
VezilkaLLM	0.30	0.50	0.72	0.41	0.25	0.65	0.59	0.03
domestic-yak-8B	0.31	0.52	0.77	0.43	0.29	0.67	0.63	0.04
MKLLM-7B	0.32	0.54	0.71	0.43	0.28	0.62	0.62	0.03

Table 1: Model Performance Comparison Across Evaluation Bencmarks

We present the evaluation results viusally in Figure 1.

Figure 1: Model Performance Comparison Across Evaluation Bencmarks

🔗 What’s Next?

This release is just the beginning. Our roadmap includes:

📖 instruction fine-tuning for tasks like Q&A, summarization, and chat,
🧪 standard benchmark evaluations on Macedonian tasks, and
🤝 collaborations with academic institutions and researchers.

🚀 Get Involved!

If you're working on Macedonian NLP—or any low-resource language—we’d love to collaborate! This model was trained to empower researchers, educators, and developers in Macedonia and beyond.

Let’s bring more languages to the LLM world! 🌍

⚠️ Limitations & Considerations

VezilkaLLM is a base language model and is not intended for direct deployment in end-user applications without additional fine-tuning and safety measures.

❌ No Built-in Safety or Moderation

VezilkaLLM does not include moderation tools or guardrails. It can generate unsafe, inappropriate, or biased content, especially in open-ended prompts. Use with care and always evaluate downstream outputs.

🧠 Hallucinations & Factuality

Like most base LLMs, VezilkaLLM may produce factually incorrect or misleading information. This tendency can be more noticeable when responding to prompts about Macedonian-specific topics, as the training corpus for these domains is comparatively small.

⚖️ Bias and Fairness

Although steps were taken to clean and deduplicate the dataset, VezilkaLLM may reflect biases present in its training data. Additional bias analysis and mitigation may be required before using it in production, especially in sensitive or high-impact settings.

🌐 Domain Coverage

The model is trained on a broad dataset, but domain balance is not uniform. It performs well on common domains like news or general discourse, but may underperform on specialized topics such as science, law, or medicine due to limited representation.

💬 Chat Readiness

This is the base version of the model and is not optimized for dialogue. If you're looking for a conversational or instruction-following variant, stay tuned for the upcoming instruct version of VezilkaLLM.

peshevskidimitar
/

VezilkaLLM