llama-3.2-1b-ko-cpt — morph100_content variant

Continued pretraining of unsloth/Llama-3.2-1B-unsloth-bnb-4bit on Korean Wikipedia (ko_wiki_public) with a content-POS morpheme tokenizer extension (+100 Korean tokens) and rsLoRA r=256, α=256.

Submission for CAS4133 Assignment 1 (Yonsei).

Final Eval (frozen 2,125-doc held-out test set)

Metric Value
eval_loss 1.9971
perplexity 7.368
baseline (notebook reference) eval_loss 2.0516 / PPL 7.780
Δ vs baseline -0.0545 / -0.412 PPL

Tokenizer Extension

  • +100 Korean morpheme tokens added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
  • POS whitelist: [NNG, NNP, VV, VA, MAG] (content words only — common/proper nouns, verbs, adjectives, adverbs)
  • Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
  • Selection: freq_natural (top-k by surface-form frequency, min_freq=10) over the filtered training corpus
  • Embedding init: subword-mean of base LLaMA tokenizer pieces

Training Configuration

Component Value
Base model unsloth/Llama-3.2-1B-unsloth-bnb-4bit
Adapter rsLoRA, r=256, alpha=256, dropout=0.0
Target modules q,k,v,o,gate,up,down + embed_tokens, lm_head
Optimizer AdamW (8-bit), lr=2e-4 cosine, warmup_ratio=0.05
Batch bs=8 x grad_accum=4 (effective 32), seq_len=1024
Steps 2,500
Precision bf16, 4-bit base in NF4
Hardware 1x RTX 3090 (24 GB), ~5h31m wall-clock
Train / test split 139,394 / 2,125 documents (super_strict filter)

Data Filtering (super_strict)

Triple filter applied to ko_wiki_public:

  1. min/max chars
  2. Korean character ratio threshold
  3. content-density (drop list-heavy / link-stub pages)

Ablations

Variant Tokenizer extension eval_loss PPL
baseline (notebook) none 2.0516 7.780
r256/a256 (no extension) none 1.9902 7.317
morph100_content (this repo) +100 content tokens 1.9971 7.368
morph200_content +200 content tokens 2.0041 7.420
morph100 (all-POS) +100 mixed tokens NaN inf
morph200 (all-POS) +200 mixed tokens NaN inf

Key finding: content-POS filtering is essential — including 조사/어미 in the extension causes immediate gradient explosion under the rsLoRA r=256 + mixed-precision embed/lm_head training setup. Smaller extensions (100) outperform larger ones (200) under the fixed 2,500-step budget.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "unsloth/Llama-3.2-1B-unsloth-bnb-4bit"
adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"

tok = AutoTokenizer.from_pretrained(adapter_id)  # extended tokenizer
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
base.resize_token_embeddings(len(tok))           # 128356
model = PeftModel.from_pretrained(base, adapter_id)
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gdvstd/llama-3.2-1b-ko-cpt

Adapter
(18)
this model