llama-3.2-1b-ko-cpt — morph100_content variant

Continued pretraining of unsloth/Llama-3.2-1B-unsloth-bnb-4bit on Korean Wikipedia (ko_wiki_public) with a content-POS morpheme tokenizer extension (+100 Korean tokens) and rsLoRA r=256, α=256.

Submission for CAS4133 Assignment 1 (Yonsei).

Final Eval (frozen 2,125-doc held-out test set)

Metric	Value
eval_loss	1.9971
perplexity	7.368
baseline (notebook reference)	eval_loss 2.0516 / PPL 7.780
Δ vs baseline	-0.0545 / -0.412 PPL

Tokenizer Extension

+100 Korean morpheme tokens added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
POS whitelist: [NNG, NNP, VV, VA, MAG] (content words only — common/proper nouns, verbs, adjectives, adverbs)
Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
Selection: freq_natural (top-k by surface-form frequency, min_freq=10) over the filtered training corpus
Embedding init: subword-mean of base LLaMA tokenizer pieces

Training Configuration

Component	Value
Base model	`unsloth/Llama-3.2-1B-unsloth-bnb-4bit`
Adapter	rsLoRA, r=256, alpha=256, dropout=0.0
Target modules	q,k,v,o,gate,up,down + embed_tokens, lm_head
Optimizer	AdamW (8-bit), lr=2e-4 cosine, warmup_ratio=0.05
Batch	bs=8 x grad_accum=4 (effective 32), seq_len=1024
Steps	2,500
Precision	bf16, 4-bit base in NF4
Hardware	1x RTX 3090 (24 GB), ~5h31m wall-clock
Train / test split	139,394 / 2,125 documents (super_strict filter)

Data Filtering (`super_strict`)

Triple filter applied to ko_wiki_public:

min/max chars
Korean character ratio threshold
content-density (drop list-heavy / link-stub pages)

Ablations

Variant	Tokenizer extension	eval_loss	PPL
baseline (notebook)	none	2.0516	7.780
r256/a256 (no extension)	none	1.9902	7.317
morph100_content (this repo)	+100 content tokens	1.9971	7.368
morph200_content	+200 content tokens	2.0041	7.420
morph100 (all-POS)	+100 mixed tokens	NaN	inf
morph200 (all-POS)	+200 mixed tokens	NaN	inf

Key finding: content-POS filtering is essential — including 조사/어미 in the extension causes immediate gradient explosion under the rsLoRA r=256 + mixed-precision embed/lm_head training setup. Smaller extensions (100) outperform larger ones (200) under the fixed 2,500-step budget.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "unsloth/Llama-3.2-1B-unsloth-bnb-4bit"
adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"

tok = AutoTokenizer.from_pretrained(adapter_id)  # extended tokenizer
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
base.resize_token_embeddings(len(tok))           # 128356
model = PeftModel.from_pretrained(base, adapter_id)

Downloads last month: 28

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gdvstd/llama-3.2-1b-ko-cpt

Base model

meta-llama/Llama-3.2-1B

Quantized

unsloth/Llama-3.2-1B-unsloth-bnb-4bit

Adapter

(18)

this model