SanskritGPT-Vedic

A GPT-2 Transformer trained from scratch on accented Vedic Sanskrit — with native support for Udatta (॑) and Anudatta (॒) pitch accent marks across Vedas.

Model Summary

SanskritGPT-Vedic is a decoder-only Transformer (GPT-2 architecture) trained entirely from scratch on a consolidated, accent-annotated corpus of three Vedas: Rigveda, Yajurveda, and Atharvaveda.

The model learns to generate Devanagari Sanskrit verse — including Vedic pitch accent marks — in the stylistic register of the selected Veda, guided by style-control tokens.

This is a computational linguistic research experiment — not an authoritative source of scripture. The model captures statistical patterns of Vedic Sanskrit phonology and prosody, including accent density, syllabic structure, and inter-Veda stylistic variation.

🔗 Links

Resource	Link
Live Demo (Gradio App)	spaces/Dhruvil8/SanskritGPT-Vedic
Source Code & Notebooks	github.com/Dhruvil-8/SanskritGPT-Vedic
Related Epic Model	Dhruvil8/SanskritGPT-Itihasa

Model Specifications

Attribute	Value
Architecture	GPT-2 (Decoder-only Transformer)
Parameters	~10.1 Million
Layers	6
Attention Heads	8
Embedding Dimension	320
Context Window	512 Tokens
Tokenizer	Unigram + Metaspace (`PreTrainedTokenizerFast`)
Vocabulary Size	8,000
Weight Format	Safetensors (~39 MB)
Training Hardware	Google Colab (CPU/GPU)
Framework	PyTorch + Hugging Face Transformers

Style-Control Tokens

The model uses special control tokens to guide generation toward the style of a specific Veda:

Token	Veda	Description
`<RIG>`	Rigveda	Hymns to the Devas; fire, soma, dawn
`<YAJUR>`	Yajurveda	Ritual sacrificial formulae
`<ATHARVA>`	Atharvaveda	Practical and apotropaic hymns
`<eos>`	—	End-of-verse marker

Quick Start

Installation

pip install transformers torch tokenizers sentencepiece

Load and Generate

from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel
import torch

# Load model and tokenizer from Hugging Face Hub
model_name = "Dhruvil8/SanskritGPT-Vedic"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

def generate_vedic_verse(veda="RIG", max_length=100, temperature=0.8):
    """
    Generate a Vedic Sanskrit verse with accent marks.

    Args:
        veda: One of "RIG", "YAJUR", "SAM", "ATHARVA"
        max_length: Maximum tokens to generate.
        temperature: Sampling temperature (0.5 = focused, 1.2 = creative).
    """
    prompt = f"<{veda}> "
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            inputs.input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=False)

# Generate from each Veda
print(generate_vedic_verse("RIG"))
print(generate_vedic_verse("YAJUR"))
print(generate_vedic_verse("ATHARVA"))

Expected Output (Example)

<RIG> यस्य॑ ते॒ यस्य॒ त्यर॑त॒ यस्य॒ मक्ष॑तं॒ दर्गुद॑ति । <eos>

Note the presence of Udatta (॑) and Anudatta (॒) accent marks in the generated output.

Training Details

Dataset

Source: DharmicData by bhavykhatri
Content: Verse-level texts of the Rigveda, Yajurveda, and Atharvaveda with full Vedic accent annotations.
Language: Vedic Sanskrit in Devanagari script with svaras (pitch accents).

Training Procedure

Tokenizer: Custom Unigram tokenizer with Metaspace pre-tokenization, wrapped in PreTrainedTokenizerFast. Vocabulary size 8,000 — trained on the accented corpus to natively preserve Devanagari Unicode and Vedic accent characters (॑ ॒).
Style Tokens: <RIG>, <YAJUR>, <ATHARVA>, <eos>.
Optimizer: AdamW with weight decay.
Sequence Length: 512 tokens.
Framework: Hugging Face Transformers + PyTorch.

Evaluation

Metric	Value
Evaluation Loss	6.5157
Perplexity (PPL)	675.69
Udatta Ratio — Ground Truth	6.56%
Udatta Ratio — Generated	6.62%
Anudatta Ratio — Ground Truth	8.69%
Anudatta Ratio — Generated	8.39%

Accent Fidelity Note: A difference < 1% between ground truth and generated accent ratios indicates the model has learned Vedic prosodic patterns with high fidelity. Both Udatta and Anudatta deviations are within this threshold.

Intended Use

Appropriate Use Cases

Computational Linguistics Research: Analysis of Vedic phonological and accentual patterns.
Digital Humanities: Exploration of AI-assisted generation of archaic Sanskrit forms.
Educational Demos: Demonstrating Transformer capabilities on a highly specialized, accented, non-Latin script.
Comparative Study: Contrasting stylistic variation across the four Vedic traditions.

Out-of-Scope Uses

Religious or Ritual use: Generated text is not authentic Vedic scripture.
Scholarly Translation: The model does not understand semantic meaning.
Authoritative Attribution: Generated text must not be attributed to classical authors or traditions.

Limitations

Statistical Mimicry: The model learns prosodic and phonological patterns from the corpus. It does not possess traditional Vedic knowledge or semantic understanding.
High Perplexity: A PPL of ~675 reflects the extreme difficulty of the task (tiny 8K vocabulary modelling highly-inflected accented Sanskrit). Accent density fidelity is more meaningful than raw perplexity here.
Corpus Size: The Vedic corpus is small by modern NLP standards, which limits generalization.
Not Scripture: Generated text is not authentic Vedic scripture and must never be used for ritual, recitation, or canonical scholarly interpretation.

Citation

If you use this model in your research, please cite:

@misc{sanskritgpt-vedic-2026,
  author       = {Dhruvil},
  title        = {SanskritGPT-Vedic: A GPT-2 Transformer for Accented Vedic Sanskrit Verse Generation},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Dhruvil8/SanskritGPT-Vedic},
  note         = {Trained from scratch on accent-annotated Rigveda, Yajurveda, and Atharvaveda.}
}

License

This project is released under the MIT License. The underlying Vedic Sanskrit texts are in the public domain.

This model is an AI-assisted computational experiment. It reflects the statistical patterns of the corpus — not the wisdom, tradition, or oral lineage of the Vedas. For authentic Vedic guidance, consult qualified scholars and traditional sources.

Downloads last month: 306

Safetensors

Model size

10.1M params

Tensor type

F32

Dhruvil8
/

SanskritGPT-Vedic