SanskritGPT-Vedic

A GPT-2 Transformer trained from scratch on accented Vedic Sanskrit — with native support for Udatta (॑) and Anudatta (॒) pitch accent marks across Vedas.

Demo on Hugging Face Spaces GitHub Repository SanskritGPT-Itihasa


Model Summary

SanskritGPT-Vedic is a decoder-only Transformer (GPT-2 architecture) trained entirely from scratch on a consolidated, accent-annotated corpus of three Vedas: Rigveda, Yajurveda, and Atharvaveda.

The model learns to generate Devanagari Sanskrit verse — including Vedic pitch accent marks — in the stylistic register of the selected Veda, guided by style-control tokens.

This is a computational linguistic research experiment — not an authoritative source of scripture. The model captures statistical patterns of Vedic Sanskrit phonology and prosody, including accent density, syllabic structure, and inter-Veda stylistic variation.


🔗 Links

Resource Link
Live Demo (Gradio App) spaces/Dhruvil8/SanskritGPT-Vedic
Source Code & Notebooks github.com/Dhruvil-8/SanskritGPT-Vedic
Related Epic Model Dhruvil8/SanskritGPT-Itihasa

Model Specifications

Attribute Value
Architecture GPT-2 (Decoder-only Transformer)
Parameters ~10.1 Million
Layers 6
Attention Heads 8
Embedding Dimension 320
Context Window 512 Tokens
Tokenizer Unigram + Metaspace (PreTrainedTokenizerFast)
Vocabulary Size 8,000
Weight Format Safetensors (~39 MB)
Training Hardware Google Colab (CPU/GPU)
Framework PyTorch + Hugging Face Transformers

Style-Control Tokens

The model uses special control tokens to guide generation toward the style of a specific Veda:

Token Veda Description
<RIG> Rigveda Hymns to the Devas; fire, soma, dawn
<YAJUR> Yajurveda Ritual sacrificial formulae
<ATHARVA> Atharvaveda Practical and apotropaic hymns
<eos> End-of-verse marker

Quick Start

Installation

pip install transformers torch tokenizers sentencepiece

Load and Generate

from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel
import torch

# Load model and tokenizer from Hugging Face Hub
model_name = "Dhruvil8/SanskritGPT-Vedic"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

def generate_vedic_verse(veda="RIG", max_length=100, temperature=0.8):
    """
    Generate a Vedic Sanskrit verse with accent marks.

    Args:
        veda: One of "RIG", "YAJUR", "SAM", "ATHARVA"
        max_length: Maximum tokens to generate.
        temperature: Sampling temperature (0.5 = focused, 1.2 = creative).
    """
    prompt = f"<{veda}> "
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            inputs.input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=False)

# Generate from each Veda
print(generate_vedic_verse("RIG"))
print(generate_vedic_verse("YAJUR"))
print(generate_vedic_verse("ATHARVA"))

Expected Output (Example)

<RIG> यस्य॑ ते॒ यस्य॒ त्यर॑त॒ यस्य॒ मक्ष॑तं॒ दर्गुद॑ति । <eos>

Note the presence of Udatta (॑) and Anudatta (॒) accent marks in the generated output.


Training Details

Dataset

  • Source: DharmicData by bhavykhatri
  • Content: Verse-level texts of the Rigveda, Yajurveda, and Atharvaveda with full Vedic accent annotations.
  • Language: Vedic Sanskrit in Devanagari script with svaras (pitch accents).

Training Procedure

  • Tokenizer: Custom Unigram tokenizer with Metaspace pre-tokenization, wrapped in PreTrainedTokenizerFast. Vocabulary size 8,000 — trained on the accented corpus to natively preserve Devanagari Unicode and Vedic accent characters (॑ ॒).
  • Style Tokens: <RIG>, <YAJUR>, <ATHARVA>, <eos>.
  • Optimizer: AdamW with weight decay.
  • Sequence Length: 512 tokens.
  • Framework: Hugging Face Transformers + PyTorch.

Evaluation

Metric Value
Evaluation Loss 6.5157
Perplexity (PPL) 675.69
Udatta Ratio — Ground Truth 6.56%
Udatta Ratio — Generated 6.62%
Anudatta Ratio — Ground Truth 8.69%
Anudatta Ratio — Generated 8.39%

Accent Fidelity Note: A difference < 1% between ground truth and generated accent ratios indicates the model has learned Vedic prosodic patterns with high fidelity. Both Udatta and Anudatta deviations are within this threshold.


Intended Use

Appropriate Use Cases

  • Computational Linguistics Research: Analysis of Vedic phonological and accentual patterns.
  • Digital Humanities: Exploration of AI-assisted generation of archaic Sanskrit forms.
  • Educational Demos: Demonstrating Transformer capabilities on a highly specialized, accented, non-Latin script.
  • Comparative Study: Contrasting stylistic variation across the four Vedic traditions.

Out-of-Scope Uses

  • Religious or Ritual use: Generated text is not authentic Vedic scripture.
  • Scholarly Translation: The model does not understand semantic meaning.
  • Authoritative Attribution: Generated text must not be attributed to classical authors or traditions.

Limitations

  • Statistical Mimicry: The model learns prosodic and phonological patterns from the corpus. It does not possess traditional Vedic knowledge or semantic understanding.
  • High Perplexity: A PPL of ~675 reflects the extreme difficulty of the task (tiny 8K vocabulary modelling highly-inflected accented Sanskrit). Accent density fidelity is more meaningful than raw perplexity here.
  • Corpus Size: The Vedic corpus is small by modern NLP standards, which limits generalization.
  • Not Scripture: Generated text is not authentic Vedic scripture and must never be used for ritual, recitation, or canonical scholarly interpretation.

Citation

If you use this model in your research, please cite:

@misc{sanskritgpt-vedic-2026,
  author       = {Dhruvil},
  title        = {SanskritGPT-Vedic: A GPT-2 Transformer for Accented Vedic Sanskrit Verse Generation},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Dhruvil8/SanskritGPT-Vedic},
  note         = {Trained from scratch on accent-annotated Rigveda, Yajurveda, and Atharvaveda.}
}

License

This project is released under the MIT License. The underlying Vedic Sanskrit texts are in the public domain.


This model is an AI-assisted computational experiment. It reflects the statistical patterns of the corpus — not the wisdom, tradition, or oral lineage of the Vedas. For authentic Vedic guidance, consult qualified scholars and traditional sources.

Downloads last month
306
Safetensors
Model size
10.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Dhruvil8/SanskritGPT-Vedic 1