NabuOCR: Neural Cuneiform Transliteration

NabuOCR

NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.

Made for the ERNIE AI Developer Challenge, you can watch the submission video here: https://www.youtube.com/embed/hqmjepRLdfU?si=aJHpWdc12ThgWIxD

Overview

NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use ATF (ASCII Transliteration Format), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus.

Built by fine-tuning PaddleOCR-VL on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like @obverse, @reverse, @left, @right, @top, and @bottom.

Features

NabuOCR is based on the efficient 0.9B parameter PaddleOCR-VL model with an expanded tokenizer that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods.

It employs end-to-end transcription rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles multi-view images containing obverse, reverse, and edge views all at once.

Example Output

result-demo-1 result-demo-2

Training

Base Model

NabuOCR is built on PaddleOCR-VL with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (@obverse, @reverse, @left, @right, @top, @bottom).

Dataset

The training data was built from the Cuneiform Digital Library Initiative (CDLI). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets.

SFT

The model was trained using Unsloth's FastVisionModel wrapper for full fine-tuning with gradient checkpointing:

  • Epochs: 2 (~32,000 steps)
  • Batch size: 2
  • Learning rate: 2e-5 with linear decay
  • Warmup: 5% of training steps
  • Optimizer: AdamW (8-bit)
  • Precision: BF16
  • Max sequence length: 16,000 tokens

sft-loss

GRPO

Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs.

  • LoRA rank: 256 (RSLoRA with ฮฑ=16)
  • Trainable parameters: 239M of 1.2B (20%)
  • Generations per prompt: 4
  • Batch size: 16
  • Learning rate: 5e-6 with cosine decay
  • Warmup: 3% of training steps
  • Optimizer: AdamW (8-bit)

The reward function combined five components: weighted Token Error Rate using glyph visual similarity and curriculum learning, length deviation penalty, repetition penalty, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision.

grpo-reward

Story

For the more detailed story of how this model was trained, see STORY.md. To read the code used for training, see training/.

Performance

Evaluated on a held-out test set of 1,000 tablets using TER. Lower is better; 0% means perfect transcription.

performance

Usage

Best Practices

Provide high-resolution images when possible (minimum 800x800 recommended) and include all visible sides of the tablet in a single image.

Ensure that the photographs are well-lit and have high contrast so that characters are readable, and remove excessive background from images.

For more details on the best format for images, see the CDLI guidlines.

Limitations

NabuOCR performs best on well-preserved tablets with clear impressions and may struggle with heavily damaged or eroded sections.

Note that the model only supports the Sumerian and Akkadian languages, and limited support is available for complex literary texts with unusual sign variants.

Citation

If you use NabuOCR in your research, please cite:

@software{nabuocr2025,
  title={NabuOCR: Neural Cuneiform Transliteration},
  author={[Zack Williams]},
  year={2025},
  url={https://huggingface.co/boatbomber/NabuOCR}
}

Acknowledgments

Downloads last month
3
Safetensors
Model size
1.0B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for boatbomber/NabuOCR

Finetuned
(4)
this model