PhoneticXeus

PhoneticXeus is a multilingual phone recognition model using self-conditioned CTC on the XEUS speech encoder. See the paper: An Empirical Recipe for Universal Phone Recognition. Trained on IPAPack data covering 100+ languages with IPA transcriptions.

GitHub: changelinglab/PhoneticXeus

Files

File	Description
`checkpoint-22000.ckpt`	Model checkpoint (PyTorch Lightning)
`ipa_vocab.json`	IPA vocabulary (token-to-id mapping)
`config_tree.log`	Hydra config used for training

Usage

1. Install PhoneticXeus

git clone git@github.com:changelinglab/PhoneticXeus.git
cd PhoneticXeus
make install
source .venv/bin/activate

2. Download checkpoint

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download("changelinglab/PhoneticXeus", "checkpoint-22000.ckpt")

3. Run inference

import torch
import torchaudio
from src.model.xeusphoneme.builders import build_xeus_pr_inference

# Build inference object
inference = build_xeus_pr_inference(
    work_dir="exp/cache/xeus",        # cache dir for XEUS base weights
    checkpoint=ckpt_path,              # path to downloaded checkpoint
    vocab_file="src/model/xeusphoneme/resources/ipa_vocab.json",
    hf_repo="espnet/xeus",            # base encoder weights
    device="cuda" if torch.cuda.is_available() else "cpu",
)

# Transcribe audio
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

results = inference(waveform.squeeze(0))
print(results[0]["processed_transcript"])
# e.g., "h ə l oʊ w ɝ l d"

4. Distributed inference on evaluation sets

python src/main.py \
    experiment=inference/transcribe_xeuspr_selfctc \
    data=powsmeval data.dataset_name=doreco \
    inference.inference_runner.checkpoint=path/to/checkpoint-22000.ckpt

See Running Inference for SLURM-based distributed inference.

Model Architecture

Encoder: XEUS (E-Branchformer, 18 layers, 1024-dim)
CTC: Self-conditioned intermediate CTC at layers 4, 8, 12 (encoder-only, no decoder)
Vocabulary: 395 IPA tokens
Training: CTC loss with self-conditioning, 22k steps on IPAPack accent-mix data

Metrics

Evaluated with PhoneRecognitionEvaluator from PhoneticXeus:

PER (Phone Error Rate)
PFER (Phone Feature Error Rate)
FED (Feature Edit Distance)

Citation

If you use this model, please cite:

@misc{phoneticxeus2025,
    title={PhoneticXeus: Multilingual Phone Recognition},
    url={https://github.com/changelinglab/PhoneticXeus},
    year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

changelinglab
/

PhoneticXeus