PhoneticXeus
PhoneticXeus is a multilingual phone recognition model using self-conditioned CTC on the XEUS speech encoder. See the paper: An Empirical Recipe for Universal Phone Recognition. Trained on IPAPack data covering 100+ languages with IPA transcriptions.
GitHub: changelinglab/PhoneticXeus
Files
| File | Description |
|---|---|
checkpoint-22000.ckpt |
Model checkpoint (PyTorch Lightning) |
ipa_vocab.json |
IPA vocabulary (token-to-id mapping) |
config_tree.log |
Hydra config used for training |
Usage
1. Install PhoneticXeus
git clone git@github.com:changelinglab/PhoneticXeus.git
cd PhoneticXeus
make install
source .venv/bin/activate
2. Download checkpoint
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("changelinglab/PhoneticXeus", "checkpoint-22000.ckpt")
3. Run inference
import torch
import torchaudio
from src.model.xeusphoneme.builders import build_xeus_pr_inference
# Build inference object
inference = build_xeus_pr_inference(
work_dir="exp/cache/xeus", # cache dir for XEUS base weights
checkpoint=ckpt_path, # path to downloaded checkpoint
vocab_file="src/model/xeusphoneme/resources/ipa_vocab.json",
hf_repo="espnet/xeus", # base encoder weights
device="cuda" if torch.cuda.is_available() else "cpu",
)
# Transcribe audio
waveform, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
results = inference(waveform.squeeze(0))
print(results[0]["processed_transcript"])
# e.g., "h ə l oʊ w ɝ l d"
4. Distributed inference on evaluation sets
python src/main.py \
experiment=inference/transcribe_xeuspr_selfctc \
data=powsmeval data.dataset_name=doreco \
inference.inference_runner.checkpoint=path/to/checkpoint-22000.ckpt
See Running Inference for SLURM-based distributed inference.
Model Architecture
- Encoder: XEUS (E-Branchformer, 18 layers, 1024-dim)
- CTC: Self-conditioned intermediate CTC at layers 4, 8, 12 (encoder-only, no decoder)
- Vocabulary: 395 IPA tokens
- Training: CTC loss with self-conditioning, 22k steps on IPAPack accent-mix data
Metrics
Evaluated with PhoneRecognitionEvaluator from PhoneticXeus:
- PER (Phone Error Rate)
- PFER (Phone Feature Error Rate)
- FED (Feature Edit Distance)
Citation
If you use this model, please cite:
@misc{phoneticxeus2025,
title={PhoneticXeus: Multilingual Phone Recognition},
url={https://github.com/changelinglab/PhoneticXeus},
year={2025}
}