A newer version of the Gradio SDK is available:
5.42.0
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Jie Sun 3 Guangliang Cheng 1 Yifei Zhang 5 Bin Dong 4 Kaizhu Huang 4
4 Duke Kunshan University 5 Ricoh Software Research Center
Comparative videos
https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd
Demo
Gradio Demo KDTalker
. The model was trained using only 4,282 video clips from VoxCeleb
.
To Do List
- Train a community version using more datasets
- Release training code
Environment
Our KDTalker could be conducted on one RTX4090 or RTX3090.
1. Clone the code and prepare the environment
Note: Make sure your system has git
, conda
, and FFmpeg
installed.
git clone https://github.com/chaolongy/KDTalker
cd KDTalker
# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
2. Download pretrained weights
First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights
.
Ensuring the directory structure is as follows:
pretrained_weights
├── insightface
│ └── models
│ └── buffalo_l
│ ├── 2d106det.onnx
│ └── det_10g.onnx
└── liveportrait
├── base_models
│ ├── appearance_feature_extractor.pth
│ ├── motion_extractor.pth
│ ├── spade_generator.pth
│ └── warping_module.pth
├── landmark.onnx
└── retargeting_models
└── stitching_retargeting_module.pth
You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts
.
OR, you can download above all weights in Huggingface.
Inference
python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4
Contact
Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]
Citation
If you find this code helpful for your research, please cite:
@misc{yang2025kdtalker,
title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
year={2025},
eprint={2503.12963},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12963},
}
Acknowledge
We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid etc.