Spaces:

shipra-99
/

TALKSHOW

Running

App Files Files Community

shipra-99 commited on Apr 18

Commit

7aea5cd

0 Parent(s):

Initial commit with essential files only

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +14 -0
Dockerfile +34 -0
README.md +116 -0
__init__.py +0 -0
app.py +65 -0
config/LS3DCG.json +65 -0
config/body_pixel.json +63 -0
config/body_vq.json +62 -0
config/face.json +59 -0
data_utils/__init__.py +3 -0
data_utils/apply_split.py +51 -0
data_utils/axis2matrix.py +29 -0
data_utils/consts.py +0 -0
data_utils/dataloader_torch.py +279 -0
data_utils/dataset_preprocess.py +170 -0
data_utils/get_j.py +51 -0
data_utils/hand_component.json +0 -0
data_utils/lower_body.py +143 -0
data_utils/mesh_dataset.py +348 -0
data_utils/rotation_conversion.py +551 -0
data_utils/split_train_val_test.py +27 -0
data_utils/train_val_test.json +0 -0
data_utils/utils.py +318 -0
download_models.py +28 -0
evaluation/FGD.py +199 -0
evaluation/__init__.py +0 -0
evaluation/diversity_LVD.py +64 -0
evaluation/get_quality_samples.py +62 -0
evaluation/metrics.py +109 -0
evaluation/mode_transition.py +60 -0
evaluation/peak_velocity.py +65 -0
evaluation/util.py +148 -0
losses/__init__.py +1 -0
losses/losses.py +91 -0
nets/LS3DCG.py +414 -0
nets/__init__.py +8 -0
nets/base.py +89 -0
nets/body_ae.py +152 -0
nets/init_model.py +35 -0
nets/layers.py +1052 -0
nets/smplx_body_pixel.py +326 -0
nets/smplx_body_vq.py +302 -0
nets/smplx_face.py +238 -0
nets/spg/gated_pixelcnn_v2.py +179 -0
nets/spg/s2g_face.py +226 -0
nets/spg/s2glayers.py +522 -0
nets/spg/vqvae_1d.py +235 -0
nets/spg/vqvae_modules.py +380 -0
nets/spg/wav2vec.py +143 -0
nets/utils.py +122 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+cat > .gitignore << EOF
+# Binary and large files
+*.pkl
+*.mp4
+*.npy
+*.wav
+# Demo binary files
+demo/**/*.mp4
+demo/**/*.npy
+# Large model files
+experiments/
+# Any other large files
+visualise/teaser_01.png
+EOF

Dockerfile ADDED Viewed

	@@ -0,0 +1,34 @@

+FROM python:3.7-cuda11.3-runtime
+# System dependencies
+RUN apt-get update && apt-get install -y \
+    ffmpeg \
+    libgl1-mesa-glx \
+    git \
+    wget \
+    unzip \
+    && rm -rf /var/lib/apt/lists/*
+# Set up a non-root user for Hugging Face Space compatibility
+RUN useradd -m -u 1000 user
+USER user
+WORKDIR /home/user
+# Copy your code
+COPY --chown=user . .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Create directories for model files if they don't exist
+RUN mkdir -p visualise/smplx_model
+RUN mkdir -p experiments
+RUN mkdir -p visualise/video/body-pixel
+# Set environment variables for GPU
+ENV PYTHONUNBUFFERED=1
+ENV NVIDIA_VISIBLE_DEVICES=all
+ENV NVIDIA_DRIVER_CAPABILITIES=all
+# Default command - modify if you have a different entry point
+CMD ["python", "app.py"]

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+title: TalkSHOW Speech-to-Motion Translation
+emoji: 🎙️
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
+pinned: false
+license: mit
+# Team 14 - TalkSHOW: Generating Holistic 3D Human Motion from Speech
+Contributors - Abinaya Odeti , Shipra , Shravani , Vishal
+![teaser](visualise/teaser_01.png)
+## About
+This repository hosts the implementation of "TalkSHOW: A Speech-to-Motion Translation System", which maps raw audio input to full-body 3D motion using the SMPL-X model. It enables synchronized generation of expressive human body motion (including face, hands, and body) from speech input — supporting real-time animation, virtual avatars, and digital storytelling.
+##  Highlights
+Translates raw .wav audio into natural whole-body motion (jaw, pose, expressions, hands) using deep learning.
+Based on SMPL-X model for realistic 3D human mesh generation.
+Modular pipeline with support for face-body composition.
+Visualization with OpenGL & FFmpeg for final rendered video.
+End-to-end customizable configuration with audio models, latent generation, and rendering.
+##  Prerequisites
+Python 3.7+
+Anaconda for environment management
+Install required packages:
+```bash
+pip install -r requirements.txt
+```
+Install FFmpeg
+➤ Extract the FFmpeg ZIP and add its bin folder to System PATH
+## Getting started
+The visualization code was test on `Windows 10`, and it requires:
+* Python 3.7
+* conda3 or miniconda3
+* CUDA capable GPU (one is enough)
+### 1. Setup and Steps
+Clone the repo:
+  ```bash
+  git clone https://github.com/YOUR_USERNAME/TALKSHOW-speech-to-motion-translation-system.git
+  cd TalkSHOW
+  ```
+Create conda environment:
+```bash
+conda create -n talkshow python=3.7 -y
+conda activate talkshow
+pip install -r requirements.txt
+```
+### 2.Download models
+Download or place the required checkpoints:
+Download [**pretrained models**](https://drive.google.com/file/d/1bC0ZTza8HOhLB46WOJ05sBywFvcotDZG/view?usp=sharing),
+unzip and place it in the TalkSHOW folder, i.e. ``path-to-TalkSHOW/experiments``.
+Download [**smplx model**](https://drive.google.com/file/d/1Ly_hQNLQcZ89KG0Nj4jYZwccQiimSUVn/view?usp=share_link) (Please register in the official [**SMPLX webpage**](https://smpl-x.is.tue.mpg.de) before you use it.)
+and place it in ``path-to-TalkSHOW/visualise/smplx_model``.
+To visualise the test set and generated result (in each video, left: generated result | right: ground truth).
+The videos and generated motion data are saved in ``./visualise/video/body-pixel``:
+SMPLX Model Weights – visualise/smplx_model/SMPLX_NEUTRAL_2020.npz
+Extra joints, regressors, YAML configs – inside visualise/smplx_model/
+Also, ensure vq_path in body_pixel.json points to a valid .pth model (in ./experiments/.../ckpt-*.pth)
+###  3.🎙️ Running Inference
+To generate a 3D animated video from an audio file:
+```bash
+python scripts/demo.py \
+  --config_file ./config/body_pixel.json \
+  --infer \
+  --audio_file ./demo_audio/1st-page.wav \
+  --id 0 \
+  --whole_body
+```
+Change Input
+Replace --audio_file value with your own .wav file path.
+### 4. Output
+The final 3D animated video will be saved under:
+```bash
+visualise/video/body-pixel2/<audio_file_name>/1st-page.mp4
+```
+The exact command you used to run the project
+```bash
+python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/1st-page.wav --id 0 --whole_body
+```
+### Contact
+For issues or questions, raise an issue or contact the contributors directly!

__init__.py ADDED Viewed

File without changes

app.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import gradio as gr
+import os
+import subprocess
+import time
+import shutil
+def process_audio(audio_file):
+    # Generate a unique timestamp for this run
+    timestamp = str(int(time.time()))
+    # Save the uploaded audio to a temporary location
+    temp_audio_path = f"./temp_{timestamp}.wav"
+    with open(temp_audio_path, "wb") as f:
+        f.write(audio_file)
+    # Create output directory if it doesn't exist
+    os.makedirs("visualise/video/body-pixel2", exist_ok=True)
+    # Run the TALKSHOW inference script
+    cmd = [
+        "python", "scripts/demo.py",
+        "--config_file", "./config/body_pixel.json",
+        "--infer",
+        "--audio_file", temp_audio_path,
+        "--id", "0",
+        "--whole_body"
+    ]
+    try:
+        result = subprocess.run(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True
+        )
+        # Get the output video path
+        audio_name = os.path.basename(temp_audio_path).split('.')[0].replace("temp_", "")
+        output_dir = f"visualise/video/body-pixel2/{audio_name}"
+        output_path = f"{output_dir}/1st-page.mp4"
+        # Check if the output video was created
+        if os.path.exists(output_path):
+            return output_path
+        else:
+            return None, f"Error: {result.stderr}"
+    finally:
+        # Clean up temporary file
+        if os.path.exists(temp_audio_path):
+            os.remove(temp_audio_path)
+# Create Gradio interface
+demo = gr.Interface(
+    fn=process_audio,
+    inputs=gr.Audio(type="filepath"),
+    outputs=gr.Video(),
+    title="TalkSHOW: Speech-to-Motion Translation System",
+    description="Convert speech audio to realistic 3D human motion using the SMPL-X model.",
+    examples=[["demo_audio/1st-page.wav"]]
+)
+# Launch the app
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

config/LS3DCG.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "pickle",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_LS3DCG",
+    "code_num": 2048,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    },
+    "weights": {
+      "keypoint_loss_weight": 1.0,
+      "gan_loss_weight": 1.0
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "LS3DCG"
+  },
+  "device": "cpu"
+}

config/body_pixel.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_body_pixel",
+    "composition": true,
+    "code_num": 2048,
+    "bh_model": true,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false,
+    "vq_path": "./experiments/2022-10-31-smplx_S2G-body-vq-3d/ckpt-99.pth"
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "body-pixel2"
+  }
+}

config/body_vq.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_mfcc.pkl",
+    "whole_video": false,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "body",
+    "model_name": "s2g_body_vq",
+    "composition": true,
+    "code_num": 2048,
+    "bh_model": true,
+    "AudioOpt": "Adam",
+    "encoder_choice": "mfcc",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 128,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 200,
+    "name": "body-vq"
+  }
+}

config/face.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
+  "dataset_load_mode": "json",
+  "store_file_path": "store.pkl",
+  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
+  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
+  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
+  "param": {
+    "w_j": 1,
+    "w_b": 1,
+    "w_h": 1
+  },
+  "Data": {
+    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
+    "pklname": "_3d_wv2.pkl",
+    "whole_video": true,
+    "pose": {
+      "normalization": false,
+      "convert_to_6d": false,
+      "norm_method": "all",
+      "augmentation": false,
+      "generate_length": 88,
+      "pre_pose_length": 0,
+      "pose_dim": 99,
+      "expression": true
+    },
+    "aud": {
+      "feat_method": "mfcc",
+      "aud_feat_dim": 64,
+      "aud_feat_win_size": null,
+      "context_info": false
+    }
+  },
+  "Model": {
+    "model_type": "face",
+    "model_name": "s2g_face",
+    "AudioOpt": "SGD",
+    "encoder_choice": "faceformer",
+    "gan": false
+  },
+  "DataLoader": {
+    "batch_size": 1,
+    "num_workers": 0
+  },
+  "Train": {
+    "epochs": 100,
+    "max_gradient_norm": 5,
+    "learning_rate": {
+      "generator_learning_rate": 1e-4,
+      "discriminator_learning_rate": 1e-4
+    }
+  },
+  "Log": {
+    "save_every": 50,
+    "print_every": 1000,
+    "name": "face"
+  }
+}

data_utils/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+# from .dataloader_csv import MultiVidData as csv_data
+from .dataloader_torch import MultiVidData as torch_data
+from .utils import get_melspec, get_mfcc, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta

data_utils/apply_split.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import os
+from tqdm import tqdm
+import pickle
+import shutil
+speakers = ['seth', 'oliver', 'conan', 'chemistry']
+source_data_root = "../expressive_body-V0.7"
+data_root = "D:/Downloads/SHOW_dataset_v1.0/ExpressiveWholeBodyDatasetReleaseV1.0"
+f_read = open('split_more_than_2s.pkl', 'rb')
+f_save = open('none.pkl', 'wb')
+data_split = pickle.load(f_read)
+none_split = []
+train = val = test = 0
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in data_split[speaker_name]]
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+        for split in data_split[speaker_name][vid]:
+            for seq in data_split[speaker_name][vid][split]:
+                seq = seq.replace('\\', '/')
+                old_file_path = os.path.join(data_root, speaker_name, vid, seq.split('/')[-1])
+                old_file_path = old_file_path.replace('\\', '/')
+                new_file_path = seq.replace(source_data_root.split('/')[-1], data_root.split('/')[-1])
+                try:
+                    shutil.move(old_file_path, new_file_path)
+                    if split == 'train':
+                        train = train + 1
+                    elif split == 'test':
+                        test = test + 1
+                    elif split == 'val':
+                        val = val + 1
+                except FileNotFoundError:
+                    none_split.append(old_file_path)
+                    print(f"The file {old_file_path} does not exists.")
+                except shutil.Error:
+                    none_split.append(old_file_path)
+                    print(f"The file {old_file_path} does not exists.")
+print(none_split.__len__())
+pickle.dump(none_split, f_save)
+f_save.close()
+print(train, val, test)

data_utils/axis2matrix.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import numpy as np
+import math
+import scipy.linalg as linalg
+def rotate_mat(axis, radian):
+    a = np.cross(np.eye(3), axis / linalg.norm(axis) * radian)
+    rot_matrix = linalg.expm(a)
+    return rot_matrix
+def aaa2mat(axis, sin, cos):
+    i = np.eye(3)
+    nnt = np.dot(axis.T, axis)
+    s = np.asarray([[0, -axis[0,2], axis[0,1]],
+                    [axis[0,2], 0, -axis[0,0]],
+                    [-axis[0,1], axis[0,0], 0]])
+    r = cos * i + (1-cos)*nnt +sin * s
+    return r
+rand_axis = np.asarray([[1,0,0]])
+#旋转角度
+r = math.pi/2
+#返回旋转矩阵
+rot_matrix = rotate_mat(rand_axis, r)
+r2 = aaa2mat(rand_axis, np.sin(r), np.cos(r))
+print(rot_matrix)

data_utils/consts.py ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/dataloader_torch.py ADDED Viewed

	@@ -0,0 +1,279 @@

+import sys
+import os
+sys.path.append(os.getcwd())
+import os
+from tqdm import tqdm
+from data_utils.utils import *
+import torch.utils.data as data
+from data_utils.mesh_dataset import SmplxDataset
+from transformers import Wav2Vec2Processor
+class MultiVidData():
+    def __init__(self,
+                data_root,
+                speakers,
+                split='train',
+                limbscaling=False,
+                normalization=False,
+                norm_method='new',
+                split_trans_zero=False,
+                num_frames=25,
+                num_pre_frames=25,
+                num_generate_length=None,
+                aud_feat_win_size=None,
+                aud_feat_dim=64,
+                feat_method='mel_spec',
+                context_info=False,
+                smplx=False,
+                audio_sr=16000,
+                convert_to_6d=False,
+                expression=False,
+                config=None
+                ):
+        self.data_root = data_root
+        self.speakers = speakers
+        self.split = split
+        if split == 'pre':
+            self.split = 'train'
+        self.norm_method=norm_method
+        self.normalization = normalization
+        self.limbscaling = limbscaling
+        self.convert_to_6d = convert_to_6d
+        self.num_frames=num_frames
+        self.num_pre_frames=num_pre_frames
+        if num_generate_length is None:
+            self.num_generate_length = num_frames
+        else:
+            self.num_generate_length = num_generate_length
+        self.split_trans_zero=split_trans_zero
+        dataset = SmplxDataset
+        if self.split_trans_zero:
+            self.trans_dataset_list = []
+            self.zero_dataset_list = []
+        else:
+            self.all_dataset_list = []
+        self.dataset={}
+        self.complete_data=[]
+        self.config=config
+        load_mode=self.config.dataset_load_mode
+        ######################load with pickle file
+        if load_mode=='pickle':
+            import pickle
+            import subprocess
+            # store_file_path='/tmp/store.pkl'
+            # cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl /tmp/store.pkl
+            # subprocess.run(f'cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl {store_file_path}',shell=True)
+            # f = open(self.config.store_file_path, 'rb+')
+            f = open(self.split+config.Data.pklname, 'rb+')
+            self.dataset=pickle.load(f)
+            f.close()
+            for key in self.dataset:
+                self.complete_data.append(self.dataset[key].complete_data)
+        ######################load with pickle file
+        ######################load with a csv file
+        elif load_mode=='csv':
+            # 这里从我的一个code文件夹导入的，后续再完善进来
+            try:
+                sys.path.append(self.config.config_root_path)
+                from config import config_path
+                from csv_parser import csv_parse
+            except ImportError as e:
+                print(f'err: {e}')
+                raise ImportError('config root path error...')
+            for speaker_name in self.speakers:
+                # df_intervals=pd.read_csv(self.config.voca_csv_file_path)
+                df_intervals=None
+                df_intervals=df_intervals[df_intervals['speaker']==speaker_name]
+                df_intervals = df_intervals[df_intervals['dataset'] == self.split]
+                print(f'speaker {speaker_name} train interval length: {len(df_intervals)}')
+                for iter_index, (_, interval) in tqdm(
+                        (enumerate(df_intervals.iterrows())),desc=f'load {speaker_name}'
+                ):
+                    (
+                        interval_index,
+                        interval_speaker,
+                        interval_video_fn,
+                        interval_id,
+                        start_time,
+                        end_time,
+                        duration_time,
+                        start_time_10,
+                        over_flow_flag,
+                        short_dur_flag,
+                        big_video_dir,
+                        small_video_dir_name,
+                        speaker_video_path,
+                        voca_basename,
+                        json_basename,
+                        wav_basename,
+                        voca_top_clip_path,
+                        voca_json_clip_path,
+                        voca_wav_clip_path,
+                        audio_output_fn,
+                        image_output_path,
+                        pifpaf_output_path,
+                        mp_output_path,
+                        op_output_path,
+                        deca_output_path,
+                        pixie_output_path,
+                        cam_output_path,
+                        ours_output_path,
+                        merge_output_path,
+                        multi_output_path,
+                        gt_output_path,
+                        ours_images_path,
+                        pkl_fil_path,
+                    )=csv_parse(interval)
+                    if not os.path.exists(pkl_fil_path) or not os.path.exists(audio_output_fn):
+                        continue
+                    key=f'{interval_video_fn}/{small_video_dir_name}'
+                    self.dataset[key] = dataset(
+                        data_root=pkl_fil_path,
+                        speaker=speaker_name,
+                        audio_fn=audio_output_fn,
+                        audio_sr=audio_sr,
+                        fps=num_frames,
+                        feat_method=feat_method,
+                        audio_feat_dim=aud_feat_dim,
+                        train=(self.split == 'train'),
+                        load_all=True,
+                        split_trans_zero=self.split_trans_zero,
+                        limbscaling=self.limbscaling,
+                        num_frames=self.num_frames,
+                        num_pre_frames=self.num_pre_frames,
+                        num_generate_length=self.num_generate_length,
+                        audio_feat_win_size=aud_feat_win_size,
+                        context_info=context_info,
+                        convert_to_6d=convert_to_6d,
+                        expression=expression,
+                        config=self.config
+                    )
+                    self.complete_data.append(self.dataset[key].complete_data)
+        ######################load with a csv file
+        ######################origin load method
+        elif load_mode=='json':
+            # if self.split == 'train':
+            #     import pickle
+            #     f = open('store.pkl', 'rb+')
+            #     self.dataset=pickle.load(f)
+            #     f.close()
+            #     for key in self.dataset:
+            #         self.complete_data.append(self.dataset[key].complete_data)
+            # else:https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav
+            # if config.Model.model_type == 'face':
+            am = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
+            am_sr = 16000
+            # else:
+            #     am, am_sr = None, None
+            for speaker_name in self.speakers:
+                speaker_root = os.path.join(self.data_root, speaker_name)
+                videos=[v for v in os.listdir(speaker_root) ]
+                print(videos)
+                haode = huaide = 0
+                for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+                    source_vid=vid
+                    # vid_pth=os.path.join(speaker_root, source_vid, 'images/half', self.split)
+                    vid_pth = os.path.join(speaker_root, source_vid, self.split)
+                    if smplx == 'pose':
+                        seqs = [s for s in os.listdir(vid_pth) if (s.startswith('clip'))]
+                    else:
+                        try:
+                            seqs = [s for s in os.listdir(vid_pth)]
+                        except:
+                            continue
+                    for s in seqs:
+                        seq_root=os.path.join(vid_pth, s)
+                        key = seq_root # correspond to clip******
+                        audio_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.wav' % (s))
+                        motion_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.pkl' % (s))
+                        if not os.path.isfile(audio_fname) or not os.path.isfile(motion_fname):
+                            huaide = huaide + 1
+                            continue
+                        self.dataset[key]=dataset(
+                            data_root=seq_root,
+                            speaker=speaker_name,
+                            motion_fn=motion_fname,
+                            audio_fn=audio_fname,
+                            audio_sr=audio_sr,
+                            fps=num_frames,
+                            feat_method=feat_method,
+                            audio_feat_dim=aud_feat_dim,
+                            train=(self.split=='train'),
+                            load_all=True,
+                            split_trans_zero=self.split_trans_zero,
+                            limbscaling=self.limbscaling,
+                            num_frames=self.num_frames,
+                            num_pre_frames=self.num_pre_frames,
+                            num_generate_length=self.num_generate_length,
+                            audio_feat_win_size=aud_feat_win_size,
+                            context_info=context_info,
+                            convert_to_6d=convert_to_6d,
+                            expression=expression,
+                            config=self.config,
+                            am=am,
+                            am_sr=am_sr,
+                            whole_video=config.Data.whole_video
+                        )
+                        self.complete_data.append(self.dataset[key].complete_data)
+                        haode = haode + 1
+                print("huaide:{}, haode:{}".format(huaide, haode))
+            import pickle
+            f = open(self.split+config.Data.pklname, 'wb')
+            pickle.dump(self.dataset, f)
+            f.close()
+        ######################origin load method
+        self.complete_data=np.concatenate(self.complete_data, axis=0)
+        # assert self.complete_data.shape[-1] == (12+21+21)*2
+        self.normalize_stats = {}
+        self.data_mean = None
+        self.data_std = None
+    def get_dataset(self):
+        self.normalize_stats['mean'] = self.data_mean
+        self.normalize_stats['std'] = self.data_std
+        for key in list(self.dataset.keys()):
+            if self.dataset[key].complete_data.shape[0] < self.num_generate_length:
+                continue
+            self.dataset[key].num_generate_length = self.num_generate_length
+            self.dataset[key].get_dataset(self.normalization, self.normalize_stats, self.split)
+            self.all_dataset_list.append(self.dataset[key].all_dataset)
+        if self.split_trans_zero:
+            self.trans_dataset = data.ConcatDataset(self.trans_dataset_list)
+            self.zero_dataset = data.ConcatDataset(self.zero_dataset_list)
+        else:
+            self.all_dataset = data.ConcatDataset(self.all_dataset_list)

data_utils/dataset_preprocess.py ADDED Viewed

	@@ -0,0 +1,170 @@

+import os
+import pickle
+from tqdm import tqdm
+import shutil
+import torch
+import numpy as np
+import librosa
+import random
+speakers = ['seth', 'conan', 'oliver', 'chemistry']
+data_root = "../ExpressiveWholeBodyDatasetv1.0/"
+split = 'train'
+def split_list(full_list,shuffle=False,ratio=0.2):
+    n_total = len(full_list)
+    offset_0 = int(n_total * ratio)
+    offset_1 = int(n_total * ratio * 2)
+    if n_total==0 or offset_1<1:
+        return [],full_list
+    if shuffle:
+        random.shuffle(full_list)
+    sublist_0 = full_list[:offset_0]
+    sublist_1 = full_list[offset_0:offset_1]
+    sublist_2 = full_list[offset_1:]
+    return sublist_0, sublist_1, sublist_2
+def moveto(list, file):
+    for f in list:
+        before, after = '/'.join(f.split('/')[:-1]), f.split('/')[-1]
+        new_path = os.path.join(before, file)
+        new_path = os.path.join(new_path, after)
+        # os.makedirs(new_path)
+        # os.path.isdir(new_path)
+        # shutil.move(f, new_path)
+        #转移到新目录
+        shutil.copytree(f, new_path)
+        #删除原train里的文件
+        shutil.rmtree(f)
+    return None
+def read_pkl(data):
+    betas = np.array(data['betas'])
+    jaw_pose = np.array(data['jaw_pose'])
+    leye_pose = np.array(data['leye_pose'])
+    reye_pose = np.array(data['reye_pose'])
+    global_orient = np.array(data['global_orient']).squeeze()
+    body_pose = np.array(data['body_pose_axis'])
+    left_hand_pose = np.array(data['left_hand_pose'])
+    right_hand_pose = np.array(data['right_hand_pose'])
+    full_body = np.concatenate(
+        (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
+    expression = np.array(data['expression'])
+    full_body = np.concatenate((full_body, expression), axis=1)
+    if (full_body.shape[0] < 90) or (torch.isnan(torch.from_numpy(full_body)).sum() > 0):
+        return 1
+    else:
+        return 0
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in os.listdir(speaker_root)]
+    print(videos)
+    haode = huaide = 0
+    total_seqs = []
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+    # for vid in videos:
+        source_vid = vid
+        vid_pth = os.path.join(speaker_root, source_vid)
+        # vid_pth = os.path.join(speaker_root, source_vid, 'images/half', split)
+        t = os.path.join(speaker_root, source_vid, 'test')
+        v = os.path.join(speaker_root, source_vid, 'val')
+        # if os.path.exists(t):
+        #     shutil.rmtree(t)
+        # if os.path.exists(v):
+        #     shutil.rmtree(v)
+        try:
+            seqs = [s for s in os.listdir(vid_pth)]
+        except:
+            continue
+        # if len(seqs) == 0:
+        #     shutil.rmtree(os.path.join(speaker_root, source_vid))
+            # None
+        for s in seqs:
+            quality = 0
+            total_seqs.append(os.path.join(vid_pth,s))
+            seq_root = os.path.join(vid_pth, s)
+            key = seq_root  # correspond to clip******
+            audio_fname = os.path.join(speaker_root, source_vid, s, '%s.wav' % (s))
+            # delete the data without audio or the audio file could not be read
+            if os.path.isfile(audio_fname):
+                try:
+                    audio = librosa.load(audio_fname)
+                except:
+                    # print(key)
+                    shutil.rmtree(key)
+                    huaide = huaide + 1
+                    continue
+            else:
+                huaide = huaide + 1
+                # print(key)
+                shutil.rmtree(key)
+                continue
+            # check motion file
+            motion_fname = os.path.join(speaker_root, source_vid, s, '%s.pkl' % (s))
+            try:
+                f = open(motion_fname, 'rb+')
+            except:
+                shutil.rmtree(key)
+                huaide = huaide + 1
+                continue
+            data = pickle.load(f)
+            w = read_pkl(data)
+            f.close()
+            quality = quality + w
+            if w == 1:
+                shutil.rmtree(key)
+                # print(key)
+                huaide = huaide + 1
+                continue
+            haode = haode + 1
+    print("huaide:{}, haode:{}, total_seqs:{}".format(huaide, haode, total_seqs.__len__()))
+for speaker_name in speakers:
+    speaker_root = os.path.join(data_root, speaker_name)
+    videos = [v for v in os.listdir(speaker_root)]
+    print(videos)
+    haode = huaide = 0
+    total_seqs = []
+    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
+        # for vid in videos:
+        source_vid = vid
+        vid_pth = os.path.join(speaker_root, source_vid)
+        try:
+            seqs = [s for s in os.listdir(vid_pth)]
+        except:
+            continue
+        for s in seqs:
+            quality = 0
+            total_seqs.append(os.path.join(vid_pth, s))
+    print("total_seqs:{}".format(total_seqs.__len__()))
+    # split the dataset
+    test_list, val_list, train_list = split_list(total_seqs, True, 0.1)
+    print(len(test_list), len(val_list), len(train_list))
+    moveto(train_list, 'train')
+    moveto(test_list, 'test')
+    moveto(val_list, 'val')

data_utils/get_j.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import torch
+def to3d(poses, config):
+    if config.Data.pose.convert_to_6d:
+        if config.Data.pose.expression:
+            poses_exp = poses[:, -100:]
+            poses = poses[:, :-100]
+        poses = poses.reshape(poses.shape[0], -1, 5)
+        sin, cos = poses[:, :, 3], poses[:, :, 4]
+        pose_angle = torch.atan2(sin, cos)
+        poses = (poses[:, :, :3] * pose_angle.unsqueeze(dim=-1)).reshape(poses.shape[0], -1)
+        if config.Data.pose.expression:
+            poses = torch.cat([poses, poses_exp], dim=-1)
+    return poses
+def get_joint(smplx_model, betas, pred):
+    joint = smplx_model(betas=betas.repeat(pred.shape[0], 1),
+                        expression=pred[:, 165:265],
+                        jaw_pose=pred[:, 0:3],
+                        leye_pose=pred[:, 3:6],
+                        reye_pose=pred[:, 6:9],
+                        global_orient=pred[:, 9:12],
+                        body_pose=pred[:, 12:75],
+                        left_hand_pose=pred[:, 75:120],
+                        right_hand_pose=pred[:, 120:165],
+                        return_verts=True)['joints']
+    return joint
+def get_joints(smplx_model, betas, pred):
+    if len(pred.shape) == 3:
+        B = pred.shape[0]
+        x = 4 if B>= 4 else B
+        T = pred.shape[1]
+        pred = pred.reshape(-1, 265)
+        smplx_model.batch_size = L = T * x
+        times = pred.shape[0] // smplx_model.batch_size
+        joints = []
+        for i in range(times):
+            joints.append(get_joint(smplx_model, betas, pred[i*L:(i+1)*L]))
+        joints = torch.cat(joints, dim=0)
+        joints = joints.reshape(B, T, -1, 3)
+    else:
+        smplx_model.batch_size = pred.shape[0]
+        joints = get_joint(smplx_model, betas, pred)
+    return joints

data_utils/hand_component.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/lower_body.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import numpy as np
+import torch
+lower_pose = torch.tensor(
+    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0747, -0.0158, -0.0152, -1.1826512813568115, 0.23866955935955048,
+     0.15146760642528534, -1.2604516744613647, -0.3160211145877838,
+     -0.1603458970785141, 1.1654603481292725, 0.0, 0.0, 1.2521806955337524, 0.041598282754421234, -0.06312154978513718,
+     0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
+lower_pose_stand = torch.tensor([
+    8.9759e-04, 7.1074e-04, -5.9163e-06, 8.9759e-04, 7.1074e-04, -5.9163e-06,
+    3.0747, -0.0158, -0.0152,
+    -3.6665e-01, -8.8455e-03, 1.6113e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
+    -3.9716e-01, -4.0229e-02, -1.2637e-01,
+    7.9163e-01, 6.8519e-02, -1.5091e-01, 7.9163e-01, 6.8519e-02, -1.5091e-01,
+    7.8632e-01, -4.3810e-02, 1.4375e-02,
+    -1.0675e-01, 1.2635e-01, 1.6711e-02, -1.0675e-01, 1.2635e-01, 1.6711e-02, ])
+# lower_pose_stand = torch.tensor(
+#     [6.4919e-02,  3.3018e-02,  1.7485e-02,  8.9759e-04,  7.1074e-04, -5.9163e-06,
+#      3.0747, -0.0158, -0.0152,
+#      -3.3633e+00, -9.3915e-02, 3.0996e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
+#      1.1654603481292725, 0.0, 0.0,
+#      4.4167e-01,  6.7183e-03, -3.6379e-03,  7.9163e-01,  6.8519e-02, -1.5091e-01,
+#      0.0, 0.0, 0.0,
+#      2.2910e-02, -2.4797e-02, -5.5657e-03, -1.0675e-01,  1.2635e-01,  1.6711e-02,])
+lower_body = [0, 1, 3, 4, 6, 7, 9, 10]
+count_part = [6, 9, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
+              29, 30, 31, 32, 33, 34, 35, 36, 37,
+              38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
+fix_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
+             29,
+             35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
+             50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
+             65, 66, 67, 68, 69, 70, 71, 72, 73, 74]
+all_index = np.ones(275)
+all_index[fix_index] = 0
+c_index = []
+i = 0
+for num in all_index:
+    if num == 1:
+        c_index.append(i)
+    i = i + 1
+c_index = np.asarray(c_index)
+fix_index_3d = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
+                21, 22, 23, 24, 25, 26,
+                30, 31, 32, 33, 34, 35,
+                45, 46, 47, 48, 49, 50]
+all_index_3d = np.ones(165)
+all_index_3d[fix_index_3d] = 0
+c_index_3d = []
+i = 0
+for num in all_index_3d:
+    if num == 1:
+        c_index_3d.append(i)
+    i = i + 1
+c_index_3d = np.asarray(c_index_3d)
+c_index_6d = []
+i = 0
+for num in all_index_3d:
+    if num == 1:
+        c_index_6d.append(2*i)
+        c_index_6d.append(2 * i + 1)
+    i = i + 1
+c_index_6d = np.asarray(c_index_6d)
+def part2full(input, stand=False):
+    if stand:
+        # lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+        lp = torch.zeros_like(lower_pose)
+        lp[6:9] = torch.tensor([3.0747, -0.0158, -0.0152])
+        lp = lp.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    else:
+        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    input = torch.cat([input[:, :3],
+                       lp[:, :15],
+                       input[:, 3:6],
+                       lp[:, 15:21],
+                       input[:, 6:9],
+                       lp[:, 21:27],
+                       input[:, 9:12],
+                       lp[:, 27:],
+                       input[:, 12:]]
+                      , dim=1)
+    return input
+def pred2poses(input, gt):
+    input = torch.cat([input[:, :3],
+                       gt[0:1, 3:18].repeat(input.shape[0], 1),
+                       input[:, 3:6],
+                       gt[0:1, 21:27].repeat(input.shape[0], 1),
+                       input[:, 6:9],
+                       gt[0:1, 30:36].repeat(input.shape[0], 1),
+                       input[:, 9:12],
+                       gt[0:1, 39:45].repeat(input.shape[0], 1),
+                       input[:, 12:]]
+                      , dim=1)
+    return input
+def poses2poses(input, gt):
+    input = torch.cat([input[:, :3],
+                       gt[0:1, 3:18].repeat(input.shape[0], 1),
+                       input[:, 18:21],
+                       gt[0:1, 21:27].repeat(input.shape[0], 1),
+                       input[:, 27:30],
+                       gt[0:1, 30:36].repeat(input.shape[0], 1),
+                       input[:, 36:39],
+                       gt[0:1, 39:45].repeat(input.shape[0], 1),
+                       input[:, 45:]]
+                      , dim=1)
+    return input
+def poses2pred(input, stand=False):
+    if stand:
+        lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+        # lp = torch.zeros_like(lower_pose).unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    else:
+        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
+    input = torch.cat([input[:, :3],
+                       lp[:, :15],
+                       input[:, 18:21],
+                       lp[:, 15:21],
+                       input[:, 27:30],
+                       lp[:, 21:27],
+                       input[:, 36:39],
+                       lp[:, 27:],
+                       input[:, 45:]]
+                      , dim=1)
+    return input
+rearrange = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]\
+            # ,22, 23, 24, 25, 40, 26, 41,
+            #  27, 42, 28, 43, 29, 44, 30, 45, 31, 46, 32, 47, 33, 48, 34, 49, 35, 50, 36, 51, 37, 52, 38, 53, 39, 54, 55,
+            #  57, 56, 59, 58, 60, 63, 61, 64, 62, 65, 66, 71, 67, 72, 68, 73, 69, 74, 70, 75]
+symmetry = [0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1]#, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            # 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            # 1, 1, 1, 1, 1, 1]

data_utils/mesh_dataset.py ADDED Viewed

	@@ -0,0 +1,348 @@

+import pickle
+import sys
+import os
+sys.path.append(os.getcwd())
+import json
+from glob import glob
+from data_utils.utils import *
+import torch.utils.data as data
+from data_utils.consts import speaker_id
+from data_utils.lower_body import count_part
+import random
+from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
+with open('data_utils/hand_component.json') as file_obj:
+    comp = json.load(file_obj)
+    left_hand_c = np.asarray(comp['left'])
+    right_hand_c = np.asarray(comp['right'])
+def to3d(data):
+    left_hand_pose = np.einsum('bi,ij->bj', data[:, 75:87], left_hand_c[:12, :])
+    right_hand_pose = np.einsum('bi,ij->bj', data[:, 87:99], right_hand_c[:12, :])
+    data = np.concatenate((data[:, :75], left_hand_pose, right_hand_pose), axis=-1)
+    return data
+class SmplxDataset():
+    '''
+    creat a dataset for every segment and concat.
+    '''
+    def __init__(self,
+                 data_root,
+                 speaker,
+                 motion_fn,
+                 audio_fn,
+                 audio_sr,
+                 fps,
+                 feat_method='mel_spec',
+                 audio_feat_dim=64,
+                 audio_feat_win_size=None,
+                 train=True,
+                 load_all=False,
+                 split_trans_zero=False,
+                 limbscaling=False,
+                 num_frames=25,
+                 num_pre_frames=25,
+                 num_generate_length=25,
+                 context_info=False,
+                 convert_to_6d=False,
+                 expression=False,
+                 config=None,
+                 am=None,
+                 am_sr=None,
+                 whole_video=False
+                 ):
+        self.data_root = data_root
+        self.speaker = speaker
+        self.feat_method = feat_method
+        self.audio_fn = audio_fn
+        self.audio_sr = audio_sr
+        self.fps = fps
+        self.audio_feat_dim = audio_feat_dim
+        self.audio_feat_win_size = audio_feat_win_size
+        self.context_info = context_info  # for aud feat
+        self.convert_to_6d = convert_to_6d
+        self.expression = expression
+        self.train = train
+        self.load_all = load_all
+        self.split_trans_zero = split_trans_zero
+        self.limbscaling = limbscaling
+        self.num_frames = num_frames
+        self.num_pre_frames = num_pre_frames
+        self.num_generate_length = num_generate_length
+        # print('num_generate_length ', self.num_generate_length)
+        self.config = config
+        self.am_sr = am_sr
+        self.whole_video = whole_video
+        load_mode = self.config.dataset_load_mode
+        if load_mode == 'pickle':
+            raise NotImplementedError
+        elif load_mode == 'csv':
+            import pickle
+            with open(data_root, 'rb') as f:
+                u = pickle._Unpickler(f)
+                data = u.load()
+                self.data = data[0]
+            if self.load_all:
+                self._load_npz_all()
+        elif load_mode == 'json':
+            self.annotations = glob(data_root + '/*pkl')
+            if len(self.annotations) == 0:
+                raise FileNotFoundError(data_root + ' are empty')
+            self.annotations = sorted(self.annotations)
+            self.img_name_list = self.annotations
+            if self.load_all:
+                self._load_them_all(am, am_sr, motion_fn)
+    def _load_npz_all(self):
+        self.loaded_data = {}
+        self.complete_data = []
+        data = self.data
+        shape = data['body_pose_axis'].shape[0]
+        self.betas = data['betas']
+        self.img_name_list = []
+        for index in range(shape):
+            img_name = f'{index:6d}'
+            self.img_name_list.append(img_name)
+            jaw_pose = data['jaw_pose'][index]
+            leye_pose = data['leye_pose'][index]
+            reye_pose = data['reye_pose'][index]
+            global_orient = data['global_orient'][index]
+            body_pose = data['body_pose_axis'][index]
+            left_hand_pose = data['left_hand_pose'][index]
+            right_hand_pose = data['right_hand_pose'][index]
+            full_body = np.concatenate(
+                (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose))
+            assert full_body.shape[0] == 99
+            if self.convert_to_6d:
+                full_body = to3d(full_body)
+                full_body = torch.from_numpy(full_body)
+                full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body))
+                full_body = np.asarray(full_body)
+                if self.expression:
+                    expression = data['expression'][index]
+                    full_body = np.concatenate((full_body, expression))
+                # full_body = np.concatenate((full_body, non_zero))
+            else:
+                full_body = to3d(full_body)
+                if self.expression:
+                    expression = data['expression'][index]
+                    full_body = np.concatenate((full_body, expression))
+            self.loaded_data[img_name] = full_body.reshape(-1)
+            self.complete_data.append(full_body.reshape(-1))
+        self.complete_data = np.array(self.complete_data)
+        if self.audio_feat_win_size is not None:
+            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
+            # print(self.audio_feat.shape)
+        else:
+            if self.feat_method == 'mel_spec':
+                self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
+            elif self.feat_method == 'mfcc':
+                self.audio_feat = get_mfcc(self.audio_fn,
+                                           smlpx=True,
+                                           sr=self.audio_sr,
+                                           n_mfcc=self.audio_feat_dim,
+                                           win_size=self.audio_feat_win_size
+                                           )
+    def _load_them_all(self, am, am_sr, motion_fn):
+        self.loaded_data = {}
+        self.complete_data = []
+        f = open(motion_fn, 'rb+')
+        data = pickle.load(f)
+        self.betas = np.array(data['betas'])
+        jaw_pose = np.array(data['jaw_pose'])
+        leye_pose = np.array(data['leye_pose'])
+        reye_pose = np.array(data['reye_pose'])
+        global_orient = np.array(data['global_orient']).squeeze()
+        body_pose = np.array(data['body_pose_axis'])
+        left_hand_pose = np.array(data['left_hand_pose'])
+        right_hand_pose = np.array(data['right_hand_pose'])
+        full_body = np.concatenate(
+            (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
+        assert full_body.shape[1] == 99
+        if self.convert_to_6d:
+            full_body = to3d(full_body)
+            full_body = torch.from_numpy(full_body)
+            full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body.reshape(-1, 55, 3))).reshape(-1, 330)
+            full_body = np.asarray(full_body)
+            if self.expression:
+                expression = np.array(data['expression'])
+                full_body = np.concatenate((full_body, expression), axis=1)
+        else:
+            full_body = to3d(full_body)
+            expression = np.array(data['expression'])
+            full_body = np.concatenate((full_body, expression), axis=1)
+        self.complete_data = full_body
+        self.complete_data = np.array(self.complete_data)
+        if self.audio_feat_win_size is not None:
+            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
+        else:
+            # if self.feat_method == 'mel_spec':
+            #     self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
+            # elif self.feat_method == 'mfcc':
+            self.audio_feat = get_mfcc_ta(self.audio_fn,
+                                          smlpx=True,
+                                          fps=30,
+                                          sr=self.audio_sr,
+                                          n_mfcc=self.audio_feat_dim,
+                                          win_size=self.audio_feat_win_size,
+                                          type=self.feat_method,
+                                          am=am,
+                                          am_sr=am_sr,
+                                          encoder_choice=self.config.Model.encoder_choice,
+                                          )
+            # with open(audio_file, 'w', encoding='utf-8') as file:
+            #     file.write(json.dumps(self.audio_feat.__array__().tolist(), indent=0, ensure_ascii=False))
+    def get_dataset(self, normalization=False, normalize_stats=None, split='train'):
+        class __Worker__(data.Dataset):
+            def __init__(child, index_list, normalization, normalize_stats, split='train') -> None:
+                super().__init__()
+                child.index_list = index_list
+                child.normalization = normalization
+                child.normalize_stats = normalize_stats
+                child.split = split
+            def __getitem__(child, index):
+                num_generate_length = self.num_generate_length
+                num_pre_frames = self.num_pre_frames
+                seq_len = num_generate_length + num_pre_frames
+                # print(num_generate_length)
+                index = child.index_list[index]
+                index_new = index + random.randrange(0, 5, 3)
+                if index_new + seq_len > self.complete_data.shape[0]:
+                    index_new = index
+                index = index_new
+                if child.split in ['val', 'pre', 'test'] or self.whole_video:
+                    index = 0
+                    seq_len = self.complete_data.shape[0]
+                seq_data = []
+                assert index + seq_len <= self.complete_data.shape[0]
+                # print(seq_len)
+                seq_data = self.complete_data[index:(index + seq_len), :]
+                seq_data = np.array(seq_data)
+                '''
+                audio feature，
+                '''
+                if not self.context_info:
+                    if not self.whole_video:
+                        audio_feat = self.audio_feat[index:index + seq_len, ...]
+                        if audio_feat.shape[0] < seq_len:
+                            audio_feat = np.pad(audio_feat, [[0, seq_len - audio_feat.shape[0]], [0, 0]],
+                                                mode='reflect')
+                        assert audio_feat.shape[0] == seq_len and audio_feat.shape[1] == self.audio_feat_dim
+                    else:
+                        audio_feat = self.audio_feat
+                else:  # including feature and history
+                    if self.audio_feat_win_size is None:
+                        audio_feat = self.audio_feat[index:index + seq_len + num_pre_frames, ...]
+                        if audio_feat.shape[0] < seq_len + num_pre_frames:
+                            audio_feat = np.pad(audio_feat,
+                                                [[0, seq_len + self.num_frames - audio_feat.shape[0]], [0, 0]],
+                                                mode='constant')
+                        assert audio_feat.shape[0] == self.num_frames + seq_len and audio_feat.shape[
+                            1] == self.audio_feat_dim
+                if child.normalization:
+                    data_mean = child.normalize_stats['mean'].reshape(1, -1)
+                    data_std = child.normalize_stats['std'].reshape(1, -1)
+                    seq_data[:, :330] = (seq_data[:, :330] - data_mean) / data_std
+                if child.split in['train', 'test']:
+                    if self.convert_to_6d:
+                        if self.expression:
+                            data_sample = {
+                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                                'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                                # 'nzero': seq_data[:, 375:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas,
+                                'aud_file': self.audio_fn,
+                            }
+                        else:
+                            data_sample = {
+                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                                'nzero': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas
+                            }
+                    else:
+                        if self.expression:
+                            data_sample = {
+                                'poses': seq_data[:, :165].astype(np.float).transpose(1, 0),
+                                'expression': seq_data[:, 165:].astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                # 'wv2_feat': wv2_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'aud_file': self.audio_fn,
+                                'betas': self.betas
+                            }
+                        else:
+                            data_sample = {
+                                'poses': seq_data.astype(np.float).transpose(1, 0),
+                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                                'speaker': speaker_id[self.speaker],
+                                'betas': self.betas
+                            }
+                    return data_sample
+                else:
+                    data_sample = {
+                        'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
+                        'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
+                        # 'nzero': seq_data[:, 325:].astype(np.float).transpose(1, 0),
+                        'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
+                        'aud_file': self.audio_fn,
+                        'speaker': speaker_id[self.speaker],
+                        'betas': self.betas
+                    }
+                    return data_sample
+            def __len__(child):
+                return len(child.index_list)
+        if split == 'train':
+            index_list = list(
+                range(0, min(self.complete_data.shape[0], self.audio_feat.shape[0]) - self.num_generate_length - self.num_pre_frames,
+                      6))
+        elif split in ['val', 'test']:
+            index_list = list([0])
+        if self.whole_video:
+            index_list = list([0])
+        self.all_dataset = __Worker__(index_list, normalization, normalize_stats, split)
+    def __len__(self):
+        return len(self.img_name_list)

data_utils/rotation_conversion.py ADDED Viewed

	@@ -0,0 +1,551 @@

+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+# Check PYTORCH3D_LICENCE before use
+import functools
+from typing import Optional
+import torch
+import torch.nn.functional as F
+"""
+The transformation matrices returned from the functions in this file assume
+the points on which the transformation will be applied are column vectors.
+i.e. the R matrix is structured as
+    R = [
+            [Rxx, Rxy, Rxz],
+            [Ryx, Ryy, Ryz],
+            [Rzx, Rzy, Rzz],
+        ]  # (3, 3)
+This matrix can be applied to column vectors by post multiplication
+by the points e.g.
+    points = [[0], [1], [2]]  # (3 x 1) xyz coordinates of a point
+    transformed_points = R * points
+To apply the same matrix to points which are row vectors, the R matrix
+can be transposed and pre multiplied by the points:
+e.g.
+    points = [[0, 1, 2]]  # (1 x 3) xyz coordinates of a point
+    transformed_points = points * R.transpose(1, 0)
+"""
+def quaternion_to_matrix(quaternions):
+    """
+    Convert rotations given as quaternions to rotation matrices.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+def _copysign(a, b):
+    """
+    Return a tensor where each element has the absolute value taken from the,
+    corresponding element of a, with sign taken from the corresponding
+    element of b. This is like the standard copysign floating-point operation,
+    but is not careful about negative 0 and NaN.
+    Args:
+        a: source tensor.
+        b: tensor whose signs will be used, of the same shape as a.
+    Returns:
+        Tensor of the same shape as a with the signs of b.
+    """
+    signs_differ = (a < 0) != (b < 0)
+    return torch.where(signs_differ, -a, a)
+def _sqrt_positive_part(x):
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+def matrix_to_quaternion(matrix):
+    """
+    Convert rotations given as rotation matrices to quaternions.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    m00 = matrix[..., 0, 0]
+    m11 = matrix[..., 1, 1]
+    m22 = matrix[..., 2, 2]
+    o0 = 0.5 * _sqrt_positive_part(1 + m00 + m11 + m22)
+    x = 0.5 * _sqrt_positive_part(1 + m00 - m11 - m22)
+    y = 0.5 * _sqrt_positive_part(1 - m00 + m11 - m22)
+    z = 0.5 * _sqrt_positive_part(1 - m00 - m11 + m22)
+    o1 = _copysign(x, matrix[..., 2, 1] - matrix[..., 1, 2])
+    o2 = _copysign(y, matrix[..., 0, 2] - matrix[..., 2, 0])
+    o3 = _copysign(z, matrix[..., 1, 0] - matrix[..., 0, 1])
+    return torch.stack((o0, o1, o2, o3), -1)
+def _axis_angle_rotation(axis: str, angle):
+    """
+    Return the rotation matrices for one of the rotations about an axis
+    of which Euler angles describe, for each value of the angle given.
+    Args:
+        axis: Axis label "X" or "Y or "Z".
+        angle: any shape tensor of Euler angles in radians
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    cos = torch.cos(angle)
+    sin = torch.sin(angle)
+    one = torch.ones_like(angle)
+    zero = torch.zeros_like(angle)
+    if axis == "X":
+        R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
+    if axis == "Y":
+        R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
+    if axis == "Z":
+        R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
+    return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
+def euler_angles_to_matrix(euler_angles, convention: str):
+    """
+    Convert rotations given as Euler angles in radians to rotation matrices.
+    Args:
+        euler_angles: Euler angles in radians as tensor of shape (..., 3).
+        convention: Convention string of three uppercase letters from
+            {"X", "Y", and "Z"}.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
+        raise ValueError("Invalid input euler angles.")
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    matrices = map(_axis_angle_rotation, convention, torch.unbind(euler_angles, -1))
+    return functools.reduce(torch.matmul, matrices)
+def _angle_from_tan(
+    axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
+):
+    """
+    Extract the first or third Euler angle from the two members of
+    the matrix which are positive constant times its sine and cosine.
+    Args:
+        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
+        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
+            convention.
+        data: Rotation matrices as tensor of shape (..., 3, 3).
+        horizontal: Whether we are looking for the angle for the third axis,
+            which means the relevant entries are in the same row of the
+            rotation matrix. If not, they are in the same column.
+        tait_bryan: Whether the first and third axes in the convention differ.
+    Returns:
+        Euler Angles in radians for each matrix in data as a tensor
+        of shape (...).
+    """
+    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
+    if horizontal:
+        i2, i1 = i1, i2
+    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
+    if horizontal == even:
+        return torch.atan2(data[..., i1], data[..., i2])
+    if tait_bryan:
+        return torch.atan2(-data[..., i2], data[..., i1])
+    return torch.atan2(data[..., i2], -data[..., i1])
+def _index_from_letter(letter: str):
+    if letter == "X":
+        return 0
+    if letter == "Y":
+        return 1
+    if letter == "Z":
+        return 2
+def matrix_to_euler_angles(matrix, convention: str):
+    """
+    Convert rotations given as rotation matrices to Euler angles in radians.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+        convention: Convention string of three uppercase letters.
+    Returns:
+        Euler angles in radians as tensor of shape (..., 3).
+    """
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    i0 = _index_from_letter(convention[0])
+    i2 = _index_from_letter(convention[2])
+    tait_bryan = i0 != i2
+    if tait_bryan:
+        central_angle = torch.asin(
+            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
+        )
+    else:
+        central_angle = torch.acos(matrix[..., i0, i0])
+    o = (
+        _angle_from_tan(
+            convention[0], convention[1], matrix[..., i2], False, tait_bryan
+        ),
+        central_angle,
+        _angle_from_tan(
+            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
+        ),
+    )
+    return torch.stack(o, -1)
+def random_quaternions(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random quaternions representing rotations,
+    i.e. versors with nonnegative real part.
+    Args:
+        n: Number of quaternions in a batch to return.
+        dtype: Type to return.
+        device: Desired device of returned tensor. Default:
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Quaternions as tensor of shape (N, 4).
+    """
+    o = torch.randn((n, 4), dtype=dtype, device=device, requires_grad=requires_grad)
+    s = (o * o).sum(1)
+    o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
+    return o
+def random_rotations(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random rotations as 3x3 rotation matrices.
+    Args:
+        n: Number of rotation matrices in a batch to return.
+        dtype: Type to return.
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Rotation matrices as tensor of shape (n, 3, 3).
+    """
+    quaternions = random_quaternions(
+        n, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    return quaternion_to_matrix(quaternions)
+def random_rotation(
+    dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate a single random 3x3 rotation matrix.
+    Args:
+        dtype: Type to return
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set
+    Returns:
+        Rotation matrix as tensor of shape (3, 3).
+    """
+    return random_rotations(1, dtype, device, requires_grad)[0]
+def standardize_quaternion(quaternions):
+    """
+    Convert a unit quaternion to a standard form: one in which the real
+    part is non negative.
+    Args:
+        quaternions: Quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Standardized quaternions as tensor of shape (..., 4).
+    """
+    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
+def quaternion_raw_multiply(a, b):
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+def quaternion_multiply(a, b):
+    """
+    Multiply two quaternions representing rotations, returning the quaternion
+    representing their composition, i.e. the versor with nonnegative real part.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions of shape (..., 4).
+    """
+    ab = quaternion_raw_multiply(a, b)
+    return standardize_quaternion(ab)
+def quaternion_invert(quaternion):
+    """
+    Given a quaternion representing rotation, get the quaternion representing
+    its inverse.
+    Args:
+        quaternion: Quaternions as tensor of shape (..., 4), with real part
+            first, which must be versors (unit quaternions).
+    Returns:
+        The inverse, a tensor of quaternions of shape (..., 4).
+    """
+    return quaternion * quaternion.new_tensor([1, -1, -1, -1])
+def quaternion_apply(quaternion, point):
+    """
+    Apply the rotation given by a quaternion to a 3D point.
+    Usual torch rules for broadcasting apply.
+    Args:
+        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
+        point: Tensor of 3D points of shape (..., 3).
+    Returns:
+        Tensor of rotated points of shape (..., 3).
+    """
+    if point.size(-1) != 3:
+        raise ValueError(f"Points are not in 3D, f{point.shape}.")
+    real_parts = point.new_zeros(point.shape[:-1] + (1,))
+    point_as_quaternion = torch.cat((real_parts, point), -1)
+    out = quaternion_raw_multiply(
+        quaternion_raw_multiply(quaternion, point_as_quaternion),
+        quaternion_invert(quaternion),
+    )
+    return out[..., 1:]
+def axis_angle_to_matrix(axis_angle):
+    """
+    Convert rotations given as axis/angle to rotation matrices.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    return quaternion_to_matrix(axis_angle_to_quaternion(axis_angle))
+def matrix_to_axis_angle(matrix):
+    """
+    Convert rotations given as rotation matrices to axis/angle.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    return quaternion_to_axis_angle(matrix_to_quaternion(matrix))
+def axis_angle_to_quaternion(axis_angle):
+    """
+    Convert rotations given as axis/angle to quaternions.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = 0.5 * angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+def quaternion_to_axis_angle(quaternions):
+    """
+    Convert rotations given as quaternions to axis/angle.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
+    half_angles = torch.atan2(norms, quaternions[..., :1])
+    angles = 2 * half_angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    return quaternions[..., 1:] / sin_half_angles_over_angles
+def rotation_6d_to_matrix(d6: torch.Tensor) -> torch.Tensor:
+    """
+    Converts 6D rotation representation by Zhou et al. [1] to rotation matrix
+    using Gram--Schmidt orthogonalisation per Section B of [1].
+    Args:
+        d6: 6D rotation representation, of size (*, 6)
+    Returns:
+        batch of rotation matrices of size (*, 3, 3)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    a1, a2 = d6[..., :3], d6[..., 3:]
+    b1 = F.normalize(a1, dim=-1)
+    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
+    b2 = F.normalize(b2, dim=-1)
+    b3 = torch.cross(b1, b2, dim=-1)
+    return torch.stack((b1, b2, b3), dim=-2)
+def matrix_to_rotation_6d(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Converts rotation matrices to 6D rotation representation by Zhou et al. [1]
+    by dropping the last row. Note that 6D representation is not unique.
+    Args:
+        matrix: batch of rotation matrices of size (*, 3, 3)
+    Returns:
+        6D rotation representation, of size (*, 6)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    return matrix[..., :2, :].clone().reshape(*matrix.size()[:-2], 6)

data_utils/split_train_val_test.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import os
+import json
+import shutil
+if __name__ =='__main__':
+    id_list = "chemistry conan oliver seth"
+    id_list = id_list.split(' ')
+    old_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0'
+    new_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0/talkshow_data_splited'
+    with open('train_val_test.json') as f:
+        split_info = json.load(f)
+    phase_list = ['train', 'val', 'test']
+    for phase in phase_list:
+        phase_path_list = split_info[phase]
+        for p in phase_path_list:
+            old_path = os.path.join(old_root, p)
+            if not os.path.exists(old_path):
+                print(f'{old_path} not found, continue' )
+                continue
+            new_path = os.path.join(new_root, phase, p)
+            dir_name = os.path.dirname(new_path)
+            if not os.path.isdir(dir_name):
+                os.makedirs(dir_name, exist_ok=True)
+            shutil.move(old_path, new_path)

data_utils/train_val_test.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data_utils/utils.py ADDED Viewed

	@@ -0,0 +1,318 @@

+import numpy as np
+# import librosa #has to do this cause librosa is not supported on my server
+import python_speech_features
+from scipy.io import wavfile
+from scipy import signal
+import librosa
+import torch
+import torchaudio as ta
+import torchaudio.functional as ta_F
+import torchaudio.transforms as ta_T
+# import pyloudnorm as pyln
+def load_wav_old(audio_fn, sr = 16000):
+    sample_rate, sig = wavfile.read(audio_fn)
+    if sample_rate != sr:
+        result = int((sig.shape[0]) / sample_rate * sr)
+        x_resampled = signal.resample(sig, result)
+        x_resampled = x_resampled.astype(np.float64)
+        return x_resampled, sr
+    sig = sig / (2**15)
+    return sig, sample_rate
+def get_mfcc(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = librosa.load(audio_fn, sr=sr, mono=True)
+    if win_size is None:
+        hop_len=int(sr / fps)
+    else:
+        hop_len=int(sr / win_size)
+    n_fft=2048
+    C = librosa.feature.mfcc(
+        y = y,
+        sr = sr,
+        n_mfcc = n_mfcc,
+        hop_length = hop_len,
+        n_fft = n_fft
+    )
+    if C.shape[0] == n_mfcc:
+        C = C.transpose(1, 0)
+    return C
+def get_melspec(audio_fn, eps=1e-6, fps = 25, sr=16000, n_mels=64):
+    raise NotImplementedError
+    '''
+    # y, sr = load_wav(audio_fn=audio_fn, sr=sr)
+    # hop_len = int(sr / fps)
+    # n_fft = 2048
+    # C = librosa.feature.melspectrogram(
+    #     y = y,
+    #     sr = sr,
+    #     n_fft=n_fft,
+    #     hop_length=hop_len,
+    #     n_mels = n_mels,
+    #     fmin=0,
+    #     fmax=8000)
+    # mask = (C == 0).astype(np.float)
+    # C = mask * eps + (1-mask) * C
+    # C = np.log(C)
+    # #wierd error may occur here
+    # assert not (np.isnan(C).any()), audio_fn
+    # if C.shape[0] == n_mels:
+    #     C = C.transpose(1, 0)
+    # return C
+    '''
+def extract_mfcc(audio,sample_rate=16000):
+    mfcc = zip(*python_speech_features.mfcc(audio,sample_rate, numcep=64, nfilt=64, nfft=2048, winstep=0.04))
+    mfcc = np.stack([np.array(i) for i in mfcc])
+    return mfcc
+def get_mfcc_psf(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = load_wav_old(audio_fn, sr=sr)
+    if y.shape.__len__() > 1:
+        y = (y[:,0]+y[:,1])/2
+    if win_size is None:
+        hop_len=int(sr / fps)
+    else:
+        hop_len=int(sr/ win_size)
+    n_fft=2048
+    #hard coded for 25 fps
+    if not smlpx:
+        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=0.04)
+    else:
+        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01/15)
+    # if C.shape[0] == n_mfcc:
+    #     C = C.transpose(1, 0)
+    return C
+def get_mfcc_psf_min(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
+    y, sr = load_wav_old(audio_fn, sr=sr)
+    if y.shape.__len__() > 1:
+        y = (y[:, 0] + y[:, 1]) / 2
+    n_fft = 2048
+    slice_len = 22000 * 5
+    slice = y.size // slice_len
+    C = []
+    for i in range(slice):
+        if i != (slice - 1):
+            feat = python_speech_features.mfcc(y[i*slice_len:(i+1)*slice_len], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
+        else:
+            feat = python_speech_features.mfcc(y[i * slice_len:], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
+        C.append(feat)
+    return C
+def audio_chunking(audio: torch.Tensor, frame_rate: int = 30, chunk_size: int = 16000):
+    """
+    :param audio: 1 x T tensor containing a 16kHz audio signal
+    :param frame_rate: frame rate for video (we need one audio chunk per video frame)
+    :param chunk_size: number of audio samples per chunk
+    :return: num_chunks x chunk_size tensor containing sliced audio
+    """
+    samples_per_frame = chunk_size // frame_rate
+    padding = (chunk_size - samples_per_frame) // 2
+    audio = torch.nn.functional.pad(audio.unsqueeze(0), pad=[padding, padding]).squeeze(0)
+    anchor_points = list(range(chunk_size//2, audio.shape[-1]-chunk_size//2, samples_per_frame))
+    audio = torch.cat([audio[:, i-chunk_size//2:i+chunk_size//2] for i in anchor_points], dim=0)
+    return audio
+def  get_mfcc_ta(audio_fn, eps=1e-6, fps=15, smlpx=False, sr=16000, n_mfcc=64, win_size=None, type='mfcc', am=None, am_sr=None, encoder_choice='mfcc'):
+    if am is None:
+        audio, sr_0 = ta.load(audio_fn)
+        if sr != sr_0:
+            audio = ta.transforms.Resample(sr_0, sr)(audio)
+        if audio.shape[0] > 1:
+            audio = torch.mean(audio, dim=0, keepdim=True)
+        n_fft = 2048
+        if fps == 15:
+            hop_length = 1467
+        elif fps == 30:
+            hop_length = 734
+        win_length = hop_length * 2
+        n_mels = 256
+        n_mfcc = 64
+        if type == 'mfcc':
+            mfcc_transform = ta_T.MFCC(
+                sample_rate=sr,
+                n_mfcc=n_mfcc,
+                melkwargs={
+                    "n_fft": n_fft,
+                    "n_mels": n_mels,
+                    # "win_length": win_length,
+                    "hop_length": hop_length,
+                    "mel_scale": "htk",
+                },
+            )
+            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0,1).numpy()
+        elif type == 'mel':
+            # audio = 0.01 * audio / torch.mean(torch.abs(audio))
+            mel_transform = ta_T.MelSpectrogram(
+                sample_rate=sr, n_fft=n_fft, win_length=None, hop_length=hop_length, n_mels=n_mels
+            )
+            audio_ft = mel_transform(audio).squeeze(0).transpose(0,1).numpy()
+            # audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).transpose(0,1).numpy()
+        elif type == 'mel_mul':
+            audio = 0.01 * audio / torch.mean(torch.abs(audio))
+            audio = audio_chunking(audio, frame_rate=fps, chunk_size=sr)
+            mel_transform = ta_T.MelSpectrogram(
+                sample_rate=sr, n_fft=n_fft, win_length=int(sr/20), hop_length=int(sr/100), n_mels=n_mels
+            )
+            audio_ft = mel_transform(audio).squeeze(1)
+            audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).numpy()
+    else:
+        speech_array, sampling_rate = librosa.load(audio_fn, sr=16000)
+        if encoder_choice == 'faceformer':
+            # audio_ft = np.squeeze(am(speech_array, sampling_rate=16000).input_values).reshape(-1, 1)
+            audio_ft = speech_array.reshape(-1, 1)
+        elif encoder_choice == 'meshtalk':
+            audio_ft = 0.01 * speech_array / np.mean(np.abs(speech_array))
+        elif encoder_choice == 'onset':
+            audio_ft = librosa.onset.onset_detect(y=speech_array, sr=16000, units='time').reshape(-1, 1)
+        else:
+            audio, sr_0 = ta.load(audio_fn)
+            if sr != sr_0:
+                audio = ta.transforms.Resample(sr_0, sr)(audio)
+            if audio.shape[0] > 1:
+                audio = torch.mean(audio, dim=0, keepdim=True)
+            n_fft = 2048
+            if fps == 15:
+                hop_length = 1467
+            elif fps == 30:
+                hop_length = 734
+            win_length = hop_length * 2
+            n_mels = 256
+            n_mfcc = 64
+            mfcc_transform = ta_T.MFCC(
+                sample_rate=sr,
+                n_mfcc=n_mfcc,
+                melkwargs={
+                    "n_fft": n_fft,
+                    "n_mels": n_mels,
+                    # "win_length": win_length,
+                    "hop_length": hop_length,
+                    "mel_scale": "htk",
+                },
+            )
+            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0, 1).numpy()
+    return audio_ft
+def  get_mfcc_sepa(audio_fn, fps=15, sr=16000):
+    audio, sr_0 = ta.load(audio_fn)
+    if sr != sr_0:
+        audio = ta.transforms.Resample(sr_0, sr)(audio)
+    if audio.shape[0] > 1:
+        audio = torch.mean(audio, dim=0, keepdim=True)
+    n_fft = 2048
+    if fps == 15:
+        hop_length = 1467
+    elif fps == 30:
+        hop_length = 734
+    n_mels = 256
+    n_mfcc = 64
+    mfcc_transform = ta_T.MFCC(
+        sample_rate=sr,
+        n_mfcc=n_mfcc,
+        melkwargs={
+            "n_fft": n_fft,
+            "n_mels": n_mels,
+            # "win_length": win_length,
+            "hop_length": hop_length,
+            "mel_scale": "htk",
+        },
+    )
+    audio_ft_0 = mfcc_transform(audio[0, :sr*2]).squeeze(dim=0).transpose(0,1).numpy()
+    audio_ft_1 = mfcc_transform(audio[0, sr*2:]).squeeze(dim=0).transpose(0,1).numpy()
+    audio_ft = np.concatenate((audio_ft_0, audio_ft_1), axis=0)
+    return audio_ft, audio_ft_0.shape[0]
+def get_mfcc_old(wav_file):
+    sig, sample_rate = load_wav_old(wav_file)
+    mfcc = extract_mfcc(sig)
+    return mfcc
+def smooth_geom(geom, mask: torch.Tensor = None, filter_size: int = 9, sigma: float = 2.0):
+    """
+    :param geom: T x V x 3 tensor containing a temporal sequence of length T with V vertices in each frame
+    :param mask: V-dimensional Tensor containing a mask with vertices to be smoothed
+    :param filter_size: size of the Gaussian filter
+    :param sigma: standard deviation of the Gaussian filter
+    :return: T x V x 3 tensor containing smoothed geometry (i.e., smoothed in the area indicated by the mask)
+    """
+    assert filter_size % 2 == 1, f"filter size must be odd but is {filter_size}"
+    # Gaussian smoothing (low-pass filtering)
+    fltr = np.arange(-(filter_size // 2), filter_size // 2 + 1)
+    fltr = np.exp(-0.5 * fltr ** 2 / sigma ** 2)
+    fltr = torch.Tensor(fltr) / np.sum(fltr)
+    # apply fltr
+    fltr = fltr.view(1, 1, -1).to(device=geom.device)
+    T, V = geom.shape[1], geom.shape[2]
+    g = torch.nn.functional.pad(
+        geom.permute(2, 0, 1).view(V, 1, T),
+        pad=[filter_size // 2, filter_size // 2], mode='replicate'
+    )
+    g = torch.nn.functional.conv1d(g, fltr).view(V, 1, T)
+    smoothed = g.permute(1, 2, 0).contiguous()
+    # blend smoothed signal with original signal
+    if mask is None:
+        return smoothed
+    else:
+        return smoothed * mask[None, :, None] + geom * (-mask[None, :, None] + 1)
+if __name__ == '__main__':
+    audio_fn = '../sample_audio/clip000028_tCAkv4ggPgI.wav'
+    C = get_mfcc_psf(audio_fn)
+    print(C.shape)
+    C_2 = get_mfcc_librosa(audio_fn)
+    print(C.shape)
+    print(C)
+    print(C_2)
+    print((C == C_2).all())
+    # print(y.shape, sr)
+    # mel_spec = get_melspec(audio_fn)
+    # print(mel_spec.shape)
+    # mfcc = get_mfcc(audio_fn, sr = 16000)
+    # print(mfcc.shape)
+    # print(mel_spec.max(), mel_spec.min())
+    # print(mfcc.max(), mfcc.min())

download_models.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import os
+import urllib.request
+import zipfile
+import subprocess
+def download_file(url, output_path):
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    if not os.path.exists(output_path):
+        print(f"Downloading {url} to {output_path}...")
+        urllib.request.urlretrieve(url, output_path)
+        print("Download complete!")
+    else:
+        print(f"File already exists: {output_path}")
+def main():
+    # Create necessary directories
+    os.makedirs("experiments", exist_ok=True)
+    os.makedirs("visualise/smplx_model", exist_ok=True)
+    # Here you would need to add URLs to download your models
+    # For example:
+    # download_file("YOUR_MODEL_URL", "experiments/your_model.pth")
+    # download_file("SMPLX_MODEL_URL", "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz")
+    print("Setup complete!")
+if __name__ == "__main__":
+    main()

evaluation/FGD.py ADDED Viewed

	@@ -0,0 +1,199 @@

+import time
+import numpy as np
+import torch
+import torch.nn.functional as F
+from scipy import linalg
+import math
+from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
+import warnings
+warnings.filterwarnings("ignore", category=RuntimeWarning)  # ignore warnings
+change_angle = torch.tensor([6.0181e-05, 5.1597e-05, 2.1344e-04, 2.1899e-04])
+class EmbeddingSpaceEvaluator:
+    def __init__(self, ae, vae, device):
+        # init embed net
+        self.ae = ae
+        # self.vae = vae
+        # storage
+        self.real_feat_list = []
+        self.generated_feat_list = []
+        self.real_joints_list = []
+        self.generated_joints_list = []
+        self.real_6d_list = []
+        self.generated_6d_list = []
+        self.audio_beat_list = []
+    def reset(self):
+        self.real_feat_list = []
+        self.generated_feat_list = []
+    def get_no_of_samples(self):
+        return len(self.real_feat_list)
+    def push_samples(self, generated_poses, real_poses):
+        # self.net.eval()
+        # convert poses to latent features
+        real_feat, real_poses = self.ae.extract(real_poses)
+        generated_feat, generated_poses = self.ae.extract(generated_poses)
+        num_joints = real_poses.shape[2] // 3
+        real_feat = real_feat.squeeze()
+        generated_feat = generated_feat.reshape(generated_feat.shape[0]*generated_feat.shape[1], -1)
+        self.real_feat_list.append(real_feat.data.cpu().numpy())
+        self.generated_feat_list.append(generated_feat.data.cpu().numpy())
+        # real_poses = matrix_to_rotation_6d(axis_angle_to_matrix(real_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
+        # generated_poses = matrix_to_rotation_6d(axis_angle_to_matrix(generated_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
+        #
+        # self.real_feat_list.append(real_poses.data.cpu().numpy())
+        # self.generated_feat_list.append(generated_poses.data.cpu().numpy())
+    def push_joints(self, generated_poses, real_poses):
+        self.real_joints_list.append(real_poses.data.cpu())
+        self.generated_joints_list.append(generated_poses.squeeze().data.cpu())
+    def push_aud(self, aud):
+        self.audio_beat_list.append(aud.squeeze().data.cpu())
+    def get_MAAC(self):
+        ang_vel_list = []
+        for real_joints in self.real_joints_list:
+            real_joints[:, 15:21] = real_joints[:, 16:22]
+            vec = real_joints[:, 15:21] - real_joints[:, 13:19]
+            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
+            inner_product = torch.clamp(inner_product, -1, 1, out=None)
+            angle = torch.acos(inner_product) / math.pi
+            ang_vel = (angle[1:] - angle[:-1]).abs().mean(dim=0)
+            ang_vel_list.append(ang_vel.unsqueeze(dim=0))
+        all_vel = torch.cat(ang_vel_list, dim=0)
+        MAAC = all_vel.mean(dim=0)
+        return MAAC
+    def get_BCscore(self):
+        thres = 0.01
+        sigma = 0.1
+        sum_1 = 0
+        total_beat = 0
+        for joints, audio_beat_time in zip(self.generated_joints_list, self.audio_beat_list):
+            motion_beat_time = []
+            if joints.dim() == 4:
+                joints = joints[0]
+            joints[:, 15:21] = joints[:, 16:22]
+            vec = joints[:, 15:21] - joints[:, 13:19]
+            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
+            inner_product = torch.clamp(inner_product, -1, 1, out=None)
+            angle = torch.acos(inner_product) / math.pi
+            ang_vel = (angle[1:] - angle[:-1]).abs() / change_angle / len(change_angle)
+            angle_diff = torch.cat((torch.zeros(1, 4), ang_vel), dim=0)
+            sum_2 = 0
+            for i in range(angle_diff.shape[1]):
+                motion_beat_time = []
+                for t in range(1, joints.shape[0]-1):
+                    if (angle_diff[t][i] < angle_diff[t - 1][i] and angle_diff[t][i] < angle_diff[t + 1][i]):
+                        if (angle_diff[t - 1][i] - angle_diff[t][i] >= thres or angle_diff[t + 1][i] - angle_diff[
+                            t][i] >= thres):
+                            motion_beat_time.append(float(t) / 30.0)
+                if (len(motion_beat_time) == 0):
+                    continue
+                motion_beat_time = torch.tensor(motion_beat_time)
+                sum = 0
+                for audio in audio_beat_time:
+                    sum += np.power(math.e, -(np.power((audio.item() - motion_beat_time), 2)).min() / (2 * sigma * sigma))
+                sum_2 = sum_2 + sum
+                total_beat = total_beat + len(audio_beat_time)
+            sum_1 = sum_1 + sum_2
+        return sum_1/total_beat
+    def get_scores(self):
+        generated_feats = np.vstack(self.generated_feat_list)
+        real_feats = np.vstack(self.real_feat_list)
+        def frechet_distance(samples_A, samples_B):
+            A_mu = np.mean(samples_A, axis=0)
+            A_sigma = np.cov(samples_A, rowvar=False)
+            B_mu = np.mean(samples_B, axis=0)
+            B_sigma = np.cov(samples_B, rowvar=False)
+            try:
+                frechet_dist = self.calculate_frechet_distance(A_mu, A_sigma, B_mu, B_sigma)
+            except ValueError:
+                frechet_dist = 1e+10
+            return frechet_dist
+        ####################################################################
+        # frechet distance
+        frechet_dist = frechet_distance(generated_feats, real_feats)
+        ####################################################################
+        # distance between real and generated samples on the latent feature space
+        dists = []
+        for i in range(real_feats.shape[0]):
+            d = np.sum(np.absolute(real_feats[i] - generated_feats[i]))  # MAE
+            dists.append(d)
+        feat_dist = np.mean(dists)
+        return frechet_dist, feat_dist
+    @staticmethod
+    def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
+        """ from https://github.com/mseitzer/pytorch-fid/blob/master/fid_score.py """
+        """Numpy implementation of the Frechet Distance.
+        The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
+        and X_2 ~ N(mu_2, C_2) is
+                d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
+        Stable version by Dougal J. Sutherland.
+        Params:
+        -- mu1   : Numpy array containing the activations of a layer of the
+                   inception net (like returned by the function 'get_predictions')
+                   for generated samples.
+        -- mu2   : The sample mean over activations, precalculated on an
+                   representative data set.
+        -- sigma1: The covariance matrix over activations for generated samples.
+        -- sigma2: The covariance matrix over activations, precalculated on an
+                   representative data set.
+        Returns:
+        --   : The Frechet Distance.
+        """
+        mu1 = np.atleast_1d(mu1)
+        mu2 = np.atleast_1d(mu2)
+        sigma1 = np.atleast_2d(sigma1)
+        sigma2 = np.atleast_2d(sigma2)
+        assert mu1.shape == mu2.shape, \
+            'Training and test mean vectors have different lengths'
+        assert sigma1.shape == sigma2.shape, \
+            'Training and test covariances have different dimensions'
+        diff = mu1 - mu2
+        # Product might be almost singular
+        covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+        if not np.isfinite(covmean).all():
+            msg = ('fid calculation produces singular product; '
+                   'adding %s to diagonal of cov estimates') % eps
+            print(msg)
+            offset = np.eye(sigma1.shape[0]) * eps
+            covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+        # Numerical error might give slight imaginary component
+        if np.iscomplexobj(covmean):
+            if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+                m = np.max(np.abs(covmean.imag))
+                raise ValueError('Imaginary component {}'.format(m))
+            covmean = covmean.real
+        tr_covmean = np.trace(covmean)
+        return (diff.dot(diff) + np.trace(sigma1) +
+                np.trace(sigma2) - 2 * tr_covmean)

evaluation/__init__.py ADDED Viewed

File without changes

evaluation/diversity_LVD.py ADDED Viewed

	@@ -0,0 +1,64 @@

+'''
+LVD: different initial pose
+diversity: same initial pose
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['base'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+LVD_list = []
+diversity_list = []
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = hand_points(gt_poses)
+        pred_valid_points = hand_points(pred_poses)
+        lvd = LVD(gt_valid_points, pred_valid_points)
+        # div = diversity(pred_valid_points)
+        LVD_list.append(lvd)
+        # diversity_list.append(div)
+        # gt_velocity = peak_velocity(gt_valid_points, order=2)
+        # pred_velocity = peak_velocity(pred_valid_points, order=2)
+        # gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
+        # pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
+        # gt_consistency_list.append(gt_consistency)
+        # pred_consistency_list.append(pred_consistency)
+lvd = np.mean(LVD_list)
+# diversity_list = np.mean(diversity_list)
+print('LVD:', lvd)
+# print("diversity:", diversity_list)

evaluation/get_quality_samples.py ADDED Viewed

	@@ -0,0 +1,62 @@

+'''
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+quality_samples={'gt':[]}
+for post_fix in args.post_fix:
+    quality_samples[post_fix] = []
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    gt_valid_points = valid_points(gt_poses)
+    # print(gt_valid_points.shape)
+    quality_samples['gt'].append(gt_valid_points)
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        pred_valid_points = valid_points(pred_poses)[0:1]
+        quality_samples[post_fix].append(pred_valid_points)
+quality_samples['gt'] = np.concatenate(quality_samples['gt'], axis=1)
+for post_fix in args.post_fix:
+    quality_samples[post_fix] = np.concatenate(quality_samples[post_fix], axis=1)
+print('gt:', quality_samples['gt'].shape)
+quality_samples['gt'] = quality_samples['gt'].tolist()
+for post_fix in args.post_fix:
+    print(post_fix, ':', quality_samples[post_fix].shape)
+    quality_samples[post_fix] = quality_samples[post_fix].tolist()
+save_dir = '../../experiments/'
+os.makedirs(save_dir, exist_ok=True)
+save_name = os.path.join(save_dir, 'quality_samples_%s.json'%(speaker))
+with open(save_name, 'w') as f:
+    json.dump(quality_samples, f)

evaluation/metrics.py ADDED Viewed

	@@ -0,0 +1,109 @@

+'''
+Warning: metrics are for reference only, may have limited significance
+'''
+import os
+import sys
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+from data_utils.lower_body import rearrange, symmetry
+import torch.nn.functional as F
+def data_driven_baselines(gt_kps):
+    '''
+    gt_kps: T, D
+    '''
+    gt_velocity = np.abs(gt_kps[1:] - gt_kps[:-1])
+    mean= np.mean(gt_velocity, axis=0)[np.newaxis] #(1, D)
+    mean = np.mean(np.abs(gt_velocity-mean))
+    last_step = gt_kps[1] - gt_kps[0]
+    last_step = last_step[np.newaxis] #(1, D)
+    last_step = np.mean(np.abs(gt_velocity-last_step))
+    return last_step, mean
+def Batch_LVD(gt_kps, pr_kps, symmetrical, weight):
+    if gt_kps.shape[0] > pr_kps.shape[1]:
+        length = pr_kps.shape[1]
+    else:
+        length = gt_kps.shape[0]
+    gt_kps = gt_kps[:length]
+    pr_kps = pr_kps[:, :length]
+    global symmetry
+    symmetry = torch.tensor(symmetry).bool()
+    if symmetrical:
+        # rearrange for compute symmetric. ns means non-symmetrical joints, ys means symmetrical joints.
+        gt_kps = gt_kps[:, rearrange]
+        ns_gt_kps = gt_kps[:, ~symmetry]
+        ys_gt_kps = gt_kps[:, symmetry]
+        ys_gt_kps = ys_gt_kps.reshape(ys_gt_kps.shape[0], -1, 2, 3)
+        ns_gt_velocity = (ns_gt_kps[1:] - ns_gt_kps[:-1]).norm(p=2, dim=-1)
+        ys_gt_velocity = (ys_gt_kps[1:] - ys_gt_kps[:-1]).norm(p=2, dim=-1)
+        left_gt_vel = ys_gt_velocity[:, :, 0].sum(dim=-1)
+        right_gt_vel = ys_gt_velocity[:, :, 1].sum(dim=-1)
+        move_side = torch.where(left_gt_vel>right_gt_vel, torch.ones(left_gt_vel.shape).cuda(),  torch.zeros(left_gt_vel.shape).cuda())
+        ys_gt_velocity = torch.mul(ys_gt_velocity[:, :, 0].transpose(0,1), move_side) + torch.mul(ys_gt_velocity[:, :, 1].transpose(0,1), ~move_side.bool())
+        ys_gt_velocity = ys_gt_velocity.transpose(0,1)
+        gt_velocity = torch.cat([ns_gt_velocity, ys_gt_velocity], dim=1)
+        pr_kps = pr_kps[:, :, rearrange]
+        ns_pr_kps = pr_kps[:, :, ~symmetry]
+        ys_pr_kps = pr_kps[:, :, symmetry]
+        ys_pr_kps = ys_pr_kps.reshape(ys_pr_kps.shape[0], ys_pr_kps.shape[1], -1, 2, 3)
+        ns_pr_velocity = (ns_pr_kps[:, 1:] - ns_pr_kps[:, :-1]).norm(p=2, dim=-1)
+        ys_pr_velocity = (ys_pr_kps[:, 1:] - ys_pr_kps[:, :-1]).norm(p=2, dim=-1)
+        left_pr_vel = ys_pr_velocity[:, :, :, 0].sum(dim=-1)
+        right_pr_vel = ys_pr_velocity[:, :, :, 1].sum(dim=-1)
+        move_side = torch.where(left_pr_vel > right_pr_vel, torch.ones(left_pr_vel.shape).cuda(),
+                                torch.zeros(left_pr_vel.shape).cuda())
+        ys_pr_velocity = torch.mul(ys_pr_velocity[..., 0].permute(2, 0, 1), move_side) + torch.mul(
+            ys_pr_velocity[..., 1].permute(2, 0, 1), ~move_side.long())
+        ys_pr_velocity = ys_pr_velocity.permute(1, 2, 0)
+        pr_velocity = torch.cat([ns_pr_velocity, ys_pr_velocity], dim=2)
+    else:
+        gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
+        pr_velocity = (pr_kps[:, 1:] - pr_kps[:, :-1]).norm(p=2, dim=-1)
+    if weight:
+        w = F.softmax(gt_velocity.sum(dim=1).normal_(), dim=0)
+    else:
+        w = 1 / gt_velocity.shape[0]
+    v_diff = ((pr_velocity - gt_velocity).abs().sum(dim=-1) * w).sum(dim=-1).mean()
+    return v_diff
+def LVD(gt_kps, pr_kps, symmetrical=False, weight=False):
+    gt_kps = gt_kps.squeeze()
+    pr_kps = pr_kps.squeeze()
+    if len(pr_kps.shape) == 4:
+        return Batch_LVD(gt_kps, pr_kps, symmetrical, weight)
+    # length = np.minimum(gt_kps.shape[0], pr_kps.shape[0])
+    length = gt_kps.shape[0]-10
+    # gt_kps = gt_kps[25:length]
+    # pr_kps = pr_kps[25:length] #(T, D)
+    # if pr_kps.shape[0] < gt_kps.shape[0]:
+    #     pr_kps = np.pad(pr_kps, [[0, int(gt_kps.shape[0]-pr_kps.shape[0])], [0, 0]], mode='constant')
+    gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
+    pr_velocity = (pr_kps[1:] - pr_kps[:-1]).norm(p=2, dim=-1)
+    return (pr_velocity-gt_velocity).abs().sum(dim=-1).mean()
+def diversity(kps):
+    '''
+    kps: bs, seq, dim
+    '''
+    dis_list = []
+    #the distance between each pair
+    for i in range(kps.shape[0]):
+        for j in range(i+1, kps.shape[0]):
+            seq_i = kps[i]
+            seq_j = kps[j]
+            dis = np.mean(np.abs(seq_i - seq_j))
+            dis_list.append(dis)
+    return np.mean(dis_list)

evaluation/mode_transition.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+precision_list=[]
+recall_list=[]
+accuracy_list=[]
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    if gt_poses.shape[0] < 50:
+        continue
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = valid_points(gt_poses)
+        pred_valid_points = valid_points(pred_poses)
+        # print(gt_valid_points.shape, pred_valid_points.shape)
+        gt_mode_transition_seq = mode_transition_seq(gt_valid_points, speaker)#(B, N)
+        pred_mode_transition_seq = mode_transition_seq(pred_valid_points, speaker)#(B, N)
+        # baseline = np.random.randint(0, 2, size=pred_mode_transition_seq.shape)
+        # pred_mode_transition_seq = baseline
+        precision, recall, accuracy = mode_transition_consistency(pred_mode_transition_seq, gt_mode_transition_seq)
+        precision_list.append(precision)
+        recall_list.append(recall)
+        accuracy_list.append(accuracy)
+print(len(precision_list), len(recall_list), len(accuracy_list))
+precision_list = np.mean(precision_list)
+recall_list = np.mean(recall_list)
+accuracy_list = np.mean(accuracy_list)
+print('precision, recall, accu:', precision_list, recall_list, accuracy_list)

evaluation/peak_velocity.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from glob import glob
+from argparse import ArgumentParser
+import json
+from evaluation.util import *
+from evaluation.metrics import *
+from tqdm import tqdm
+parser = ArgumentParser()
+parser.add_argument('--speaker', required=True, type=str)
+parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
+args = parser.parse_args()
+speaker = args.speaker
+test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
+gt_consistency_list=[]
+pred_consistency_list=[]
+for aud in tqdm(test_audios):
+    base_name = os.path.splitext(aud)[0]
+    gt_path = get_full_path(aud, speaker, 'val')
+    _, gt_poses, _ = get_gts(gt_path)
+    gt_poses = gt_poses[np.newaxis,...]
+    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
+    for post_fix in args.post_fix:
+        pred_path = base_name + '_'+post_fix+'.json'
+        pred_poses = np.array(json.load(open(pred_path)))
+        # print(pred_poses.shape)#(B, seq_len, 108)
+        pred_poses = cvt25(pred_poses, gt_poses)
+        # print(pred_poses.shape)#(B, seq, pose_dim)
+        gt_valid_points = hand_points(gt_poses)
+        pred_valid_points = hand_points(pred_poses)
+        gt_velocity = peak_velocity(gt_valid_points, order=2)
+        pred_velocity = peak_velocity(pred_valid_points, order=2)
+        gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
+        pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
+        gt_consistency_list.append(gt_consistency)
+        pred_consistency_list.append(pred_consistency)
+gt_consistency_list = np.concatenate(gt_consistency_list)
+pred_consistency_list = np.concatenate(pred_consistency_list)
+print(gt_consistency_list.max(), gt_consistency_list.min())
+print(pred_consistency_list.max(), pred_consistency_list.min())
+print(np.mean(gt_consistency_list), np.mean(pred_consistency_list))
+print(np.std(gt_consistency_list), np.std(pred_consistency_list))
+draw_cdf(gt_consistency_list, save_name='%s_gt.jpg'%(speaker), color='slateblue')
+draw_cdf(pred_consistency_list, save_name='%s_pred.jpg'%(speaker), color='lightskyblue')
+to_excel(gt_consistency_list, '%s_gt.xlsx'%(speaker))
+to_excel(pred_consistency_list, '%s_pred.xlsx'%(speaker))
+np.save('%s_gt.npy'%(speaker), gt_consistency_list)
+np.save('%s_pred.npy'%(speaker), pred_consistency_list)

evaluation/util.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import os
+from glob import glob
+import numpy as np
+import json
+from matplotlib import pyplot as plt
+import pandas as pd
+def get_gts(clip):
+    '''
+    clip: abs path to the clip dir
+    '''
+    keypoints_files = sorted(glob(os.path.join(clip, 'keypoints_new/person_1')+'/*.json'))
+    upper_body_points = list(np.arange(0, 25))
+    poses = []
+    confs = []
+    neck_to_nose_len = []
+    mean_position = []
+    for kp_file in keypoints_files:
+        kp_load = json.load(open(kp_file, 'r'))['people'][0]
+        posepts = kp_load['pose_keypoints_2d']
+        lhandpts = kp_load['hand_left_keypoints_2d']
+        rhandpts = kp_load['hand_right_keypoints_2d']
+        facepts = kp_load['face_keypoints_2d']
+        neck = np.array(posepts).reshape(-1,3)[1]
+        nose = np.array(posepts).reshape(-1,3)[0]
+        x_offset = abs(neck[0]-nose[0])
+        y_offset = abs(neck[1]-nose[1])
+        neck_to_nose_len.append(y_offset)
+        mean_position.append([neck[0],neck[1]])
+        keypoints=np.array(posepts+lhandpts+rhandpts+facepts).reshape(-1,3)[:,:2]
+        upper_body = keypoints[upper_body_points, :]
+        hand_points = keypoints[25:, :]
+        keypoints = np.vstack([upper_body, hand_points])
+        poses.append(keypoints)
+    if len(neck_to_nose_len) > 0:
+        scale_factor = np.mean(neck_to_nose_len)
+    else:
+        raise ValueError(clip)
+    mean_position = np.mean(np.array(mean_position), axis=0)
+    unlocalized_poses = np.array(poses).copy()
+    localized_poses = []
+    for i in range(len(poses)):
+        keypoints = poses[i]
+        neck = keypoints[1].copy()
+        keypoints[:, 0] = (keypoints[:, 0] - neck[0]) / scale_factor
+        keypoints[:, 1] = (keypoints[:, 1] - neck[1]) / scale_factor
+        localized_poses.append(keypoints.reshape(-1))
+    localized_poses=np.array(localized_poses)
+    return unlocalized_poses, localized_poses, (scale_factor, mean_position)
+def get_full_path(wav_name, speaker, split):
+    '''
+    get clip path from aud file
+    '''
+    wav_name = os.path.basename(wav_name)
+    wav_name = os.path.splitext(wav_name)[0]
+    clip_name, vid_name = wav_name[:10], wav_name[11:]
+    full_path = os.path.join('pose_dataset/videos/', speaker, 'clips', vid_name, 'images/half', split, clip_name)
+    assert os.path.isdir(full_path), full_path
+    return full_path
+def smooth(res):
+    '''
+    res: (B, seq_len, pose_dim)
+    '''
+    window = [res[:, 7, :], res[:, 8, :], res[:, 9, :], res[:, 10, :], res[:, 11, :], res[:, 12, :]]
+    w_size=7
+    for i in range(10, res.shape[1]-3):
+        window.append(res[:, i+3, :])
+        if len(window) > w_size:
+            window = window[1:]
+        if (i%25) in [22, 23, 24, 0, 1, 2, 3]:
+            res[:, i, :] = np.mean(window, axis=1)
+    return res
+def cvt25(pred_poses, gt_poses=None):
+    '''
+    gt_poses: (1, seq_len, 270), 135 *2
+    pred_poses: (B, seq_len, 108), 54 * 2
+    '''
+    if gt_poses is None:
+        gt_poses = np.zeros_like(pred_poses)
+    else:
+        gt_poses = gt_poses.repeat(pred_poses.shape[0], axis=0)
+    length = min(pred_poses.shape[1], gt_poses.shape[1])
+    pred_poses = pred_poses[:, :length, :]
+    gt_poses = gt_poses[:, :length, :]
+    gt_poses = gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1, 2)
+    pred_poses = pred_poses.reshape(pred_poses.shape[0], pred_poses.shape[1], -1, 2)
+    gt_poses[:, :, [1, 2, 3, 4, 5, 6, 7], :] = pred_poses[:, :, 1:8, :]
+    gt_poses[:, :, 25:25+21+21, :] = pred_poses[:, :, 12:, :]
+    return gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1)
+def hand_points(seq):
+    '''
+    seq: (B, seq_len, 135*2)
+    hands only
+    '''
+    hand_idx = [1, 2, 3, 4,5 ,6,7] + list(range(25, 25+21+21))
+    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
+    return seq[:, :, hand_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
+def valid_points(seq):
+    '''
+    hands with some head points
+    '''
+    valid_idx = [0, 1, 2, 3, 4,5 ,6,7, 8, 9, 10, 11] + list(range(25, 25+21+21))
+    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
+    seq = seq[:, :, valid_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
+    assert seq.shape[-1] == 108, seq.shape
+    return seq
+def draw_cdf(seq, save_name='cdf.jpg', color='slatebule'):
+    plt.figure()
+    plt.hist(seq, bins=100, range=(0, 100), color=color)
+    plt.savefig(save_name)
+def to_excel(seq, save_name='res.xlsx'):
+    '''
+    seq: (T)
+    '''
+    df = pd.DataFrame(seq)
+    writer = pd.ExcelWriter(save_name)
+    df.to_excel(writer, 'sheet1')
+    writer.save()
+    writer.close()
+if __name__ == '__main__':
+    random_data = np.random.randint(0, 10, 100)
+    draw_cdf(random_data)

losses/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .losses import *

losses/losses.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+class KeypointLoss(nn.Module):
+    def __init__(self):
+        super(KeypointLoss, self).__init__()
+    def forward(self, pred_seq, gt_seq, gt_conf=None):
+        #pred_seq: (B, C, T)
+        if gt_conf is not None:
+            gt_conf = gt_conf >= 0.01
+            return F.mse_loss(pred_seq[gt_conf], gt_seq[gt_conf], reduction='mean')
+        else:
+            return F.mse_loss(pred_seq, gt_seq)
+class KLLoss(nn.Module):
+    def __init__(self, kl_tolerance):
+        super(KLLoss, self).__init__()
+        self.kl_tolerance = kl_tolerance
+    def forward(self, mu, var, mul=1):
+        kl_tolerance = self.kl_tolerance * mul * var.shape[1] / 64
+        kld_loss = -0.5 * torch.sum(1 + var - mu**2 - var.exp(), dim=1)
+        # kld_loss = -0.5 * torch.sum(1 + (var-1) - (mu) ** 2 - (var-1).exp(), dim=1)
+        if self.kl_tolerance is not None:
+            # above_line = kld_loss[kld_loss > self.kl_tolerance]
+            # if len(above_line) > 0:
+            #     kld_loss = torch.mean(kld_loss)
+            # else:
+            #     kld_loss = 0
+            kld_loss = torch.where(kld_loss > kl_tolerance, kld_loss, torch.tensor(kl_tolerance, device='cuda'))
+        # else:
+        kld_loss = torch.mean(kld_loss)
+        return kld_loss
+class L2KLLoss(nn.Module):
+    def __init__(self, kl_tolerance):
+        super(L2KLLoss, self).__init__()
+        self.kl_tolerance = kl_tolerance
+    def forward(self, x):
+        # TODO: check
+        kld_loss = torch.sum(x ** 2, dim=1)
+        if self.kl_tolerance is not None:
+            above_line = kld_loss[kld_loss > self.kl_tolerance]
+            if len(above_line) > 0:
+                kld_loss = torch.mean(kld_loss)
+            else:
+                kld_loss = 0
+        else:
+            kld_loss = torch.mean(kld_loss)
+        return kld_loss
+class L2RegLoss(nn.Module):
+    def __init__(self):
+        super(L2RegLoss, self).__init__()
+    def forward(self, x):
+        #TODO: check
+        return torch.sum(x**2)
+class L2Loss(nn.Module):
+    def __init__(self):
+        super(L2Loss, self).__init__()
+    def forward(self, x):
+        # TODO: check
+        return torch.sum(x ** 2)
+class AudioLoss(nn.Module):
+    def __init__(self):
+        super(AudioLoss, self).__init__()
+    def forward(self, dynamics, gt_poses):
+        #pay attention, normalized
+        mean = torch.mean(gt_poses, dim=-1).unsqueeze(-1)
+        gt = gt_poses - mean
+        return F.mse_loss(dynamics, gt)
+L1Loss = nn.L1Loss

nets/LS3DCG.py ADDED Viewed

	@@ -0,0 +1,414 @@

+'''
+not exactly the same as the official repo but the results are good
+'''
+import sys
+import os
+from data_utils.lower_body import c_index_3d, c_index_6d
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+import math
+from nets.base import TrainWrapperBaseClass
+from nets.layers import SeqEncoder1D
+from losses import KeypointLoss, L1Loss, KLLoss
+from data_utils.utils import get_melspec, get_mfcc_psf, get_mfcc_ta
+from nets.utils import denormalize
+class Conv1d_tf(nn.Conv1d):
+    """
+    Conv1d with the padding behavior from TF
+    modified from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
+    """
+    def __init__(self, *args, **kwargs):
+        super(Conv1d_tf, self).__init__(*args, **kwargs)
+        self.padding = kwargs.get("padding", "same")
+    def _compute_padding(self, input, dim):
+        input_size = input.size(dim + 2)
+        filter_size = self.weight.size(dim + 2)
+        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
+        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
+        total_padding = max(
+            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
+        )
+        additional_padding = int(total_padding % 2 != 0)
+        return additional_padding, total_padding
+    def forward(self, input):
+        if self.padding == "VALID":
+            return F.conv1d(
+                input,
+                self.weight,
+                self.bias,
+                self.stride,
+                padding=0,
+                dilation=self.dilation,
+                groups=self.groups,
+            )
+        rows_odd, padding_rows = self._compute_padding(input, dim=0)
+        if rows_odd:
+            input = F.pad(input, [0, rows_odd])
+        return F.conv1d(
+            input,
+            self.weight,
+            self.bias,
+            self.stride,
+            padding=(padding_rows // 2),
+            dilation=self.dilation,
+            groups=self.groups,
+        )
+def ConvNormRelu(in_channels, out_channels, type='1d', downsample=False, k=None, s=None, norm='bn', padding='valid'):
+    if k is None and s is None:
+        if not downsample:
+            k = 3
+            s = 1
+        else:
+            k = 4
+            s = 2
+    if type == '1d':
+        conv_block = Conv1d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
+        if norm == 'bn':
+            norm_block = nn.BatchNorm1d(out_channels)
+        elif norm == 'ln':
+            norm_block = nn.LayerNorm(out_channels)
+    elif type == '2d':
+        conv_block = Conv2d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
+        norm_block = nn.BatchNorm2d(out_channels)
+    else:
+        assert False
+    return nn.Sequential(
+        conv_block,
+        norm_block,
+        nn.LeakyReLU(0.2, True)
+    )
+class Decoder(nn.Module):
+    def __init__(self, in_ch, out_ch):
+        super(Decoder, self).__init__()
+        self.up1 = nn.Sequential(
+            ConvNormRelu(in_ch // 2 + in_ch, in_ch // 2),
+            ConvNormRelu(in_ch // 2, in_ch // 2),
+            nn.Upsample(scale_factor=2, mode='nearest')
+        )
+        self.up2 = nn.Sequential(
+            ConvNormRelu(in_ch // 4 + in_ch // 2, in_ch // 4),
+            ConvNormRelu(in_ch // 4, in_ch // 4),
+            nn.Upsample(scale_factor=2, mode='nearest')
+        )
+        self.up3 = nn.Sequential(
+            ConvNormRelu(in_ch // 8 + in_ch // 4, in_ch // 8),
+            ConvNormRelu(in_ch // 8, in_ch // 8),
+            nn.Conv1d(in_ch // 8, out_ch, 1, 1)
+        )
+    def forward(self, x, x1, x2, x3):
+        x = F.interpolate(x, x3.shape[2])
+        x = torch.cat([x, x3], dim=1)
+        x = self.up1(x)
+        x = F.interpolate(x, x2.shape[2])
+        x = torch.cat([x, x2], dim=1)
+        x = self.up2(x)
+        x = F.interpolate(x, x1.shape[2])
+        x = torch.cat([x, x1], dim=1)
+        x = self.up3(x)
+        return x
+class EncoderDecoder(nn.Module):
+    def __init__(self, n_frames, each_dim):
+        super().__init__()
+        self.n_frames = n_frames
+        self.down1 = nn.Sequential(
+            ConvNormRelu(64, 64, '1d', False),
+            ConvNormRelu(64, 128, '1d', False),
+        )
+        self.down2 = nn.Sequential(
+            ConvNormRelu(128, 128, '1d', False),
+            ConvNormRelu(128, 256, '1d', False),
+        )
+        self.down3 = nn.Sequential(
+            ConvNormRelu(256, 256, '1d', False),
+            ConvNormRelu(256, 512, '1d', False),
+        )
+        self.down4 = nn.Sequential(
+            ConvNormRelu(512, 512, '1d', False),
+            ConvNormRelu(512, 1024, '1d', False),
+        )
+        self.down = nn.MaxPool1d(kernel_size=2)
+        self.up = nn.Upsample(scale_factor=2, mode='nearest')
+        self.face_decoder = Decoder(1024, each_dim[0] + each_dim[3])
+        self.body_decoder = Decoder(1024, each_dim[1])
+        self.hand_decoder = Decoder(1024, each_dim[2])
+    def forward(self, spectrogram, time_steps=None):
+        if time_steps is None:
+            time_steps = self.n_frames
+        x1 = self.down1(spectrogram)
+        x = self.down(x1)
+        x2 = self.down2(x)
+        x = self.down(x2)
+        x3 = self.down3(x)
+        x = self.down(x3)
+        x = self.down4(x)
+        x = self.up(x)
+        face = self.face_decoder(x, x1, x2, x3)
+        body = self.body_decoder(x, x1, x2, x3)
+        hand = self.hand_decoder(x, x1, x2, x3)
+        return face, body, hand
+class Generator(nn.Module):
+    def __init__(self,
+                 each_dim,
+                 training=False,
+                 device=None
+                 ):
+        super().__init__()
+        self.training = training
+        self.device = device
+        self.encoderdecoder = EncoderDecoder(15, each_dim)
+    def forward(self, in_spec, time_steps=None):
+        if time_steps is not None:
+            self.gen_length = time_steps
+        face, body, hand = self.encoderdecoder(in_spec)
+        out = torch.cat([face, body, hand], dim=1)
+        out = out.transpose(1, 2)
+        return out
+class Discriminator(nn.Module):
+    def __init__(self, input_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            ConvNormRelu(input_dim, 128, '1d'),
+            ConvNormRelu(128, 256, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            ConvNormRelu(256, 256, '1d'),
+            ConvNormRelu(256, 512, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            ConvNormRelu(512, 512, '1d'),
+            ConvNormRelu(512, 1024, '1d'),
+            nn.MaxPool1d(kernel_size=2),
+            nn.Conv1d(1024, 1, 1, 1),
+            nn.Sigmoid()
+        )
+    def forward(self, x):
+        x = x.transpose(1, 2)
+        out = self.net(x)
+        return out
+class TrainWrapper(TrainWrapperBaseClass):
+    def __init__(self, args, config) -> None:
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.init_params()
+        self.generator = Generator(
+            each_dim=self.each_dim,
+            training=not self.args.infer,
+            device=self.device,
+        ).to(self.device)
+        self.discriminator = Discriminator(
+            input_dim=self.each_dim[1] + self.each_dim[2] + 64
+        ).to(self.device)
+        if self.convert_to_6d:
+            self.c_index = c_index_6d
+        else:
+            self.c_index = c_index_3d
+        self.MSELoss = KeypointLoss().to(self.device)
+        self.L1Loss = L1Loss().to(self.device)
+        super().__init__(args, config)
+    def init_params(self):
+        scale = 1
+        global_orient = round(0 * scale)
+        leye_pose = reye_pose = round(0 * scale)
+        jaw_pose = round(3 * scale)
+        body_pose = round((63 - 24) * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        expression = 100
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
+    def __call__(self, bat):
+        assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        expression = bat['expression'].to(self.device).to(torch.float32)
+        jaw = poses[:, :3, :]
+        poses = poses[:, self.c_index, :]
+        pred = self.generator(in_spec=aud)
+        D_loss, D_loss_dict = self.get_loss(
+            pred_poses=pred.detach(),
+            gt_poses=poses,
+            aud=aud,
+            mode='training_D',
+        )
+        self.discriminator_optimizer.zero_grad()
+        D_loss.backward()
+        self.discriminator_optimizer.step()
+        G_loss, G_loss_dict = self.get_loss(
+            pred_poses=pred,
+            gt_poses=poses,
+            aud=aud,
+            expression=expression,
+            jaw=jaw,
+            mode='training_G',
+        )
+        self.generator_optimizer.zero_grad()
+        G_loss.backward()
+        self.generator_optimizer.step()
+        total_loss = None
+        loss_dict = {}
+        for key in list(D_loss_dict.keys()) + list(G_loss_dict.keys()):
+            loss_dict[key] = G_loss_dict.get(key, 0) + D_loss_dict.get(key, 0)
+        return total_loss, loss_dict
+    def get_loss(self,
+                 pred_poses,
+                 gt_poses,
+                 aud=None,
+                 jaw=None,
+                 expression=None,
+                 mode='training_G',
+                 ):
+        loss_dict = {}
+        aud = aud.transpose(1, 2)
+        gt_poses = gt_poses.transpose(1, 2)
+        gt_aud = torch.cat([gt_poses, aud], dim=2)
+        pred_aud = torch.cat([pred_poses[:, :, 103:], aud], dim=2)
+        if mode == 'training_D':
+            dis_real = self.discriminator(gt_aud)
+            dis_fake = self.discriminator(pred_aud)
+            dis_error = self.MSELoss(torch.ones_like(dis_real).to(self.device), dis_real) + self.MSELoss(
+                torch.zeros_like(dis_fake).to(self.device), dis_fake)
+            loss_dict['dis'] = dis_error
+            return dis_error, loss_dict
+        elif mode == 'training_G':
+            jaw_loss = self.L1Loss(pred_poses[:, :, :3], jaw.transpose(1, 2))
+            face_loss = self.MSELoss(pred_poses[:, :, 3:103], expression.transpose(1, 2))
+            body_loss = self.L1Loss(pred_poses[:, :, 103:142], gt_poses[:, :, :39])
+            hand_loss = self.L1Loss(pred_poses[:, :, 142:], gt_poses[:, :, 39:])
+            l1_loss = jaw_loss + face_loss + body_loss + hand_loss
+            dis_output = self.discriminator(pred_aud)
+            gen_error = self.MSELoss(torch.ones_like(dis_output).to(self.device), dis_output)
+            gen_loss = self.config.Train.weights.keypoint_loss_weight * l1_loss + self.config.Train.weights.gan_loss_weight * gen_error
+            loss_dict['gen'] = gen_error
+            loss_dict['jaw_loss'] = jaw_loss
+            loss_dict['face_loss'] = face_loss
+            loss_dict['body_loss'] = body_loss
+            loss_dict['hand_loss'] = hand_loss
+            return gen_loss, loss_dict
+        else:
+            raise ValueError(mode)
+    def infer_on_audio(self, aud_fn, fps=30, initial_pose=None, norm_stats=None, id=None, B=1, **kwargs):
+        output = []
+        assert self.args.infer, "train mode"
+        self.generator.eval()
+        if self.config.Data.pose.normalization:
+            assert norm_stats is not None
+            data_mean = norm_stats[0]
+            data_std = norm_stats[1]
+        pre_length = self.config.Data.pose.pre_pose_length
+        generate_length = self.config.Data.pose.generate_length
+        # assert pre_length == initial_pose.shape[-1]
+        # pre_poses = initial_pose.permute(0, 2, 1).to(self.device).to(torch.float32)
+        # B = pre_poses.shape[0]
+        aud_feat = get_mfcc_ta(aud_fn, sr=22000, fps=fps, smlpx=True, type='mfcc').transpose(1, 0)
+        num_poses_to_generate = aud_feat.shape[-1]
+        aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
+        aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
+        with torch.no_grad():
+            pred_poses = self.generator(aud_feat)
+            pred_poses = pred_poses.cpu().numpy()
+        output = pred_poses.squeeze()
+        return output
+    def generate(self, aud, id):
+        self.generator.eval()
+        pred_poses = self.generator(aud)
+        return pred_poses
+if __name__ == '__main__':
+    from trainer.options import parse_args
+    parser = parse_args()
+    args = parser.parse_args(
+        ['--exp_name', '0', '--data_root', '0', '--speakers', '0', '--pre_pose_length', '4', '--generate_length', '64',
+         '--infer'])
+    generator = TrainWrapper(args)
+    aud_fn = '../sample_audio/jon.wav'
+    initial_pose = torch.randn(64, 108, 4)
+    norm_stats = (np.random.randn(108), np.random.randn(108))
+    output = generator.infer_on_audio(aud_fn, initial_pose, norm_stats)
+    print(output.shape)

nets/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from .smplx_face import TrainWrapper as s2g_face
+from .smplx_body_vq import TrainWrapper as s2g_body_vq
+from .smplx_body_pixel import TrainWrapper as s2g_body_pixel
+from .body_ae import TrainWrapper as s2g_body_ae
+from .LS3DCG import TrainWrapper as LS3DCG
+from .base import TrainWrapperBaseClass
+from .utils import normalize, denormalize

nets/base.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+class TrainWrapperBaseClass():
+    def __init__(self, args, config) -> None:
+        self.init_optimizer()
+    def init_optimizer(self) -> None:
+        print('using Adam')
+        self.generator_optimizer = optim.Adam(
+            self.generator.parameters(),
+            lr = self.config.Train.learning_rate.generator_learning_rate,
+            betas=[0.9, 0.999]
+        )
+        if self.discriminator is not None:
+            self.discriminator_optimizer = optim.Adam(
+                self.discriminator.parameters(),
+                lr = self.config.Train.learning_rate.discriminator_learning_rate,
+                betas=[0.9, 0.999]
+            )
+    def __call__(self, bat):
+        raise NotImplementedError
+    def get_loss(self, **kwargs):
+        raise NotImplementedError
+    def state_dict(self):
+        model_state = {
+            'generator': self.generator.state_dict(),
+            'generator_optim': self.generator_optimizer.state_dict(),
+            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
+            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
+        }
+        return model_state
+    def parameters(self):
+        return self.generator.parameters()
+    def load_state_dict(self, state_dict):
+        if 'generator' in state_dict:
+            self.generator.load_state_dict(state_dict['generator'])
+        else:
+            self.generator.load_state_dict(state_dict)
+        if 'generator_optim' in state_dict and self.generator_optimizer is not None:
+            self.generator_optimizer.load_state_dict(state_dict['generator_optim'])
+        if self.discriminator is not None:
+            self.discriminator.load_state_dict(state_dict['discriminator'])
+            if 'discriminator_optim' in state_dict and self.discriminator_optimizer is not None:
+                self.discriminator_optimizer.load_state_dict(state_dict['discriminator_optim'])
+    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, **kwargs):
+        raise NotImplementedError
+    def init_params(self):
+        if self.config.Data.pose.convert_to_6d:
+            scale = 2
+        else:
+            scale = 1
+        global_orient = round(0 * scale)
+        leye_pose = reye_pose = round(0 * scale)
+        jaw_pose = round(0 * scale)
+        body_pose = round((63 - 24) * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        if self.expression:
+            expression = 100
+        else:
+            expression = 0
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]

nets/body_ae.py ADDED Viewed

	@@ -0,0 +1,152 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from nets.base import TrainWrapperBaseClass
+from nets.spg.s2glayers import Discriminator as D_S2G
+from nets.spg.vqvae_1d import AE as s2g_body
+import torch
+import torch.optim as optim
+import torch.nn.functional as F
+from data_utils.lower_body import c_index, c_index_3d, c_index_6d
+def separate_aa(aa):
+    aa = aa[:, :, :].reshape(aa.shape[0], aa.shape[1], -1, 5)
+    axis = F.normalize(aa[:, :, :, :3], dim=-1)
+    angle = F.normalize(aa[:, :, :, 3:5], dim=-1)
+    return axis, angle
+class TrainWrapper(TrainWrapperBaseClass):
+    '''
+    a wrapper receving a batch from data_utils and calculate loss
+    '''
+    def __init__(self, args, config):
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.gan = False
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.preleng = self.config.Data.pose.pre_pose_length
+        self.expression = self.config.Data.pose.expression
+        self.epoch = 0
+        self.init_params()
+        self.num_classes = 4
+        self.g = s2g_body(self.each_dim[1] + self.each_dim[2], embedding_dim=64, num_embeddings=0,
+                          num_hiddens=1024, num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+        if self.gan:
+            self.discriminator = D_S2G(
+                pose_dim=110 + 64, pose=self.pose
+            ).to(self.device)
+        else:
+            self.discriminator = None
+        if self.convert_to_6d:
+            self.c_index = c_index_6d
+        else:
+            self.c_index = c_index_3d
+        super().__init__(args, config)
+    def init_optimizer(self):
+        self.g_optimizer = optim.Adam(
+            self.g.parameters(),
+            lr=self.config.Train.learning_rate.generator_learning_rate,
+            betas=[0.9, 0.999]
+        )
+    def state_dict(self):
+        model_state = {
+            'g': self.g.state_dict(),
+            'g_optim': self.g_optimizer.state_dict(),
+            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
+            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
+        }
+        return model_state
+    def __call__(self, bat):
+        # assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        total_loss = None
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        # id = bat['speaker'].to(self.device) - 20
+        # id = F.one_hot(id, self.num_classes)
+        poses = poses[:, self.c_index, :]
+        gt_poses = poses[:, :, self.preleng:].permute(0, 2, 1)
+        loss = 0
+        loss_dict, loss = self.vq_train(gt_poses[:, :], 'g', self.g, loss_dict, loss)
+        return total_loss, loss_dict
+    def vq_train(self, gt, name, model, dict, total_loss, pre=None):
+        x_recon = model(gt_poses=gt, pre_state=pre)
+        loss, loss_dict = self.get_loss(pred_poses=x_recon, gt_poses=gt, pre=pre)
+        # total_loss = total_loss + loss
+        if name == 'g':
+            optimizer_name = 'g_optimizer'
+        optimizer = getattr(self, optimizer_name)
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        for key in list(loss_dict.keys()):
+            dict[name + key] = loss_dict.get(key, 0).item()
+        return dict, total_loss
+    def get_loss(self,
+                 pred_poses,
+                 gt_poses,
+                 pre=None
+                 ):
+        loss_dict = {}
+        rec_loss = torch.mean(torch.abs(pred_poses - gt_poses))
+        v_pr = pred_poses[:, 1:] - pred_poses[:, :-1]
+        v_gt = gt_poses[:, 1:] - gt_poses[:, :-1]
+        velocity_loss = torch.mean(torch.abs(v_pr - v_gt))
+        if pre is None:
+            f0_vel = 0
+        else:
+            v0_pr = pred_poses[:, 0] - pre[:, -1]
+            v0_gt = gt_poses[:, 0] - pre[:, -1]
+            f0_vel = torch.mean(torch.abs(v0_pr - v0_gt))
+        gen_loss = rec_loss + velocity_loss + f0_vel
+        loss_dict['rec_loss'] = rec_loss
+        loss_dict['velocity_loss'] = velocity_loss
+        # loss_dict['e_q_loss'] = e_q_loss
+        if pre is not None:
+            loss_dict['f0_vel'] = f0_vel
+        return gen_loss, loss_dict
+    def load_state_dict(self, state_dict):
+        self.g.load_state_dict(state_dict['g'])
+    def extract(self, x):
+        self.g.eval()
+        if x.shape[2] > self.full_dim:
+            if x.shape[2] == 239:
+                x = x[:, :, 102:]
+            x = x[:, :, self.c_index]
+        feat = self.g.encode(x)
+        return feat.transpose(1, 2), x

nets/init_model.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from nets import *
+def init_model(model_name, args, config):
+    if model_name == 's2g_face':
+        generator = s2g_face(
+            args,
+            config,
+        )
+    elif model_name == 's2g_body_vq':
+        generator = s2g_body_vq(
+            args,
+            config,
+        )
+    elif model_name == 's2g_body_pixel':
+        generator = s2g_body_pixel(
+            args,
+            config,
+        )
+    elif model_name == 's2g_body_ae':
+        generator = s2g_body_ae(
+            args,
+            config,
+        )
+    elif model_name == 's2g_LS3DCG':
+        generator = LS3DCG(
+            args,
+            config,
+        )
+    else:
+        raise ValueError
+    return generator

nets/layers.py ADDED Viewed

	@@ -0,0 +1,1052 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+import torch
+import torch.nn as nn
+import numpy as np
+# TODO: be aware of the actual netork structures
+def get_log(x):
+    log = 0
+    while x > 1:
+        if x % 2 == 0:
+            x = x // 2
+            log += 1
+        else:
+            raise ValueError('x is not a power of 2')
+    return log
+class ConvNormRelu(nn.Module):
+    '''
+    (B,C_in,H,W) -> (B, C_out, H, W)
+    there exist some kernel size that makes the result is not H/s
+    #TODO: there might some problems with residual
+    '''
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 type='1d',
+                 leaky=False,
+                 downsample=False,
+                 kernel_size=None,
+                 stride=None,
+                 padding=None,
+                 p=0,
+                 groups=1,
+                 residual=False,
+                 norm='bn'):
+        '''
+        conv-bn-relu
+        '''
+        super(ConvNormRelu, self).__init__()
+        self.residual = residual
+        self.norm_type = norm
+        # kernel_size = k
+        # stride = s
+        if kernel_size is None and stride is None:
+            if not downsample:
+                kernel_size = 3
+                stride = 1
+            else:
+                kernel_size = 4
+                stride = 2
+        if padding is None:
+            if isinstance(kernel_size, int) and isinstance(stride, tuple):
+                padding = tuple(int((kernel_size - st) / 2) for st in stride)
+            elif isinstance(kernel_size, tuple) and isinstance(stride, int):
+                padding = tuple(int((ks - stride) / 2) for ks in kernel_size)
+            elif isinstance(kernel_size, tuple) and isinstance(stride, tuple):
+                padding = tuple(int((ks - st) / 2) for ks, st in zip(kernel_size, stride))
+            else:
+                padding = int((kernel_size - stride) / 2)
+        if self.residual:
+            if downsample:
+                if type == '1d':
+                    self.residual_layer = nn.Sequential(
+                        nn.Conv1d(
+                            in_channels=in_channels,
+                            out_channels=out_channels,
+                            kernel_size=kernel_size,
+                            stride=stride,
+                            padding=padding
+                        )
+                    )
+                elif type == '2d':
+                    self.residual_layer = nn.Sequential(
+                        nn.Conv2d(
+                            in_channels=in_channels,
+                            out_channels=out_channels,
+                            kernel_size=kernel_size,
+                            stride=stride,
+                            padding=padding
+                        )
+                    )
+            else:
+                if in_channels == out_channels:
+                    self.residual_layer = nn.Identity()
+                else:
+                    if type == '1d':
+                        self.residual_layer = nn.Sequential(
+                            nn.Conv1d(
+                                in_channels=in_channels,
+                                out_channels=out_channels,
+                                kernel_size=kernel_size,
+                                stride=stride,
+                                padding=padding
+                            )
+                        )
+                    elif type == '2d':
+                        self.residual_layer = nn.Sequential(
+                            nn.Conv2d(
+                                in_channels=in_channels,
+                                out_channels=out_channels,
+                                kernel_size=kernel_size,
+                                stride=stride,
+                                padding=padding
+                            )
+                        )
+        in_channels = in_channels * groups
+        out_channels = out_channels * groups
+        if type == '1d':
+            self.conv = nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
+                                  kernel_size=kernel_size, stride=stride, padding=padding,
+                                  groups=groups)
+            self.norm = nn.BatchNorm1d(out_channels)
+            self.dropout = nn.Dropout(p=p)
+        elif type == '2d':
+            self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
+                                  kernel_size=kernel_size, stride=stride, padding=padding,
+                                  groups=groups)
+            self.norm = nn.BatchNorm2d(out_channels)
+            self.dropout = nn.Dropout2d(p=p)
+        if norm == 'gn':
+            self.norm = nn.GroupNorm(2, out_channels)
+        elif norm == 'ln':
+            self.norm = nn.LayerNorm(out_channels)
+        if leaky:
+            self.relu = nn.LeakyReLU(negative_slope=0.2)
+        else:
+            self.relu = nn.ReLU()
+    def forward(self, x, **kwargs):
+        if self.norm_type == 'ln':
+            out = self.dropout(self.conv(x))
+            out = self.norm(out.transpose(1,2)).transpose(1,2)
+        else:
+            out = self.norm(self.dropout(self.conv(x)))
+        if self.residual:
+            residual = self.residual_layer(x)
+            out += residual
+        return self.relu(out)
+class UNet1D(nn.Module):
+    def __init__(self,
+                 input_channels,
+                 output_channels,
+                 max_depth=5,
+                 kernel_size=None,
+                 stride=None,
+                 p=0,
+                 groups=1):
+        super(UNet1D, self).__init__()
+        self.pre_downsampling_conv = nn.ModuleList([])
+        self.conv1 = nn.ModuleList([])
+        self.conv2 = nn.ModuleList([])
+        self.upconv = nn.Upsample(scale_factor=2, mode='nearest')
+        self.max_depth = max_depth
+        self.groups = groups
+        self.pre_downsampling_conv.append(ConvNormRelu(input_channels, output_channels,
+                                                       type='1d', leaky=True, downsample=False,
+                                                       kernel_size=kernel_size, stride=stride, p=p, groups=groups))
+        self.pre_downsampling_conv.append(ConvNormRelu(output_channels, output_channels,
+                                                       type='1d', leaky=True, downsample=False,
+                                                       kernel_size=kernel_size, stride=stride, p=p, groups=groups))
+        for i in range(self.max_depth):
+            self.conv1.append(ConvNormRelu(output_channels, output_channels,
+                                           type='1d', leaky=True, downsample=True,
+                                           kernel_size=kernel_size, stride=stride, p=p, groups=groups))
+        for i in range(self.max_depth):
+            self.conv2.append(ConvNormRelu(output_channels, output_channels,
+                                           type='1d', leaky=True, downsample=False,
+                                           kernel_size=kernel_size, stride=stride, p=p, groups=groups))
+    def forward(self, x):
+        input_size = x.shape[-1]
+        assert get_log(
+            input_size) >= self.max_depth, 'num_frames must be a power of 2 and its power must be greater than max_depth'
+        x = nn.Sequential(*self.pre_downsampling_conv)(x)
+        residuals = []
+        residuals.append(x)
+        for i, conv1 in enumerate(self.conv1):
+            x = conv1(x)
+            if i < self.max_depth - 1:
+                residuals.append(x)
+        for i, conv2 in enumerate(self.conv2):
+            x = self.upconv(x) + residuals[self.max_depth - i - 1]
+            x = conv2(x)
+        return x
+class UNet2D(nn.Module):
+    def __init__(self):
+        super(UNet2D, self).__init__()
+        raise NotImplementedError('2D Unet is wierd')
+class AudioPoseEncoder1D(nn.Module):
+    '''
+    (B, C, T) -> (B, C*2, T) -> ... -> (B, C_out, T)
+    '''
+    def __init__(self,
+                 C_in,
+                 C_out,
+                 kernel_size=None,
+                 stride=None,
+                 min_layer_nums=None
+                 ):
+        super(AudioPoseEncoder1D, self).__init__()
+        self.C_in = C_in
+        self.C_out = C_out
+        conv_layers = nn.ModuleList([])
+        cur_C = C_in
+        num_layers = 0
+        while cur_C < self.C_out:
+            conv_layers.append(ConvNormRelu(
+                in_channels=cur_C,
+                out_channels=cur_C * 2,
+                kernel_size=kernel_size,
+                stride=stride
+            ))
+            cur_C *= 2
+            num_layers += 1
+        if (cur_C != C_out) or (min_layer_nums is not None and num_layers < min_layer_nums):
+            while (cur_C != C_out) or num_layers < min_layer_nums:
+                conv_layers.append(ConvNormRelu(
+                    in_channels=cur_C,
+                    out_channels=C_out,
+                    kernel_size=kernel_size,
+                    stride=stride
+                ))
+                num_layers += 1
+                cur_C = C_out
+        self.conv_layers = nn.Sequential(*conv_layers)
+    def forward(self, x):
+        '''
+        x: (B, C, T)
+        '''
+        x = self.conv_layers(x)
+        return x
+class AudioPoseEncoder2D(nn.Module):
+    '''
+    (B, C, T) -> (B, 1, C, T) -> ... -> (B, C_out, T)
+    '''
+    def __init__(self):
+        raise NotImplementedError
+class AudioPoseEncoderRNN(nn.Module):
+    '''
+    (B, C, T)->(B, T, C)->(B, T, C_out)->(B, C_out, T)
+    '''
+    def __init__(self,
+                 C_in,
+                 hidden_size,
+                 num_layers,
+                 rnn_cell='gru',
+                 bidirectional=False
+                 ):
+        super(AudioPoseEncoderRNN, self).__init__()
+        if rnn_cell == 'gru':
+            self.cell = nn.GRU(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                               bidirectional=bidirectional)
+        elif rnn_cell == 'lstm':
+            self.cell = nn.LSTM(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                bidirectional=bidirectional)
+        else:
+            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
+    def forward(self, x, state=None):
+        x = x.permute(0, 2, 1)
+        x, state = self.cell(x, state)
+        x = x.permute(0, 2, 1)
+        return x
+class AudioPoseEncoderGraph(nn.Module):
+    '''
+    (B, C, T)->(B, 2, V, T)->(B, 2, T, V)->(B, D, T, V)
+    '''
+    def __init__(self,
+                 layers_config,  # 理应是(C_in, C_out, kernel_size)的list
+                 A,  # adjacent matrix (num_parts, V, V)
+                 residual,
+                 local_bn=False,
+                 share_weights=False
+                 ) -> None:
+        super().__init__()
+        self.A = A
+        self.num_joints = A.shape[1]
+        self.num_parts = A.shape[0]
+        self.C_in = layers_config[0][0]
+        self.C_out = layers_config[-1][1]
+        self.conv_layers = nn.ModuleList([
+            GraphConvNormRelu(
+                C_in=c_in,
+                C_out=c_out,
+                A=self.A,
+                residual=residual,
+                local_bn=local_bn,
+                kernel_size=k,
+                share_weights=share_weights
+            ) for (c_in, c_out, k) in layers_config
+        ])
+        self.conv_layers = nn.Sequential(*self.conv_layers)
+    def forward(self, x):
+        '''
+        x: (B, C, T), C should be num_joints*D
+        output: (B, D, T, V)
+        '''
+        B, C, T = x.shape
+        x = x.view(B, self.num_joints, self.C_in, T)  # (B, V, D, T)，D：每个joint的特征维度，注意这里V在前面
+        x = x.permute(0, 2, 3, 1)  # (B, D, T, V)
+        assert x.shape[1] == self.C_in
+        x_conved = self.conv_layers(x)
+        # x_conved = x_conved.permute(0, 3, 1, 2).contiguous().view(B, self.C_out*self.num_joints, T)#(B, V*C_out, T)
+        return x_conved
+class SeqEncoder2D(nn.Module):
+    '''
+    seq_encoder, encoding a seq to a vector
+    (B, C, T)->(B, 2, V, T)->(B, 2, T, V) -> (B, 32, )->...->(B, C_out)
+    '''
+    def __init__(self,
+                 C_in,  # should be 2
+                 T_in,
+                 C_out,
+                 num_joints,
+                 min_layer_num=None,
+                 residual=False
+                 ):
+        super(SeqEncoder2D, self).__init__()
+        self.C_in = C_in
+        self.C_out = C_out
+        self.T_in = T_in
+        self.num_joints = num_joints
+        conv_layers = nn.ModuleList([])
+        conv_layers.append(ConvNormRelu(
+            in_channels=C_in,
+            out_channels=32,
+            type='2d',
+            residual=residual
+        ))
+        cur_C = 32
+        cur_H = T_in
+        cur_W = num_joints
+        num_layers = 1
+        while (cur_C < C_out) or (cur_H > 1) or (cur_W > 1):
+            ks = [3, 3]
+            st = [1, 1]
+            if cur_H > 1:
+                if cur_H > 4:
+                    ks[0] = 4
+                    st[0] = 2
+                else:
+                    ks[0] = cur_H
+                    st[0] = cur_H
+            if cur_W > 1:
+                if cur_W > 4:
+                    ks[1] = 4
+                    st[1] = 2
+                else:
+                    ks[1] = cur_W
+                    st[1] = cur_W
+            conv_layers.append(ConvNormRelu(
+                in_channels=cur_C,
+                out_channels=min(C_out, cur_C * 2),
+                type='2d',
+                kernel_size=tuple(ks),
+                stride=tuple(st),
+                residual=residual
+            ))
+            cur_C = min(cur_C * 2, C_out)
+            if cur_H > 1:
+                if cur_H > 4:
+                    cur_H //= 2
+                else:
+                    cur_H = 1
+            if cur_W > 1:
+                if cur_W > 4:
+                    cur_W //= 2
+                else:
+                    cur_W = 1
+            num_layers += 1
+        if min_layer_num is not None and (num_layers < min_layer_num):
+            while num_layers < min_layer_num:
+                conv_layers.append(ConvNormRelu(
+                    in_channels=C_out,
+                    out_channels=C_out,
+                    type='2d',
+                    kernel_size=1,
+                    stride=1,
+                    residual=residual
+                ))
+                num_layers += 1
+        self.conv_layers = nn.Sequential(*conv_layers)
+        self.num_layers = num_layers
+    def forward(self, x):
+        B, C, T = x.shape
+        x = x.view(B, self.num_joints, self.C_in, T)  # (B, V, D, T) V in front
+        x = x.permute(0, 2, 3, 1)  # (B, D, T, V)
+        assert x.shape[1] == self.C_in and x.shape[-1] == self.num_joints
+        x = self.conv_layers(x)
+        return x.squeeze()
+class SeqEncoder1D(nn.Module):
+    '''
+    (B, C, T)->(B, D)
+    '''
+    def __init__(self,
+                 C_in,
+                 C_out,
+                 T_in,
+                 min_layer_nums=None
+                 ):
+        super(SeqEncoder1D, self).__init__()
+        conv_layers = nn.ModuleList([])
+        cur_C = C_in
+        cur_T = T_in
+        self.num_layers = 0
+        while (cur_C < C_out) or (cur_T > 1):
+            ks = 3
+            st = 1
+            if cur_T > 1:
+                if cur_T > 4:
+                    ks = 4
+                    st = 2
+                else:
+                    ks = cur_T
+                    st = cur_T
+            conv_layers.append(ConvNormRelu(
+                in_channels=cur_C,
+                out_channels=min(C_out, cur_C * 2),
+                type='1d',
+                kernel_size=ks,
+                stride=st
+            ))
+            cur_C = min(cur_C * 2, C_out)
+            if cur_T > 1:
+                if cur_T > 4:
+                    cur_T = cur_T // 2
+                else:
+                    cur_T = 1
+            self.num_layers += 1
+        if min_layer_nums is not None and (self.num_layers < min_layer_nums):
+            while self.num_layers < min_layer_nums:
+                conv_layers.append(ConvNormRelu(
+                    in_channels=C_out,
+                    out_channels=C_out,
+                    type='1d',
+                    kernel_size=1,
+                    stride=1
+                ))
+                self.num_layers += 1
+        self.conv_layers = nn.Sequential(*conv_layers)
+    def forward(self, x):
+        x = self.conv_layers(x)
+        return x.squeeze()
+class SeqEncoderRNN(nn.Module):
+    '''
+    (B, C, T) -> (B, T, C) -> (B, D)
+    LSTM/GRU-FC
+    '''
+    def __init__(self,
+                 hidden_size,
+                 in_size,
+                 num_rnn_layers,
+                 rnn_cell='gru',
+                 bidirectional=False
+                 ):
+        super(SeqEncoderRNN, self).__init__()
+        self.hidden_size = hidden_size
+        self.in_size = in_size
+        self.num_rnn_layers = num_rnn_layers
+        self.bidirectional = bidirectional
+        if rnn_cell == 'gru':
+            self.cell = nn.GRU(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.num_rnn_layers,
+                               batch_first=True, bidirectional=bidirectional)
+        elif rnn_cell == 'lstm':
+            self.cell = nn.LSTM(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.num_rnn_layers,
+                                batch_first=True, bidirectional=bidirectional)
+    def forward(self, x, state=None):
+        x = x.permute(0, 2, 1)
+        B, T, C = x.shape
+        x, _ = self.cell(x, state)
+        if self.bidirectional:
+            out = torch.cat([x[:, -1, :self.hidden_size], x[:, 0, self.hidden_size:]], dim=-1)
+        else:
+            out = x[:, -1, :]
+        assert out.shape[0] == B
+        return out
+class SeqEncoderGraph(nn.Module):
+    '''
+    '''
+    def __init__(self,
+                 embedding_size,
+                 layer_configs,
+                 residual,
+                 local_bn,
+                 A,
+                 T,
+                 share_weights=False
+                 ) -> None:
+        super().__init__()
+        self.C_in = layer_configs[0][0]
+        self.C_out = embedding_size
+        self.num_joints = A.shape[1]
+        self.graph_encoder = AudioPoseEncoderGraph(
+            layers_config=layer_configs,
+            A=A,
+            residual=residual,
+            local_bn=local_bn,
+            share_weights=share_weights
+        )
+        cur_C = layer_configs[-1][1]
+        self.spatial_pool = ConvNormRelu(
+            in_channels=cur_C,
+            out_channels=cur_C,
+            type='2d',
+            kernel_size=(1, self.num_joints),
+            stride=(1, 1),
+            padding=(0, 0)
+        )
+        temporal_pool = nn.ModuleList([])
+        cur_H = T
+        num_layers = 0
+        self.temporal_conv_info = []
+        while cur_C < self.C_out or cur_H > 1:
+            self.temporal_conv_info.append(cur_C)
+            ks = [3, 1]
+            st = [1, 1]
+            if cur_H > 1:
+                if cur_H > 4:
+                    ks[0] = 4
+                    st[0] = 2
+                else:
+                    ks[0] = cur_H
+                    st[0] = cur_H
+            temporal_pool.append(ConvNormRelu(
+                in_channels=cur_C,
+                out_channels=min(self.C_out, cur_C * 2),
+                type='2d',
+                kernel_size=tuple(ks),
+                stride=tuple(st)
+            ))
+            cur_C = min(cur_C * 2, self.C_out)
+            if cur_H > 1:
+                if cur_H > 4:
+                    cur_H //= 2
+                else:
+                    cur_H = 1
+            num_layers += 1
+        self.temporal_pool = nn.Sequential(*temporal_pool)
+        print("graph seq encoder info: temporal pool:", self.temporal_conv_info)
+        self.num_layers = num_layers
+        # need fc?
+    def forward(self, x):
+        '''
+        x: (B, C, T)
+        '''
+        B, C, T = x.shape
+        x = self.graph_encoder(x)
+        x = self.spatial_pool(x)
+        x = self.temporal_pool(x)
+        x = x.view(B, self.C_out)
+        return x
+class SeqDecoder2D(nn.Module):
+    '''
+    (B, D)->(B, D, 1, 1)->(B, C_out, C, T)->(B, C_out, T)
+    '''
+    def __init__(self):
+        super(SeqDecoder2D, self).__init__()
+        raise NotImplementedError
+class SeqDecoder1D(nn.Module):
+    '''
+    (B, D)->(B, D, 1)->...->(B, C_out, T)
+    '''
+    def __init__(self,
+                 D_in,
+                 C_out,
+                 T_out,
+                 min_layer_num=None
+                 ):
+        super(SeqDecoder1D, self).__init__()
+        self.T_out = T_out
+        self.min_layer_num = min_layer_num
+        cur_t = 1
+        self.pre_conv = ConvNormRelu(
+            in_channels=D_in,
+            out_channels=C_out,
+            type='1d'
+        )
+        self.num_layers = 1
+        self.upconv = nn.Upsample(scale_factor=2, mode='nearest')
+        self.conv_layers = nn.ModuleList([])
+        cur_t *= 2
+        while cur_t <= T_out:
+            self.conv_layers.append(ConvNormRelu(
+                in_channels=C_out,
+                out_channels=C_out,
+                type='1d'
+            ))
+            cur_t *= 2
+            self.num_layers += 1
+        post_conv = nn.ModuleList([ConvNormRelu(
+            in_channels=C_out,
+            out_channels=C_out,
+            type='1d'
+        )])
+        self.num_layers += 1
+        if min_layer_num is not None and self.num_layers < min_layer_num:
+            while self.num_layers < min_layer_num:
+                post_conv.append(ConvNormRelu(
+                    in_channels=C_out,
+                    out_channels=C_out,
+                    type='1d'
+                ))
+                self.num_layers += 1
+        self.post_conv = nn.Sequential(*post_conv)
+    def forward(self, x):
+        x = x.unsqueeze(-1)
+        x = self.pre_conv(x)
+        for conv in self.conv_layers:
+            x = self.upconv(x)
+            x = conv(x)
+        x = torch.nn.functional.interpolate(x, size=self.T_out, mode='nearest')
+        x = self.post_conv(x)
+        return x
+class SeqDecoderRNN(nn.Module):
+    '''
+    (B, D)->(B, C_out, T)
+    '''
+    def __init__(self,
+                 hidden_size,
+                 C_out,
+                 T_out,
+                 num_layers,
+                 rnn_cell='gru'
+                 ):
+        super(SeqDecoderRNN, self).__init__()
+        self.num_steps = T_out
+        if rnn_cell == 'gru':
+            self.cell = nn.GRU(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                               bidirectional=False)
+        elif rnn_cell == 'lstm':
+            self.cell = nn.LSTM(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                bidirectional=False)
+        else:
+            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
+        self.fc = nn.Linear(hidden_size, C_out)
+    def forward(self, hidden, frame_0):
+        frame_0 = frame_0.permute(0, 2, 1)
+        dec_input = frame_0
+        outputs = []
+        for i in range(self.num_steps):
+            frame_out, hidden = self.cell(dec_input, hidden)
+            frame_out = self.fc(frame_out)
+            dec_input = frame_out
+            outputs.append(frame_out)
+        output = torch.cat(outputs, dim=1)
+        return output.permute(0, 2, 1)
+class SeqTranslator2D(nn.Module):
+    '''
+    (B, C, T)->(B, 1, C, T)-> ... -> (B, 1, C_out, T_out)
+    '''
+    def __init__(self,
+                 C_in=64,
+                 C_out=108,
+                 T_in=75,
+                 T_out=25,
+                 residual=True
+                 ):
+        super(SeqTranslator2D, self).__init__()
+        print("Warning: hard coded")
+        self.C_in = C_in
+        self.C_out = C_out
+        self.T_in = T_in
+        self.T_out = T_out
+        self.residual = residual
+        self.conv_layers = nn.Sequential(
+            ConvNormRelu(1, 32, '2d', kernel_size=5, stride=1),
+            ConvNormRelu(32, 32, '2d', kernel_size=5, stride=1, residual=self.residual),
+            ConvNormRelu(32, 32, '2d', kernel_size=5, stride=1, residual=self.residual),
+            ConvNormRelu(32, 64, '2d', kernel_size=5, stride=(4, 3)),
+            ConvNormRelu(64, 64, '2d', kernel_size=5, stride=1, residual=self.residual),
+            ConvNormRelu(64, 64, '2d', kernel_size=5, stride=1, residual=self.residual),
+            ConvNormRelu(64, 128, '2d', kernel_size=5, stride=(4, 1)),
+            ConvNormRelu(128, 108, '2d', kernel_size=3, stride=(4, 1)),
+            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1, residual=self.residual),
+            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1, residual=self.residual),
+            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1),
+        )
+    def forward(self, x):
+        assert len(x.shape) == 3 and x.shape[1] == self.C_in and x.shape[2] == self.T_in
+        x = x.view(x.shape[0], 1, x.shape[1], x.shape[2])
+        x = self.conv_layers(x)
+        x = x.squeeze(2)
+        return x
+class SeqTranslator1D(nn.Module):
+    '''
+    (B, C, T)->(B, C_out, T)
+    '''
+    def __init__(self,
+                 C_in,
+                 C_out,
+                 kernel_size=None,
+                 stride=None,
+                 min_layers_num=None,
+                 residual=True,
+                 norm='bn'
+                 ):
+        super(SeqTranslator1D, self).__init__()
+        conv_layers = nn.ModuleList([])
+        conv_layers.append(ConvNormRelu(
+            in_channels=C_in,
+            out_channels=C_out,
+            type='1d',
+            kernel_size=kernel_size,
+            stride=stride,
+            residual=residual,
+            norm=norm
+        ))
+        self.num_layers = 1
+        if min_layers_num is not None and self.num_layers < min_layers_num:
+            while self.num_layers < min_layers_num:
+                conv_layers.append(ConvNormRelu(
+                    in_channels=C_out,
+                    out_channels=C_out,
+                    type='1d',
+                    kernel_size=kernel_size,
+                    stride=stride,
+                    residual=residual,
+                    norm=norm
+                ))
+                self.num_layers += 1
+        self.conv_layers = nn.Sequential(*conv_layers)
+    def forward(self, x):
+        return self.conv_layers(x)
+class SeqTranslatorRNN(nn.Module):
+    '''
+    (B, C, T)->(B, C_out, T)
+    LSTM-FC
+    '''
+    def __init__(self,
+                 C_in,
+                 C_out,
+                 hidden_size,
+                 num_layers,
+                 rnn_cell='gru'
+                 ):
+        super(SeqTranslatorRNN, self).__init__()
+        if rnn_cell == 'gru':
+            self.enc_cell = nn.GRU(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                   bidirectional=False)
+            self.dec_cell = nn.GRU(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                   bidirectional=False)
+        elif rnn_cell == 'lstm':
+            self.enc_cell = nn.LSTM(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                    bidirectional=False)
+            self.dec_cell = nn.LSTM(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
+                                    bidirectional=False)
+        else:
+            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
+        self.fc = nn.Linear(hidden_size, C_out)
+    def forward(self, x, frame_0):
+        num_steps = x.shape[-1]
+        x = x.permute(0, 2, 1)
+        frame_0 = frame_0.permute(0, 2, 1)
+        _, hidden = self.enc_cell(x, None)
+        outputs = []
+        for i in range(num_steps):
+            inputs = frame_0
+            output_frame, hidden = self.dec_cell(inputs, hidden)
+            output_frame = self.fc(output_frame)
+            frame_0 = output_frame
+            outputs.append(output_frame)
+        outputs = torch.cat(outputs, dim=1)
+        return outputs.permute(0, 2, 1)
+class ResBlock(nn.Module):
+    def __init__(self,
+                 input_dim,
+                 fc_dim,
+                 afn,
+                 nfn
+                 ):
+        '''
+        afn: activation fn
+        nfn: normalization fn
+        '''
+        super(ResBlock, self).__init__()
+        self.input_dim = input_dim
+        self.fc_dim = fc_dim
+        self.afn = afn
+        self.nfn = nfn
+        if self.afn != 'relu':
+            raise ValueError('Wrong')
+        if self.nfn == 'layer_norm':
+            raise ValueError('wrong')
+        self.layers = nn.Sequential(
+            nn.Linear(self.input_dim, self.fc_dim // 2),
+            nn.ReLU(),
+            nn.Linear(self.fc_dim // 2, self.fc_dim // 2),
+            nn.ReLU(),
+            nn.Linear(self.fc_dim // 2, self.fc_dim),
+            nn.ReLU()
+        )
+        self.shortcut_layer = nn.Sequential(
+            nn.Linear(self.input_dim, self.fc_dim),
+            nn.ReLU(),
+        )
+    def forward(self, inputs):
+        return self.layers(inputs) + self.shortcut_layer(inputs)
+class AudioEncoder(nn.Module):
+    def __init__(self, channels, padding=3, kernel_size=8, conv_stride=2, conv_pool=None, augmentation=False):
+        super(AudioEncoder, self).__init__()
+        self.in_channels = channels[0]
+        self.augmentation = augmentation
+        model = []
+        acti = nn.LeakyReLU(0.2)
+        nr_layer = len(channels) - 1
+        for i in range(nr_layer):
+            if conv_pool is None:
+                model.append(nn.ReflectionPad1d(padding))
+                model.append(nn.Conv1d(channels[i], channels[i + 1], kernel_size=kernel_size, stride=conv_stride))
+                model.append(acti)
+            else:
+                model.append(nn.ReflectionPad1d(padding))
+                model.append(nn.Conv1d(channels[i], channels[i + 1], kernel_size=kernel_size, stride=conv_stride))
+                model.append(acti)
+                model.append(conv_pool(kernel_size=2, stride=2))
+        if self.augmentation:
+            model.append(
+                nn.Conv1d(channels[-1], channels[-1], kernel_size=kernel_size, stride=conv_stride)
+            )
+            model.append(acti)
+        self.model = nn.Sequential(*model)
+    def forward(self, x):
+        x = x[:, :self.in_channels, :]
+        x = self.model(x)
+        return x
+class AudioDecoder(nn.Module):
+    def __init__(self, channels, kernel_size=7, ups=25):
+        super(AudioDecoder, self).__init__()
+        model = []
+        pad = (kernel_size - 1) // 2
+        acti = nn.LeakyReLU(0.2)
+        for i in range(len(channels) - 2):
+            model.append(nn.Upsample(scale_factor=2, mode='nearest'))
+            model.append(nn.ReflectionPad1d(pad))
+            model.append(nn.Conv1d(channels[i], channels[i + 1],
+                                   kernel_size=kernel_size, stride=1))
+            if i == 0 or i == 1:
+                model.append(nn.Dropout(p=0.2))
+            if not i == len(channels) - 2:
+                model.append(acti)
+        model.append(nn.Upsample(size=ups, mode='nearest'))
+        model.append(nn.ReflectionPad1d(pad))
+        model.append(nn.Conv1d(channels[-2], channels[-1],
+                               kernel_size=kernel_size, stride=1))
+        self.model = nn.Sequential(*model)
+    def forward(self, x):
+        return self.model(x)
+class Audio2Pose(nn.Module):
+    def __init__(self, pose_dim, embed_size, augmentation, ups=25):
+        super(Audio2Pose, self).__init__()
+        self.pose_dim = pose_dim
+        self.embed_size = embed_size
+        self.augmentation = augmentation
+        self.aud_enc = AudioEncoder(channels=[13, 64, 128, 256], padding=2, kernel_size=7, conv_stride=1,
+                                    conv_pool=nn.AvgPool1d, augmentation=self.augmentation)
+        if self.augmentation:
+            self.aud_dec = AudioDecoder(channels=[512, 256, 128, pose_dim])
+        else:
+            self.aud_dec = AudioDecoder(channels=[256, 256, 128, pose_dim], ups=ups)
+        if self.augmentation:
+            self.pose_enc = nn.Sequential(
+                nn.Linear(self.embed_size // 2, 256),
+                nn.LayerNorm(256)
+            )
+    def forward(self, audio_feat, dec_input=None):
+        B = audio_feat.shape[0]
+        aud_embed = self.aud_enc.forward(audio_feat)
+        if self.augmentation:
+            dec_input = dec_input.squeeze(0)
+            dec_embed = self.pose_enc(dec_input)
+            dec_embed = dec_embed.unsqueeze(2)
+            dec_embed = dec_embed.expand(dec_embed.shape[0], dec_embed.shape[1], aud_embed.shape[-1])
+            aud_embed = torch.cat([aud_embed, dec_embed], dim=1)
+        out = self.aud_dec.forward(aud_embed)
+        return out
+if __name__ == '__main__':
+    import numpy as np
+    import os
+    import sys
+    test_model = SeqEncoder2D(
+        C_in=2,
+        T_in=25,
+        C_out=512,
+        num_joints=54,
+    )
+    print(test_model.num_layers)
+    input = torch.randn((64, 108, 25))
+    output = test_model(input)
+    print(output.shape)

nets/smplx_body_pixel.py ADDED Viewed

	@@ -0,0 +1,326 @@

+import os
+import sys
+import torch
+from torch.optim.lr_scheduler import StepLR
+sys.path.append(os.getcwd())
+from nets.layers import *
+from nets.base import TrainWrapperBaseClass
+from nets.spg.gated_pixelcnn_v2 import GatedPixelCNN as pixelcnn
+from nets.spg.vqvae_1d import VQVAE as s2g_body, Wav2VecEncoder
+from nets.spg.vqvae_1d import AudioEncoder
+from nets.utils import parse_audio, denormalize
+from data_utils import get_mfcc, get_melspec, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
+import numpy as np
+import torch.optim as optim
+import torch.nn.functional as F
+from sklearn.preprocessing import normalize
+from data_utils.lower_body import c_index, c_index_3d, c_index_6d
+from data_utils.utils import smooth_geom, get_mfcc_sepa
+class TrainWrapper(TrainWrapperBaseClass):
+    '''
+    a wrapper receving a batch from data_utils and calculate loss
+    '''
+    def __init__(self, args, config):
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.expression = self.config.Data.pose.expression
+        self.epoch = 0
+        self.init_params()
+        self.num_classes = 4
+        self.audio = True
+        self.composition = self.config.Model.composition
+        self.bh_model = self.config.Model.bh_model
+        if self.audio:
+            self.audioencoder = AudioEncoder(in_dim=64, num_hiddens=256, num_residual_layers=2, num_residual_hiddens=256).to(self.device)
+        else:
+            self.audioencoder = None
+        if self.convert_to_6d:
+            dim, layer = 512, 10
+        else:
+            dim, layer = 256, 15
+        self.generator = pixelcnn(2048, dim, layer, self.num_classes, self.audio, self.bh_model).to(self.device)
+        self.g_body = s2g_body(self.each_dim[1], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
+                               num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+        self.g_hand = s2g_body(self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
+                               num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+        model_path = self.config.Model.vq_path
+        model_ckpt = torch.load(model_path, map_location=torch.device('cpu'))
+        self.g_body.load_state_dict(model_ckpt['generator']['g_body'])
+        self.g_hand.load_state_dict(model_ckpt['generator']['g_hand'])
+        if torch.cuda.device_count() > 1:
+            self.g_body = torch.nn.DataParallel(self.g_body, device_ids=[0, 1])
+            self.g_hand = torch.nn.DataParallel(self.g_hand, device_ids=[0, 1])
+            self.generator = torch.nn.DataParallel(self.generator, device_ids=[0, 1])
+            if self.audioencoder is not None:
+                self.audioencoder = torch.nn.DataParallel(self.audioencoder, device_ids=[0, 1])
+        self.discriminator = None
+        if self.convert_to_6d:
+            self.c_index = c_index_6d
+        else:
+            self.c_index = c_index_3d
+        super().__init__(args, config)
+    def init_optimizer(self):
+        print('using Adam')
+        self.generator_optimizer = optim.Adam(
+            self.generator.parameters(),
+            lr=self.config.Train.learning_rate.generator_learning_rate,
+            betas=[0.9, 0.999]
+        )
+        if self.audioencoder is not None:
+            opt = self.config.Model.AudioOpt
+            if opt == 'Adam':
+                self.audioencoder_optimizer = optim.Adam(
+                    self.audioencoder.parameters(),
+                    lr=self.config.Train.learning_rate.generator_learning_rate,
+                    betas=[0.9, 0.999]
+                )
+            else:
+                print('using SGD')
+                self.audioencoder_optimizer = optim.SGD(
+                filter(lambda p: p.requires_grad,self.audioencoder.parameters()),
+                lr=self.config.Train.learning_rate.generator_learning_rate*10,
+                momentum=0.9,
+                nesterov=False,
+        )
+    def state_dict(self):
+        model_state = {
+            'generator': self.generator.state_dict(),
+            'generator_optim': self.generator_optimizer.state_dict(),
+            'audioencoder': self.audioencoder.state_dict() if self.audio else None,
+            'audioencoder_optim': self.audioencoder_optimizer.state_dict() if self.audio else None,
+            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
+            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
+        }
+        return model_state
+    def load_state_dict(self, state_dict):
+        from collections import OrderedDict
+        new_state_dict = OrderedDict()  # create new OrderedDict that does not contain `module.`
+        for k, v in state_dict.items():
+            sub_dict = OrderedDict()
+            if v is not None:
+                for k1, v1 in v.items():
+                    name = k1.replace('module.', '')
+                    sub_dict[name] = v1
+            new_state_dict[k] = sub_dict
+        state_dict = new_state_dict
+        if 'generator' in state_dict:
+            self.generator.load_state_dict(state_dict['generator'])
+        else:
+            self.generator.load_state_dict(state_dict)
+        if 'generator_optim' in state_dict and self.generator_optimizer is not None:
+            self.generator_optimizer.load_state_dict(state_dict['generator_optim'])
+        if self.discriminator is not None:
+            self.discriminator.load_state_dict(state_dict['discriminator'])
+            if 'discriminator_optim' in state_dict and self.discriminator_optimizer is not None:
+                self.discriminator_optimizer.load_state_dict(state_dict['discriminator_optim'])
+        if 'audioencoder' in state_dict and self.audioencoder is not None:
+            self.audioencoder.load_state_dict(state_dict['audioencoder'])
+    def init_params(self):
+        if self.config.Data.pose.convert_to_6d:
+            scale = 2
+        else:
+            scale = 1
+        global_orient = round(0 * scale)
+        leye_pose = reye_pose = round(0 * scale)
+        jaw_pose = round(0 * scale)
+        body_pose = round((63 - 24) * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        if self.expression:
+            expression = 100
+        else:
+            expression = 0
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
+    def __call__(self, bat):
+        # assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        total_loss = None
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        id = bat['speaker'].to(self.device) - 20
+        # id = F.one_hot(id, self.num_classes)
+        poses = poses[:, self.c_index, :]
+        aud = aud.permute(0, 2, 1)
+        gt_poses = poses.permute(0, 2, 1)
+        with torch.no_grad():
+            self.g_body.eval()
+            self.g_hand.eval()
+            if torch.cuda.device_count() > 1:
+                _, body_latents = self.g_body.module.encode(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
+                _, hand_latents = self.g_hand.module.encode(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
+            else:
+                _, body_latents = self.g_body.encode(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
+                _, hand_latents = self.g_hand.encode(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
+            latents = torch.cat([body_latents.unsqueeze(dim=-1), hand_latents.unsqueeze(dim=-1)], dim=-1)
+            latents = latents.detach()
+        if self.audio:
+            audio = self.audioencoder(aud[:, :].transpose(1, 2), frame_num=latents.shape[1]*4).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
+            logits = self.generator(latents[:, :], id, audio)
+        else:
+            logits = self.generator(latents, id)
+        logits = logits.permute(0, 2, 3, 1).contiguous()
+        self.generator_optimizer.zero_grad()
+        if self.audio:
+            self.audioencoder_optimizer.zero_grad()
+        loss = F.cross_entropy(logits.view(-1, logits.shape[-1]), latents.view(-1))
+        loss.backward()
+        grad = torch.nn.utils.clip_grad_norm(self.generator.parameters(), self.config.Train.max_gradient_norm)
+        if torch.isnan(grad).sum() > 0:
+            print('fuck')
+        loss_dict['grad'] = grad.item()
+        loss_dict['ce_loss'] = loss.item()
+        self.generator_optimizer.step()
+        if self.audio:
+            self.audioencoder_optimizer.step()
+        return total_loss, loss_dict
+    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, exp=None, var=None, w_pre=False, rand=None,
+                       continuity=False, id=None, fps=15, sr=22000, B=1, am=None, am_sr=None, frame=0,**kwargs):
+        '''
+        initial_pose: (B, C, T), normalized
+        (aud_fn, txgfile) -> generated motion (B, T, C)
+        '''
+        output = []
+        assert self.args.infer, "train mode"
+        self.generator.eval()
+        self.g_body.eval()
+        self.g_hand.eval()
+        if continuity:
+            aud_feat, gap = get_mfcc_sepa(aud_fn, sr=sr, fps=fps)
+        else:
+            aud_feat = get_mfcc_ta(aud_fn, sr=sr, fps=fps, smlpx=True, type='mfcc', am=am)
+        aud_feat = aud_feat.transpose(1, 0)
+        aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
+        aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
+        if id is None:
+            id = torch.tensor([0]).to(self.device)
+        else:
+            id = id.repeat(B)
+        with torch.no_grad():
+            aud_feat = aud_feat.permute(0, 2, 1)
+            if continuity:
+                self.audioencoder.eval()
+                pre_pose = {}
+                pre_pose['b'] = pre_pose['h'] = None
+                pre_latents, pre_audio, body_0, hand_0 = self.infer(aud_feat[:, :gap], frame, id, B, pre_pose=pre_pose)
+                pre_pose['b'] = body_0[:, :, -4:].transpose(1,2)
+                pre_pose['h'] = hand_0[:, :, -4:].transpose(1,2)
+                _, _, body_1, hand_1 = self.infer(aud_feat[:, gap:], frame, id, B, pre_latents, pre_audio, pre_pose)
+                body = torch.cat([body_0, body_1], dim=2)
+                hand = torch.cat([hand_0, hand_1], dim=2)
+            else:
+                if self.audio:
+                    self.audioencoder.eval()
+                    audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
+                    latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=B, aud_feat=audio)
+                else:
+                    latents = self.generator.generate(id, shape=[aud_feat.shape[1]//4, 2], batch_size=B)
+                body_latents = latents[..., 0]
+                hand_latents = latents[..., 1]
+                body, _ = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1], latents=body_latents)
+                hand, _ = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1], latents=hand_latents)
+            pred_poses = torch.cat([body, hand], dim=1).transpose(1,2).cpu().numpy()
+        output = pred_poses
+        return output
+    def infer(self, aud_feat, frame, id, B, pre_latents=None, pre_audio=None, pre_pose=None):
+        audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
+        latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=B, aud_feat=audio,
+                                          pre_latents=pre_latents, pre_audio=pre_audio)
+        body_latents = latents[..., 0]
+        hand_latents = latents[..., 1]
+        body, _ = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1],
+                                  latents=body_latents, pre_state=pre_pose['b'])
+        hand, _ = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1],
+                                  latents=hand_latents, pre_state=pre_pose['h'])
+        return latents, audio, body, hand
+    def generate(self, aud, id, frame_num=0):
+        self.generator.eval()
+        self.g_body.eval()
+        self.g_hand.eval()
+        aud_feat = aud.permute(0, 2, 1)
+        if self.audio:
+            self.audioencoder.eval()
+            audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame_num).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
+            latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=aud.shape[0], aud_feat=audio)
+        else:
+            latents = self.generator.generate(id, shape=[aud_feat.shape[1] // 4, 2], batch_size=aud.shape[0])
+        body_latents = latents[..., 0]
+        hand_latents = latents[..., 1]
+        body = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1], latents=body_latents)
+        hand = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1], latents=hand_latents)
+        pred_poses = torch.cat([body, hand], dim=1).transpose(1, 2)
+        return pred_poses

nets/smplx_body_vq.py ADDED Viewed

	@@ -0,0 +1,302 @@

+import os
+import sys
+from torch.optim.lr_scheduler import StepLR
+sys.path.append(os.getcwd())
+from nets.layers import *
+from nets.base import TrainWrapperBaseClass
+from nets.spg.s2glayers import Generator as G_S2G, Discriminator as D_S2G
+from nets.spg.vqvae_1d import VQVAE as s2g_body
+from nets.utils import parse_audio, denormalize
+from data_utils import get_mfcc, get_melspec, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
+import numpy as np
+import torch.optim as optim
+import torch.nn.functional as F
+from sklearn.preprocessing import normalize
+from data_utils.lower_body import c_index, c_index_3d, c_index_6d
+class TrainWrapper(TrainWrapperBaseClass):
+    '''
+    a wrapper receving a batch from data_utils and calculate loss
+    '''
+    def __init__(self, args, config):
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.expression = self.config.Data.pose.expression
+        self.epoch = 0
+        self.init_params()
+        self.num_classes = 4
+        self.composition = self.config.Model.composition
+        if self.composition:
+            self.g_body = s2g_body(self.each_dim[1], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
+                                   num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+            self.g_hand = s2g_body(self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
+                                   num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+        else:
+            self.g = s2g_body(self.each_dim[1] + self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num,
+                              num_hiddens=1024, num_residual_layers=2, num_residual_hiddens=512).to(self.device)
+        self.discriminator = None
+        if self.convert_to_6d:
+            self.c_index = c_index_6d
+        else:
+            self.c_index = c_index_3d
+        super().__init__(args, config)
+    def init_optimizer(self):
+        print('using Adam')
+        if self.composition:
+            self.g_body_optimizer = optim.Adam(
+                self.g_body.parameters(),
+                lr=self.config.Train.learning_rate.generator_learning_rate,
+                betas=[0.9, 0.999]
+            )
+            self.g_hand_optimizer = optim.Adam(
+                self.g_hand.parameters(),
+                lr=self.config.Train.learning_rate.generator_learning_rate,
+                betas=[0.9, 0.999]
+            )
+        else:
+            self.g_optimizer = optim.Adam(
+                self.g.parameters(),
+                lr=self.config.Train.learning_rate.generator_learning_rate,
+                betas=[0.9, 0.999]
+            )
+    def state_dict(self):
+        if self.composition:
+            model_state = {
+                'g_body': self.g_body.state_dict(),
+                'g_body_optim': self.g_body_optimizer.state_dict(),
+                'g_hand': self.g_hand.state_dict(),
+                'g_hand_optim': self.g_hand_optimizer.state_dict(),
+                'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
+                'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
+            }
+        else:
+            model_state = {
+                'g': self.g.state_dict(),
+                'g_optim': self.g_optimizer.state_dict(),
+                'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
+                'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
+            }
+        return model_state
+    def init_params(self):
+        if self.config.Data.pose.convert_to_6d:
+            scale = 2
+        else:
+            scale = 1
+        global_orient = round(0 * scale)
+        leye_pose = reye_pose = round(0 * scale)
+        jaw_pose = round(0 * scale)
+        body_pose = round((63 - 24) * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        if self.expression:
+            expression = 100
+        else:
+            expression = 0
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
+    def __call__(self, bat):
+        # assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        total_loss = None
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        # id = bat['speaker'].to(self.device) - 20
+        # id = F.one_hot(id, self.num_classes)
+        poses = poses[:, self.c_index, :]
+        gt_poses = poses.permute(0, 2, 1)
+        b_poses = gt_poses[..., :self.each_dim[1]]
+        h_poses = gt_poses[..., self.each_dim[1]:]
+        if self.composition:
+            loss = 0
+            loss_dict, loss = self.vq_train(b_poses[:, :], 'b', self.g_body, loss_dict, loss)
+            loss_dict, loss = self.vq_train(h_poses[:, :], 'h', self.g_hand, loss_dict, loss)
+        else:
+            loss = 0
+            loss_dict, loss = self.vq_train(gt_poses[:, :], 'g', self.g, loss_dict, loss)
+        return total_loss, loss_dict
+    def vq_train(self, gt, name, model, dict, total_loss, pre=None):
+        e_q_loss, x_recon = model(gt_poses=gt, pre_state=pre)
+        loss, loss_dict = self.get_loss(pred_poses=x_recon, gt_poses=gt, e_q_loss=e_q_loss, pre=pre)
+        # total_loss = total_loss + loss
+        if name == 'b':
+            optimizer_name = 'g_body_optimizer'
+        elif name == 'h':
+            optimizer_name = 'g_hand_optimizer'
+        elif name == 'g':
+            optimizer_name = 'g_optimizer'
+        else:
+            raise ValueError("model's name must be b or h")
+        optimizer = getattr(self, optimizer_name)
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        for key in list(loss_dict.keys()):
+            dict[name + key] = loss_dict.get(key, 0).item()
+        return dict, total_loss
+    def get_loss(self,
+                 pred_poses,
+                 gt_poses,
+                 e_q_loss,
+                 pre=None
+                 ):
+        loss_dict = {}
+        rec_loss = torch.mean(torch.abs(pred_poses - gt_poses))
+        v_pr = pred_poses[:, 1:] - pred_poses[:, :-1]
+        v_gt = gt_poses[:, 1:] - gt_poses[:, :-1]
+        velocity_loss = torch.mean(torch.abs(v_pr - v_gt))
+        if pre is None:
+            f0_vel = 0
+        else:
+            v0_pr = pred_poses[:, 0] - pre[:, -1]
+            v0_gt = gt_poses[:, 0] - pre[:, -1]
+            f0_vel = torch.mean(torch.abs(v0_pr - v0_gt))
+        gen_loss = rec_loss + e_q_loss + velocity_loss + f0_vel
+        loss_dict['rec_loss'] = rec_loss
+        loss_dict['velocity_loss'] = velocity_loss
+        # loss_dict['e_q_loss'] = e_q_loss
+        if pre is not None:
+            loss_dict['f0_vel'] = f0_vel
+        return gen_loss, loss_dict
+    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, exp=None, var=None, w_pre=False, continuity=False,
+                       id=None, fps=15, sr=22000, smooth=False, **kwargs):
+        '''
+        initial_pose: (B, C, T), normalized
+        (aud_fn, txgfile) -> generated motion (B, T, C)
+        '''
+        output = []
+        assert self.args.infer, "train mode"
+        if self.composition:
+            self.g_body.eval()
+            self.g_hand.eval()
+        else:
+            self.g.eval()
+        if self.config.Data.pose.normalization:
+            assert norm_stats is not None
+            data_mean = norm_stats[0]
+            data_std = norm_stats[1]
+        # assert initial_pose.shape[-1] == pre_length
+        if initial_pose is not None:
+            gt = initial_pose[:, :, :].to(self.device).to(torch.float32)
+            pre_poses = initial_pose[:, :, :15].permute(0, 2, 1).to(self.device).to(torch.float32)
+            poses = initial_pose.permute(0, 2, 1).to(self.device).to(torch.float32)
+            B = pre_poses.shape[0]
+        else:
+            gt = None
+            pre_poses = None
+            B = 1
+        if type(aud_fn) == torch.Tensor:
+            aud_feat = torch.tensor(aud_fn, dtype=torch.float32).to(self.device)
+            num_poses_to_generate = aud_feat.shape[-1]
+        else:
+            aud_feat = get_mfcc_ta(aud_fn, sr=sr, fps=fps, smlpx=True, type='mfcc').transpose(1, 0)
+            aud_feat = aud_feat[:, :]
+            num_poses_to_generate = aud_feat.shape[-1]
+            aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
+            aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
+        # pre_poses = torch.randn(pre_poses.shape).to(self.device).to(torch.float32)
+        if id is None:
+            id = F.one_hot(torch.tensor([[0]]), self.num_classes).to(self.device)
+        with torch.no_grad():
+            aud_feat = aud_feat.permute(0, 2, 1)
+            gt_poses = gt[:, self.c_index].permute(0, 2, 1)
+            if self.composition:
+                if continuity:
+                    pred_poses_body = []
+                    pred_poses_hand = []
+                    pre_b = None
+                    pre_h = None
+                    for i in range(5):
+                        _, pred_body = self.g_body(gt_poses=gt_poses[:, i*60:(i+1)*60, :self.each_dim[1]], pre_state=pre_b)
+                        pre_b = pred_body[..., -1:].transpose(1,2)
+                        pred_poses_body.append(pred_body)
+                        _, pred_hand = self.g_hand(gt_poses=gt_poses[:, i*60:(i+1)*60, self.each_dim[1]:], pre_state=pre_h)
+                        pre_h = pred_hand[..., -1:].transpose(1,2)
+                        pred_poses_hand.append(pred_hand)
+                    pred_poses_body = torch.cat(pred_poses_body, dim=2)
+                    pred_poses_hand = torch.cat(pred_poses_hand, dim=2)
+                else:
+                    _, pred_poses_body = self.g_body(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
+                    _, pred_poses_hand = self.g_hand(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
+                pred_poses = torch.cat([pred_poses_body, pred_poses_hand], dim=1)
+            else:
+                _, pred_poses = self.g(gt_poses=gt_poses, id=id)
+            pred_poses = pred_poses.transpose(1, 2).cpu().numpy()
+        output = pred_poses
+        if self.config.Data.pose.normalization:
+            output = denormalize(output, data_mean, data_std)
+        if smooth:
+            lamda = 0.8
+            smooth_f = 10
+            frame = 149
+            for i in range(smooth_f):
+                f = frame + i
+                l = lamda * (i + 1) / smooth_f
+                output[0, f] = (1 - l) * output[0, f - 1] + l * output[0, f]
+        output = np.concatenate(output, axis=1)
+        return output
+    def load_state_dict(self, state_dict):
+        if self.composition:
+            self.g_body.load_state_dict(state_dict['g_body'])
+            self.g_hand.load_state_dict(state_dict['g_hand'])
+        else:
+            self.g.load_state_dict(state_dict['g'])

nets/smplx_face.py ADDED Viewed

	@@ -0,0 +1,238 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+from nets.layers import *
+from nets.base import TrainWrapperBaseClass
+# from nets.spg.faceformer import Faceformer
+from nets.spg.s2g_face import Generator as s2g_face
+from losses import KeypointLoss
+from nets.utils import denormalize
+from data_utils import get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
+import numpy as np
+import torch.optim as optim
+import torch.nn.functional as F
+from sklearn.preprocessing import normalize
+import smplx
+class TrainWrapper(TrainWrapperBaseClass):
+    '''
+    a wrapper receving a batch from data_utils and calculate loss
+    '''
+    def __init__(self, args, config):
+        self.args = args
+        self.config = config
+        self.device = torch.device(self.args.gpu)
+        self.global_step = 0
+        self.convert_to_6d = self.config.Data.pose.convert_to_6d
+        self.expression = self.config.Data.pose.expression
+        self.epoch = 0
+        self.init_params()
+        self.num_classes = 4
+        self.generator = s2g_face(
+            n_poses=self.config.Data.pose.generate_length,
+            each_dim=self.each_dim,
+            dim_list=self.dim_list,
+            training=not self.args.infer,
+            device=self.device,
+            identity=False if self.convert_to_6d else True,
+            num_classes=self.num_classes,
+        ).to(self.device)
+        # self.generator = Faceformer().to(self.device)
+        self.discriminator = None
+        self.am = None
+        self.MSELoss = KeypointLoss().to(self.device)
+        super().__init__(args, config)
+    def init_optimizer(self):
+        self.generator_optimizer = optim.SGD(
+            filter(lambda p: p.requires_grad,self.generator.parameters()),
+            lr=0.001,
+            momentum=0.9,
+            nesterov=False,
+        )
+    def init_params(self):
+        if self.convert_to_6d:
+            scale = 2
+        else:
+            scale = 1
+        global_orient = round(3 * scale)
+        leye_pose = reye_pose = round(3 * scale)
+        jaw_pose = round(3 * scale)
+        body_pose = round(63 * scale)
+        left_hand_pose = right_hand_pose = round(45 * scale)
+        if self.expression:
+            expression = 100
+        else:
+            expression = 0
+        b_j = 0
+        jaw_dim = jaw_pose
+        b_e = b_j + jaw_dim
+        eye_dim = leye_pose + reye_pose
+        b_b = b_e + eye_dim
+        body_dim = global_orient + body_pose
+        b_h = b_b + body_dim
+        hand_dim = left_hand_pose + right_hand_pose
+        b_f = b_h + hand_dim
+        face_dim = expression
+        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
+        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim + face_dim
+        self.pose = int(self.full_dim / round(3 * scale))
+        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
+    def __call__(self, bat):
+        # assert (not self.args.infer), "infer mode"
+        self.global_step += 1
+        total_loss = None
+        loss_dict = {}
+        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
+        id = bat['speaker'].to(self.device) - 20
+        id = F.one_hot(id, self.num_classes)
+        aud = aud.permute(0, 2, 1)
+        gt_poses = poses.permute(0, 2, 1)
+        if self.expression:
+            expression = bat['expression'].to(self.device).to(torch.float32)
+            gt_poses = torch.cat([gt_poses, expression.permute(0, 2, 1)], dim=2)
+        pred_poses, _ = self.generator(
+            aud,
+            gt_poses,
+            id,
+        )
+        G_loss, G_loss_dict = self.get_loss(
+            pred_poses=pred_poses,
+            gt_poses=gt_poses,
+            pre_poses=None,
+            mode='training_G',
+            gt_conf=None,
+            aud=aud,
+        )
+        self.generator_optimizer.zero_grad()
+        G_loss.backward()
+        grad = torch.nn.utils.clip_grad_norm(self.generator.parameters(), self.config.Train.max_gradient_norm)
+        loss_dict['grad'] = grad.item()
+        self.generator_optimizer.step()
+        for key in list(G_loss_dict.keys()):
+            loss_dict[key] = G_loss_dict.get(key, 0).item()
+        return total_loss, loss_dict
+    def get_loss(self,
+                 pred_poses,
+                 gt_poses,
+                 pre_poses,
+                 aud,
+                 mode='training_G',
+                 gt_conf=None,
+                 exp=1,
+                 gt_nzero=None,
+                 pre_nzero=None,
+                 ):
+        loss_dict = {}
+        [b_j, b_e, b_b, b_h, b_f] = self.dim_list
+        MSELoss = torch.mean(torch.abs(pred_poses[:, :, :6] - gt_poses[:, :, :6]))
+        if self.expression:
+            expl = torch.mean((pred_poses[:, :, -100:] - gt_poses[:, :, -100:])**2)
+        else:
+            expl = 0
+        gen_loss = expl + MSELoss
+        loss_dict['MSELoss'] = MSELoss
+        if self.expression:
+            loss_dict['exp_loss'] = expl
+        return gen_loss, loss_dict
+    def infer_on_audio(self, aud_fn, id=None, initial_pose=None, norm_stats=None, w_pre=False, frame=None, am=None, am_sr=16000, **kwargs):
+        '''
+        initial_pose: (B, C, T), normalized
+        (aud_fn, txgfile) -> generated motion (B, T, C)
+        '''
+        output = []
+        # assert self.args.infer, "train mode"
+        self.generator.eval()
+        if self.config.Data.pose.normalization:
+            assert norm_stats is not None
+            data_mean = norm_stats[0]
+            data_std = norm_stats[1]
+        # assert initial_pose.shape[-1] == pre_length
+        if initial_pose is not None:
+            gt = initial_pose[:,:,:].permute(0, 2, 1).to(self.generator.device).to(torch.float32)
+            pre_poses = initial_pose[:,:,:15].permute(0, 2, 1).to(self.generator.device).to(torch.float32)
+            poses = initial_pose.permute(0, 2, 1).to(self.generator.device).to(torch.float32)
+            B = pre_poses.shape[0]
+        else:
+            gt = None
+            pre_poses=None
+            B = 1
+        if type(aud_fn) == torch.Tensor:
+            aud_feat = torch.tensor(aud_fn, dtype=torch.float32).to(self.generator.device)
+            num_poses_to_generate = aud_feat.shape[-1]
+        else:
+            aud_feat = get_mfcc_ta(aud_fn, am=am, am_sr=am_sr, fps=30, encoder_choice='faceformer')
+            aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
+            aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.generator.device).transpose(1, 2)
+        if frame is None:
+            frame = aud_feat.shape[2]*30//16000
+        #
+        if id is None:
+            id = torch.tensor([[0, 0, 0, 0]], dtype=torch.float32, device=self.generator.device)
+        else:
+            id = F.one_hot(id, self.num_classes).to(self.generator.device)
+        with torch.no_grad():
+            pred_poses = self.generator(aud_feat, pre_poses, id, time_steps=frame)[0]
+            pred_poses = pred_poses.cpu().numpy()
+        output = pred_poses
+        if self.config.Data.pose.normalization:
+            output = denormalize(output, data_mean, data_std)
+        return output
+    def generate(self, wv2_feat, frame):
+        '''
+        initial_pose: (B, C, T), normalized
+        (aud_fn, txgfile) -> generated motion (B, T, C)
+        '''
+        output = []
+        # assert self.args.infer, "train mode"
+        self.generator.eval()
+        B = 1
+        id = torch.tensor([[0, 0, 0, 0]], dtype=torch.float32, device=self.generator.device)
+        id = id.repeat(wv2_feat.shape[0], 1)
+        with torch.no_grad():
+            pred_poses = self.generator(wv2_feat, None, id, time_steps=frame)[0]
+        return pred_poses

nets/spg/gated_pixelcnn_v2.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+def weights_init(m):
+    classname = m.__class__.__name__
+    if classname.find('Conv') != -1:
+        try:
+            nn.init.xavier_uniform_(m.weight.data)
+            m.bias.data.fill_(0)
+        except AttributeError:
+            print("Skipping initialization of ", classname)
+class GatedActivation(nn.Module):
+    def __init__(self):
+        super().__init__()
+    def forward(self, x):
+        x, y = x.chunk(2, dim=1)
+        return F.tanh(x) * F.sigmoid(y)
+class GatedMaskedConv2d(nn.Module):
+    def __init__(self, mask_type, dim, kernel, residual=True, n_classes=10, bh_model=False):
+        super().__init__()
+        assert kernel % 2 == 1, print("Kernel size must be odd")
+        self.mask_type = mask_type
+        self.residual = residual
+        self.bh_model = bh_model
+        self.class_cond_embedding = nn.Embedding(n_classes, 2 * dim)
+        self.class_cond_embedding = self.class_cond_embedding.to("cpu")
+        kernel_shp = (kernel // 2 + 1, 3 if self.bh_model else 1)  # (ceil(n/2), n)
+        padding_shp = (kernel // 2, 1 if self.bh_model else 0)
+        self.vert_stack = nn.Conv2d(
+            dim, dim * 2,
+            kernel_shp, 1, padding_shp
+        )
+        self.vert_to_horiz = nn.Conv2d(2 * dim, 2 * dim, 1)
+        kernel_shp = (1, 2)
+        padding_shp = (0, 1)
+        self.horiz_stack = nn.Conv2d(
+            dim, dim * 2,
+            kernel_shp, 1, padding_shp
+        )
+        self.horiz_resid = nn.Conv2d(dim, dim, 1)
+        self.gate = GatedActivation()
+    def make_causal(self):
+        self.vert_stack.weight.data[:, :, -1].zero_()  # Mask final row
+        self.horiz_stack.weight.data[:, :, :, -1].zero_()  # Mask final column
+    def forward(self, x_v, x_h, h):
+        if self.mask_type == 'A':
+            self.make_causal()
+        h = h.to(self.class_cond_embedding.weight.device)
+        h = self.class_cond_embedding(h)
+        h_vert = self.vert_stack(x_v)
+        h_vert = h_vert[:, :, :x_v.size(-2), :]
+        out_v = self.gate(h_vert + h[:, :, None, None])
+        if self.bh_model:
+            h_horiz = self.horiz_stack(x_h)
+            h_horiz = h_horiz[:, :, :, :x_h.size(-1)]
+            v2h = self.vert_to_horiz(h_vert)
+            out = self.gate(v2h + h_horiz + h[:, :, None, None])
+            if self.residual:
+                out_h = self.horiz_resid(out) + x_h
+            else:
+                out_h = self.horiz_resid(out)
+        else:
+            if self.residual:
+                out_v = self.horiz_resid(out_v) + x_v
+            else:
+                out_v = self.horiz_resid(out_v)
+            out_h = out_v
+        return out_v, out_h
+class GatedPixelCNN(nn.Module):
+    def __init__(self, input_dim=256, dim=64, n_layers=15, n_classes=10, audio=False, bh_model=False):
+        super().__init__()
+        self.dim = dim
+        self.audio = audio
+        self.bh_model = bh_model
+        if self.audio:
+            self.embedding_aud = nn.Conv2d(256, dim, 1, 1, padding=0)
+            self.fusion_v = nn.Conv2d(dim * 2, dim, 1, 1, padding=0)
+            self.fusion_h = nn.Conv2d(dim * 2, dim, 1, 1, padding=0)
+        # Create embedding layer to embed input
+        self.embedding = nn.Embedding(input_dim, dim)
+        # Building the PixelCNN layer by layer
+        self.layers = nn.ModuleList()
+        # Initial block with Mask-A convolution
+        # Rest with Mask-B convolutions
+        for i in range(n_layers):
+            mask_type = 'A' if i == 0 else 'B'
+            kernel = 7 if i == 0 else 3
+            residual = False if i == 0 else True
+            self.layers.append(
+                GatedMaskedConv2d(mask_type, dim, kernel, residual, n_classes, bh_model)
+            )
+        # Add the output layer
+        self.output_conv = nn.Sequential(
+            nn.Conv2d(dim, 512, 1),
+            nn.ReLU(True),
+            nn.Conv2d(512, input_dim, 1)
+        )
+        self.apply(weights_init)
+        self.dp = nn.Dropout(0.1)
+        self.to("cpu")
+    def forward(self, x, label, aud=None):
+        shp = x.size() + (-1,)
+        x = self.embedding(x.view(-1)).view(shp)  # (B, H, W, C)
+        x = x.permute(0, 3, 1, 2)  # (B, C, W, W)
+        x_v, x_h = (x, x)
+        for i, layer in enumerate(self.layers):
+            if i == 1 and self.audio is True:
+                aud = self.embedding_aud(aud)
+                a = torch.ones(aud.shape[-2]).to(aud.device)
+                a = self.dp(a)
+                aud = (aud.transpose(-1, -2) * a).transpose(-1, -2)
+                x_v = self.fusion_v(torch.cat([x_v, aud], dim=1))
+                if self.bh_model:
+                    x_h = self.fusion_h(torch.cat([x_h, aud], dim=1))
+            x_v, x_h = layer(x_v, x_h, label)
+        if self.bh_model:
+            return self.output_conv(x_h)
+        else:
+            return self.output_conv(x_v)
+    def generate(self, label, shape=(8, 8), batch_size=64, aud_feat=None, pre_latents=None, pre_audio=None):
+        param = next(self.parameters())
+        x = torch.zeros(
+            (batch_size, *shape),
+            dtype=torch.int64, device=param.device
+        )
+        if pre_latents is not None:
+            x = torch.cat([pre_latents, x], dim=1)
+            aud_feat = torch.cat([pre_audio, aud_feat], dim=2)
+            h0 = pre_latents.shape[1]
+            h = h0 + shape[0]
+        else:
+            h0 = 0
+            h = shape[0]
+        for i in range(h0, h):
+            for j in range(shape[1]):
+                if self.audio:
+                    logits = self.forward(x, label, aud_feat)
+                else:
+                    logits = self.forward(x, label)
+                probs = F.softmax(logits[:, :, i, j], -1)
+                x.data[:, i, j].copy_(
+                    probs.multinomial(1).squeeze().data
+                )
+        return x[:, h0:h]

nets/spg/s2g_face.py ADDED Viewed

	@@ -0,0 +1,226 @@

+'''
+not exactly the same as the official repo but the results are good
+'''
+import sys
+import os
+from transformers import Wav2Vec2Processor
+from .wav2vec import Wav2Vec2Model
+from torchaudio.sox_effects import apply_effects_tensor
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchaudio as ta
+import math
+from nets.layers import SeqEncoder1D, SeqTranslator1D, ConvNormRelu
+""" from https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git """
+def audio_chunking(audio: torch.Tensor, frame_rate: int = 30, chunk_size: int = 16000):
+    """
+    :param audio: 1 x T tensor containing a 16kHz audio signal
+    :param frame_rate: frame rate for video (we need one audio chunk per video frame)
+    :param chunk_size: number of audio samples per chunk
+    :return: num_chunks x chunk_size tensor containing sliced audio
+    """
+    samples_per_frame = 16000 // frame_rate
+    padding = (chunk_size - samples_per_frame) // 2
+    audio = torch.nn.functional.pad(audio.unsqueeze(0), pad=[padding, padding]).squeeze(0)
+    anchor_points = list(range(chunk_size//2, audio.shape[-1]-chunk_size//2, samples_per_frame))
+    audio = torch.cat([audio[:, i-chunk_size//2:i+chunk_size//2] for i in anchor_points], dim=0)
+    return audio
+class MeshtalkEncoder(nn.Module):
+    def __init__(self, latent_dim: int = 128, model_name: str = 'audio_encoder'):
+        """
+        :param latent_dim: size of the latent audio embedding
+        :param model_name: name of the model, used to load and save the model
+        """
+        super().__init__()
+        self.melspec = ta.transforms.MelSpectrogram(
+            sample_rate=16000, n_fft=2048, win_length=800, hop_length=160, n_mels=80
+        )
+        conv_len = 5
+        self.convert_dimensions = torch.nn.Conv1d(80, 128, kernel_size=conv_len)
+        self.weights_init(self.convert_dimensions)
+        self.receptive_field = conv_len
+        convs = []
+        for i in range(6):
+            dilation = 2 * (i % 3 + 1)
+            self.receptive_field += (conv_len - 1) * dilation
+            convs += [torch.nn.Conv1d(128, 128, kernel_size=conv_len, dilation=dilation)]
+            self.weights_init(convs[-1])
+        self.convs = torch.nn.ModuleList(convs)
+        self.code = torch.nn.Linear(128, latent_dim)
+        self.apply(lambda x: self.weights_init(x))
+    def weights_init(self, m):
+        if isinstance(m, torch.nn.Conv1d):
+            torch.nn.init.xavier_uniform_(m.weight)
+            try:
+                torch.nn.init.constant_(m.bias, .01)
+            except:
+                pass
+    def forward(self, audio: torch.Tensor):
+        """
+        :param audio: B x T x 16000 Tensor containing 1 sec of audio centered around the current time frame
+        :return: code: B x T x latent_dim Tensor containing a latent audio code/embedding
+        """
+        B, T = audio.shape[0], audio.shape[1]
+        x = self.melspec(audio).squeeze(1)
+        x = torch.log(x.clamp(min=1e-10, max=None))
+        if T == 1:
+            x = x.unsqueeze(1)
+        # Convert to the right dimensionality
+        x = x.view(-1, x.shape[2], x.shape[3])
+        x = F.leaky_relu(self.convert_dimensions(x), .2)
+        # Process stacks
+        for conv in self.convs:
+            x_ = F.leaky_relu(conv(x), .2)
+            if self.training:
+                x_ = F.dropout(x_, .2)
+            l = (x.shape[2] - x_.shape[2]) // 2
+            x = (x[:, :, l:-l] + x_) / 2
+        x = torch.mean(x, dim=-1)
+        x = x.view(B, T, x.shape[-1])
+        x = self.code(x)
+        return {"code": x}
+class AudioEncoder(nn.Module):
+    def __init__(self, in_dim, out_dim, identity=False, num_classes=0):
+        super().__init__()
+        self.identity = identity
+        if self.identity:
+            in_dim = in_dim + 64
+            self.id_mlp = nn.Conv1d(num_classes, 64, 1, 1)
+        self.first_net = SeqTranslator1D(in_dim, out_dim,
+                                         min_layers_num=3,
+                                         residual=True,
+                                         norm='ln'
+                                         )
+        self.grus = nn.GRU(out_dim, out_dim, 1, batch_first=True)
+        self.dropout = nn.Dropout(0.1)
+        # self.att = nn.MultiheadAttention(out_dim, 4, dropout=0.1, batch_first=True)
+    def forward(self, spectrogram, pre_state=None, id=None, time_steps=None):
+        spectrogram = spectrogram
+        spectrogram = self.dropout(spectrogram)
+        if self.identity:
+            id = id.reshape(id.shape[0], -1, 1).repeat(1, 1, spectrogram.shape[2]).to(torch.float32)
+            id = self.id_mlp(id)
+            spectrogram = torch.cat([spectrogram, id], dim=1)
+        x1 = self.first_net(spectrogram)# .permute(0, 2, 1)
+        if time_steps is not None:
+            x1 = F.interpolate(x1, size=time_steps, align_corners=False, mode='linear')
+        # x1, _ = self.att(x1, x1, x1)
+        # x1, hidden_state = self.grus(x1)
+        # x1 = x1.permute(0, 2, 1)
+        hidden_state=None
+        return x1, hidden_state
+class Generator(nn.Module):
+    def __init__(self,
+                 n_poses,
+                 each_dim: list,
+                 dim_list: list,
+                 training=False,
+                 device=None,
+                 identity=True,
+                 num_classes=0,
+                 ):
+        super().__init__()
+        self.training = training
+        self.device = device
+        self.gen_length = n_poses
+        self.identity = identity
+        norm = 'ln'
+        in_dim = 256
+        out_dim = 256
+        self.encoder_choice = 'faceformer'
+        if self.encoder_choice == 'meshtalk':
+            self.audio_encoder = MeshtalkEncoder(latent_dim=in_dim)
+        elif self.encoder_choice == 'faceformer':
+            # wav2vec 2.0 weights initialization
+            self.audio_encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")  # "vitouphy/wav2vec2-xls-r-300m-phoneme""facebook/wav2vec2-base-960h"
+            self.audio_encoder.feature_extractor._freeze_parameters()
+            self.audio_feature_map = nn.Linear(768, in_dim)
+        else:
+            self.audio_encoder = AudioEncoder(in_dim=64, out_dim=out_dim)
+        self.audio_middle = AudioEncoder(in_dim, out_dim, identity, num_classes)
+        self.dim_list = dim_list
+        self.decoder = nn.ModuleList()
+        self.final_out = nn.ModuleList()
+        self.decoder.append(nn.Sequential(
+            ConvNormRelu(out_dim, 64, norm=norm),
+            ConvNormRelu(64, 64, norm=norm),
+            ConvNormRelu(64, 64, norm=norm),
+        ))
+        self.final_out.append(nn.Conv1d(64, each_dim[0], 1, 1))
+        self.decoder.append(nn.Sequential(
+            ConvNormRelu(out_dim, out_dim, norm=norm),
+            ConvNormRelu(out_dim, out_dim, norm=norm),
+            ConvNormRelu(out_dim, out_dim, norm=norm),
+        ))
+        self.final_out.append(nn.Conv1d(out_dim, each_dim[3], 1, 1))
+    def forward(self, in_spec, gt_poses=None, id=None, pre_state=None, time_steps=None):
+        if self.training:
+            time_steps = gt_poses.shape[1]
+        # vector, hidden_state = self.audio_encoder(in_spec, pre_state, time_steps=time_steps)
+        if self.encoder_choice == 'meshtalk':
+            in_spec = audio_chunking(in_spec.squeeze(-1), frame_rate=30, chunk_size=16000)
+            feature = self.audio_encoder(in_spec.unsqueeze(0))["code"].transpose(1, 2)
+        elif self.encoder_choice == 'faceformer':
+            hidden_states = self.audio_encoder(in_spec.reshape(in_spec.shape[0], -1), frame_num=time_steps).last_hidden_state
+            feature = self.audio_feature_map(hidden_states).transpose(1, 2)
+        else:
+            feature, hidden_state = self.audio_encoder(in_spec, pre_state, time_steps=time_steps)
+        # hidden_states = in_spec
+        feature, _ = self.audio_middle(feature, id=id)
+        out = []
+        for i in range(self.decoder.__len__()):
+            mid = self.decoder[i](feature)
+            mid = self.final_out[i](mid)
+            out.append(mid)
+        out = torch.cat(out, dim=1)
+        out = out.transpose(1, 2)
+        return out, None

nets/spg/s2glayers.py ADDED Viewed

	@@ -0,0 +1,522 @@

+'''
+not exactly the same as the official repo but the results are good
+'''
+import sys
+import os
+sys.path.append(os.getcwd())
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from nets.layers import SeqEncoder1D, SeqTranslator1D
+""" from https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git """
+class Conv2d_tf(nn.Conv2d):
+    """
+    Conv2d with the padding behavior from TF
+    from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
+    """
+    def __init__(self, *args, **kwargs):
+        super(Conv2d_tf, self).__init__(*args, **kwargs)
+        self.padding = kwargs.get("padding", "SAME")
+    def _compute_padding(self, input, dim):
+        input_size = input.size(dim + 2)
+        filter_size = self.weight.size(dim + 2)
+        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
+        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
+        total_padding = max(
+            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
+        )
+        additional_padding = int(total_padding % 2 != 0)
+        return additional_padding, total_padding
+    def forward(self, input):
+        if self.padding == "VALID":
+            return F.conv2d(
+                input,
+                self.weight,
+                self.bias,
+                self.stride,
+                padding=0,
+                dilation=self.dilation,
+                groups=self.groups,
+            )
+        rows_odd, padding_rows = self._compute_padding(input, dim=0)
+        cols_odd, padding_cols = self._compute_padding(input, dim=1)
+        if rows_odd or cols_odd:
+            input = F.pad(input, [0, cols_odd, 0, rows_odd])
+        return F.conv2d(
+            input,
+            self.weight,
+            self.bias,
+            self.stride,
+            padding=(padding_rows // 2, padding_cols // 2),
+            dilation=self.dilation,
+            groups=self.groups,
+        )
+class Conv1d_tf(nn.Conv1d):
+    """
+    Conv1d with the padding behavior from TF
+    modified from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
+    """
+    def __init__(self, *args, **kwargs):
+        super(Conv1d_tf, self).__init__(*args, **kwargs)
+        self.padding = kwargs.get("padding")
+    def _compute_padding(self, input, dim):
+        input_size = input.size(dim + 2)
+        filter_size = self.weight.size(dim + 2)
+        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
+        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
+        total_padding = max(
+            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
+        )
+        additional_padding = int(total_padding % 2 != 0)
+        return additional_padding, total_padding
+    def forward(self, input):
+        # if self.padding == "valid":
+        #     return F.conv1d(
+        #         input,
+        #         self.weight,
+        #         self.bias,
+        #         self.stride,
+        #         padding=0,
+        #         dilation=self.dilation,
+        #         groups=self.groups,
+        #     )
+        rows_odd, padding_rows = self._compute_padding(input, dim=0)
+        if rows_odd:
+            input = F.pad(input, [0, rows_odd])
+        return F.conv1d(
+            input,
+            self.weight,
+            self.bias,
+            self.stride,
+            padding=(padding_rows // 2),
+            dilation=self.dilation,
+            groups=self.groups,
+        )
+def ConvNormRelu(in_channels, out_channels, type='1d', downsample=False, k=None, s=None, padding='valid', groups=1,
+                 nonlinear='lrelu', bn='bn'):
+    if k is None and s is None:
+        if not downsample:
+            k = 3
+            s = 1
+            padding = 'same'
+        else:
+            k = 4
+            s = 2
+            padding = 'valid'
+    if type == '1d':
+        conv_block = Conv1d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding, groups=groups)
+        norm_block = nn.BatchNorm1d(out_channels)
+    elif type == '2d':
+        conv_block = Conv2d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding, groups=groups)
+        norm_block = nn.BatchNorm2d(out_channels)
+    else:
+        assert False
+    if bn != 'bn':
+        if bn == 'gn':
+            norm_block = nn.GroupNorm(1, out_channels)
+        elif bn == 'ln':
+            norm_block = nn.LayerNorm(out_channels)
+        else:
+            norm_block = nn.Identity()
+    if nonlinear == 'lrelu':
+        nlinear = nn.LeakyReLU(0.2, True)
+    elif nonlinear == 'tanh':
+        nlinear = nn.Tanh()
+    elif nonlinear == 'none':
+        nlinear = nn.Identity()
+    return nn.Sequential(
+        conv_block,
+        norm_block,
+        nlinear
+    )
+class UnetUp(nn.Module):
+    def __init__(self, in_ch, out_ch):
+        super(UnetUp, self).__init__()
+        self.conv = ConvNormRelu(in_ch, out_ch)
+    def forward(self, x1, x2):
+        # x1 = torch.repeat_interleave(x1, 2, dim=2)
+        # x1 = x1[:, :, :x2.shape[2]]
+        x1 = torch.nn.functional.interpolate(x1, size=x2.shape[2], mode='linear')
+        x = x1 + x2
+        x = self.conv(x)
+        return x
+class UNet(nn.Module):
+    def __init__(self, input_dim, dim):
+        super(UNet, self).__init__()
+        # dim = 512
+        self.down1 = nn.Sequential(
+            ConvNormRelu(input_dim, input_dim, '1d', False),
+            ConvNormRelu(input_dim, dim, '1d', False),
+            ConvNormRelu(dim, dim, '1d', False)
+        )
+        self.gru = nn.GRU(dim, dim, 1, batch_first=True)
+        self.down2 = ConvNormRelu(dim, dim, '1d', True)
+        self.down3 = ConvNormRelu(dim, dim, '1d', True)
+        self.down4 = ConvNormRelu(dim, dim, '1d', True)
+        self.down5 = ConvNormRelu(dim, dim, '1d', True)
+        self.down6 = ConvNormRelu(dim, dim, '1d', True)
+        self.up1 = UnetUp(dim, dim)
+        self.up2 = UnetUp(dim, dim)
+        self.up3 = UnetUp(dim, dim)
+        self.up4 = UnetUp(dim, dim)
+        self.up5 = UnetUp(dim, dim)
+    def forward(self, x1, pre_pose=None, w_pre=False):
+        x2_0 = self.down1(x1)
+        if w_pre:
+            i = 1
+            x2_pre = self.gru(x2_0[:,:,0:i].permute(0,2,1), pre_pose[:,:,-1:].permute(2,0,1).contiguous())[0].permute(0,2,1)
+            x2 = torch.cat([x2_pre, x2_0[:,:,i:]], dim=-1)
+            # x2 = torch.cat([pre_pose, x2_0], dim=2) # [B, 512, 15]
+        else:
+            # x2 = self.gru(x2_0.transpose(1, 2))[0].transpose(1,2)
+            x2 = x2_0
+        x3 = self.down2(x2)
+        x4 = self.down3(x3)
+        x5 = self.down4(x4)
+        x6 = self.down5(x5)
+        x7 = self.down6(x6)
+        x = self.up1(x7, x6)
+        x = self.up2(x, x5)
+        x = self.up3(x, x4)
+        x = self.up4(x, x3)
+        x = self.up5(x, x2)             # [B, 512, 15]
+        return x, x2_0
+class AudioEncoder(nn.Module):
+    def __init__(self, n_frames, template_length, pose=False, common_dim=512):
+        super().__init__()
+        self.n_frames = n_frames
+        self.pose = pose
+        self.step = 0
+        self.weight = 0
+        if self.pose:
+            # self.first_net = nn.Sequential(
+            #     ConvNormRelu(1, 64, '2d', False),
+            #     ConvNormRelu(64, 64, '2d', True),
+            #     ConvNormRelu(64, 128, '2d', False),
+            #     ConvNormRelu(128, 128, '2d', True),
+            #     ConvNormRelu(128, 256, '2d', False),
+            #     ConvNormRelu(256, 256, '2d', True),
+            #     ConvNormRelu(256, 256, '2d', False),
+            #     ConvNormRelu(256, 256, '2d', False, padding='VALID')
+            # )
+            # decoder_layer = nn.TransformerDecoderLayer(d_model=args.feature_dim, nhead=4,
+            #                                            dim_feedforward=2 * args.feature_dim, batch_first=True)
+            # a = nn.TransformerDecoder
+            self.first_net = SeqTranslator1D(256, 256,
+                                             min_layers_num=4,
+                                             residual=True
+                                             )
+            self.dropout_0 = nn.Dropout(0.1)
+            self.mu_fc = nn.Conv1d(256, 128, 1, 1)
+            self.var_fc = nn.Conv1d(256, 128, 1, 1)
+            self.trans_motion = SeqTranslator1D(common_dim, common_dim,
+                                                kernel_size=1,
+                                                stride=1,
+                                                min_layers_num=3,
+                                                residual=True
+                                                )
+            # self.att = nn.MultiheadAttention(64 + template_length, 4, dropout=0.1)
+            self.unet = UNet(128 + template_length, common_dim)
+        else:
+            self.first_net = SeqTranslator1D(256, 256,
+                                             min_layers_num=4,
+                                             residual=True
+                                             )
+            self.dropout_0 = nn.Dropout(0.1)
+            # self.att = nn.MultiheadAttention(256, 4, dropout=0.1)
+            self.unet = UNet(256, 256)
+            self.dropout_1 = nn.Dropout(0.0)
+    def forward(self, spectrogram, time_steps=None, template=None, pre_pose=None, w_pre=False):
+        self.step = self.step + 1
+        if self.pose:
+            spect = spectrogram.transpose(1, 2)
+            if w_pre:
+                spect = spect[:, :, :]
+            out = self.first_net(spect)
+            out = self.dropout_0(out)
+            mu = self.mu_fc(out)
+            var = self.var_fc(out)
+            audio = self.__reparam(mu, var)
+            # audio = out
+            # template = self.trans_motion(template)
+            x1 = torch.cat([audio, template], dim=1)#.permute(2,0,1)
+            # x1 = out
+            #x1, _ = self.att(x1, x1, x1)
+            #x1 = x1.permute(1,2,0)
+            x1, x2_0 = self.unet(x1, pre_pose=pre_pose, w_pre=w_pre)
+        else:
+            spectrogram = spectrogram.transpose(1, 2)
+            x1 = self.first_net(spectrogram)#.permute(2,0,1)
+            #out, _ = self.att(out, out, out)
+            #out = out.permute(1, 2, 0)
+            x1 = self.dropout_0(x1)
+            x1, x2_0 = self.unet(x1)
+            x1 = self.dropout_1(x1)
+            mu = None
+            var = None
+        return x1, (mu, var), x2_0
+    def __reparam(self, mu, log_var):
+        std = torch.exp(0.5 * log_var)
+        eps = torch.randn_like(std, device='cuda')
+        z = eps * std + mu
+        return z
+class Generator(nn.Module):
+    def __init__(self,
+                 n_poses,
+                 pose_dim,
+                 pose,
+                 n_pre_poses,
+                 each_dim: list,
+                 dim_list: list,
+                 use_template=False,
+                 template_length=0,
+                 training=False,
+                 device=None,
+                 separate=False,
+                 expression=False
+                 ):
+        super().__init__()
+        self.use_template = use_template
+        self.template_length = template_length
+        self.training = training
+        self.device = device
+        self.separate = separate
+        self.pose = pose
+        self.decoderf = True
+        self.expression = expression
+        common_dim = 256
+        if self.use_template:
+            assert template_length > 0
+            # self.KLLoss = KLLoss(kl_tolerance=self.config.Train.weights.kl_tolerance).to(self.device)
+            # self.pose_encoder = SeqEncoder1D(
+            #     C_in=pose_dim,
+            #     C_out=512,
+            #     T_in=n_poses,
+            #     min_layer_nums=6
+            #
+            # )
+            self.pose_encoder = SeqTranslator1D(pose_dim - 50, common_dim,
+                                                # kernel_size=1,
+                                                # stride=1,
+                                                min_layers_num=3,
+                                                residual=True
+                                                )
+            self.mu_fc = nn.Conv1d(common_dim, template_length, kernel_size=1, stride=1)
+            self.var_fc = nn.Conv1d(common_dim, template_length, kernel_size=1, stride=1)
+        else:
+            self.template_length = 0
+        self.gen_length = n_poses
+        self.audio_encoder = AudioEncoder(n_poses, template_length, True, common_dim)
+        self.speech_encoder = AudioEncoder(n_poses, template_length, False)
+        # self.pre_pose_encoder = SeqEncoder1D(
+        #     C_in=pose_dim,
+        #     C_out=128,
+        #     T_in=15,
+        #     min_layer_nums=3
+        #
+        # )
+        # self.pmu_fc = nn.Linear(128, 64)
+        # self.pvar_fc = nn.Linear(128, 64)
+        self.pre_pose_encoder = SeqTranslator1D(pose_dim-50, common_dim,
+                                                min_layers_num=5,
+                                                residual=True
+                                                )
+        self.decoder_in = 256 + 64
+        self.dim_list = dim_list
+        if self.separate:
+            self.decoder = nn.ModuleList()
+            self.final_out = nn.ModuleList()
+            self.decoder.append(nn.Sequential(
+                ConvNormRelu(256, 64),
+                ConvNormRelu(64, 64),
+                ConvNormRelu(64, 64),
+            ))
+            self.final_out.append(nn.Conv1d(64, each_dim[0], 1, 1))
+            self.decoder.append(nn.Sequential(
+                ConvNormRelu(common_dim, common_dim),
+                ConvNormRelu(common_dim, common_dim),
+                ConvNormRelu(common_dim, common_dim),
+            ))
+            self.final_out.append(nn.Conv1d(common_dim, each_dim[1], 1, 1))
+            self.decoder.append(nn.Sequential(
+                ConvNormRelu(common_dim, common_dim),
+                ConvNormRelu(common_dim, common_dim),
+                ConvNormRelu(common_dim, common_dim),
+            ))
+            self.final_out.append(nn.Conv1d(common_dim, each_dim[2], 1, 1))
+            if self.expression:
+                self.decoder.append(nn.Sequential(
+                    ConvNormRelu(256, 256),
+                    ConvNormRelu(256, 256),
+                    ConvNormRelu(256, 256),
+                ))
+                self.final_out.append(nn.Conv1d(256, each_dim[3], 1, 1))
+        else:
+            self.decoder = nn.Sequential(
+                ConvNormRelu(self.decoder_in, 512),
+                ConvNormRelu(512, 512),
+                ConvNormRelu(512, 512),
+                ConvNormRelu(512, 512),
+                ConvNormRelu(512, 512),
+                ConvNormRelu(512, 512),
+            )
+            self.final_out = nn.Conv1d(512, pose_dim, 1, 1)
+    def __reparam(self, mu, log_var):
+        std = torch.exp(0.5 * log_var)
+        eps = torch.randn_like(std, device=self.device)
+        z = eps * std + mu
+        return z
+    def forward(self, in_spec, pre_poses, gt_poses, template=None, time_steps=None, w_pre=False, norm=True):
+        if time_steps is not None:
+            self.gen_length = time_steps
+        if self.use_template:
+            if self.training:
+                if w_pre:
+                    in_spec = in_spec[:, 15:, :]
+                    pre_pose = self.pre_pose_encoder(gt_poses[:, 14:15, :-50].permute(0, 2, 1))
+                    pose_enc = self.pose_encoder(gt_poses[:, 15:, :-50].permute(0, 2, 1))
+                    mu = self.mu_fc(pose_enc)
+                    var = self.var_fc(pose_enc)
+                    template = self.__reparam(mu, var)
+                else:
+                    pre_pose = None
+                    pose_enc = self.pose_encoder(gt_poses[:, :, :-50].permute(0, 2, 1))
+                    mu = self.mu_fc(pose_enc)
+                    var = self.var_fc(pose_enc)
+                    template = self.__reparam(mu, var)
+            elif pre_poses is not None:
+                if w_pre:
+                    pre_pose = pre_poses[:, -1:, :-50]
+                    if norm:
+                        pre_pose = pre_pose.reshape(1, -1, 55, 5)
+                        pre_pose = torch.cat([F.normalize(pre_pose[..., :3], dim=-1),
+                                             F.normalize(pre_pose[..., 3:5], dim=-1)],
+                                             dim=-1).reshape(1, -1, 275)
+                    pre_pose = self.pre_pose_encoder(pre_pose.permute(0, 2, 1))
+                    template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length ]).to(
+                        in_spec.device)
+                else:
+                    pre_pose = None
+                    template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length]).to(in_spec.device)
+            elif gt_poses is not None:
+                template = self.pre_pose_encoder(gt_poses[:, :, :-50].permute(0, 2, 1))
+            elif template is None:
+                pre_pose = None
+                template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length]).to(in_spec.device)
+        else:
+            template = None
+            mu = None
+            var = None
+        a_t_f, (mu2, var2), x2_0 = self.audio_encoder(in_spec, time_steps=time_steps, template=template, pre_pose=pre_pose, w_pre=w_pre)
+        s_f, _, _ = self.speech_encoder(in_spec, time_steps=time_steps)
+        out = []
+        if self.separate:
+            for i in range(self.decoder.__len__()):
+                if i == 0 or i == 3:
+                    mid = self.decoder[i](s_f)
+                else:
+                    mid = self.decoder[i](a_t_f)
+                mid = self.final_out[i](mid)
+                out.append(mid)
+            out = torch.cat(out, dim=1)
+        else:
+            out = self.decoder(a_t_f)
+            out = self.final_out(out)
+        out = out.transpose(1, 2)
+        if self.training:
+            if w_pre:
+                return out, template, mu, var, (mu2, var2, x2_0, pre_pose)
+            else:
+                return out, template, mu, var, (mu2, var2, None, None)
+        else:
+            return out
+class Discriminator(nn.Module):
+    def __init__(self, pose_dim, pose):
+        super().__init__()
+        self.net = nn.Sequential(
+            Conv1d_tf(pose_dim, 64, kernel_size=4, stride=2, padding='SAME'),
+            nn.LeakyReLU(0.2, True),
+            ConvNormRelu(64, 128, '1d', True),
+            ConvNormRelu(128, 256, '1d', k=4, s=1),
+            Conv1d_tf(256, 1, kernel_size=4, stride=1, padding='SAME'),
+        )
+    def forward(self, x):
+        x = x.transpose(1, 2)
+        out = self.net(x)
+        return out
+def main():
+    d = Discriminator(275, 55)
+    x = torch.randn([8, 60, 275])
+    result = d(x)
+if __name__ == "__main__":
+    main()

nets/spg/vqvae_1d.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import os
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .wav2vec import Wav2Vec2Model
+from .vqvae_modules import VectorQuantizerEMA, ConvNormRelu, Res_CNR_Stack
+class AudioEncoder(nn.Module):
+    def __init__(self, in_dim, num_hiddens, num_residual_layers, num_residual_hiddens):
+        super(AudioEncoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self._num_residual_hiddens = num_residual_hiddens
+        self.project = ConvNormRelu(in_dim, self._num_hiddens // 4, leaky=True)
+        self._enc_1 = Res_CNR_Stack(self._num_hiddens // 4, self._num_residual_layers, leaky=True)
+        self._down_1 = ConvNormRelu(self._num_hiddens // 4, self._num_hiddens // 2, leaky=True, residual=True,
+                                    sample='down')
+        self._enc_2 = Res_CNR_Stack(self._num_hiddens // 2, self._num_residual_layers, leaky=True)
+        self._down_2 = ConvNormRelu(self._num_hiddens // 2, self._num_hiddens, leaky=True, residual=True, sample='down')
+        self._enc_3 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+    def forward(self, x, frame_num=0):
+        h = self.project(x)
+        h = self._enc_1(h)
+        h = self._down_1(h)
+        h = self._enc_2(h)
+        h = self._down_2(h)
+        h = self._enc_3(h)
+        return h
+class Wav2VecEncoder(nn.Module):
+    def __init__(self, num_hiddens, num_residual_layers):
+        super(Wav2VecEncoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self.audio_encoder = Wav2Vec2Model.from_pretrained(
+            "facebook/wav2vec2-base-960h")  # "vitouphy/wav2vec2-xls-r-300m-phoneme""facebook/wav2vec2-base-960h"
+        self.audio_encoder.feature_extractor._freeze_parameters()
+        self.project = ConvNormRelu(768, self._num_hiddens, leaky=True)
+        self._enc_1 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+        self._down_1 = ConvNormRelu(self._num_hiddens, self._num_hiddens, leaky=True, residual=True, sample='down')
+        self._enc_2 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+        self._down_2 = ConvNormRelu(self._num_hiddens, self._num_hiddens, leaky=True, residual=True, sample='down')
+        self._enc_3 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+    def forward(self, x, frame_num):
+        h = self.audio_encoder(x.squeeze(), frame_num=frame_num).last_hidden_state.transpose(1, 2)
+        h = self.project(h)
+        h = self._enc_1(h)
+        h = self._down_1(h)
+        h = self._enc_2(h)
+        h = self._down_2(h)
+        h = self._enc_3(h)
+        return h
+class Encoder(nn.Module):
+    def __init__(self, in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens):
+        super(Encoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self._num_residual_hiddens = num_residual_hiddens
+        self.project = ConvNormRelu(in_dim, self._num_hiddens // 4, leaky=True)
+        self._enc_1 = Res_CNR_Stack(self._num_hiddens // 4, self._num_residual_layers, leaky=True)
+        self._down_1 = ConvNormRelu(self._num_hiddens // 4, self._num_hiddens // 2, leaky=True, residual=True,
+                                    sample='down')
+        self._enc_2 = Res_CNR_Stack(self._num_hiddens // 2, self._num_residual_layers, leaky=True)
+        self._down_2 = ConvNormRelu(self._num_hiddens // 2, self._num_hiddens, leaky=True, residual=True, sample='down')
+        self._enc_3 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+        self.pre_vq_conv = nn.Conv1d(self._num_hiddens, embedding_dim, 1, 1)
+    def forward(self, x):
+        h = self.project(x)
+        h = self._enc_1(h)
+        h = self._down_1(h)
+        h = self._enc_2(h)
+        h = self._down_2(h)
+        h = self._enc_3(h)
+        h = self.pre_vq_conv(h)
+        return h
+class Frame_Enc(nn.Module):
+    def __init__(self, in_dim, num_hiddens):
+        super(Frame_Enc, self).__init__()
+        self.in_dim = in_dim
+        self.num_hiddens = num_hiddens
+        # self.enc = transformer_Enc(in_dim, num_hiddens, 2, 8, 256, 256, 256, 256, 0, dropout=0.1, n_position=4)
+        self.proj = nn.Conv1d(in_dim, num_hiddens, 1, 1)
+        self.enc = Res_CNR_Stack(num_hiddens, 2, leaky=True)
+        self.proj_1 = nn.Conv1d(256*4, num_hiddens, 1, 1)
+        self.proj_2 = nn.Conv1d(256*4, num_hiddens*2, 1, 1)
+    def forward(self, x):
+        # x = self.enc(x, None)[0].reshape(x.shape[0], -1, 1)
+        x = self.enc(self.proj(x)).reshape(x.shape[0], -1, 1)
+        second_last = self.proj_2(x)
+        last = self.proj_1(x)
+        return second_last, last
+class Decoder(nn.Module):
+    def __init__(self, out_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens, ae=False):
+        super(Decoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self._num_residual_hiddens = num_residual_hiddens
+        self.aft_vq_conv = nn.Conv1d(embedding_dim, self._num_hiddens, 1, 1)
+        self._dec_1 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True)
+        self._up_2 = ConvNormRelu(self._num_hiddens, self._num_hiddens // 2, leaky=True, residual=True, sample='up')
+        self._dec_2 = Res_CNR_Stack(self._num_hiddens // 2, self._num_residual_layers, leaky=True)
+        self._up_3 = ConvNormRelu(self._num_hiddens // 2, self._num_hiddens // 4, leaky=True, residual=True,
+                                  sample='up')
+        self._dec_3 = Res_CNR_Stack(self._num_hiddens // 4, self._num_residual_layers, leaky=True)
+        if ae:
+            self.frame_enc = Frame_Enc(out_dim, self._num_hiddens // 4)
+            self.gru_sl = nn.GRU(self._num_hiddens // 2, self._num_hiddens // 2, 1, batch_first=True)
+            self.gru_l = nn.GRU(self._num_hiddens // 4, self._num_hiddens // 4, 1, batch_first=True)
+        self.project = nn.Conv1d(self._num_hiddens // 4, out_dim, 1, 1)
+    def forward(self, h, last_frame=None):
+        h = self.aft_vq_conv(h)
+        h = self._dec_1(h)
+        h = self._up_2(h)
+        h = self._dec_2(h)
+        h = self._up_3(h)
+        h = self._dec_3(h)
+        recon = self.project(h)
+        return recon, None
+class Pre_VQ(nn.Module):
+    def __init__(self, num_hiddens, embedding_dim, num_chunks):
+        super(Pre_VQ, self).__init__()
+        self.conv = nn.Conv1d(num_hiddens, num_hiddens, 1, 1, 0, groups=num_chunks)
+        self.bn = nn.GroupNorm(num_chunks, num_hiddens)
+        self.relu = nn.ReLU()
+        self.proj = nn.Conv1d(num_hiddens, embedding_dim, 1, 1, 0, groups=num_chunks)
+    def forward(self, x):
+        x = self.conv(x)
+        x = self.bn(x)
+        x = self.relu(x)
+        x = self.proj(x)
+        return x
+class VQVAE(nn.Module):
+    """VQ-VAE"""
+    def __init__(self, in_dim, embedding_dim, num_embeddings,
+                 num_hiddens, num_residual_layers, num_residual_hiddens,
+                 commitment_cost=0.25, decay=0.99, share=False):
+        super().__init__()
+        self.in_dim = in_dim
+        self.embedding_dim = embedding_dim
+        self.num_embeddings = num_embeddings
+        self.share_code_vq = share
+        self.encoder = Encoder(in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens)
+        self.vq_layer = VectorQuantizerEMA(embedding_dim, num_embeddings, commitment_cost, decay)
+        self.decoder = Decoder(in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens)
+    def forward(self, gt_poses, id=None, pre_state=None):
+        z = self.encoder(gt_poses.transpose(1, 2))
+        if not self.training:
+            e, _ = self.vq_layer(z)
+            x_recon, cur_state = self.decoder(e, pre_state.transpose(1, 2) if pre_state is not None else None)
+            return e, x_recon
+        e, e_q_loss = self.vq_layer(z)
+        gt_recon, cur_state = self.decoder(e, pre_state.transpose(1, 2) if pre_state is not None else None)
+        return e_q_loss, gt_recon.transpose(1, 2)
+    def encode(self, gt_poses, id=None):
+        z = self.encoder(gt_poses.transpose(1, 2))
+        e, latents = self.vq_layer(z)
+        return e, latents
+    def decode(self, b, w, e=None, latents=None, pre_state=None):
+        if e is not None:
+            x = self.decoder(e, pre_state.transpose(1, 2) if pre_state is not None else None)
+        else:
+            e = self.vq_layer.quantize(latents)
+            e = e.view(b, w, -1).permute(0, 2, 1).contiguous()
+            x = self.decoder(e, pre_state.transpose(1, 2) if pre_state is not None else None)
+        return x
+class AE(nn.Module):
+    """VQ-VAE"""
+    def __init__(self, in_dim, embedding_dim, num_embeddings,
+                 num_hiddens, num_residual_layers, num_residual_hiddens):
+        super().__init__()
+        self.in_dim = in_dim
+        self.embedding_dim = embedding_dim
+        self.num_embeddings = num_embeddings
+        self.encoder = Encoder(in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens)
+        self.decoder = Decoder(in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens, True)
+    def forward(self, gt_poses, id=None, pre_state=None):
+        z = self.encoder(gt_poses.transpose(1, 2))
+        if not self.training:
+            x_recon, cur_state = self.decoder(z, pre_state.transpose(1, 2) if pre_state is not None else None)
+            return z, x_recon
+        gt_recon, cur_state = self.decoder(z, pre_state.transpose(1, 2) if pre_state is not None else None)
+        return gt_recon.transpose(1, 2)
+    def encode(self, gt_poses, id=None):
+        z = self.encoder(gt_poses.transpose(1, 2))
+        return z

nets/spg/vqvae_modules.py ADDED Viewed

	@@ -0,0 +1,380 @@

+import os
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision import datasets, transforms
+import matplotlib.pyplot as plt
+class CasualCT(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 leaky=False,
+                 p=0,
+                 groups=1, ):
+        '''
+        conv-bn-relu
+        '''
+        super(CasualCT, self).__init__()
+        padding = 0
+        kernel_size = 2
+        stride = 2
+        in_channels = in_channels * groups
+        out_channels = out_channels * groups
+        self.conv = nn.ConvTranspose1d(in_channels=in_channels, out_channels=out_channels,
+                                       kernel_size=kernel_size, stride=stride, padding=padding,
+                                       groups=groups)
+        self.norm = nn.BatchNorm1d(out_channels)
+        self.dropout = nn.Dropout(p=p)
+        if leaky:
+            self.relu = nn.LeakyReLU(negative_slope=0.2)
+        else:
+            self.relu = nn.ReLU()
+    def forward(self, x, **kwargs):
+        out = self.norm(self.dropout(self.conv(x)))
+        return self.relu(out)
+class CasualConv(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 leaky=False,
+                 p=0,
+                 groups=1,
+                 downsample=False):
+        '''
+        conv-bn-relu
+        '''
+        super(CasualConv, self).__init__()
+        padding = 0
+        kernel_size = 2
+        stride = 1
+        self.downsample = downsample
+        if self.downsample:
+            kernel_size = 2
+            stride = 2
+        in_channels = in_channels * groups
+        out_channels = out_channels * groups
+        self.conv = nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
+                              kernel_size=kernel_size, stride=stride, padding=padding,
+                              groups=groups)
+        self.norm = nn.BatchNorm1d(out_channels)
+        self.dropout = nn.Dropout(p=p)
+        if leaky:
+            self.relu = nn.LeakyReLU(negative_slope=0.2)
+        else:
+            self.relu = nn.ReLU()
+    def forward(self, x, pre_state=None):
+        if not self.downsample:
+            if pre_state is not None:
+                x = torch.cat([pre_state, x], dim=-1)
+            else:
+                zeros = torch.zeros([x.shape[0], x.shape[1], 1], device=x.device)
+                x = torch.cat([zeros, x], dim=-1)
+        out = self.norm(self.dropout(self.conv(x)))
+        return self.relu(out)
+class ConvNormRelu(nn.Module):
+    '''
+    (B,C_in,H,W) -> (B, C_out, H, W)
+    there exist some kernel size that makes the result is not H/s
+    #TODO: there might some problems with residual
+    '''
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 leaky=False,
+                 sample='none',
+                 p=0,
+                 groups=1,
+                 residual=False,
+                 norm='bn'):
+        '''
+        conv-bn-relu
+        '''
+        super(ConvNormRelu, self).__init__()
+        self.residual = residual
+        self.norm_type = norm
+        padding = 1
+        if sample == 'none':
+            kernel_size = 3
+            stride = 1
+        elif sample == 'one':
+            padding = 0
+            kernel_size = stride = 1
+        else:
+            kernel_size = 4
+            stride = 2
+        if self.residual:
+            if sample == 'down':
+                self.residual_layer = nn.Conv1d(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=kernel_size,
+                    stride=stride,
+                    padding=padding)
+            elif sample == 'up':
+                self.residual_layer = nn.ConvTranspose1d(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=kernel_size,
+                    stride=stride,
+                    padding=padding)
+            else:
+                if in_channels == out_channels:
+                    self.residual_layer = nn.Identity()
+                else:
+                    self.residual_layer = nn.Sequential(
+                        nn.Conv1d(
+                            in_channels=in_channels,
+                            out_channels=out_channels,
+                            kernel_size=kernel_size,
+                            stride=stride,
+                            padding=padding
+                        )
+                    )
+        in_channels = in_channels * groups
+        out_channels = out_channels * groups
+        if sample == 'up':
+            self.conv = nn.ConvTranspose1d(in_channels=in_channels, out_channels=out_channels,
+                                           kernel_size=kernel_size, stride=stride, padding=padding,
+                                           groups=groups)
+        else:
+            self.conv = nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
+                                  kernel_size=kernel_size, stride=stride, padding=padding,
+                                  groups=groups)
+        self.norm = nn.BatchNorm1d(out_channels)
+        self.dropout = nn.Dropout(p=p)
+        if leaky:
+            self.relu = nn.LeakyReLU(negative_slope=0.2)
+        else:
+            self.relu = nn.ReLU()
+    def forward(self, x, **kwargs):
+        out = self.norm(self.dropout(self.conv(x)))
+        if self.residual:
+            residual = self.residual_layer(x)
+            out += residual
+        return self.relu(out)
+class Res_CNR_Stack(nn.Module):
+    def __init__(self,
+                 channels,
+                 layers,
+                 sample='none',
+                 leaky=False,
+                 casual=False,
+                 ):
+        super(Res_CNR_Stack, self).__init__()
+        if casual:
+            kernal_size = 1
+            padding = 0
+            conv = CasualConv
+        else:
+            kernal_size = 3
+            padding = 1
+            conv = ConvNormRelu
+        if sample == 'one':
+            kernal_size = 1
+            padding = 0
+        self._layers = nn.ModuleList()
+        for i in range(layers):
+            self._layers.append(conv(channels, channels, leaky=leaky, sample=sample))
+        self.conv = nn.Conv1d(channels, channels, kernal_size, 1, padding)
+        self.norm = nn.BatchNorm1d(channels)
+        self.relu = nn.ReLU()
+    def forward(self, x, pre_state=None):
+        # cur_state = []
+        h = x
+        for i in range(self._layers.__len__()):
+            # cur_state.append(h[..., -1:])
+            h = self._layers[i](h, pre_state=pre_state[i] if pre_state is not None else None)
+        h = self.norm(self.conv(h))
+        return self.relu(h + x)
+class ExponentialMovingAverage(nn.Module):
+    """Maintains an exponential moving average for a value.
+      This module keeps track of a hidden exponential moving average that is
+      initialized as a vector of zeros which is then normalized to give the average.
+      This gives us a moving average which isn't biased towards either zero or the
+      initial value. Reference (https://arxiv.org/pdf/1412.6980.pdf)
+      Initially:
+          hidden_0 = 0
+      Then iteratively:
+          hidden_i = hidden_{i-1} - (hidden_{i-1} - value) * (1 - decay)
+          average_i = hidden_i / (1 - decay^i)
+    """
+    def __init__(self, init_value, decay):
+        super().__init__()
+        self.decay = decay
+        self.counter = 0
+        self.register_buffer("hidden", torch.zeros_like(init_value))
+    def forward(self, value):
+        self.counter += 1
+        self.hidden.sub_((self.hidden - value) * (1 - self.decay))
+        average = self.hidden / (1 - self.decay ** self.counter)
+        return average
+class VectorQuantizerEMA(nn.Module):
+    """
+    VQ-VAE layer: Input any tensor to be quantized. Use EMA to update embeddings.
+    Args:
+        embedding_dim (int): the dimensionality of the tensors in the
+          quantized space. Inputs to the modules must be in this format as well.
+        num_embeddings (int): the number of vectors in the quantized space.
+        commitment_cost (float): scalar which controls the weighting of the loss terms (see
+          equation 4 in the paper - this variable is Beta).
+        decay (float): decay for the moving averages.
+        epsilon (float): small float constant to avoid numerical instability.
+    """
+    def __init__(self, embedding_dim, num_embeddings, commitment_cost, decay,
+                 epsilon=1e-5):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.num_embeddings = num_embeddings
+        self.commitment_cost = commitment_cost
+        self.epsilon = epsilon
+        # initialize embeddings as buffers
+        embeddings = torch.empty(self.num_embeddings, self.embedding_dim)
+        nn.init.xavier_uniform_(embeddings)
+        self.register_buffer("embeddings", embeddings)
+        self.ema_dw = ExponentialMovingAverage(self.embeddings, decay)
+        # also maintain ema_cluster_size， which record the size of each embedding
+        self.ema_cluster_size = ExponentialMovingAverage(torch.zeros((self.num_embeddings,)), decay)
+    def forward(self, x):
+        # [B, C, H, W] -> [B, H, W, C]
+        x = x.permute(0, 2, 1).contiguous()
+        # [B, H, W, C] -> [BHW, C]
+        flat_x = x.reshape(-1, self.embedding_dim)
+        encoding_indices = self.get_code_indices(flat_x)
+        quantized = self.quantize(encoding_indices)
+        quantized = quantized.view_as(x)  # [B, W, C]
+        if not self.training:
+            quantized = quantized.permute(0, 2, 1).contiguous()
+            return quantized, encoding_indices.view(quantized.shape[0], quantized.shape[2])
+        # update embeddings with EMA
+        with torch.no_grad():
+            encodings = F.one_hot(encoding_indices, self.num_embeddings).float()
+            updated_ema_cluster_size = self.ema_cluster_size(torch.sum(encodings, dim=0))
+            n = torch.sum(updated_ema_cluster_size)
+            updated_ema_cluster_size = ((updated_ema_cluster_size + self.epsilon) /
+                                        (n + self.num_embeddings * self.epsilon) * n)
+            dw = torch.matmul(encodings.t(), flat_x)  # sum encoding vectors of each cluster
+            updated_ema_dw = self.ema_dw(dw)
+            normalised_updated_ema_w = (
+                    updated_ema_dw / updated_ema_cluster_size.reshape(-1, 1))
+            self.embeddings.data = normalised_updated_ema_w
+        # commitment loss
+        e_latent_loss = F.mse_loss(x, quantized.detach())
+        loss = self.commitment_cost * e_latent_loss
+        # Straight Through Estimator
+        quantized = x + (quantized - x).detach()
+        quantized = quantized.permute(0, 2, 1).contiguous()
+        return quantized, loss
+    def get_code_indices(self, flat_x):
+        # compute L2 distance
+        distances = (
+                torch.sum(flat_x ** 2, dim=1, keepdim=True) +
+                torch.sum(self.embeddings ** 2, dim=1) -
+                2. * torch.matmul(flat_x, self.embeddings.t())
+        )  # [N, M]
+        encoding_indices = torch.argmin(distances, dim=1)  # [N,]
+        return encoding_indices
+    def quantize(self, encoding_indices):
+        """Returns embedding tensor for a batch of indices."""
+        return F.embedding(encoding_indices, self.embeddings)
+class Casual_Encoder(nn.Module):
+    def __init__(self, in_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens):
+        super(Casual_Encoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self._num_residual_hiddens = num_residual_hiddens
+        self.project = nn.Conv1d(in_dim, self._num_hiddens // 4, 1, 1)
+        self._enc_1 = Res_CNR_Stack(self._num_hiddens // 4, self._num_residual_layers, leaky=True, casual=True)
+        self._down_1 = CasualConv(self._num_hiddens // 4, self._num_hiddens // 2, leaky=True, downsample=True)
+        self._enc_2 = Res_CNR_Stack(self._num_hiddens // 2, self._num_residual_layers, leaky=True, casual=True)
+        self._down_2 = CasualConv(self._num_hiddens // 2, self._num_hiddens, leaky=True, downsample=True)
+        self._enc_3 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True, casual=True)
+        # self.pre_vq_conv = nn.Conv1d(self._num_hiddens, embedding_dim, 1, 1)
+    def forward(self, x):
+        h = self.project(x)
+        h, _ = self._enc_1(h)
+        h = self._down_1(h)
+        h, _ = self._enc_2(h)
+        h = self._down_2(h)
+        h, _ = self._enc_3(h)
+        # h = self.pre_vq_conv(h)
+        return h
+class Casual_Decoder(nn.Module):
+    def __init__(self, out_dim, embedding_dim, num_hiddens, num_residual_layers, num_residual_hiddens):
+        super(Casual_Decoder, self).__init__()
+        self._num_hiddens = num_hiddens
+        self._num_residual_layers = num_residual_layers
+        self._num_residual_hiddens = num_residual_hiddens
+        # self.aft_vq_conv = nn.Conv1d(embedding_dim, self._num_hiddens, 1, 1)
+        self._dec_1 = Res_CNR_Stack(self._num_hiddens, self._num_residual_layers, leaky=True, casual=True)
+        self._up_2 = CasualCT(self._num_hiddens, self._num_hiddens // 2, leaky=True)
+        self._dec_2 = Res_CNR_Stack(self._num_hiddens // 2, self._num_residual_layers, leaky=True, casual=True)
+        self._up_3 = CasualCT(self._num_hiddens // 2, self._num_hiddens // 4, leaky=True)
+        self._dec_3 = Res_CNR_Stack(self._num_hiddens // 4, self._num_residual_layers, leaky=True, casual=True)
+        self.project = nn.Conv1d(self._num_hiddens//4, out_dim, 1, 1)
+    def forward(self, h, pre_state=None):
+        cur_state = []
+        # h = self.aft_vq_conv(x)
+        h, s = self._dec_1(h, pre_state[0] if pre_state is not None else None)
+        cur_state.append(s)
+        h = self._up_2(h)
+        h, s = self._dec_2(h, pre_state[1] if pre_state is not None else None)
+        cur_state.append(s)
+        h = self._up_3(h)
+        h, s = self._dec_3(h, pre_state[2] if pre_state is not None else None)
+        cur_state.append(s)
+        recon = self.project(h)
+        return recon, cur_state

nets/spg/wav2vec.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import copy
+import math
+from transformers import Wav2Vec2Model,Wav2Vec2Config
+from transformers.modeling_outputs import BaseModelOutput
+from typing import Optional, Tuple
+_CONFIG_FOR_DOC = "Wav2Vec2Config"
+# the implementation of Wav2Vec2Model is borrowed from https://huggingface.co/transformers/_modules/transformers/models/wav2vec2/modeling_wav2vec2.html#Wav2Vec2Model
+# initialize our encoder with the pre-trained wav2vec 2.0 weights.
+def _compute_mask_indices(
+    shape: Tuple[int, int],
+    mask_prob: float,
+    mask_length: int,
+    attention_mask: Optional[torch.Tensor] = None,
+    min_masks: int = 0,
+) -> np.ndarray:
+    bsz, all_sz = shape
+    mask = np.full((bsz, all_sz), False)
+    all_num_mask = int(
+        mask_prob * all_sz / float(mask_length)
+        + np.random.rand()
+    )
+    all_num_mask = max(min_masks, all_num_mask)
+    mask_idcs = []
+    padding_mask = attention_mask.ne(1) if attention_mask is not None else None
+    for i in range(bsz):
+        if padding_mask is not None:
+            sz = all_sz - padding_mask[i].long().sum().item()
+            num_mask = int(
+                mask_prob * sz / float(mask_length)
+                + np.random.rand()
+            )
+            num_mask = max(min_masks, num_mask)
+        else:
+            sz = all_sz
+            num_mask = all_num_mask
+        lengths = np.full(num_mask, mask_length)
+        if sum(lengths) == 0:
+            lengths[0] = min(mask_length, sz - 1)
+        min_len = min(lengths)
+        if sz - min_len <= num_mask:
+            min_len = sz - num_mask - 1
+        mask_idc = np.random.choice(sz - min_len, num_mask, replace=False)
+        mask_idc = np.asarray([mask_idc[j] + offset for j in range(len(mask_idc)) for offset in range(lengths[j])])
+        mask_idcs.append(np.unique(mask_idc[mask_idc < sz]))
+    min_len = min([len(m) for m in mask_idcs])
+    for i, mask_idc in enumerate(mask_idcs):
+        if len(mask_idc) > min_len:
+            mask_idc = np.random.choice(mask_idc, min_len, replace=False)
+        mask[i, mask_idc] = True
+    return mask
+# linear interpolation layer
+def linear_interpolation(features, input_fps, output_fps, output_len=None):
+    features = features.transpose(1, 2)
+    seq_len = features.shape[2] / float(input_fps)
+    if output_len is None:
+        output_len = int(seq_len * output_fps)
+    output_features = F.interpolate(features,size=output_len,align_corners=False,mode='linear')
+    return output_features.transpose(1, 2)
+class Wav2Vec2Model(Wav2Vec2Model):
+    def __init__(self, config):
+        super().__init__(config)
+    def forward(
+        self,
+        input_values,
+        attention_mask=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        frame_num=None
+    ):
+        self.config.output_attentions = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        hidden_states = self.feature_extractor(input_values)
+        hidden_states = hidden_states.transpose(1, 2)
+        hidden_states = linear_interpolation(hidden_states, 50, 30,output_len=frame_num)
+        if attention_mask is not None:
+            output_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(-1))
+            attention_mask = torch.zeros(
+                hidden_states.shape[:2], dtype=hidden_states.dtype, device=hidden_states.device
+            )
+            attention_mask[
+                (torch.arange(attention_mask.shape[0], device=hidden_states.device), output_lengths - 1)
+            ] = 1
+            attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).bool()
+        hidden_states = self.feature_projection(hidden_states)
+        if self.config.apply_spec_augment and self.training:
+            batch_size, sequence_length, hidden_size = hidden_states.size()
+            if self.config.mask_time_prob > 0:
+                mask_time_indices = _compute_mask_indices(
+                    (batch_size, sequence_length),
+                    self.config.mask_time_prob,
+                    self.config.mask_time_length,
+                    attention_mask=attention_mask,
+                    min_masks=2,
+                )
+                hidden_states[torch.from_numpy(mask_time_indices)] = self.masked_spec_embed.to(hidden_states.dtype)
+            if self.config.mask_feature_prob > 0:
+                mask_feature_indices = _compute_mask_indices(
+                    (batch_size, hidden_size),
+                    self.config.mask_feature_prob,
+                    self.config.mask_feature_length,
+                )
+                mask_feature_indices = torch.from_numpy(mask_feature_indices).to(hidden_states.device)
+                hidden_states[mask_feature_indices[:, None].expand(-1, sequence_length, -1)] = 0
+        encoder_outputs = self.encoder(
+            hidden_states[0],
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = encoder_outputs[0]
+        if not return_dict:
+            return (hidden_states,) + encoder_outputs[1:]
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )

nets/utils.py ADDED Viewed

	@@ -0,0 +1,122 @@

+import json
+import textgrid as tg
+import numpy as np
+def get_parameter_size(model):
+    total_num = sum(p.numel() for p in model.parameters())
+    trainable_num = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    return total_num, trainable_num
+def denormalize(kps, data_mean, data_std):
+    '''
+    kps: (B, T, C)
+    '''
+    data_std = data_std.reshape(1, 1, -1)
+    data_mean = data_mean.reshape(1, 1, -1)
+    return (kps * data_std) + data_mean
+def normalize(kps, data_mean, data_std):
+    '''
+    kps: (B, T, C)
+    '''
+    data_std = data_std.squeeze().reshape(1, 1, -1)
+    data_mean = data_mean.squeeze().reshape(1, 1, -1)
+    return (kps-data_mean) / data_std
+def parse_audio(textgrid_file):
+    '''a demo implementation'''
+    words=['but', 'as', 'to', 'that', 'with', 'of', 'the', 'and', 'or', 'not', 'which', 'what', 'this', 'for', 'because', 'if', 'so', 'just', 'about', 'like', 'by', 'how', 'from', 'whats', 'now', 'very', 'that', 'also', 'actually', 'who', 'then', 'well', 'where', 'even', 'today', 'between', 'than', 'when']
+    txt=tg.TextGrid.fromFile(textgrid_file)
+    total_time=int(np.ceil(txt.maxTime))
+    code_seq=np.zeros(total_time)
+    word_level=txt[0]
+    for i in range(len(word_level)):
+        start_time=word_level[i].minTime
+        end_time=word_level[i].maxTime
+        mark=word_level[i].mark
+        if mark in words:
+            start=int(np.round(start_time))
+            end=int(np.round(end_time))
+            if start >= len(code_seq) or end >= len(code_seq):
+                code_seq[-1] = 1
+            else:
+                code_seq[start]=1
+    return code_seq
+def get_path(model_name, model_type):
+    if model_name == 's2g_body_pixel':
+        if model_type == 'mfcc':
+            return './experiments/2022-10-09-smplx_S2G-body-pixel-aud-3p/ckpt-99.pth'
+        elif model_type == 'wv2':
+            return './experiments/2022-10-28-smplx_S2G-body-pixel-wv2-sg2/ckpt-99.pth'
+        elif model_type == 'random':
+            return './experiments/2022-10-09-smplx_S2G-body-pixel-random-3p/ckpt-99.pth'
+        elif model_type == 'wbhmodel':
+            return './experiments/2022-11-02-smplx_S2G-body-pixel-w-bhmodel/ckpt-99.pth'
+        elif model_type == 'wobhmodel':
+            return './experiments/2022-11-02-smplx_S2G-body-pixel-wo-bhmodel/ckpt-99.pth'
+    elif model_name == 's2g_body':
+        if model_type == 'a+m-vae':
+            return './experiments/2022-10-19-smplx_S2G-body-audio-motion-vae/ckpt-99.pth'
+        elif model_type == 'a-vae':
+            return './experiments/2022-10-18-smplx_S2G-body-audiovae/ckpt-99.pth'
+        elif model_type == 'a-ed':
+            return './experiments/2022-10-18-smplx_S2G-body-audioae/ckpt-99.pth'
+    elif model_name == 's2g_LS3DCG':
+        return './experiments/2022-10-19-smplx_S2G-LS3DCG/ckpt-99.pth'
+    elif model_name == 's2g_body_vq':
+        if model_type == 'n_com_1024':
+            return './experiments/2022-10-29-smplx_S2G-body-vq-cn1024/ckpt-99.pth'
+        elif model_type == 'n_com_2048':
+            return './experiments/2022-10-29-smplx_S2G-body-vq-cn2048/ckpt-99.pth'
+        elif model_type == 'n_com_4096':
+            return './experiments/2022-10-29-smplx_S2G-body-vq-cn4096/ckpt-99.pth'
+        elif model_type == 'n_com_8192':
+            return './experiments/2022-11-02-smplx_S2G-body-vq-cn8192/ckpt-99.pth'
+        elif model_type == 'n_com_16384':
+            return './experiments/2022-11-02-smplx_S2G-body-vq-cn16384/ckpt-99.pth'
+        elif model_type == 'n_com_170000':
+            return './experiments/2022-10-30-smplx_S2G-body-vq-cn170000/ckpt-99.pth'
+        elif model_type == 'com_1024':
+            return './experiments/2022-10-29-smplx_S2G-body-vq-composition/ckpt-99.pth'
+        elif model_type == 'com_2048':
+            return './experiments/2022-10-31-smplx_S2G-body-vq-composition2048/ckpt-99.pth'
+        elif model_type == 'com_4096':
+            return './experiments/2022-10-31-smplx_S2G-body-vq-composition4096/ckpt-99.pth'
+        elif model_type == 'com_8192':
+            return './experiments/2022-11-02-smplx_S2G-body-vq-composition8192/ckpt-99.pth'
+        elif model_type == 'com_16384':
+            return './experiments/2022-11-02-smplx_S2G-body-vq-composition16384/ckpt-99.pth'
+def get_dpath(model_name, model_type):
+    if model_name == 's2g_body_pixel':
+        if model_type == 'audio':
+            return './experiments/2022-10-26-smplx_S2G-d-pixel-aud/ckpt-9.pth'
+        elif model_type == 'wv2':
+            return './experiments/2022-11-04-smplx_S2G-d-pixel-wv2/ckpt-9.pth'
+        elif model_type == 'random':
+            return './experiments/2022-10-26-smplx_S2G-d-pixel-random/ckpt-9.pth'
+        elif model_type == 'wbhmodel':
+            return './experiments/2022-11-10-smplx_S2G-hD-wbhmodel/ckpt-9.pth'
+            # return './experiments/2022-11-05-smplx_S2G-d-pixel-wbhmodel/ckpt-9.pth'
+        elif model_type == 'wobhmodel':
+            return './experiments/2022-11-10-smplx_S2G-hD-wobhmodel/ckpt-9.pth'
+            # return './experiments/2022-11-05-smplx_S2G-d-pixel-wobhmodel/ckpt-9.pth'
+    elif model_name == 's2g_body':
+        if model_type == 'a+m-vae':
+            return './experiments/2022-10-26-smplx_S2G-d-audio+motion-vae/ckpt-9.pth'
+        elif model_type == 'a-vae':
+            return './experiments/2022-10-26-smplx_S2G-d-audio-vae/ckpt-9.pth'
+        elif model_type == 'a-ed':
+            return './experiments/2022-10-26-smplx_S2G-d-audio-ae/ckpt-9.pth'
+    elif model_name == 's2g_LS3DCG':
+        return './experiments/2022-10-26-smplx_S2G-d-ls3dcg/ckpt-9.pth'