Spaces:

shipra-99
/

TALKSHOW

Sleeping

App Files Files Community

sshravani commited on Apr 19

Commit

ec76118

1 Parent(s): f8b6a4b

Added visualise folder files from GitHub repo

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
.gitattributes +0 -1
.gitignore +0 -13
Dockerfile +0 -41
README.md +0 -118
__init__.py +0 -0
app.py +0 -98
config/LS3DCG.json +0 -65
config/body_pixel.json +0 -63
config/body_vq.json +0 -62
config/face.json +0 -59
data_utils/__init__.py +0 -3
data_utils/apply_split.py +0 -51
data_utils/axis2matrix.py +0 -29
data_utils/consts.py +0 -0
data_utils/dataloader_torch.py +0 -279
data_utils/dataset_preprocess.py +0 -170
data_utils/get_j.py +0 -51
data_utils/hand_component.json +0 -0
data_utils/lower_body.py +0 -143
data_utils/mesh_dataset.py +0 -348
data_utils/rotation_conversion.py +0 -551
data_utils/split_train_val_test.py +0 -27
data_utils/train_val_test.json +0 -0
data_utils/utils.py +0 -318
demo_audio/1st-page.wav +0 -3
demo_audio/yoy.py +0 -0
download_models.py +0 -28
evaluation/FGD.py +0 -199
evaluation/__init__.py +0 -0
evaluation/diversity_LVD.py +0 -64
evaluation/get_quality_samples.py +0 -62
evaluation/metrics.py +0 -109
evaluation/mode_transition.py +0 -60
evaluation/peak_velocity.py +0 -65
evaluation/util.py +0 -148
losses/__init__.py +0 -1
losses/losses.py +0 -91
nets/LS3DCG.py +0 -414
nets/__init__.py +0 -8
nets/base.py +0 -89
nets/body_ae.py +0 -152
nets/init_model.py +0 -35
nets/layers.py +0 -1052
nets/smplx_body_pixel.py +0 -326
nets/smplx_body_vq.py +0 -302
nets/smplx_face.py +0 -238
nets/spg/gated_pixelcnn_v2.py +0 -179
nets/spg/s2g_face.py +0 -226
nets/spg/s2glayers.py +0 -522

.DS_Store DELETED Viewed

Binary file (8.2 kB)

.gitattributes DELETED Viewed

	@@ -1 +0,0 @@
1	- demo_audio/1st-page.wav filter=lfs diff=lfs merge=lfs -text

.gitignore DELETED Viewed

@@ -1,13 +0,0 @@
-cat > .gitignore << EOF
-# Binary and large files
-*.pkl
-*.mp4
-*.npy
-# Demo binary files
-demo/**/*.mp4
-demo/**/*.npy
-# Large model files
-experiments/
-# Any other large files
-visualise/teaser_01.png
-EOF

Dockerfile DELETED Viewed

@@ -1,41 +0,0 @@
-FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    ffmpeg \
-    libgl1-mesa-glx \
-    git \
-    wget \
-    unzip \
-    libsndfile1 \
-    && rm -rf /var/lib/apt/lists/*
-# Set up a non-root user for Hugging Face Space compatibility
-RUN useradd -m -u 1000 user
-USER user
-WORKDIR /home/user
-# Copy project files
-COPY --chown=user requirements.txt .
-COPY --chown=user . .
-# Install Python dependencies
-RUN pip install --no-cache-dir -r requirements.txt
-# Create necessary directories
-RUN mkdir -p visualise/smplx_model \
-    && mkdir -p experiments \
-    && mkdir -p visualise/video/body-pixel \
-    && mkdir -p visualise/video/body-pixel2 \
-    && mkdir -p demo_audio
-# Set environment variables for GPU and Python
-ENV PYTHONUNBUFFERED=1
-ENV NVIDIA_VISIBLE_DEVICES=all
-ENV NVIDIA_DRIVER_CAPABILITIES=all
-# Expose Gradio port
-EXPOSE 7860
-# Default command
-CMD ["python", "app.py"]

README.md DELETED Viewed

@@ -1,118 +0,0 @@
----
-title: TalkSHOW Speech-to-Motion Translation
-emoji: 🎙️
-colorFrom: blue
-colorTo: purple
-sdk: docker
-app_port: 7860
-pinned: false
-license: mit
----
-# Team 14 - TalkSHOW: Generating Holistic 3D Human Motion from Speech
-Contributors - Abinaya Odeti , Shipra , Shravani , Vishal
-![teaser](visualise/teaser_01.png)
-## About
-This repository hosts the implementation of "TalkSHOW: A Speech-to-Motion Translation System", which maps raw audio input to full-body 3D motion using the SMPL-X model. It enables synchronized generation of expressive human body motion (including face, hands, and body) from speech input — supporting real-time animation, virtual avatars, and digital storytelling.
-##  Highlights
-Translates raw .wav audio into natural whole-body motion (jaw, pose, expressions, hands) using deep learning.
-Based on SMPL-X model for realistic 3D human mesh generation.
-Modular pipeline with support for face-body composition.
-Visualization with OpenGL & FFmpeg for final rendered video.
-End-to-end customizable configuration with audio models, latent generation, and rendering.
-##  Prerequisites
-Python 3.7+
-Anaconda for environment management
-Install required packages:
-```bash
-pip install -r requirements.txt
-```
-Install FFmpeg
-➤ Extract the FFmpeg ZIP and add its bin folder to System PATH
-## Getting started
-The visualization code was test on `Windows 10`, and it requires:
-* Python 3.7
-* conda3 or miniconda3
-* CUDA capable GPU (one is enough)
-### 1. Setup and Steps
-Clone the repo:
-  ```bash
-  git clone https://github.com/YOUR_USERNAME/TALKSHOW-speech-to-motion-translation-system.git
-  cd TalkSHOW
-  ```
-Create conda environment:
-```bash
-conda create -n talkshow python=3.7 -y
-conda activate talkshow
-pip install -r requirements.txt
-```
-### 2.Download models
-Download or place the required checkpoints:
-Download [**pretrained models**](https://drive.google.com/file/d/1bC0ZTza8HOhLB46WOJ05sBywFvcotDZG/view?usp=sharing),
-unzip and place it in the TalkSHOW folder, i.e. ``path-to-TalkSHOW/experiments``.
-Download [**smplx model**](https://drive.google.com/file/d/1Ly_hQNLQcZ89KG0Nj4jYZwccQiimSUVn/view?usp=share_link) (Please register in the official [**SMPLX webpage**](https://smpl-x.is.tue.mpg.de) before you use it.)
-and place it in ``path-to-TalkSHOW/visualise/smplx_model``.
-To visualise the test set and generated result (in each video, left: generated result | right: ground truth).
-The videos and generated motion data are saved in ``./visualise/video/body-pixel``:
-SMPLX Model Weights – visualise/smplx_model/SMPLX_NEUTRAL_2020.npz
-Extra joints, regressors, YAML configs – inside visualise/smplx_model/
-Also, ensure vq_path in body_pixel.json points to a valid .pth model (in ./experiments/.../ckpt-*.pth)
-###  3.🎙️ Running Inference
-To generate a 3D animated video from an audio file:
-```bash
-python scripts/demo.py \
-  --config_file ./config/body_pixel.json \
-  --infer \
-  --audio_file ./demo_audio/1st-page.wav \
-  --id 0 \
-  --whole_body
-```
-Change Input
-Replace --audio_file value with your own .wav file path.
-### 4. Output
-The final 3D animated video will be saved under:
-```bash
-visualise/video/body-pixel2/<audio_file_name>/1st-page.mp4
-```
-The exact command you used to run the project
-```bash
-python scripts/demo.py --config_file ./config/body_pixel.json --infer --audio_file ./demo_audio/1st-page.wav --id 0 --whole_body
-```
-### Contact
-For issues or questions, raise an issue or contact the contributors directly!

__init__.py DELETED Viewed

File without changes

app.py DELETED Viewed

@@ -1,98 +0,0 @@
-import gradio as gr
-import os
-import subprocess
-import time
-import logging
-import traceback
-def process_audio(audio_file):
-    # Configure detailed logging
-    logging.basicConfig(level=logging.DEBUG,
-                        format='%(asctime)s - %(levelname)s - %(message)s')
-    logger = logging.getLogger(__name__)
-    try:
-        # Detailed logging for input
-        logger.info(f"Received audio file: {audio_file}")
-        logger.info(f"Audio file exists: {os.path.exists(audio_file)}")
-        # Validate input file
-        if not audio_file or not os.path.exists(audio_file):
-            raise ValueError(f"Invalid or non-existent audio file: {audio_file}")
-        # Ensure output directory exists
-        os.makedirs("visualise/video/body-pixel2", exist_ok=True)
-        # Debugging: print current working directory and file details
-        logger.debug(f"Current working directory: {os.getcwd()}")
-        logger.debug(f"Audio file path: {os.path.abspath(audio_file)}")
-        logger.debug(f"Audio file size: {os.path.getsize(audio_file)} bytes")
-        # Construct command with full paths
-        cmd = [
-            "python",
-            os.path.abspath("scripts/demo.py"),
-            "--config_file", os.path.abspath("config/body_pixel.json"),
-            "--infer",
-            "--audio_file", os.path.abspath(audio_file),
-            "--id", "0",
-            "--whole_body"
-        ]
-        logger.info(f"Executing command: {' '.join(cmd)}")
-        # Run with more detailed error capture
-        result = subprocess.run(
-            cmd,
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            text=True,
-            cwd=os.getcwd(),  # Ensure correct working directory
-            timeout=1800
-        )
-        # Log full command output
-        logger.info(f"Command STDOUT: {result.stdout}")
-        logger.error(f"Command STDERR: {result.stderr}")
-        # Determine output video path
-        audio_name = os.path.splitext(os.path.basename(audio_file))[0]
-        output_dir = f"visualise/video/body-pixel2/{audio_name}"
-        output_path = f"{output_dir}/1st-page.mp4"
-        logger.info(f"Expected output path: {output_path}")
-        # Check output video
-        if os.path.exists(output_path):
-            logger.info(f"Output video found: {output_path}")
-            return output_path
-        else:
-            logger.error("Output video not generated")
-            return None, f"Error: Output video not generated. STDERR: {result.stderr}"
-    except subprocess.TimeoutExpired:
-        logger.error("Inference process timed out")
-        return None, "Error: Inference process took too long"
-    except Exception as e:
-        logger.error(f"Unexpected error: {str(e)}")
-        logger.error(traceback.format_exc())
-        return None, f"Unexpected error: {str(e)}"
-# Gradio Interface for 3.x compatibility
-demo = gr.Interface(
-    fn=process_audio,
-    inputs=gr.inputs.File(type="file", label="Upload Audio File"),
-    outputs=gr.outputs.Video(label="Generated Motion Video"),
-    title="TalkSHOW: Speech-to-Motion Translation System",
-    description="Convert speech audio to realistic 3D human motion using the SMPL-X model.",
-    examples=[["demo_audio/1st-page.wav"]]
-)
-# Launch with comprehensive logging
-if __name__ == "__main__":
-    demo.launch(
-        server_name="0.0.0.0",
-        server_port=7860,
-        debug=True  # Enable Gradio debug mode
-    )

config/LS3DCG.json DELETED Viewed

@@ -1,65 +0,0 @@
-{
-  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
-  "dataset_load_mode": "pickle",
-  "store_file_path": "store.pkl",
-  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
-  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
-  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
-  "param": {
-    "w_j": 1,
-    "w_b": 1,
-    "w_h": 1
-  },
-  "Data": {
-    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
-    "pklname": "_3d_mfcc.pkl",
-    "whole_video": false,
-    "pose": {
-      "normalization": false,
-      "convert_to_6d": false,
-      "norm_method": "all",
-      "augmentation": false,
-      "generate_length": 88,
-      "pre_pose_length": 0,
-      "pose_dim": 99,
-      "expression": true
-    },
-    "aud": {
-      "feat_method": "mfcc",
-      "aud_feat_dim": 64,
-      "aud_feat_win_size": null,
-      "context_info": false
-    }
-  },
-  "Model": {
-    "model_type": "body",
-    "model_name": "s2g_LS3DCG",
-    "code_num": 2048,
-    "AudioOpt": "Adam",
-    "encoder_choice": "mfcc",
-    "gan": false
-  },
-  "DataLoader": {
-    "batch_size": 128,
-    "num_workers": 0
-  },
-  "Train": {
-    "epochs": 100,
-    "max_gradient_norm": 5,
-    "learning_rate": {
-      "generator_learning_rate": 1e-4,
-      "discriminator_learning_rate": 1e-4
-    },
-    "weights": {
-      "keypoint_loss_weight": 1.0,
-      "gan_loss_weight": 1.0
-    }
-  },
-  "Log": {
-    "save_every": 50,
-    "print_every": 200,
-    "name": "LS3DCG"
-  },
-  "device": "cpu"
-}

config/body_pixel.json DELETED Viewed

@@ -1,63 +0,0 @@
-{
-  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
-  "dataset_load_mode": "json",
-  "store_file_path": "store.pkl",
-  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
-  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
-  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
-  "param": {
-    "w_j": 1,
-    "w_b": 1,
-    "w_h": 1
-  },
-  "Data": {
-    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
-    "pklname": "_3d_mfcc.pkl",
-    "whole_video": false,
-    "pose": {
-      "normalization": false,
-      "convert_to_6d": false,
-      "norm_method": "all",
-      "augmentation": false,
-      "generate_length": 88,
-      "pre_pose_length": 0,
-      "pose_dim": 99,
-      "expression": true
-    },
-    "aud": {
-      "feat_method": "mfcc",
-      "aud_feat_dim": 64,
-      "aud_feat_win_size": null,
-      "context_info": false
-    }
-  },
-  "Model": {
-    "model_type": "body",
-    "model_name": "s2g_body_pixel",
-    "composition": true,
-    "code_num": 2048,
-    "bh_model": true,
-    "AudioOpt": "Adam",
-    "encoder_choice": "mfcc",
-    "gan": false,
-    "vq_path": "./experiments/2022-10-31-smplx_S2G-body-vq-3d/ckpt-99.pth"
-  },
-  "DataLoader": {
-    "batch_size": 128,
-    "num_workers": 0
-  },
-  "Train": {
-    "epochs": 100,
-    "max_gradient_norm": 5,
-    "learning_rate": {
-      "generator_learning_rate": 1e-4,
-      "discriminator_learning_rate": 1e-4
-    }
-  },
-  "Log": {
-    "save_every": 50,
-    "print_every": 200,
-    "name": "body-pixel2"
-  }
-}

config/body_vq.json DELETED Viewed

@@ -1,62 +0,0 @@
-{
-  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
-  "dataset_load_mode": "json",
-  "store_file_path": "store.pkl",
-  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
-  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
-  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
-  "param": {
-    "w_j": 1,
-    "w_b": 1,
-    "w_h": 1
-  },
-  "Data": {
-    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
-    "pklname": "_3d_mfcc.pkl",
-    "whole_video": false,
-    "pose": {
-      "normalization": false,
-      "convert_to_6d": false,
-      "norm_method": "all",
-      "augmentation": false,
-      "generate_length": 88,
-      "pre_pose_length": 0,
-      "pose_dim": 99,
-      "expression": true
-    },
-    "aud": {
-      "feat_method": "mfcc",
-      "aud_feat_dim": 64,
-      "aud_feat_win_size": null,
-      "context_info": false
-    }
-  },
-  "Model": {
-    "model_type": "body",
-    "model_name": "s2g_body_vq",
-    "composition": true,
-    "code_num": 2048,
-    "bh_model": true,
-    "AudioOpt": "Adam",
-    "encoder_choice": "mfcc",
-    "gan": false
-  },
-  "DataLoader": {
-    "batch_size": 128,
-    "num_workers": 0
-  },
-  "Train": {
-    "epochs": 100,
-    "max_gradient_norm": 5,
-    "learning_rate": {
-      "generator_learning_rate": 1e-4,
-      "discriminator_learning_rate": 1e-4
-    }
-  },
-  "Log": {
-    "save_every": 50,
-    "print_every": 200,
-    "name": "body-vq"
-  }
-}

config/face.json DELETED Viewed

@@ -1,59 +0,0 @@
-{
-  "config_root_path": "/is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts",
-  "dataset_load_mode": "json",
-  "store_file_path": "store.pkl",
-  "smplx_npz_path": "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz",
-  "extra_joint_path": "visualise/smplx_model/smplx_extra_joints.yaml",
-  "j14_regressor_path": "visualise/smplx_model/SMPLX_to_J14.pkl",
-  "param": {
-    "w_j": 1,
-    "w_b": 1,
-    "w_h": 1
-  },
-  "Data": {
-    "data_root": "../ExpressiveWholeBodyDatasetv1.0/",
-    "pklname": "_3d_wv2.pkl",
-    "whole_video": true,
-    "pose": {
-      "normalization": false,
-      "convert_to_6d": false,
-      "norm_method": "all",
-      "augmentation": false,
-      "generate_length": 88,
-      "pre_pose_length": 0,
-      "pose_dim": 99,
-      "expression": true
-    },
-    "aud": {
-      "feat_method": "mfcc",
-      "aud_feat_dim": 64,
-      "aud_feat_win_size": null,
-      "context_info": false
-    }
-  },
-  "Model": {
-    "model_type": "face",
-    "model_name": "s2g_face",
-    "AudioOpt": "SGD",
-    "encoder_choice": "faceformer",
-    "gan": false
-  },
-  "DataLoader": {
-    "batch_size": 1,
-    "num_workers": 0
-  },
-  "Train": {
-    "epochs": 100,
-    "max_gradient_norm": 5,
-    "learning_rate": {
-      "generator_learning_rate": 1e-4,
-      "discriminator_learning_rate": 1e-4
-    }
-  },
-  "Log": {
-    "save_every": 50,
-    "print_every": 1000,
-    "name": "face"
-  }
-}

data_utils/__init__.py DELETED Viewed

@@ -1,3 +0,0 @@
-# from .dataloader_csv import MultiVidData as csv_data
-from .dataloader_torch import MultiVidData as torch_data
-from .utils import get_melspec, get_mfcc, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta

data_utils/apply_split.py DELETED Viewed

@@ -1,51 +0,0 @@
-import os
-from tqdm import tqdm
-import pickle
-import shutil
-speakers = ['seth', 'oliver', 'conan', 'chemistry']
-source_data_root = "../expressive_body-V0.7"
-data_root = "D:/Downloads/SHOW_dataset_v1.0/ExpressiveWholeBodyDatasetReleaseV1.0"
-f_read = open('split_more_than_2s.pkl', 'rb')
-f_save = open('none.pkl', 'wb')
-data_split = pickle.load(f_read)
-none_split = []
-train = val = test = 0
-for speaker_name in speakers:
-    speaker_root = os.path.join(data_root, speaker_name)
-    videos = [v for v in data_split[speaker_name]]
-    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
-        for split in data_split[speaker_name][vid]:
-            for seq in data_split[speaker_name][vid][split]:
-                seq = seq.replace('\\', '/')
-                old_file_path = os.path.join(data_root, speaker_name, vid, seq.split('/')[-1])
-                old_file_path = old_file_path.replace('\\', '/')
-                new_file_path = seq.replace(source_data_root.split('/')[-1], data_root.split('/')[-1])
-                try:
-                    shutil.move(old_file_path, new_file_path)
-                    if split == 'train':
-                        train = train + 1
-                    elif split == 'test':
-                        test = test + 1
-                    elif split == 'val':
-                        val = val + 1
-                except FileNotFoundError:
-                    none_split.append(old_file_path)
-                    print(f"The file {old_file_path} does not exists.")
-                except shutil.Error:
-                    none_split.append(old_file_path)
-                    print(f"The file {old_file_path} does not exists.")
-print(none_split.__len__())
-pickle.dump(none_split, f_save)
-f_save.close()
-print(train, val, test)

data_utils/axis2matrix.py DELETED Viewed

@@ -1,29 +0,0 @@
-import numpy as np
-import math
-import scipy.linalg as linalg
-def rotate_mat(axis, radian):
-    a = np.cross(np.eye(3), axis / linalg.norm(axis) * radian)
-    rot_matrix = linalg.expm(a)
-    return rot_matrix
-def aaa2mat(axis, sin, cos):
-    i = np.eye(3)
-    nnt = np.dot(axis.T, axis)
-    s = np.asarray([[0, -axis[0,2], axis[0,1]],
-                    [axis[0,2], 0, -axis[0,0]],
-                    [-axis[0,1], axis[0,0], 0]])
-    r = cos * i + (1-cos)*nnt +sin * s
-    return r
-rand_axis = np.asarray([[1,0,0]])
-#旋转角度
-r = math.pi/2
-#返回旋转矩阵
-rot_matrix = rotate_mat(rand_axis, r)
-r2 = aaa2mat(rand_axis, np.sin(r), np.cos(r))
-print(rot_matrix)

data_utils/consts.py DELETED Viewed

The diff for this file is too large to render. See raw diff

data_utils/dataloader_torch.py DELETED Viewed

@@ -1,279 +0,0 @@
-import sys
-import os
-sys.path.append(os.getcwd())
-import os
-from tqdm import tqdm
-from data_utils.utils import *
-import torch.utils.data as data
-from data_utils.mesh_dataset import SmplxDataset
-from transformers import Wav2Vec2Processor
-class MultiVidData():
-    def __init__(self,
-                data_root,
-                speakers,
-                split='train',
-                limbscaling=False,
-                normalization=False,
-                norm_method='new',
-                split_trans_zero=False,
-                num_frames=25,
-                num_pre_frames=25,
-                num_generate_length=None,
-                aud_feat_win_size=None,
-                aud_feat_dim=64,
-                feat_method='mel_spec',
-                context_info=False,
-                smplx=False,
-                audio_sr=16000,
-                convert_to_6d=False,
-                expression=False,
-                config=None
-                ):
-        self.data_root = data_root
-        self.speakers = speakers
-        self.split = split
-        if split == 'pre':
-            self.split = 'train'
-        self.norm_method=norm_method
-        self.normalization = normalization
-        self.limbscaling = limbscaling
-        self.convert_to_6d = convert_to_6d
-        self.num_frames=num_frames
-        self.num_pre_frames=num_pre_frames
-        if num_generate_length is None:
-            self.num_generate_length = num_frames
-        else:
-            self.num_generate_length = num_generate_length
-        self.split_trans_zero=split_trans_zero
-        dataset = SmplxDataset
-        if self.split_trans_zero:
-            self.trans_dataset_list = []
-            self.zero_dataset_list = []
-        else:
-            self.all_dataset_list = []
-        self.dataset={}
-        self.complete_data=[]
-        self.config=config
-        load_mode=self.config.dataset_load_mode
-        ######################load with pickle file
-        if load_mode=='pickle':
-            import pickle
-            import subprocess
-            # store_file_path='/tmp/store.pkl'
-            # cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl /tmp/store.pkl
-            # subprocess.run(f'cp /is/cluster/scratch/hyi/ExpressiveBody/SMPLifyX4/scripts/store.pkl {store_file_path}',shell=True)
-            # f = open(self.config.store_file_path, 'rb+')
-            f = open(self.split+config.Data.pklname, 'rb+')
-            self.dataset=pickle.load(f)
-            f.close()
-            for key in self.dataset:
-                self.complete_data.append(self.dataset[key].complete_data)
-        ######################load with pickle file
-        ######################load with a csv file
-        elif load_mode=='csv':
-            # 这里从我的一个code文件夹导入的，后续再完善进来
-            try:
-                sys.path.append(self.config.config_root_path)
-                from config import config_path
-                from csv_parser import csv_parse
-            except ImportError as e:
-                print(f'err: {e}')
-                raise ImportError('config root path error...')
-            for speaker_name in self.speakers:
-                # df_intervals=pd.read_csv(self.config.voca_csv_file_path)
-                df_intervals=None
-                df_intervals=df_intervals[df_intervals['speaker']==speaker_name]
-                df_intervals = df_intervals[df_intervals['dataset'] == self.split]
-                print(f'speaker {speaker_name} train interval length: {len(df_intervals)}')
-                for iter_index, (_, interval) in tqdm(
-                        (enumerate(df_intervals.iterrows())),desc=f'load {speaker_name}'
-                ):
-                    (
-                        interval_index,
-                        interval_speaker,
-                        interval_video_fn,
-                        interval_id,
-                        start_time,
-                        end_time,
-                        duration_time,
-                        start_time_10,
-                        over_flow_flag,
-                        short_dur_flag,
-                        big_video_dir,
-                        small_video_dir_name,
-                        speaker_video_path,
-                        voca_basename,
-                        json_basename,
-                        wav_basename,
-                        voca_top_clip_path,
-                        voca_json_clip_path,
-                        voca_wav_clip_path,
-                        audio_output_fn,
-                        image_output_path,
-                        pifpaf_output_path,
-                        mp_output_path,
-                        op_output_path,
-                        deca_output_path,
-                        pixie_output_path,
-                        cam_output_path,
-                        ours_output_path,
-                        merge_output_path,
-                        multi_output_path,
-                        gt_output_path,
-                        ours_images_path,
-                        pkl_fil_path,
-                    )=csv_parse(interval)
-                    if not os.path.exists(pkl_fil_path) or not os.path.exists(audio_output_fn):
-                        continue
-                    key=f'{interval_video_fn}/{small_video_dir_name}'
-                    self.dataset[key] = dataset(
-                        data_root=pkl_fil_path,
-                        speaker=speaker_name,
-                        audio_fn=audio_output_fn,
-                        audio_sr=audio_sr,
-                        fps=num_frames,
-                        feat_method=feat_method,
-                        audio_feat_dim=aud_feat_dim,
-                        train=(self.split == 'train'),
-                        load_all=True,
-                        split_trans_zero=self.split_trans_zero,
-                        limbscaling=self.limbscaling,
-                        num_frames=self.num_frames,
-                        num_pre_frames=self.num_pre_frames,
-                        num_generate_length=self.num_generate_length,
-                        audio_feat_win_size=aud_feat_win_size,
-                        context_info=context_info,
-                        convert_to_6d=convert_to_6d,
-                        expression=expression,
-                        config=self.config
-                    )
-                    self.complete_data.append(self.dataset[key].complete_data)
-        ######################load with a csv file
-        ######################origin load method
-        elif load_mode=='json':
-            # if self.split == 'train':
-            #     import pickle
-            #     f = open('store.pkl', 'rb+')
-            #     self.dataset=pickle.load(f)
-            #     f.close()
-            #     for key in self.dataset:
-            #         self.complete_data.append(self.dataset[key].complete_data)
-            # else:https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav
-            # if config.Model.model_type == 'face':
-            am = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
-            am_sr = 16000
-            # else:
-            #     am, am_sr = None, None
-            for speaker_name in self.speakers:
-                speaker_root = os.path.join(self.data_root, speaker_name)
-                videos=[v for v in os.listdir(speaker_root) ]
-                print(videos)
-                haode = huaide = 0
-                for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
-                    source_vid=vid
-                    # vid_pth=os.path.join(speaker_root, source_vid, 'images/half', self.split)
-                    vid_pth = os.path.join(speaker_root, source_vid, self.split)
-                    if smplx == 'pose':
-                        seqs = [s for s in os.listdir(vid_pth) if (s.startswith('clip'))]
-                    else:
-                        try:
-                            seqs = [s for s in os.listdir(vid_pth)]
-                        except:
-                            continue
-                    for s in seqs:
-                        seq_root=os.path.join(vid_pth, s)
-                        key = seq_root # correspond to clip******
-                        audio_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.wav' % (s))
-                        motion_fname = os.path.join(speaker_root, source_vid, self.split, s, '%s.pkl' % (s))
-                        if not os.path.isfile(audio_fname) or not os.path.isfile(motion_fname):
-                            huaide = huaide + 1
-                            continue
-                        self.dataset[key]=dataset(
-                            data_root=seq_root,
-                            speaker=speaker_name,
-                            motion_fn=motion_fname,
-                            audio_fn=audio_fname,
-                            audio_sr=audio_sr,
-                            fps=num_frames,
-                            feat_method=feat_method,
-                            audio_feat_dim=aud_feat_dim,
-                            train=(self.split=='train'),
-                            load_all=True,
-                            split_trans_zero=self.split_trans_zero,
-                            limbscaling=self.limbscaling,
-                            num_frames=self.num_frames,
-                            num_pre_frames=self.num_pre_frames,
-                            num_generate_length=self.num_generate_length,
-                            audio_feat_win_size=aud_feat_win_size,
-                            context_info=context_info,
-                            convert_to_6d=convert_to_6d,
-                            expression=expression,
-                            config=self.config,
-                            am=am,
-                            am_sr=am_sr,
-                            whole_video=config.Data.whole_video
-                        )
-                        self.complete_data.append(self.dataset[key].complete_data)
-                        haode = haode + 1
-                print("huaide:{}, haode:{}".format(huaide, haode))
-            import pickle
-            f = open(self.split+config.Data.pklname, 'wb')
-            pickle.dump(self.dataset, f)
-            f.close()
-        ######################origin load method
-        self.complete_data=np.concatenate(self.complete_data, axis=0)
-        # assert self.complete_data.shape[-1] == (12+21+21)*2
-        self.normalize_stats = {}
-        self.data_mean = None
-        self.data_std = None
-    def get_dataset(self):
-        self.normalize_stats['mean'] = self.data_mean
-        self.normalize_stats['std'] = self.data_std
-        for key in list(self.dataset.keys()):
-            if self.dataset[key].complete_data.shape[0] < self.num_generate_length:
-                continue
-            self.dataset[key].num_generate_length = self.num_generate_length
-            self.dataset[key].get_dataset(self.normalization, self.normalize_stats, self.split)
-            self.all_dataset_list.append(self.dataset[key].all_dataset)
-        if self.split_trans_zero:
-            self.trans_dataset = data.ConcatDataset(self.trans_dataset_list)
-            self.zero_dataset = data.ConcatDataset(self.zero_dataset_list)
-        else:
-            self.all_dataset = data.ConcatDataset(self.all_dataset_list)

data_utils/dataset_preprocess.py DELETED Viewed

@@ -1,170 +0,0 @@
-import os
-import pickle
-from tqdm import tqdm
-import shutil
-import torch
-import numpy as np
-import librosa
-import random
-speakers = ['seth', 'conan', 'oliver', 'chemistry']
-data_root = "../ExpressiveWholeBodyDatasetv1.0/"
-split = 'train'
-def split_list(full_list,shuffle=False,ratio=0.2):
-    n_total = len(full_list)
-    offset_0 = int(n_total * ratio)
-    offset_1 = int(n_total * ratio * 2)
-    if n_total==0 or offset_1<1:
-        return [],full_list
-    if shuffle:
-        random.shuffle(full_list)
-    sublist_0 = full_list[:offset_0]
-    sublist_1 = full_list[offset_0:offset_1]
-    sublist_2 = full_list[offset_1:]
-    return sublist_0, sublist_1, sublist_2
-def moveto(list, file):
-    for f in list:
-        before, after = '/'.join(f.split('/')[:-1]), f.split('/')[-1]
-        new_path = os.path.join(before, file)
-        new_path = os.path.join(new_path, after)
-        # os.makedirs(new_path)
-        # os.path.isdir(new_path)
-        # shutil.move(f, new_path)
-        #转移到新目录
-        shutil.copytree(f, new_path)
-        #删除原train里的文件
-        shutil.rmtree(f)
-    return None
-def read_pkl(data):
-    betas = np.array(data['betas'])
-    jaw_pose = np.array(data['jaw_pose'])
-    leye_pose = np.array(data['leye_pose'])
-    reye_pose = np.array(data['reye_pose'])
-    global_orient = np.array(data['global_orient']).squeeze()
-    body_pose = np.array(data['body_pose_axis'])
-    left_hand_pose = np.array(data['left_hand_pose'])
-    right_hand_pose = np.array(data['right_hand_pose'])
-    full_body = np.concatenate(
-        (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
-    expression = np.array(data['expression'])
-    full_body = np.concatenate((full_body, expression), axis=1)
-    if (full_body.shape[0] < 90) or (torch.isnan(torch.from_numpy(full_body)).sum() > 0):
-        return 1
-    else:
-        return 0
-for speaker_name in speakers:
-    speaker_root = os.path.join(data_root, speaker_name)
-    videos = [v for v in os.listdir(speaker_root)]
-    print(videos)
-    haode = huaide = 0
-    total_seqs = []
-    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
-    # for vid in videos:
-        source_vid = vid
-        vid_pth = os.path.join(speaker_root, source_vid)
-        # vid_pth = os.path.join(speaker_root, source_vid, 'images/half', split)
-        t = os.path.join(speaker_root, source_vid, 'test')
-        v = os.path.join(speaker_root, source_vid, 'val')
-        # if os.path.exists(t):
-        #     shutil.rmtree(t)
-        # if os.path.exists(v):
-        #     shutil.rmtree(v)
-        try:
-            seqs = [s for s in os.listdir(vid_pth)]
-        except:
-            continue
-        # if len(seqs) == 0:
-        #     shutil.rmtree(os.path.join(speaker_root, source_vid))
-            # None
-        for s in seqs:
-            quality = 0
-            total_seqs.append(os.path.join(vid_pth,s))
-            seq_root = os.path.join(vid_pth, s)
-            key = seq_root  # correspond to clip******
-            audio_fname = os.path.join(speaker_root, source_vid, s, '%s.wav' % (s))
-            # delete the data without audio or the audio file could not be read
-            if os.path.isfile(audio_fname):
-                try:
-                    audio = librosa.load(audio_fname)
-                except:
-                    # print(key)
-                    shutil.rmtree(key)
-                    huaide = huaide + 1
-                    continue
-            else:
-                huaide = huaide + 1
-                # print(key)
-                shutil.rmtree(key)
-                continue
-            # check motion file
-            motion_fname = os.path.join(speaker_root, source_vid, s, '%s.pkl' % (s))
-            try:
-                f = open(motion_fname, 'rb+')
-            except:
-                shutil.rmtree(key)
-                huaide = huaide + 1
-                continue
-            data = pickle.load(f)
-            w = read_pkl(data)
-            f.close()
-            quality = quality + w
-            if w == 1:
-                shutil.rmtree(key)
-                # print(key)
-                huaide = huaide + 1
-                continue
-            haode = haode + 1
-    print("huaide:{}, haode:{}, total_seqs:{}".format(huaide, haode, total_seqs.__len__()))
-for speaker_name in speakers:
-    speaker_root = os.path.join(data_root, speaker_name)
-    videos = [v for v in os.listdir(speaker_root)]
-    print(videos)
-    haode = huaide = 0
-    total_seqs = []
-    for vid in tqdm(videos, desc="Processing training data of {}......".format(speaker_name)):
-        # for vid in videos:
-        source_vid = vid
-        vid_pth = os.path.join(speaker_root, source_vid)
-        try:
-            seqs = [s for s in os.listdir(vid_pth)]
-        except:
-            continue
-        for s in seqs:
-            quality = 0
-            total_seqs.append(os.path.join(vid_pth, s))
-    print("total_seqs:{}".format(total_seqs.__len__()))
-    # split the dataset
-    test_list, val_list, train_list = split_list(total_seqs, True, 0.1)
-    print(len(test_list), len(val_list), len(train_list))
-    moveto(train_list, 'train')
-    moveto(test_list, 'test')
-    moveto(val_list, 'val')

data_utils/get_j.py DELETED Viewed

@@ -1,51 +0,0 @@
-import torch
-def to3d(poses, config):
-    if config.Data.pose.convert_to_6d:
-        if config.Data.pose.expression:
-            poses_exp = poses[:, -100:]
-            poses = poses[:, :-100]
-        poses = poses.reshape(poses.shape[0], -1, 5)
-        sin, cos = poses[:, :, 3], poses[:, :, 4]
-        pose_angle = torch.atan2(sin, cos)
-        poses = (poses[:, :, :3] * pose_angle.unsqueeze(dim=-1)).reshape(poses.shape[0], -1)
-        if config.Data.pose.expression:
-            poses = torch.cat([poses, poses_exp], dim=-1)
-    return poses
-def get_joint(smplx_model, betas, pred):
-    joint = smplx_model(betas=betas.repeat(pred.shape[0], 1),
-                        expression=pred[:, 165:265],
-                        jaw_pose=pred[:, 0:3],
-                        leye_pose=pred[:, 3:6],
-                        reye_pose=pred[:, 6:9],
-                        global_orient=pred[:, 9:12],
-                        body_pose=pred[:, 12:75],
-                        left_hand_pose=pred[:, 75:120],
-                        right_hand_pose=pred[:, 120:165],
-                        return_verts=True)['joints']
-    return joint
-def get_joints(smplx_model, betas, pred):
-    if len(pred.shape) == 3:
-        B = pred.shape[0]
-        x = 4 if B>= 4 else B
-        T = pred.shape[1]
-        pred = pred.reshape(-1, 265)
-        smplx_model.batch_size = L = T * x
-        times = pred.shape[0] // smplx_model.batch_size
-        joints = []
-        for i in range(times):
-            joints.append(get_joint(smplx_model, betas, pred[i*L:(i+1)*L]))
-        joints = torch.cat(joints, dim=0)
-        joints = joints.reshape(B, T, -1, 3)
-    else:
-        smplx_model.batch_size = pred.shape[0]
-        joints = get_joint(smplx_model, betas, pred)
-    return joints

data_utils/hand_component.json DELETED Viewed

The diff for this file is too large to render. See raw diff

data_utils/lower_body.py DELETED Viewed

@@ -1,143 +0,0 @@
-import numpy as np
-import torch
-lower_pose = torch.tensor(
-    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0747, -0.0158, -0.0152, -1.1826512813568115, 0.23866955935955048,
-     0.15146760642528534, -1.2604516744613647, -0.3160211145877838,
-     -0.1603458970785141, 1.1654603481292725, 0.0, 0.0, 1.2521806955337524, 0.041598282754421234, -0.06312154978513718,
-     0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
-lower_pose_stand = torch.tensor([
-    8.9759e-04, 7.1074e-04, -5.9163e-06, 8.9759e-04, 7.1074e-04, -5.9163e-06,
-    3.0747, -0.0158, -0.0152,
-    -3.6665e-01, -8.8455e-03, 1.6113e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
-    -3.9716e-01, -4.0229e-02, -1.2637e-01,
-    7.9163e-01, 6.8519e-02, -1.5091e-01, 7.9163e-01, 6.8519e-02, -1.5091e-01,
-    7.8632e-01, -4.3810e-02, 1.4375e-02,
-    -1.0675e-01, 1.2635e-01, 1.6711e-02, -1.0675e-01, 1.2635e-01, 1.6711e-02, ])
-# lower_pose_stand = torch.tensor(
-#     [6.4919e-02,  3.3018e-02,  1.7485e-02,  8.9759e-04,  7.1074e-04, -5.9163e-06,
-#      3.0747, -0.0158, -0.0152,
-#      -3.3633e+00, -9.3915e-02, 3.0996e-01, -3.6665e-01, -8.8455e-03, 1.6113e-01,
-#      1.1654603481292725, 0.0, 0.0,
-#      4.4167e-01,  6.7183e-03, -3.6379e-03,  7.9163e-01,  6.8519e-02, -1.5091e-01,
-#      0.0, 0.0, 0.0,
-#      2.2910e-02, -2.4797e-02, -5.5657e-03, -1.0675e-01,  1.2635e-01,  1.6711e-02,])
-lower_body = [0, 1, 3, 4, 6, 7, 9, 10]
-count_part = [6, 9, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
-              29, 30, 31, 32, 33, 34, 35, 36, 37,
-              38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
-fix_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
-             29,
-             35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
-             50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
-             65, 66, 67, 68, 69, 70, 71, 72, 73, 74]
-all_index = np.ones(275)
-all_index[fix_index] = 0
-c_index = []
-i = 0
-for num in all_index:
-    if num == 1:
-        c_index.append(i)
-    i = i + 1
-c_index = np.asarray(c_index)
-fix_index_3d = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
-                21, 22, 23, 24, 25, 26,
-                30, 31, 32, 33, 34, 35,
-                45, 46, 47, 48, 49, 50]
-all_index_3d = np.ones(165)
-all_index_3d[fix_index_3d] = 0
-c_index_3d = []
-i = 0
-for num in all_index_3d:
-    if num == 1:
-        c_index_3d.append(i)
-    i = i + 1
-c_index_3d = np.asarray(c_index_3d)
-c_index_6d = []
-i = 0
-for num in all_index_3d:
-    if num == 1:
-        c_index_6d.append(2*i)
-        c_index_6d.append(2 * i + 1)
-    i = i + 1
-c_index_6d = np.asarray(c_index_6d)
-def part2full(input, stand=False):
-    if stand:
-        # lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-        lp = torch.zeros_like(lower_pose)
-        lp[6:9] = torch.tensor([3.0747, -0.0158, -0.0152])
-        lp = lp.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-    else:
-        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-    input = torch.cat([input[:, :3],
-                       lp[:, :15],
-                       input[:, 3:6],
-                       lp[:, 15:21],
-                       input[:, 6:9],
-                       lp[:, 21:27],
-                       input[:, 9:12],
-                       lp[:, 27:],
-                       input[:, 12:]]
-                      , dim=1)
-    return input
-def pred2poses(input, gt):
-    input = torch.cat([input[:, :3],
-                       gt[0:1, 3:18].repeat(input.shape[0], 1),
-                       input[:, 3:6],
-                       gt[0:1, 21:27].repeat(input.shape[0], 1),
-                       input[:, 6:9],
-                       gt[0:1, 30:36].repeat(input.shape[0], 1),
-                       input[:, 9:12],
-                       gt[0:1, 39:45].repeat(input.shape[0], 1),
-                       input[:, 12:]]
-                      , dim=1)
-    return input
-def poses2poses(input, gt):
-    input = torch.cat([input[:, :3],
-                       gt[0:1, 3:18].repeat(input.shape[0], 1),
-                       input[:, 18:21],
-                       gt[0:1, 21:27].repeat(input.shape[0], 1),
-                       input[:, 27:30],
-                       gt[0:1, 30:36].repeat(input.shape[0], 1),
-                       input[:, 36:39],
-                       gt[0:1, 39:45].repeat(input.shape[0], 1),
-                       input[:, 45:]]
-                      , dim=1)
-    return input
-def poses2pred(input, stand=False):
-    if stand:
-        lp = lower_pose_stand.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-        # lp = torch.zeros_like(lower_pose).unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-    else:
-        lp = lower_pose.unsqueeze(dim=0).repeat(input.shape[0], 1).to(input.device)
-    input = torch.cat([input[:, :3],
-                       lp[:, :15],
-                       input[:, 18:21],
-                       lp[:, 15:21],
-                       input[:, 27:30],
-                       lp[:, 21:27],
-                       input[:, 36:39],
-                       lp[:, 27:],
-                       input[:, 45:]]
-                      , dim=1)
-    return input
-rearrange = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]\
-            # ,22, 23, 24, 25, 40, 26, 41,
-            #  27, 42, 28, 43, 29, 44, 30, 45, 31, 46, 32, 47, 33, 48, 34, 49, 35, 50, 36, 51, 37, 52, 38, 53, 39, 54, 55,
-            #  57, 56, 59, 58, 60, 63, 61, 64, 62, 65, 66, 71, 67, 72, 68, 73, 69, 74, 70, 75]
-symmetry = [0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1]#, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-            # 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-            # 1, 1, 1, 1, 1, 1]

data_utils/mesh_dataset.py DELETED Viewed

@@ -1,348 +0,0 @@
-import pickle
-import sys
-import os
-sys.path.append(os.getcwd())
-import json
-from glob import glob
-from data_utils.utils import *
-import torch.utils.data as data
-from data_utils.consts import speaker_id
-from data_utils.lower_body import count_part
-import random
-from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
-with open('data_utils/hand_component.json') as file_obj:
-    comp = json.load(file_obj)
-    left_hand_c = np.asarray(comp['left'])
-    right_hand_c = np.asarray(comp['right'])
-def to3d(data):
-    left_hand_pose = np.einsum('bi,ij->bj', data[:, 75:87], left_hand_c[:12, :])
-    right_hand_pose = np.einsum('bi,ij->bj', data[:, 87:99], right_hand_c[:12, :])
-    data = np.concatenate((data[:, :75], left_hand_pose, right_hand_pose), axis=-1)
-    return data
-class SmplxDataset():
-    '''
-    creat a dataset for every segment and concat.
-    '''
-    def __init__(self,
-                 data_root,
-                 speaker,
-                 motion_fn,
-                 audio_fn,
-                 audio_sr,
-                 fps,
-                 feat_method='mel_spec',
-                 audio_feat_dim=64,
-                 audio_feat_win_size=None,
-                 train=True,
-                 load_all=False,
-                 split_trans_zero=False,
-                 limbscaling=False,
-                 num_frames=25,
-                 num_pre_frames=25,
-                 num_generate_length=25,
-                 context_info=False,
-                 convert_to_6d=False,
-                 expression=False,
-                 config=None,
-                 am=None,
-                 am_sr=None,
-                 whole_video=False
-                 ):
-        self.data_root = data_root
-        self.speaker = speaker
-        self.feat_method = feat_method
-        self.audio_fn = audio_fn
-        self.audio_sr = audio_sr
-        self.fps = fps
-        self.audio_feat_dim = audio_feat_dim
-        self.audio_feat_win_size = audio_feat_win_size
-        self.context_info = context_info  # for aud feat
-        self.convert_to_6d = convert_to_6d
-        self.expression = expression
-        self.train = train
-        self.load_all = load_all
-        self.split_trans_zero = split_trans_zero
-        self.limbscaling = limbscaling
-        self.num_frames = num_frames
-        self.num_pre_frames = num_pre_frames
-        self.num_generate_length = num_generate_length
-        # print('num_generate_length ', self.num_generate_length)
-        self.config = config
-        self.am_sr = am_sr
-        self.whole_video = whole_video
-        load_mode = self.config.dataset_load_mode
-        if load_mode == 'pickle':
-            raise NotImplementedError
-        elif load_mode == 'csv':
-            import pickle
-            with open(data_root, 'rb') as f:
-                u = pickle._Unpickler(f)
-                data = u.load()
-                self.data = data[0]
-            if self.load_all:
-                self._load_npz_all()
-        elif load_mode == 'json':
-            self.annotations = glob(data_root + '/*pkl')
-            if len(self.annotations) == 0:
-                raise FileNotFoundError(data_root + ' are empty')
-            self.annotations = sorted(self.annotations)
-            self.img_name_list = self.annotations
-            if self.load_all:
-                self._load_them_all(am, am_sr, motion_fn)
-    def _load_npz_all(self):
-        self.loaded_data = {}
-        self.complete_data = []
-        data = self.data
-        shape = data['body_pose_axis'].shape[0]
-        self.betas = data['betas']
-        self.img_name_list = []
-        for index in range(shape):
-            img_name = f'{index:6d}'
-            self.img_name_list.append(img_name)
-            jaw_pose = data['jaw_pose'][index]
-            leye_pose = data['leye_pose'][index]
-            reye_pose = data['reye_pose'][index]
-            global_orient = data['global_orient'][index]
-            body_pose = data['body_pose_axis'][index]
-            left_hand_pose = data['left_hand_pose'][index]
-            right_hand_pose = data['right_hand_pose'][index]
-            full_body = np.concatenate(
-                (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose))
-            assert full_body.shape[0] == 99
-            if self.convert_to_6d:
-                full_body = to3d(full_body)
-                full_body = torch.from_numpy(full_body)
-                full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body))
-                full_body = np.asarray(full_body)
-                if self.expression:
-                    expression = data['expression'][index]
-                    full_body = np.concatenate((full_body, expression))
-                # full_body = np.concatenate((full_body, non_zero))
-            else:
-                full_body = to3d(full_body)
-                if self.expression:
-                    expression = data['expression'][index]
-                    full_body = np.concatenate((full_body, expression))
-            self.loaded_data[img_name] = full_body.reshape(-1)
-            self.complete_data.append(full_body.reshape(-1))
-        self.complete_data = np.array(self.complete_data)
-        if self.audio_feat_win_size is not None:
-            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
-            # print(self.audio_feat.shape)
-        else:
-            if self.feat_method == 'mel_spec':
-                self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
-            elif self.feat_method == 'mfcc':
-                self.audio_feat = get_mfcc(self.audio_fn,
-                                           smlpx=True,
-                                           sr=self.audio_sr,
-                                           n_mfcc=self.audio_feat_dim,
-                                           win_size=self.audio_feat_win_size
-                                           )
-    def _load_them_all(self, am, am_sr, motion_fn):
-        self.loaded_data = {}
-        self.complete_data = []
-        f = open(motion_fn, 'rb+')
-        data = pickle.load(f)
-        self.betas = np.array(data['betas'])
-        jaw_pose = np.array(data['jaw_pose'])
-        leye_pose = np.array(data['leye_pose'])
-        reye_pose = np.array(data['reye_pose'])
-        global_orient = np.array(data['global_orient']).squeeze()
-        body_pose = np.array(data['body_pose_axis'])
-        left_hand_pose = np.array(data['left_hand_pose'])
-        right_hand_pose = np.array(data['right_hand_pose'])
-        full_body = np.concatenate(
-            (jaw_pose, leye_pose, reye_pose, global_orient, body_pose, left_hand_pose, right_hand_pose), axis=1)
-        assert full_body.shape[1] == 99
-        if self.convert_to_6d:
-            full_body = to3d(full_body)
-            full_body = torch.from_numpy(full_body)
-            full_body = matrix_to_rotation_6d(axis_angle_to_matrix(full_body.reshape(-1, 55, 3))).reshape(-1, 330)
-            full_body = np.asarray(full_body)
-            if self.expression:
-                expression = np.array(data['expression'])
-                full_body = np.concatenate((full_body, expression), axis=1)
-        else:
-            full_body = to3d(full_body)
-            expression = np.array(data['expression'])
-            full_body = np.concatenate((full_body, expression), axis=1)
-        self.complete_data = full_body
-        self.complete_data = np.array(self.complete_data)
-        if self.audio_feat_win_size is not None:
-            self.audio_feat = get_mfcc_old(self.audio_fn).transpose(1, 0)
-        else:
-            # if self.feat_method == 'mel_spec':
-            #     self.audio_feat = get_melspec(self.audio_fn, fps=self.fps, sr=self.audio_sr, n_mels=self.audio_feat_dim)
-            # elif self.feat_method == 'mfcc':
-            self.audio_feat = get_mfcc_ta(self.audio_fn,
-                                          smlpx=True,
-                                          fps=30,
-                                          sr=self.audio_sr,
-                                          n_mfcc=self.audio_feat_dim,
-                                          win_size=self.audio_feat_win_size,
-                                          type=self.feat_method,
-                                          am=am,
-                                          am_sr=am_sr,
-                                          encoder_choice=self.config.Model.encoder_choice,
-                                          )
-            # with open(audio_file, 'w', encoding='utf-8') as file:
-            #     file.write(json.dumps(self.audio_feat.__array__().tolist(), indent=0, ensure_ascii=False))
-    def get_dataset(self, normalization=False, normalize_stats=None, split='train'):
-        class __Worker__(data.Dataset):
-            def __init__(child, index_list, normalization, normalize_stats, split='train') -> None:
-                super().__init__()
-                child.index_list = index_list
-                child.normalization = normalization
-                child.normalize_stats = normalize_stats
-                child.split = split
-            def __getitem__(child, index):
-                num_generate_length = self.num_generate_length
-                num_pre_frames = self.num_pre_frames
-                seq_len = num_generate_length + num_pre_frames
-                # print(num_generate_length)
-                index = child.index_list[index]
-                index_new = index + random.randrange(0, 5, 3)
-                if index_new + seq_len > self.complete_data.shape[0]:
-                    index_new = index
-                index = index_new
-                if child.split in ['val', 'pre', 'test'] or self.whole_video:
-                    index = 0
-                    seq_len = self.complete_data.shape[0]
-                seq_data = []
-                assert index + seq_len <= self.complete_data.shape[0]
-                # print(seq_len)
-                seq_data = self.complete_data[index:(index + seq_len), :]
-                seq_data = np.array(seq_data)
-                '''
-                audio feature，
-                '''
-                if not self.context_info:
-                    if not self.whole_video:
-                        audio_feat = self.audio_feat[index:index + seq_len, ...]
-                        if audio_feat.shape[0] < seq_len:
-                            audio_feat = np.pad(audio_feat, [[0, seq_len - audio_feat.shape[0]], [0, 0]],
-                                                mode='reflect')
-                        assert audio_feat.shape[0] == seq_len and audio_feat.shape[1] == self.audio_feat_dim
-                    else:
-                        audio_feat = self.audio_feat
-                else:  # including feature and history
-                    if self.audio_feat_win_size is None:
-                        audio_feat = self.audio_feat[index:index + seq_len + num_pre_frames, ...]
-                        if audio_feat.shape[0] < seq_len + num_pre_frames:
-                            audio_feat = np.pad(audio_feat,
-                                                [[0, seq_len + self.num_frames - audio_feat.shape[0]], [0, 0]],
-                                                mode='constant')
-                        assert audio_feat.shape[0] == self.num_frames + seq_len and audio_feat.shape[
-                            1] == self.audio_feat_dim
-                if child.normalization:
-                    data_mean = child.normalize_stats['mean'].reshape(1, -1)
-                    data_std = child.normalize_stats['std'].reshape(1, -1)
-                    seq_data[:, :330] = (seq_data[:, :330] - data_mean) / data_std
-                if child.split in['train', 'test']:
-                    if self.convert_to_6d:
-                        if self.expression:
-                            data_sample = {
-                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
-                                'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
-                                # 'nzero': seq_data[:, 375:].astype(np.float).transpose(1, 0),
-                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
-                                'speaker': speaker_id[self.speaker],
-                                'betas': self.betas,
-                                'aud_file': self.audio_fn,
-                            }
-                        else:
-                            data_sample = {
-                                'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
-                                'nzero': seq_data[:, 330:].astype(np.float).transpose(1, 0),
-                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
-                                'speaker': speaker_id[self.speaker],
-                                'betas': self.betas
-                            }
-                    else:
-                        if self.expression:
-                            data_sample = {
-                                'poses': seq_data[:, :165].astype(np.float).transpose(1, 0),
-                                'expression': seq_data[:, 165:].astype(np.float).transpose(1, 0),
-                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
-                                # 'wv2_feat': wv2_feat.astype(np.float).transpose(1, 0),
-                                'speaker': speaker_id[self.speaker],
-                                'aud_file': self.audio_fn,
-                                'betas': self.betas
-                            }
-                        else:
-                            data_sample = {
-                                'poses': seq_data.astype(np.float).transpose(1, 0),
-                                'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
-                                'speaker': speaker_id[self.speaker],
-                                'betas': self.betas
-                            }
-                    return data_sample
-                else:
-                    data_sample = {
-                        'poses': seq_data[:, :330].astype(np.float).transpose(1, 0),
-                        'expression': seq_data[:, 330:].astype(np.float).transpose(1, 0),
-                        # 'nzero': seq_data[:, 325:].astype(np.float).transpose(1, 0),
-                        'aud_feat': audio_feat.astype(np.float).transpose(1, 0),
-                        'aud_file': self.audio_fn,
-                        'speaker': speaker_id[self.speaker],
-                        'betas': self.betas
-                    }
-                    return data_sample
-            def __len__(child):
-                return len(child.index_list)
-        if split == 'train':
-            index_list = list(
-                range(0, min(self.complete_data.shape[0], self.audio_feat.shape[0]) - self.num_generate_length - self.num_pre_frames,
-                      6))
-        elif split in ['val', 'test']:
-            index_list = list([0])
-        if self.whole_video:
-            index_list = list([0])
-        self.all_dataset = __Worker__(index_list, normalization, normalize_stats, split)
-    def __len__(self):
-        return len(self.img_name_list)

data_utils/rotation_conversion.py DELETED Viewed

@@ -1,551 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
-# Check PYTORCH3D_LICENCE before use
-import functools
-from typing import Optional
-import torch
-import torch.nn.functional as F
-"""
-The transformation matrices returned from the functions in this file assume
-the points on which the transformation will be applied are column vectors.
-i.e. the R matrix is structured as
-    R = [
-            [Rxx, Rxy, Rxz],
-            [Ryx, Ryy, Ryz],
-            [Rzx, Rzy, Rzz],
-        ]  # (3, 3)
-This matrix can be applied to column vectors by post multiplication
-by the points e.g.
-    points = [[0], [1], [2]]  # (3 x 1) xyz coordinates of a point
-    transformed_points = R * points
-To apply the same matrix to points which are row vectors, the R matrix
-can be transposed and pre multiplied by the points:
-e.g.
-    points = [[0, 1, 2]]  # (1 x 3) xyz coordinates of a point
-    transformed_points = points * R.transpose(1, 0)
-"""
-def quaternion_to_matrix(quaternions):
-    """
-    Convert rotations given as quaternions to rotation matrices.
-    Args:
-        quaternions: quaternions with real part first,
-            as tensor of shape (..., 4).
-    Returns:
-        Rotation matrices as tensor of shape (..., 3, 3).
-    """
-    r, i, j, k = torch.unbind(quaternions, -1)
-    two_s = 2.0 / (quaternions * quaternions).sum(-1)
-    o = torch.stack(
-        (
-            1 - two_s * (j * j + k * k),
-            two_s * (i * j - k * r),
-            two_s * (i * k + j * r),
-            two_s * (i * j + k * r),
-            1 - two_s * (i * i + k * k),
-            two_s * (j * k - i * r),
-            two_s * (i * k - j * r),
-            two_s * (j * k + i * r),
-            1 - two_s * (i * i + j * j),
-        ),
-        -1,
-    )
-    return o.reshape(quaternions.shape[:-1] + (3, 3))
-def _copysign(a, b):
-    """
-    Return a tensor where each element has the absolute value taken from the,
-    corresponding element of a, with sign taken from the corresponding
-    element of b. This is like the standard copysign floating-point operation,
-    but is not careful about negative 0 and NaN.
-    Args:
-        a: source tensor.
-        b: tensor whose signs will be used, of the same shape as a.
-    Returns:
-        Tensor of the same shape as a with the signs of b.
-    """
-    signs_differ = (a < 0) != (b < 0)
-    return torch.where(signs_differ, -a, a)
-def _sqrt_positive_part(x):
-    """
-    Returns torch.sqrt(torch.max(0, x))
-    but with a zero subgradient where x is 0.
-    """
-    ret = torch.zeros_like(x)
-    positive_mask = x > 0
-    ret[positive_mask] = torch.sqrt(x[positive_mask])
-    return ret
-def matrix_to_quaternion(matrix):
-    """
-    Convert rotations given as rotation matrices to quaternions.
-    Args:
-        matrix: Rotation matrices as tensor of shape (..., 3, 3).
-    Returns:
-        quaternions with real part first, as tensor of shape (..., 4).
-    """
-    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
-        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
-    m00 = matrix[..., 0, 0]
-    m11 = matrix[..., 1, 1]
-    m22 = matrix[..., 2, 2]
-    o0 = 0.5 * _sqrt_positive_part(1 + m00 + m11 + m22)
-    x = 0.5 * _sqrt_positive_part(1 + m00 - m11 - m22)
-    y = 0.5 * _sqrt_positive_part(1 - m00 + m11 - m22)
-    z = 0.5 * _sqrt_positive_part(1 - m00 - m11 + m22)
-    o1 = _copysign(x, matrix[..., 2, 1] - matrix[..., 1, 2])
-    o2 = _copysign(y, matrix[..., 0, 2] - matrix[..., 2, 0])
-    o3 = _copysign(z, matrix[..., 1, 0] - matrix[..., 0, 1])
-    return torch.stack((o0, o1, o2, o3), -1)
-def _axis_angle_rotation(axis: str, angle):
-    """
-    Return the rotation matrices for one of the rotations about an axis
-    of which Euler angles describe, for each value of the angle given.
-    Args:
-        axis: Axis label "X" or "Y or "Z".
-        angle: any shape tensor of Euler angles in radians
-    Returns:
-        Rotation matrices as tensor of shape (..., 3, 3).
-    """
-    cos = torch.cos(angle)
-    sin = torch.sin(angle)
-    one = torch.ones_like(angle)
-    zero = torch.zeros_like(angle)
-    if axis == "X":
-        R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
-    if axis == "Y":
-        R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
-    if axis == "Z":
-        R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
-    return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
-def euler_angles_to_matrix(euler_angles, convention: str):
-    """
-    Convert rotations given as Euler angles in radians to rotation matrices.
-    Args:
-        euler_angles: Euler angles in radians as tensor of shape (..., 3).
-        convention: Convention string of three uppercase letters from
-            {"X", "Y", and "Z"}.
-    Returns:
-        Rotation matrices as tensor of shape (..., 3, 3).
-    """
-    if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
-        raise ValueError("Invalid input euler angles.")
-    if len(convention) != 3:
-        raise ValueError("Convention must have 3 letters.")
-    if convention[1] in (convention[0], convention[2]):
-        raise ValueError(f"Invalid convention {convention}.")
-    for letter in convention:
-        if letter not in ("X", "Y", "Z"):
-            raise ValueError(f"Invalid letter {letter} in convention string.")
-    matrices = map(_axis_angle_rotation, convention, torch.unbind(euler_angles, -1))
-    return functools.reduce(torch.matmul, matrices)
-def _angle_from_tan(
-    axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
-):
-    """
-    Extract the first or third Euler angle from the two members of
-    the matrix which are positive constant times its sine and cosine.
-    Args:
-        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
-        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
-            convention.
-        data: Rotation matrices as tensor of shape (..., 3, 3).
-        horizontal: Whether we are looking for the angle for the third axis,
-            which means the relevant entries are in the same row of the
-            rotation matrix. If not, they are in the same column.
-        tait_bryan: Whether the first and third axes in the convention differ.
-    Returns:
-        Euler Angles in radians for each matrix in data as a tensor
-        of shape (...).
-    """
-    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
-    if horizontal:
-        i2, i1 = i1, i2
-    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
-    if horizontal == even:
-        return torch.atan2(data[..., i1], data[..., i2])
-    if tait_bryan:
-        return torch.atan2(-data[..., i2], data[..., i1])
-    return torch.atan2(data[..., i2], -data[..., i1])
-def _index_from_letter(letter: str):
-    if letter == "X":
-        return 0
-    if letter == "Y":
-        return 1
-    if letter == "Z":
-        return 2
-def matrix_to_euler_angles(matrix, convention: str):
-    """
-    Convert rotations given as rotation matrices to Euler angles in radians.
-    Args:
-        matrix: Rotation matrices as tensor of shape (..., 3, 3).
-        convention: Convention string of three uppercase letters.
-    Returns:
-        Euler angles in radians as tensor of shape (..., 3).
-    """
-    if len(convention) != 3:
-        raise ValueError("Convention must have 3 letters.")
-    if convention[1] in (convention[0], convention[2]):
-        raise ValueError(f"Invalid convention {convention}.")
-    for letter in convention:
-        if letter not in ("X", "Y", "Z"):
-            raise ValueError(f"Invalid letter {letter} in convention string.")
-    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
-        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
-    i0 = _index_from_letter(convention[0])
-    i2 = _index_from_letter(convention[2])
-    tait_bryan = i0 != i2
-    if tait_bryan:
-        central_angle = torch.asin(
-            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
-        )
-    else:
-        central_angle = torch.acos(matrix[..., i0, i0])
-    o = (
-        _angle_from_tan(
-            convention[0], convention[1], matrix[..., i2], False, tait_bryan
-        ),
-        central_angle,
-        _angle_from_tan(
-            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
-        ),
-    )
-    return torch.stack(o, -1)
-def random_quaternions(
-    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
-):
-    """
-    Generate random quaternions representing rotations,
-    i.e. versors with nonnegative real part.
-    Args:
-        n: Number of quaternions in a batch to return.
-        dtype: Type to return.
-        device: Desired device of returned tensor. Default:
-            uses the current device for the default tensor type.
-        requires_grad: Whether the resulting tensor should have the gradient
-            flag set.
-    Returns:
-        Quaternions as tensor of shape (N, 4).
-    """
-    o = torch.randn((n, 4), dtype=dtype, device=device, requires_grad=requires_grad)
-    s = (o * o).sum(1)
-    o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
-    return o
-def random_rotations(
-    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
-):
-    """
-    Generate random rotations as 3x3 rotation matrices.
-    Args:
-        n: Number of rotation matrices in a batch to return.
-        dtype: Type to return.
-        device: Device of returned tensor. Default: if None,
-            uses the current device for the default tensor type.
-        requires_grad: Whether the resulting tensor should have the gradient
-            flag set.
-    Returns:
-        Rotation matrices as tensor of shape (n, 3, 3).
-    """
-    quaternions = random_quaternions(
-        n, dtype=dtype, device=device, requires_grad=requires_grad
-    )
-    return quaternion_to_matrix(quaternions)
-def random_rotation(
-    dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
-):
-    """
-    Generate a single random 3x3 rotation matrix.
-    Args:
-        dtype: Type to return
-        device: Device of returned tensor. Default: if None,
-            uses the current device for the default tensor type
-        requires_grad: Whether the resulting tensor should have the gradient
-            flag set
-    Returns:
-        Rotation matrix as tensor of shape (3, 3).
-    """
-    return random_rotations(1, dtype, device, requires_grad)[0]
-def standardize_quaternion(quaternions):
-    """
-    Convert a unit quaternion to a standard form: one in which the real
-    part is non negative.
-    Args:
-        quaternions: Quaternions with real part first,
-            as tensor of shape (..., 4).
-    Returns:
-        Standardized quaternions as tensor of shape (..., 4).
-    """
-    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
-def quaternion_raw_multiply(a, b):
-    """
-    Multiply two quaternions.
-    Usual torch rules for broadcasting apply.
-    Args:
-        a: Quaternions as tensor of shape (..., 4), real part first.
-        b: Quaternions as tensor of shape (..., 4), real part first.
-    Returns:
-        The product of a and b, a tensor of quaternions shape (..., 4).
-    """
-    aw, ax, ay, az = torch.unbind(a, -1)
-    bw, bx, by, bz = torch.unbind(b, -1)
-    ow = aw * bw - ax * bx - ay * by - az * bz
-    ox = aw * bx + ax * bw + ay * bz - az * by
-    oy = aw * by - ax * bz + ay * bw + az * bx
-    oz = aw * bz + ax * by - ay * bx + az * bw
-    return torch.stack((ow, ox, oy, oz), -1)
-def quaternion_multiply(a, b):
-    """
-    Multiply two quaternions representing rotations, returning the quaternion
-    representing their composition, i.e. the versor with nonnegative real part.
-    Usual torch rules for broadcasting apply.
-    Args:
-        a: Quaternions as tensor of shape (..., 4), real part first.
-        b: Quaternions as tensor of shape (..., 4), real part first.
-    Returns:
-        The product of a and b, a tensor of quaternions of shape (..., 4).
-    """
-    ab = quaternion_raw_multiply(a, b)
-    return standardize_quaternion(ab)
-def quaternion_invert(quaternion):
-    """
-    Given a quaternion representing rotation, get the quaternion representing
-    its inverse.
-    Args:
-        quaternion: Quaternions as tensor of shape (..., 4), with real part
-            first, which must be versors (unit quaternions).
-    Returns:
-        The inverse, a tensor of quaternions of shape (..., 4).
-    """
-    return quaternion * quaternion.new_tensor([1, -1, -1, -1])
-def quaternion_apply(quaternion, point):
-    """
-    Apply the rotation given by a quaternion to a 3D point.
-    Usual torch rules for broadcasting apply.
-    Args:
-        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
-        point: Tensor of 3D points of shape (..., 3).
-    Returns:
-        Tensor of rotated points of shape (..., 3).
-    """
-    if point.size(-1) != 3:
-        raise ValueError(f"Points are not in 3D, f{point.shape}.")
-    real_parts = point.new_zeros(point.shape[:-1] + (1,))
-    point_as_quaternion = torch.cat((real_parts, point), -1)
-    out = quaternion_raw_multiply(
-        quaternion_raw_multiply(quaternion, point_as_quaternion),
-        quaternion_invert(quaternion),
-    )
-    return out[..., 1:]
-def axis_angle_to_matrix(axis_angle):
-    """
-    Convert rotations given as axis/angle to rotation matrices.
-    Args:
-        axis_angle: Rotations given as a vector in axis angle form,
-            as a tensor of shape (..., 3), where the magnitude is
-            the angle turned anticlockwise in radians around the
-            vector's direction.
-    Returns:
-        Rotation matrices as tensor of shape (..., 3, 3).
-    """
-    return quaternion_to_matrix(axis_angle_to_quaternion(axis_angle))
-def matrix_to_axis_angle(matrix):
-    """
-    Convert rotations given as rotation matrices to axis/angle.
-    Args:
-        matrix: Rotation matrices as tensor of shape (..., 3, 3).
-    Returns:
-        Rotations given as a vector in axis angle form, as a tensor
-            of shape (..., 3), where the magnitude is the angle
-            turned anticlockwise in radians around the vector's
-            direction.
-    """
-    return quaternion_to_axis_angle(matrix_to_quaternion(matrix))
-def axis_angle_to_quaternion(axis_angle):
-    """
-    Convert rotations given as axis/angle to quaternions.
-    Args:
-        axis_angle: Rotations given as a vector in axis angle form,
-            as a tensor of shape (..., 3), where the magnitude is
-            the angle turned anticlockwise in radians around the
-            vector's direction.
-    Returns:
-        quaternions with real part first, as tensor of shape (..., 4).
-    """
-    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
-    half_angles = 0.5 * angles
-    eps = 1e-6
-    small_angles = angles.abs() < eps
-    sin_half_angles_over_angles = torch.empty_like(angles)
-    sin_half_angles_over_angles[~small_angles] = (
-        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
-    )
-    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
-    # so sin(x/2)/x is about 1/2 - (x*x)/48
-    sin_half_angles_over_angles[small_angles] = (
-        0.5 - (angles[small_angles] * angles[small_angles]) / 48
-    )
-    quaternions = torch.cat(
-        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
-    )
-    return quaternions
-def quaternion_to_axis_angle(quaternions):
-    """
-    Convert rotations given as quaternions to axis/angle.
-    Args:
-        quaternions: quaternions with real part first,
-            as tensor of shape (..., 4).
-    Returns:
-        Rotations given as a vector in axis angle form, as a tensor
-            of shape (..., 3), where the magnitude is the angle
-            turned anticlockwise in radians around the vector's
-            direction.
-    """
-    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
-    half_angles = torch.atan2(norms, quaternions[..., :1])
-    angles = 2 * half_angles
-    eps = 1e-6
-    small_angles = angles.abs() < eps
-    sin_half_angles_over_angles = torch.empty_like(angles)
-    sin_half_angles_over_angles[~small_angles] = (
-        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
-    )
-    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
-    # so sin(x/2)/x is about 1/2 - (x*x)/48
-    sin_half_angles_over_angles[small_angles] = (
-        0.5 - (angles[small_angles] * angles[small_angles]) / 48
-    )
-    return quaternions[..., 1:] / sin_half_angles_over_angles
-def rotation_6d_to_matrix(d6: torch.Tensor) -> torch.Tensor:
-    """
-    Converts 6D rotation representation by Zhou et al. [1] to rotation matrix
-    using Gram--Schmidt orthogonalisation per Section B of [1].
-    Args:
-        d6: 6D rotation representation, of size (*, 6)
-    Returns:
-        batch of rotation matrices of size (*, 3, 3)
-    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
-    On the Continuity of Rotation Representations in Neural Networks.
-    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-    Retrieved from http://arxiv.org/abs/1812.07035
-    """
-    a1, a2 = d6[..., :3], d6[..., 3:]
-    b1 = F.normalize(a1, dim=-1)
-    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
-    b2 = F.normalize(b2, dim=-1)
-    b3 = torch.cross(b1, b2, dim=-1)
-    return torch.stack((b1, b2, b3), dim=-2)
-def matrix_to_rotation_6d(matrix: torch.Tensor) -> torch.Tensor:
-    """
-    Converts rotation matrices to 6D rotation representation by Zhou et al. [1]
-    by dropping the last row. Note that 6D representation is not unique.
-    Args:
-        matrix: batch of rotation matrices of size (*, 3, 3)
-    Returns:
-        6D rotation representation, of size (*, 6)
-    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
-    On the Continuity of Rotation Representations in Neural Networks.
-    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-    Retrieved from http://arxiv.org/abs/1812.07035
-    """
-    return matrix[..., :2, :].clone().reshape(*matrix.size()[:-2], 6)

data_utils/split_train_val_test.py DELETED Viewed

@@ -1,27 +0,0 @@
-import os
-import json
-import shutil
-if __name__ =='__main__':
-    id_list = "chemistry conan oliver seth"
-    id_list = id_list.split(' ')
-    old_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0'
-    new_root = '/home/usename/talkshow_data/ExpressiveWholeBodyDatasetReleaseV1.0/talkshow_data_splited'
-    with open('train_val_test.json') as f:
-        split_info = json.load(f)
-    phase_list = ['train', 'val', 'test']
-    for phase in phase_list:
-        phase_path_list = split_info[phase]
-        for p in phase_path_list:
-            old_path = os.path.join(old_root, p)
-            if not os.path.exists(old_path):
-                print(f'{old_path} not found, continue' )
-                continue
-            new_path = os.path.join(new_root, phase, p)
-            dir_name = os.path.dirname(new_path)
-            if not os.path.isdir(dir_name):
-                os.makedirs(dir_name, exist_ok=True)
-            shutil.move(old_path, new_path)

data_utils/train_val_test.json DELETED Viewed

The diff for this file is too large to render. See raw diff

data_utils/utils.py DELETED Viewed

@@ -1,318 +0,0 @@
-import numpy as np
-# import librosa #has to do this cause librosa is not supported on my server
-import python_speech_features
-from scipy.io import wavfile
-from scipy import signal
-import librosa
-import torch
-import torchaudio as ta
-import torchaudio.functional as ta_F
-import torchaudio.transforms as ta_T
-# import pyloudnorm as pyln
-def load_wav_old(audio_fn, sr = 16000):
-    sample_rate, sig = wavfile.read(audio_fn)
-    if sample_rate != sr:
-        result = int((sig.shape[0]) / sample_rate * sr)
-        x_resampled = signal.resample(sig, result)
-        x_resampled = x_resampled.astype(np.float64)
-        return x_resampled, sr
-    sig = sig / (2**15)
-    return sig, sample_rate
-def get_mfcc(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
-    y, sr = librosa.load(audio_fn, sr=sr, mono=True)
-    if win_size is None:
-        hop_len=int(sr / fps)
-    else:
-        hop_len=int(sr / win_size)
-    n_fft=2048
-    C = librosa.feature.mfcc(
-        y = y,
-        sr = sr,
-        n_mfcc = n_mfcc,
-        hop_length = hop_len,
-        n_fft = n_fft
-    )
-    if C.shape[0] == n_mfcc:
-        C = C.transpose(1, 0)
-    return C
-def get_melspec(audio_fn, eps=1e-6, fps = 25, sr=16000, n_mels=64):
-    raise NotImplementedError
-    '''
-    # y, sr = load_wav(audio_fn=audio_fn, sr=sr)
-    # hop_len = int(sr / fps)
-    # n_fft = 2048
-    # C = librosa.feature.melspectrogram(
-    #     y = y,
-    #     sr = sr,
-    #     n_fft=n_fft,
-    #     hop_length=hop_len,
-    #     n_mels = n_mels,
-    #     fmin=0,
-    #     fmax=8000)
-    # mask = (C == 0).astype(np.float)
-    # C = mask * eps + (1-mask) * C
-    # C = np.log(C)
-    # #wierd error may occur here
-    # assert not (np.isnan(C).any()), audio_fn
-    # if C.shape[0] == n_mels:
-    #     C = C.transpose(1, 0)
-    # return C
-    '''
-def extract_mfcc(audio,sample_rate=16000):
-    mfcc = zip(*python_speech_features.mfcc(audio,sample_rate, numcep=64, nfilt=64, nfft=2048, winstep=0.04))
-    mfcc = np.stack([np.array(i) for i in mfcc])
-    return mfcc
-def get_mfcc_psf(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
-    y, sr = load_wav_old(audio_fn, sr=sr)
-    if y.shape.__len__() > 1:
-        y = (y[:,0]+y[:,1])/2
-    if win_size is None:
-        hop_len=int(sr / fps)
-    else:
-        hop_len=int(sr/ win_size)
-    n_fft=2048
-    #hard coded for 25 fps
-    if not smlpx:
-        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=0.04)
-    else:
-        C = python_speech_features.mfcc(y, sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01/15)
-    # if C.shape[0] == n_mfcc:
-    #     C = C.transpose(1, 0)
-    return C
-def get_mfcc_psf_min(audio_fn, eps=1e-6, fps=25, smlpx=False, sr=16000, n_mfcc=64, win_size=None):
-    y, sr = load_wav_old(audio_fn, sr=sr)
-    if y.shape.__len__() > 1:
-        y = (y[:, 0] + y[:, 1]) / 2
-    n_fft = 2048
-    slice_len = 22000 * 5
-    slice = y.size // slice_len
-    C = []
-    for i in range(slice):
-        if i != (slice - 1):
-            feat = python_speech_features.mfcc(y[i*slice_len:(i+1)*slice_len], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
-        else:
-            feat = python_speech_features.mfcc(y[i * slice_len:], sr, numcep=n_mfcc, nfilt=n_mfcc, nfft=n_fft, winstep=1.01 / 15)
-        C.append(feat)
-    return C
-def audio_chunking(audio: torch.Tensor, frame_rate: int = 30, chunk_size: int = 16000):
-    """
-    :param audio: 1 x T tensor containing a 16kHz audio signal
-    :param frame_rate: frame rate for video (we need one audio chunk per video frame)
-    :param chunk_size: number of audio samples per chunk
-    :return: num_chunks x chunk_size tensor containing sliced audio
-    """
-    samples_per_frame = chunk_size // frame_rate
-    padding = (chunk_size - samples_per_frame) // 2
-    audio = torch.nn.functional.pad(audio.unsqueeze(0), pad=[padding, padding]).squeeze(0)
-    anchor_points = list(range(chunk_size//2, audio.shape[-1]-chunk_size//2, samples_per_frame))
-    audio = torch.cat([audio[:, i-chunk_size//2:i+chunk_size//2] for i in anchor_points], dim=0)
-    return audio
-def  get_mfcc_ta(audio_fn, eps=1e-6, fps=15, smlpx=False, sr=16000, n_mfcc=64, win_size=None, type='mfcc', am=None, am_sr=None, encoder_choice='mfcc'):
-    if am is None:
-        audio, sr_0 = ta.load(audio_fn)
-        if sr != sr_0:
-            audio = ta.transforms.Resample(sr_0, sr)(audio)
-        if audio.shape[0] > 1:
-            audio = torch.mean(audio, dim=0, keepdim=True)
-        n_fft = 2048
-        if fps == 15:
-            hop_length = 1467
-        elif fps == 30:
-            hop_length = 734
-        win_length = hop_length * 2
-        n_mels = 256
-        n_mfcc = 64
-        if type == 'mfcc':
-            mfcc_transform = ta_T.MFCC(
-                sample_rate=sr,
-                n_mfcc=n_mfcc,
-                melkwargs={
-                    "n_fft": n_fft,
-                    "n_mels": n_mels,
-                    # "win_length": win_length,
-                    "hop_length": hop_length,
-                    "mel_scale": "htk",
-                },
-            )
-            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0,1).numpy()
-        elif type == 'mel':
-            # audio = 0.01 * audio / torch.mean(torch.abs(audio))
-            mel_transform = ta_T.MelSpectrogram(
-                sample_rate=sr, n_fft=n_fft, win_length=None, hop_length=hop_length, n_mels=n_mels
-            )
-            audio_ft = mel_transform(audio).squeeze(0).transpose(0,1).numpy()
-            # audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).transpose(0,1).numpy()
-        elif type == 'mel_mul':
-            audio = 0.01 * audio / torch.mean(torch.abs(audio))
-            audio = audio_chunking(audio, frame_rate=fps, chunk_size=sr)
-            mel_transform = ta_T.MelSpectrogram(
-                sample_rate=sr, n_fft=n_fft, win_length=int(sr/20), hop_length=int(sr/100), n_mels=n_mels
-            )
-            audio_ft = mel_transform(audio).squeeze(1)
-            audio_ft = torch.log(audio_ft.clamp(min=1e-10, max=None)).numpy()
-    else:
-        speech_array, sampling_rate = librosa.load(audio_fn, sr=16000)
-        if encoder_choice == 'faceformer':
-            # audio_ft = np.squeeze(am(speech_array, sampling_rate=16000).input_values).reshape(-1, 1)
-            audio_ft = speech_array.reshape(-1, 1)
-        elif encoder_choice == 'meshtalk':
-            audio_ft = 0.01 * speech_array / np.mean(np.abs(speech_array))
-        elif encoder_choice == 'onset':
-            audio_ft = librosa.onset.onset_detect(y=speech_array, sr=16000, units='time').reshape(-1, 1)
-        else:
-            audio, sr_0 = ta.load(audio_fn)
-            if sr != sr_0:
-                audio = ta.transforms.Resample(sr_0, sr)(audio)
-            if audio.shape[0] > 1:
-                audio = torch.mean(audio, dim=0, keepdim=True)
-            n_fft = 2048
-            if fps == 15:
-                hop_length = 1467
-            elif fps == 30:
-                hop_length = 734
-            win_length = hop_length * 2
-            n_mels = 256
-            n_mfcc = 64
-            mfcc_transform = ta_T.MFCC(
-                sample_rate=sr,
-                n_mfcc=n_mfcc,
-                melkwargs={
-                    "n_fft": n_fft,
-                    "n_mels": n_mels,
-                    # "win_length": win_length,
-                    "hop_length": hop_length,
-                    "mel_scale": "htk",
-                },
-            )
-            audio_ft = mfcc_transform(audio).squeeze(dim=0).transpose(0, 1).numpy()
-    return audio_ft
-def  get_mfcc_sepa(audio_fn, fps=15, sr=16000):
-    audio, sr_0 = ta.load(audio_fn)
-    if sr != sr_0:
-        audio = ta.transforms.Resample(sr_0, sr)(audio)
-    if audio.shape[0] > 1:
-        audio = torch.mean(audio, dim=0, keepdim=True)
-    n_fft = 2048
-    if fps == 15:
-        hop_length = 1467
-    elif fps == 30:
-        hop_length = 734
-    n_mels = 256
-    n_mfcc = 64
-    mfcc_transform = ta_T.MFCC(
-        sample_rate=sr,
-        n_mfcc=n_mfcc,
-        melkwargs={
-            "n_fft": n_fft,
-            "n_mels": n_mels,
-            # "win_length": win_length,
-            "hop_length": hop_length,
-            "mel_scale": "htk",
-        },
-    )
-    audio_ft_0 = mfcc_transform(audio[0, :sr*2]).squeeze(dim=0).transpose(0,1).numpy()
-    audio_ft_1 = mfcc_transform(audio[0, sr*2:]).squeeze(dim=0).transpose(0,1).numpy()
-    audio_ft = np.concatenate((audio_ft_0, audio_ft_1), axis=0)
-    return audio_ft, audio_ft_0.shape[0]
-def get_mfcc_old(wav_file):
-    sig, sample_rate = load_wav_old(wav_file)
-    mfcc = extract_mfcc(sig)
-    return mfcc
-def smooth_geom(geom, mask: torch.Tensor = None, filter_size: int = 9, sigma: float = 2.0):
-    """
-    :param geom: T x V x 3 tensor containing a temporal sequence of length T with V vertices in each frame
-    :param mask: V-dimensional Tensor containing a mask with vertices to be smoothed
-    :param filter_size: size of the Gaussian filter
-    :param sigma: standard deviation of the Gaussian filter
-    :return: T x V x 3 tensor containing smoothed geometry (i.e., smoothed in the area indicated by the mask)
-    """
-    assert filter_size % 2 == 1, f"filter size must be odd but is {filter_size}"
-    # Gaussian smoothing (low-pass filtering)
-    fltr = np.arange(-(filter_size // 2), filter_size // 2 + 1)
-    fltr = np.exp(-0.5 * fltr ** 2 / sigma ** 2)
-    fltr = torch.Tensor(fltr) / np.sum(fltr)
-    # apply fltr
-    fltr = fltr.view(1, 1, -1).to(device=geom.device)
-    T, V = geom.shape[1], geom.shape[2]
-    g = torch.nn.functional.pad(
-        geom.permute(2, 0, 1).view(V, 1, T),
-        pad=[filter_size // 2, filter_size // 2], mode='replicate'
-    )
-    g = torch.nn.functional.conv1d(g, fltr).view(V, 1, T)
-    smoothed = g.permute(1, 2, 0).contiguous()
-    # blend smoothed signal with original signal
-    if mask is None:
-        return smoothed
-    else:
-        return smoothed * mask[None, :, None] + geom * (-mask[None, :, None] + 1)
-if __name__ == '__main__':
-    audio_fn = '../sample_audio/clip000028_tCAkv4ggPgI.wav'
-    C = get_mfcc_psf(audio_fn)
-    print(C.shape)
-    C_2 = get_mfcc_librosa(audio_fn)
-    print(C.shape)
-    print(C)
-    print(C_2)
-    print((C == C_2).all())
-    # print(y.shape, sr)
-    # mel_spec = get_melspec(audio_fn)
-    # print(mel_spec.shape)
-    # mfcc = get_mfcc(audio_fn, sr = 16000)
-    # print(mfcc.shape)
-    # print(mel_spec.max(), mel_spec.min())
-    # print(mfcc.max(), mfcc.min())

demo_audio/1st-page.wav DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5fd78f4976c2fded490d274a9d4f20b5ebbc8e3c4e9f08ff2f69b38f92786818
-size 410190

demo_audio/yoy.py DELETED Viewed

File without changes

download_models.py DELETED Viewed

@@ -1,28 +0,0 @@
-import os
-import urllib.request
-import zipfile
-import subprocess
-def download_file(url, output_path):
-    os.makedirs(os.path.dirname(output_path), exist_ok=True)
-    if not os.path.exists(output_path):
-        print(f"Downloading {url} to {output_path}...")
-        urllib.request.urlretrieve(url, output_path)
-        print("Download complete!")
-    else:
-        print(f"File already exists: {output_path}")
-def main():
-    # Create necessary directories
-    os.makedirs("experiments", exist_ok=True)
-    os.makedirs("visualise/smplx_model", exist_ok=True)
-    # Here you would need to add URLs to download your models
-    # For example:
-    # download_file("YOUR_MODEL_URL", "experiments/your_model.pth")
-    # download_file("SMPLX_MODEL_URL", "visualise/smplx_model/SMPLX_NEUTRAL_2020.npz")
-    print("Setup complete!")
-if __name__ == "__main__":
-    main()

evaluation/FGD.py DELETED Viewed

@@ -1,199 +0,0 @@
-import time
-import numpy as np
-import torch
-import torch.nn.functional as F
-from scipy import linalg
-import math
-from data_utils.rotation_conversion import axis_angle_to_matrix, matrix_to_rotation_6d
-import warnings
-warnings.filterwarnings("ignore", category=RuntimeWarning)  # ignore warnings
-change_angle = torch.tensor([6.0181e-05, 5.1597e-05, 2.1344e-04, 2.1899e-04])
-class EmbeddingSpaceEvaluator:
-    def __init__(self, ae, vae, device):
-        # init embed net
-        self.ae = ae
-        # self.vae = vae
-        # storage
-        self.real_feat_list = []
-        self.generated_feat_list = []
-        self.real_joints_list = []
-        self.generated_joints_list = []
-        self.real_6d_list = []
-        self.generated_6d_list = []
-        self.audio_beat_list = []
-    def reset(self):
-        self.real_feat_list = []
-        self.generated_feat_list = []
-    def get_no_of_samples(self):
-        return len(self.real_feat_list)
-    def push_samples(self, generated_poses, real_poses):
-        # self.net.eval()
-        # convert poses to latent features
-        real_feat, real_poses = self.ae.extract(real_poses)
-        generated_feat, generated_poses = self.ae.extract(generated_poses)
-        num_joints = real_poses.shape[2] // 3
-        real_feat = real_feat.squeeze()
-        generated_feat = generated_feat.reshape(generated_feat.shape[0]*generated_feat.shape[1], -1)
-        self.real_feat_list.append(real_feat.data.cpu().numpy())
-        self.generated_feat_list.append(generated_feat.data.cpu().numpy())
-        # real_poses = matrix_to_rotation_6d(axis_angle_to_matrix(real_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
-        # generated_poses = matrix_to_rotation_6d(axis_angle_to_matrix(generated_poses.reshape(-1, 3))).reshape(-1, num_joints, 6)
-        #
-        # self.real_feat_list.append(real_poses.data.cpu().numpy())
-        # self.generated_feat_list.append(generated_poses.data.cpu().numpy())
-    def push_joints(self, generated_poses, real_poses):
-        self.real_joints_list.append(real_poses.data.cpu())
-        self.generated_joints_list.append(generated_poses.squeeze().data.cpu())
-    def push_aud(self, aud):
-        self.audio_beat_list.append(aud.squeeze().data.cpu())
-    def get_MAAC(self):
-        ang_vel_list = []
-        for real_joints in self.real_joints_list:
-            real_joints[:, 15:21] = real_joints[:, 16:22]
-            vec = real_joints[:, 15:21] - real_joints[:, 13:19]
-            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
-            inner_product = torch.clamp(inner_product, -1, 1, out=None)
-            angle = torch.acos(inner_product) / math.pi
-            ang_vel = (angle[1:] - angle[:-1]).abs().mean(dim=0)
-            ang_vel_list.append(ang_vel.unsqueeze(dim=0))
-        all_vel = torch.cat(ang_vel_list, dim=0)
-        MAAC = all_vel.mean(dim=0)
-        return MAAC
-    def get_BCscore(self):
-        thres = 0.01
-        sigma = 0.1
-        sum_1 = 0
-        total_beat = 0
-        for joints, audio_beat_time in zip(self.generated_joints_list, self.audio_beat_list):
-            motion_beat_time = []
-            if joints.dim() == 4:
-                joints = joints[0]
-            joints[:, 15:21] = joints[:, 16:22]
-            vec = joints[:, 15:21] - joints[:, 13:19]
-            inner_product = torch.einsum('kij,kij->ki', [vec[:, 2:], vec[:, :-2]])
-            inner_product = torch.clamp(inner_product, -1, 1, out=None)
-            angle = torch.acos(inner_product) / math.pi
-            ang_vel = (angle[1:] - angle[:-1]).abs() / change_angle / len(change_angle)
-            angle_diff = torch.cat((torch.zeros(1, 4), ang_vel), dim=0)
-            sum_2 = 0
-            for i in range(angle_diff.shape[1]):
-                motion_beat_time = []
-                for t in range(1, joints.shape[0]-1):
-                    if (angle_diff[t][i] < angle_diff[t - 1][i] and angle_diff[t][i] < angle_diff[t + 1][i]):
-                        if (angle_diff[t - 1][i] - angle_diff[t][i] >= thres or angle_diff[t + 1][i] - angle_diff[
-                            t][i] >= thres):
-                            motion_beat_time.append(float(t) / 30.0)
-                if (len(motion_beat_time) == 0):
-                    continue
-                motion_beat_time = torch.tensor(motion_beat_time)
-                sum = 0
-                for audio in audio_beat_time:
-                    sum += np.power(math.e, -(np.power((audio.item() - motion_beat_time), 2)).min() / (2 * sigma * sigma))
-                sum_2 = sum_2 + sum
-                total_beat = total_beat + len(audio_beat_time)
-            sum_1 = sum_1 + sum_2
-        return sum_1/total_beat
-    def get_scores(self):
-        generated_feats = np.vstack(self.generated_feat_list)
-        real_feats = np.vstack(self.real_feat_list)
-        def frechet_distance(samples_A, samples_B):
-            A_mu = np.mean(samples_A, axis=0)
-            A_sigma = np.cov(samples_A, rowvar=False)
-            B_mu = np.mean(samples_B, axis=0)
-            B_sigma = np.cov(samples_B, rowvar=False)
-            try:
-                frechet_dist = self.calculate_frechet_distance(A_mu, A_sigma, B_mu, B_sigma)
-            except ValueError:
-                frechet_dist = 1e+10
-            return frechet_dist
-        ####################################################################
-        # frechet distance
-        frechet_dist = frechet_distance(generated_feats, real_feats)
-        ####################################################################
-        # distance between real and generated samples on the latent feature space
-        dists = []
-        for i in range(real_feats.shape[0]):
-            d = np.sum(np.absolute(real_feats[i] - generated_feats[i]))  # MAE
-            dists.append(d)
-        feat_dist = np.mean(dists)
-        return frechet_dist, feat_dist
-    @staticmethod
-    def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
-        """ from https://github.com/mseitzer/pytorch-fid/blob/master/fid_score.py """
-        """Numpy implementation of the Frechet Distance.
-        The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
-        and X_2 ~ N(mu_2, C_2) is
-                d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
-        Stable version by Dougal J. Sutherland.
-        Params:
-        -- mu1   : Numpy array containing the activations of a layer of the
-                   inception net (like returned by the function 'get_predictions')
-                   for generated samples.
-        -- mu2   : The sample mean over activations, precalculated on an
-                   representative data set.
-        -- sigma1: The covariance matrix over activations for generated samples.
-        -- sigma2: The covariance matrix over activations, precalculated on an
-                   representative data set.
-        Returns:
-        --   : The Frechet Distance.
-        """
-        mu1 = np.atleast_1d(mu1)
-        mu2 = np.atleast_1d(mu2)
-        sigma1 = np.atleast_2d(sigma1)
-        sigma2 = np.atleast_2d(sigma2)
-        assert mu1.shape == mu2.shape, \
-            'Training and test mean vectors have different lengths'
-        assert sigma1.shape == sigma2.shape, \
-            'Training and test covariances have different dimensions'
-        diff = mu1 - mu2
-        # Product might be almost singular
-        covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
-        if not np.isfinite(covmean).all():
-            msg = ('fid calculation produces singular product; '
-                   'adding %s to diagonal of cov estimates') % eps
-            print(msg)
-            offset = np.eye(sigma1.shape[0]) * eps
-            covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
-        # Numerical error might give slight imaginary component
-        if np.iscomplexobj(covmean):
-            if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
-                m = np.max(np.abs(covmean.imag))
-                raise ValueError('Imaginary component {}'.format(m))
-            covmean = covmean.real
-        tr_covmean = np.trace(covmean)
-        return (diff.dot(diff) + np.trace(sigma1) +
-                np.trace(sigma2) - 2 * tr_covmean)

evaluation/__init__.py DELETED Viewed

File without changes

evaluation/diversity_LVD.py DELETED Viewed

@@ -1,64 +0,0 @@
-'''
-LVD: different initial pose
-diversity: same initial pose
-'''
-import os
-import sys
-sys.path.append(os.getcwd())
-from glob import glob
-from argparse import ArgumentParser
-import json
-from evaluation.util import *
-from evaluation.metrics import *
-from tqdm import tqdm
-parser = ArgumentParser()
-parser.add_argument('--speaker', required=True, type=str)
-parser.add_argument('--post_fix', nargs='+', default=['base'], type=str)
-args = parser.parse_args()
-speaker = args.speaker
-test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
-LVD_list = []
-diversity_list = []
-for aud in tqdm(test_audios):
-    base_name = os.path.splitext(aud)[0]
-    gt_path = get_full_path(aud, speaker, 'val')
-    _, gt_poses, _ = get_gts(gt_path)
-    gt_poses = gt_poses[np.newaxis,...]
-    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
-    for post_fix in args.post_fix:
-        pred_path = base_name + '_'+post_fix+'.json'
-        pred_poses = np.array(json.load(open(pred_path)))
-        # print(pred_poses.shape)#(B, seq_len, 108)
-        pred_poses = cvt25(pred_poses, gt_poses)
-        # print(pred_poses.shape)#(B, seq, pose_dim)
-        gt_valid_points = hand_points(gt_poses)
-        pred_valid_points = hand_points(pred_poses)
-        lvd = LVD(gt_valid_points, pred_valid_points)
-        # div = diversity(pred_valid_points)
-        LVD_list.append(lvd)
-        # diversity_list.append(div)
-        # gt_velocity = peak_velocity(gt_valid_points, order=2)
-        # pred_velocity = peak_velocity(pred_valid_points, order=2)
-        # gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
-        # pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
-        # gt_consistency_list.append(gt_consistency)
-        # pred_consistency_list.append(pred_consistency)
-lvd = np.mean(LVD_list)
-# diversity_list = np.mean(diversity_list)
-print('LVD:', lvd)
-# print("diversity:", diversity_list)

evaluation/get_quality_samples.py DELETED Viewed

@@ -1,62 +0,0 @@
-'''
-'''
-import os
-import sys
-sys.path.append(os.getcwd())
-from glob import glob
-from argparse import ArgumentParser
-import json
-from evaluation.util import *
-from evaluation.metrics import *
-from tqdm import tqdm
-parser = ArgumentParser()
-parser.add_argument('--speaker', required=True, type=str)
-parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
-args = parser.parse_args()
-speaker = args.speaker
-test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
-quality_samples={'gt':[]}
-for post_fix in args.post_fix:
-    quality_samples[post_fix] = []
-for aud in tqdm(test_audios):
-    base_name = os.path.splitext(aud)[0]
-    gt_path = get_full_path(aud, speaker, 'val')
-    _, gt_poses, _ = get_gts(gt_path)
-    gt_poses = gt_poses[np.newaxis,...]
-    gt_valid_points = valid_points(gt_poses)
-    # print(gt_valid_points.shape)
-    quality_samples['gt'].append(gt_valid_points)
-    for post_fix in args.post_fix:
-        pred_path = base_name + '_'+post_fix+'.json'
-        pred_poses = np.array(json.load(open(pred_path)))
-        # print(pred_poses.shape)#(B, seq_len, 108)
-        pred_poses = cvt25(pred_poses, gt_poses)
-        # print(pred_poses.shape)#(B, seq, pose_dim)
-        pred_valid_points = valid_points(pred_poses)[0:1]
-        quality_samples[post_fix].append(pred_valid_points)
-quality_samples['gt'] = np.concatenate(quality_samples['gt'], axis=1)
-for post_fix in args.post_fix:
-    quality_samples[post_fix] = np.concatenate(quality_samples[post_fix], axis=1)
-print('gt:', quality_samples['gt'].shape)
-quality_samples['gt'] = quality_samples['gt'].tolist()
-for post_fix in args.post_fix:
-    print(post_fix, ':', quality_samples[post_fix].shape)
-    quality_samples[post_fix] = quality_samples[post_fix].tolist()
-save_dir = '../../experiments/'
-os.makedirs(save_dir, exist_ok=True)
-save_name = os.path.join(save_dir, 'quality_samples_%s.json'%(speaker))
-with open(save_name, 'w') as f:
-    json.dump(quality_samples, f)

evaluation/metrics.py DELETED Viewed

@@ -1,109 +0,0 @@
-'''
-Warning: metrics are for reference only, may have limited significance
-'''
-import os
-import sys
-sys.path.append(os.getcwd())
-import numpy as np
-import torch
-from data_utils.lower_body import rearrange, symmetry
-import torch.nn.functional as F
-def data_driven_baselines(gt_kps):
-    '''
-    gt_kps: T, D
-    '''
-    gt_velocity = np.abs(gt_kps[1:] - gt_kps[:-1])
-    mean= np.mean(gt_velocity, axis=0)[np.newaxis] #(1, D)
-    mean = np.mean(np.abs(gt_velocity-mean))
-    last_step = gt_kps[1] - gt_kps[0]
-    last_step = last_step[np.newaxis] #(1, D)
-    last_step = np.mean(np.abs(gt_velocity-last_step))
-    return last_step, mean
-def Batch_LVD(gt_kps, pr_kps, symmetrical, weight):
-    if gt_kps.shape[0] > pr_kps.shape[1]:
-        length = pr_kps.shape[1]
-    else:
-        length = gt_kps.shape[0]
-    gt_kps = gt_kps[:length]
-    pr_kps = pr_kps[:, :length]
-    global symmetry
-    symmetry = torch.tensor(symmetry).bool()
-    if symmetrical:
-        # rearrange for compute symmetric. ns means non-symmetrical joints, ys means symmetrical joints.
-        gt_kps = gt_kps[:, rearrange]
-        ns_gt_kps = gt_kps[:, ~symmetry]
-        ys_gt_kps = gt_kps[:, symmetry]
-        ys_gt_kps = ys_gt_kps.reshape(ys_gt_kps.shape[0], -1, 2, 3)
-        ns_gt_velocity = (ns_gt_kps[1:] - ns_gt_kps[:-1]).norm(p=2, dim=-1)
-        ys_gt_velocity = (ys_gt_kps[1:] - ys_gt_kps[:-1]).norm(p=2, dim=-1)
-        left_gt_vel = ys_gt_velocity[:, :, 0].sum(dim=-1)
-        right_gt_vel = ys_gt_velocity[:, :, 1].sum(dim=-1)
-        move_side = torch.where(left_gt_vel>right_gt_vel, torch.ones(left_gt_vel.shape).cuda(),  torch.zeros(left_gt_vel.shape).cuda())
-        ys_gt_velocity = torch.mul(ys_gt_velocity[:, :, 0].transpose(0,1), move_side) + torch.mul(ys_gt_velocity[:, :, 1].transpose(0,1), ~move_side.bool())
-        ys_gt_velocity = ys_gt_velocity.transpose(0,1)
-        gt_velocity = torch.cat([ns_gt_velocity, ys_gt_velocity], dim=1)
-        pr_kps = pr_kps[:, :, rearrange]
-        ns_pr_kps = pr_kps[:, :, ~symmetry]
-        ys_pr_kps = pr_kps[:, :, symmetry]
-        ys_pr_kps = ys_pr_kps.reshape(ys_pr_kps.shape[0], ys_pr_kps.shape[1], -1, 2, 3)
-        ns_pr_velocity = (ns_pr_kps[:, 1:] - ns_pr_kps[:, :-1]).norm(p=2, dim=-1)
-        ys_pr_velocity = (ys_pr_kps[:, 1:] - ys_pr_kps[:, :-1]).norm(p=2, dim=-1)
-        left_pr_vel = ys_pr_velocity[:, :, :, 0].sum(dim=-1)
-        right_pr_vel = ys_pr_velocity[:, :, :, 1].sum(dim=-1)
-        move_side = torch.where(left_pr_vel > right_pr_vel, torch.ones(left_pr_vel.shape).cuda(),
-                                torch.zeros(left_pr_vel.shape).cuda())
-        ys_pr_velocity = torch.mul(ys_pr_velocity[..., 0].permute(2, 0, 1), move_side) + torch.mul(
-            ys_pr_velocity[..., 1].permute(2, 0, 1), ~move_side.long())
-        ys_pr_velocity = ys_pr_velocity.permute(1, 2, 0)
-        pr_velocity = torch.cat([ns_pr_velocity, ys_pr_velocity], dim=2)
-    else:
-        gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
-        pr_velocity = (pr_kps[:, 1:] - pr_kps[:, :-1]).norm(p=2, dim=-1)
-    if weight:
-        w = F.softmax(gt_velocity.sum(dim=1).normal_(), dim=0)
-    else:
-        w = 1 / gt_velocity.shape[0]
-    v_diff = ((pr_velocity - gt_velocity).abs().sum(dim=-1) * w).sum(dim=-1).mean()
-    return v_diff
-def LVD(gt_kps, pr_kps, symmetrical=False, weight=False):
-    gt_kps = gt_kps.squeeze()
-    pr_kps = pr_kps.squeeze()
-    if len(pr_kps.shape) == 4:
-        return Batch_LVD(gt_kps, pr_kps, symmetrical, weight)
-    # length = np.minimum(gt_kps.shape[0], pr_kps.shape[0])
-    length = gt_kps.shape[0]-10
-    # gt_kps = gt_kps[25:length]
-    # pr_kps = pr_kps[25:length] #(T, D)
-    # if pr_kps.shape[0] < gt_kps.shape[0]:
-    #     pr_kps = np.pad(pr_kps, [[0, int(gt_kps.shape[0]-pr_kps.shape[0])], [0, 0]], mode='constant')
-    gt_velocity = (gt_kps[1:] - gt_kps[:-1]).norm(p=2, dim=-1)
-    pr_velocity = (pr_kps[1:] - pr_kps[:-1]).norm(p=2, dim=-1)
-    return (pr_velocity-gt_velocity).abs().sum(dim=-1).mean()
-def diversity(kps):
-    '''
-    kps: bs, seq, dim
-    '''
-    dis_list = []
-    #the distance between each pair
-    for i in range(kps.shape[0]):
-        for j in range(i+1, kps.shape[0]):
-            seq_i = kps[i]
-            seq_j = kps[j]
-            dis = np.mean(np.abs(seq_i - seq_j))
-            dis_list.append(dis)
-    return np.mean(dis_list)

evaluation/mode_transition.py DELETED Viewed

@@ -1,60 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-from glob import glob
-from argparse import ArgumentParser
-import json
-from evaluation.util import *
-from evaluation.metrics import *
-from tqdm import tqdm
-parser = ArgumentParser()
-parser.add_argument('--speaker', required=True, type=str)
-parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
-args = parser.parse_args()
-speaker = args.speaker
-test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
-precision_list=[]
-recall_list=[]
-accuracy_list=[]
-for aud in tqdm(test_audios):
-    base_name = os.path.splitext(aud)[0]
-    gt_path = get_full_path(aud, speaker, 'val')
-    _, gt_poses, _ = get_gts(gt_path)
-    if gt_poses.shape[0] < 50:
-        continue
-    gt_poses = gt_poses[np.newaxis,...]
-    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
-    for post_fix in args.post_fix:
-        pred_path = base_name + '_'+post_fix+'.json'
-        pred_poses = np.array(json.load(open(pred_path)))
-        # print(pred_poses.shape)#(B, seq_len, 108)
-        pred_poses = cvt25(pred_poses, gt_poses)
-        # print(pred_poses.shape)#(B, seq, pose_dim)
-        gt_valid_points = valid_points(gt_poses)
-        pred_valid_points = valid_points(pred_poses)
-        # print(gt_valid_points.shape, pred_valid_points.shape)
-        gt_mode_transition_seq = mode_transition_seq(gt_valid_points, speaker)#(B, N)
-        pred_mode_transition_seq = mode_transition_seq(pred_valid_points, speaker)#(B, N)
-        # baseline = np.random.randint(0, 2, size=pred_mode_transition_seq.shape)
-        # pred_mode_transition_seq = baseline
-        precision, recall, accuracy = mode_transition_consistency(pred_mode_transition_seq, gt_mode_transition_seq)
-        precision_list.append(precision)
-        recall_list.append(recall)
-        accuracy_list.append(accuracy)
-print(len(precision_list), len(recall_list), len(accuracy_list))
-precision_list = np.mean(precision_list)
-recall_list = np.mean(recall_list)
-accuracy_list = np.mean(accuracy_list)
-print('precision, recall, accu:', precision_list, recall_list, accuracy_list)

evaluation/peak_velocity.py DELETED Viewed

@@ -1,65 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-from glob import glob
-from argparse import ArgumentParser
-import json
-from evaluation.util import *
-from evaluation.metrics import *
-from tqdm import tqdm
-parser = ArgumentParser()
-parser.add_argument('--speaker', required=True, type=str)
-parser.add_argument('--post_fix', nargs='+', default=['paper_model'], type=str)
-args = parser.parse_args()
-speaker = args.speaker
-test_audios = sorted(glob('pose_dataset/videos/test_audios/%s/*.wav'%(speaker)))
-gt_consistency_list=[]
-pred_consistency_list=[]
-for aud in tqdm(test_audios):
-    base_name = os.path.splitext(aud)[0]
-    gt_path = get_full_path(aud, speaker, 'val')
-    _, gt_poses, _ = get_gts(gt_path)
-    gt_poses = gt_poses[np.newaxis,...]
-    # print(gt_poses.shape)#(seq_len, 135*2)pose, lhand, rhand, face
-    for post_fix in args.post_fix:
-        pred_path = base_name + '_'+post_fix+'.json'
-        pred_poses = np.array(json.load(open(pred_path)))
-        # print(pred_poses.shape)#(B, seq_len, 108)
-        pred_poses = cvt25(pred_poses, gt_poses)
-        # print(pred_poses.shape)#(B, seq, pose_dim)
-        gt_valid_points = hand_points(gt_poses)
-        pred_valid_points = hand_points(pred_poses)
-        gt_velocity = peak_velocity(gt_valid_points, order=2)
-        pred_velocity = peak_velocity(pred_valid_points, order=2)
-        gt_consistency = velocity_consistency(gt_velocity, pred_velocity)
-        pred_consistency = velocity_consistency(pred_velocity, gt_velocity)
-        gt_consistency_list.append(gt_consistency)
-        pred_consistency_list.append(pred_consistency)
-gt_consistency_list = np.concatenate(gt_consistency_list)
-pred_consistency_list = np.concatenate(pred_consistency_list)
-print(gt_consistency_list.max(), gt_consistency_list.min())
-print(pred_consistency_list.max(), pred_consistency_list.min())
-print(np.mean(gt_consistency_list), np.mean(pred_consistency_list))
-print(np.std(gt_consistency_list), np.std(pred_consistency_list))
-draw_cdf(gt_consistency_list, save_name='%s_gt.jpg'%(speaker), color='slateblue')
-draw_cdf(pred_consistency_list, save_name='%s_pred.jpg'%(speaker), color='lightskyblue')
-to_excel(gt_consistency_list, '%s_gt.xlsx'%(speaker))
-to_excel(pred_consistency_list, '%s_pred.xlsx'%(speaker))
-np.save('%s_gt.npy'%(speaker), gt_consistency_list)
-np.save('%s_pred.npy'%(speaker), pred_consistency_list)

evaluation/util.py DELETED Viewed

@@ -1,148 +0,0 @@
-import os
-from glob import glob
-import numpy as np
-import json
-from matplotlib import pyplot as plt
-import pandas as pd
-def get_gts(clip):
-    '''
-    clip: abs path to the clip dir
-    '''
-    keypoints_files = sorted(glob(os.path.join(clip, 'keypoints_new/person_1')+'/*.json'))
-    upper_body_points = list(np.arange(0, 25))
-    poses = []
-    confs = []
-    neck_to_nose_len = []
-    mean_position = []
-    for kp_file in keypoints_files:
-        kp_load = json.load(open(kp_file, 'r'))['people'][0]
-        posepts = kp_load['pose_keypoints_2d']
-        lhandpts = kp_load['hand_left_keypoints_2d']
-        rhandpts = kp_load['hand_right_keypoints_2d']
-        facepts = kp_load['face_keypoints_2d']
-        neck = np.array(posepts).reshape(-1,3)[1]
-        nose = np.array(posepts).reshape(-1,3)[0]
-        x_offset = abs(neck[0]-nose[0])
-        y_offset = abs(neck[1]-nose[1])
-        neck_to_nose_len.append(y_offset)
-        mean_position.append([neck[0],neck[1]])
-        keypoints=np.array(posepts+lhandpts+rhandpts+facepts).reshape(-1,3)[:,:2]
-        upper_body = keypoints[upper_body_points, :]
-        hand_points = keypoints[25:, :]
-        keypoints = np.vstack([upper_body, hand_points])
-        poses.append(keypoints)
-    if len(neck_to_nose_len) > 0:
-        scale_factor = np.mean(neck_to_nose_len)
-    else:
-        raise ValueError(clip)
-    mean_position = np.mean(np.array(mean_position), axis=0)
-    unlocalized_poses = np.array(poses).copy()
-    localized_poses = []
-    for i in range(len(poses)):
-        keypoints = poses[i]
-        neck = keypoints[1].copy()
-        keypoints[:, 0] = (keypoints[:, 0] - neck[0]) / scale_factor
-        keypoints[:, 1] = (keypoints[:, 1] - neck[1]) / scale_factor
-        localized_poses.append(keypoints.reshape(-1))
-    localized_poses=np.array(localized_poses)
-    return unlocalized_poses, localized_poses, (scale_factor, mean_position)
-def get_full_path(wav_name, speaker, split):
-    '''
-    get clip path from aud file
-    '''
-    wav_name = os.path.basename(wav_name)
-    wav_name = os.path.splitext(wav_name)[0]
-    clip_name, vid_name = wav_name[:10], wav_name[11:]
-    full_path = os.path.join('pose_dataset/videos/', speaker, 'clips', vid_name, 'images/half', split, clip_name)
-    assert os.path.isdir(full_path), full_path
-    return full_path
-def smooth(res):
-    '''
-    res: (B, seq_len, pose_dim)
-    '''
-    window = [res[:, 7, :], res[:, 8, :], res[:, 9, :], res[:, 10, :], res[:, 11, :], res[:, 12, :]]
-    w_size=7
-    for i in range(10, res.shape[1]-3):
-        window.append(res[:, i+3, :])
-        if len(window) > w_size:
-            window = window[1:]
-        if (i%25) in [22, 23, 24, 0, 1, 2, 3]:
-            res[:, i, :] = np.mean(window, axis=1)
-    return res
-def cvt25(pred_poses, gt_poses=None):
-    '''
-    gt_poses: (1, seq_len, 270), 135 *2
-    pred_poses: (B, seq_len, 108), 54 * 2
-    '''
-    if gt_poses is None:
-        gt_poses = np.zeros_like(pred_poses)
-    else:
-        gt_poses = gt_poses.repeat(pred_poses.shape[0], axis=0)
-    length = min(pred_poses.shape[1], gt_poses.shape[1])
-    pred_poses = pred_poses[:, :length, :]
-    gt_poses = gt_poses[:, :length, :]
-    gt_poses = gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1, 2)
-    pred_poses = pred_poses.reshape(pred_poses.shape[0], pred_poses.shape[1], -1, 2)
-    gt_poses[:, :, [1, 2, 3, 4, 5, 6, 7], :] = pred_poses[:, :, 1:8, :]
-    gt_poses[:, :, 25:25+21+21, :] = pred_poses[:, :, 12:, :]
-    return gt_poses.reshape(gt_poses.shape[0], gt_poses.shape[1], -1)
-def hand_points(seq):
-    '''
-    seq: (B, seq_len, 135*2)
-    hands only
-    '''
-    hand_idx = [1, 2, 3, 4,5 ,6,7] + list(range(25, 25+21+21))
-    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
-    return seq[:, :, hand_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
-def valid_points(seq):
-    '''
-    hands with some head points
-    '''
-    valid_idx = [0, 1, 2, 3, 4,5 ,6,7, 8, 9, 10, 11] + list(range(25, 25+21+21))
-    seq = seq.reshape(seq.shape[0], seq.shape[1], -1, 2)
-    seq = seq[:, :, valid_idx, :].reshape(seq.shape[0], seq.shape[1], -1)
-    assert seq.shape[-1] == 108, seq.shape
-    return seq
-def draw_cdf(seq, save_name='cdf.jpg', color='slatebule'):
-    plt.figure()
-    plt.hist(seq, bins=100, range=(0, 100), color=color)
-    plt.savefig(save_name)
-def to_excel(seq, save_name='res.xlsx'):
-    '''
-    seq: (T)
-    '''
-    df = pd.DataFrame(seq)
-    writer = pd.ExcelWriter(save_name)
-    df.to_excel(writer, 'sheet1')
-    writer.save()
-    writer.close()
-if __name__ == '__main__':
-    random_data = np.random.randint(0, 10, 100)
-    draw_cdf(random_data)

losses/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- from .losses import *

losses/losses.py DELETED Viewed

@@ -1,91 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import numpy as np
-class KeypointLoss(nn.Module):
-    def __init__(self):
-        super(KeypointLoss, self).__init__()
-    def forward(self, pred_seq, gt_seq, gt_conf=None):
-        #pred_seq: (B, C, T)
-        if gt_conf is not None:
-            gt_conf = gt_conf >= 0.01
-            return F.mse_loss(pred_seq[gt_conf], gt_seq[gt_conf], reduction='mean')
-        else:
-            return F.mse_loss(pred_seq, gt_seq)
-class KLLoss(nn.Module):
-    def __init__(self, kl_tolerance):
-        super(KLLoss, self).__init__()
-        self.kl_tolerance = kl_tolerance
-    def forward(self, mu, var, mul=1):
-        kl_tolerance = self.kl_tolerance * mul * var.shape[1] / 64
-        kld_loss = -0.5 * torch.sum(1 + var - mu**2 - var.exp(), dim=1)
-        # kld_loss = -0.5 * torch.sum(1 + (var-1) - (mu) ** 2 - (var-1).exp(), dim=1)
-        if self.kl_tolerance is not None:
-            # above_line = kld_loss[kld_loss > self.kl_tolerance]
-            # if len(above_line) > 0:
-            #     kld_loss = torch.mean(kld_loss)
-            # else:
-            #     kld_loss = 0
-            kld_loss = torch.where(kld_loss > kl_tolerance, kld_loss, torch.tensor(kl_tolerance, device='cuda'))
-        # else:
-        kld_loss = torch.mean(kld_loss)
-        return kld_loss
-class L2KLLoss(nn.Module):
-    def __init__(self, kl_tolerance):
-        super(L2KLLoss, self).__init__()
-        self.kl_tolerance = kl_tolerance
-    def forward(self, x):
-        # TODO: check
-        kld_loss = torch.sum(x ** 2, dim=1)
-        if self.kl_tolerance is not None:
-            above_line = kld_loss[kld_loss > self.kl_tolerance]
-            if len(above_line) > 0:
-                kld_loss = torch.mean(kld_loss)
-            else:
-                kld_loss = 0
-        else:
-            kld_loss = torch.mean(kld_loss)
-        return kld_loss
-class L2RegLoss(nn.Module):
-    def __init__(self):
-        super(L2RegLoss, self).__init__()
-    def forward(self, x):
-        #TODO: check
-        return torch.sum(x**2)
-class L2Loss(nn.Module):
-    def __init__(self):
-        super(L2Loss, self).__init__()
-    def forward(self, x):
-        # TODO: check
-        return torch.sum(x ** 2)
-class AudioLoss(nn.Module):
-    def __init__(self):
-        super(AudioLoss, self).__init__()
-    def forward(self, dynamics, gt_poses):
-        #pay attention, normalized
-        mean = torch.mean(gt_poses, dim=-1).unsqueeze(-1)
-        gt = gt_poses - mean
-        return F.mse_loss(dynamics, gt)
-L1Loss = nn.L1Loss

nets/LS3DCG.py DELETED Viewed

@@ -1,414 +0,0 @@
-'''
-not exactly the same as the official repo but the results are good
-'''
-import sys
-import os
-from data_utils.lower_body import c_index_3d, c_index_6d
-sys.path.append(os.getcwd())
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.optim as optim
-import torch.nn.functional as F
-import math
-from nets.base import TrainWrapperBaseClass
-from nets.layers import SeqEncoder1D
-from losses import KeypointLoss, L1Loss, KLLoss
-from data_utils.utils import get_melspec, get_mfcc_psf, get_mfcc_ta
-from nets.utils import denormalize
-class Conv1d_tf(nn.Conv1d):
-    """
-    Conv1d with the padding behavior from TF
-    modified from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
-    """
-    def __init__(self, *args, **kwargs):
-        super(Conv1d_tf, self).__init__(*args, **kwargs)
-        self.padding = kwargs.get("padding", "same")
-    def _compute_padding(self, input, dim):
-        input_size = input.size(dim + 2)
-        filter_size = self.weight.size(dim + 2)
-        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
-        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
-        total_padding = max(
-            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
-        )
-        additional_padding = int(total_padding % 2 != 0)
-        return additional_padding, total_padding
-    def forward(self, input):
-        if self.padding == "VALID":
-            return F.conv1d(
-                input,
-                self.weight,
-                self.bias,
-                self.stride,
-                padding=0,
-                dilation=self.dilation,
-                groups=self.groups,
-            )
-        rows_odd, padding_rows = self._compute_padding(input, dim=0)
-        if rows_odd:
-            input = F.pad(input, [0, rows_odd])
-        return F.conv1d(
-            input,
-            self.weight,
-            self.bias,
-            self.stride,
-            padding=(padding_rows // 2),
-            dilation=self.dilation,
-            groups=self.groups,
-        )
-def ConvNormRelu(in_channels, out_channels, type='1d', downsample=False, k=None, s=None, norm='bn', padding='valid'):
-    if k is None and s is None:
-        if not downsample:
-            k = 3
-            s = 1
-        else:
-            k = 4
-            s = 2
-    if type == '1d':
-        conv_block = Conv1d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
-        if norm == 'bn':
-            norm_block = nn.BatchNorm1d(out_channels)
-        elif norm == 'ln':
-            norm_block = nn.LayerNorm(out_channels)
-    elif type == '2d':
-        conv_block = Conv2d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding)
-        norm_block = nn.BatchNorm2d(out_channels)
-    else:
-        assert False
-    return nn.Sequential(
-        conv_block,
-        norm_block,
-        nn.LeakyReLU(0.2, True)
-    )
-class Decoder(nn.Module):
-    def __init__(self, in_ch, out_ch):
-        super(Decoder, self).__init__()
-        self.up1 = nn.Sequential(
-            ConvNormRelu(in_ch // 2 + in_ch, in_ch // 2),
-            ConvNormRelu(in_ch // 2, in_ch // 2),
-            nn.Upsample(scale_factor=2, mode='nearest')
-        )
-        self.up2 = nn.Sequential(
-            ConvNormRelu(in_ch // 4 + in_ch // 2, in_ch // 4),
-            ConvNormRelu(in_ch // 4, in_ch // 4),
-            nn.Upsample(scale_factor=2, mode='nearest')
-        )
-        self.up3 = nn.Sequential(
-            ConvNormRelu(in_ch // 8 + in_ch // 4, in_ch // 8),
-            ConvNormRelu(in_ch // 8, in_ch // 8),
-            nn.Conv1d(in_ch // 8, out_ch, 1, 1)
-        )
-    def forward(self, x, x1, x2, x3):
-        x = F.interpolate(x, x3.shape[2])
-        x = torch.cat([x, x3], dim=1)
-        x = self.up1(x)
-        x = F.interpolate(x, x2.shape[2])
-        x = torch.cat([x, x2], dim=1)
-        x = self.up2(x)
-        x = F.interpolate(x, x1.shape[2])
-        x = torch.cat([x, x1], dim=1)
-        x = self.up3(x)
-        return x
-class EncoderDecoder(nn.Module):
-    def __init__(self, n_frames, each_dim):
-        super().__init__()
-        self.n_frames = n_frames
-        self.down1 = nn.Sequential(
-            ConvNormRelu(64, 64, '1d', False),
-            ConvNormRelu(64, 128, '1d', False),
-        )
-        self.down2 = nn.Sequential(
-            ConvNormRelu(128, 128, '1d', False),
-            ConvNormRelu(128, 256, '1d', False),
-        )
-        self.down3 = nn.Sequential(
-            ConvNormRelu(256, 256, '1d', False),
-            ConvNormRelu(256, 512, '1d', False),
-        )
-        self.down4 = nn.Sequential(
-            ConvNormRelu(512, 512, '1d', False),
-            ConvNormRelu(512, 1024, '1d', False),
-        )
-        self.down = nn.MaxPool1d(kernel_size=2)
-        self.up = nn.Upsample(scale_factor=2, mode='nearest')
-        self.face_decoder = Decoder(1024, each_dim[0] + each_dim[3])
-        self.body_decoder = Decoder(1024, each_dim[1])
-        self.hand_decoder = Decoder(1024, each_dim[2])
-    def forward(self, spectrogram, time_steps=None):
-        if time_steps is None:
-            time_steps = self.n_frames
-        x1 = self.down1(spectrogram)
-        x = self.down(x1)
-        x2 = self.down2(x)
-        x = self.down(x2)
-        x3 = self.down3(x)
-        x = self.down(x3)
-        x = self.down4(x)
-        x = self.up(x)
-        face = self.face_decoder(x, x1, x2, x3)
-        body = self.body_decoder(x, x1, x2, x3)
-        hand = self.hand_decoder(x, x1, x2, x3)
-        return face, body, hand
-class Generator(nn.Module):
-    def __init__(self,
-                 each_dim,
-                 training=False,
-                 device=None
-                 ):
-        super().__init__()
-        self.training = training
-        self.device = device
-        self.encoderdecoder = EncoderDecoder(15, each_dim)
-    def forward(self, in_spec, time_steps=None):
-        if time_steps is not None:
-            self.gen_length = time_steps
-        face, body, hand = self.encoderdecoder(in_spec)
-        out = torch.cat([face, body, hand], dim=1)
-        out = out.transpose(1, 2)
-        return out
-class Discriminator(nn.Module):
-    def __init__(self, input_dim):
-        super().__init__()
-        self.net = nn.Sequential(
-            ConvNormRelu(input_dim, 128, '1d'),
-            ConvNormRelu(128, 256, '1d'),
-            nn.MaxPool1d(kernel_size=2),
-            ConvNormRelu(256, 256, '1d'),
-            ConvNormRelu(256, 512, '1d'),
-            nn.MaxPool1d(kernel_size=2),
-            ConvNormRelu(512, 512, '1d'),
-            ConvNormRelu(512, 1024, '1d'),
-            nn.MaxPool1d(kernel_size=2),
-            nn.Conv1d(1024, 1, 1, 1),
-            nn.Sigmoid()
-        )
-    def forward(self, x):
-        x = x.transpose(1, 2)
-        out = self.net(x)
-        return out
-class TrainWrapper(TrainWrapperBaseClass):
-    def __init__(self, args, config) -> None:
-        self.args = args
-        self.config = config
-        self.device = torch.device(self.args.gpu)
-        self.global_step = 0
-        self.convert_to_6d = self.config.Data.pose.convert_to_6d
-        self.init_params()
-        self.generator = Generator(
-            each_dim=self.each_dim,
-            training=not self.args.infer,
-            device=self.device,
-        ).to(self.device)
-        self.discriminator = Discriminator(
-            input_dim=self.each_dim[1] + self.each_dim[2] + 64
-        ).to(self.device)
-        if self.convert_to_6d:
-            self.c_index = c_index_6d
-        else:
-            self.c_index = c_index_3d
-        self.MSELoss = KeypointLoss().to(self.device)
-        self.L1Loss = L1Loss().to(self.device)
-        super().__init__(args, config)
-    def init_params(self):
-        scale = 1
-        global_orient = round(0 * scale)
-        leye_pose = reye_pose = round(0 * scale)
-        jaw_pose = round(3 * scale)
-        body_pose = round((63 - 24) * scale)
-        left_hand_pose = right_hand_pose = round(45 * scale)
-        expression = 100
-        b_j = 0
-        jaw_dim = jaw_pose
-        b_e = b_j + jaw_dim
-        eye_dim = leye_pose + reye_pose
-        b_b = b_e + eye_dim
-        body_dim = global_orient + body_pose
-        b_h = b_b + body_dim
-        hand_dim = left_hand_pose + right_hand_pose
-        b_f = b_h + hand_dim
-        face_dim = expression
-        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
-        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
-        self.pose = int(self.full_dim / round(3 * scale))
-        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
-    def __call__(self, bat):
-        assert (not self.args.infer), "infer mode"
-        self.global_step += 1
-        loss_dict = {}
-        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
-        expression = bat['expression'].to(self.device).to(torch.float32)
-        jaw = poses[:, :3, :]
-        poses = poses[:, self.c_index, :]
-        pred = self.generator(in_spec=aud)
-        D_loss, D_loss_dict = self.get_loss(
-            pred_poses=pred.detach(),
-            gt_poses=poses,
-            aud=aud,
-            mode='training_D',
-        )
-        self.discriminator_optimizer.zero_grad()
-        D_loss.backward()
-        self.discriminator_optimizer.step()
-        G_loss, G_loss_dict = self.get_loss(
-            pred_poses=pred,
-            gt_poses=poses,
-            aud=aud,
-            expression=expression,
-            jaw=jaw,
-            mode='training_G',
-        )
-        self.generator_optimizer.zero_grad()
-        G_loss.backward()
-        self.generator_optimizer.step()
-        total_loss = None
-        loss_dict = {}
-        for key in list(D_loss_dict.keys()) + list(G_loss_dict.keys()):
-            loss_dict[key] = G_loss_dict.get(key, 0) + D_loss_dict.get(key, 0)
-        return total_loss, loss_dict
-    def get_loss(self,
-                 pred_poses,
-                 gt_poses,
-                 aud=None,
-                 jaw=None,
-                 expression=None,
-                 mode='training_G',
-                 ):
-        loss_dict = {}
-        aud = aud.transpose(1, 2)
-        gt_poses = gt_poses.transpose(1, 2)
-        gt_aud = torch.cat([gt_poses, aud], dim=2)
-        pred_aud = torch.cat([pred_poses[:, :, 103:], aud], dim=2)
-        if mode == 'training_D':
-            dis_real = self.discriminator(gt_aud)
-            dis_fake = self.discriminator(pred_aud)
-            dis_error = self.MSELoss(torch.ones_like(dis_real).to(self.device), dis_real) + self.MSELoss(
-                torch.zeros_like(dis_fake).to(self.device), dis_fake)
-            loss_dict['dis'] = dis_error
-            return dis_error, loss_dict
-        elif mode == 'training_G':
-            jaw_loss = self.L1Loss(pred_poses[:, :, :3], jaw.transpose(1, 2))
-            face_loss = self.MSELoss(pred_poses[:, :, 3:103], expression.transpose(1, 2))
-            body_loss = self.L1Loss(pred_poses[:, :, 103:142], gt_poses[:, :, :39])
-            hand_loss = self.L1Loss(pred_poses[:, :, 142:], gt_poses[:, :, 39:])
-            l1_loss = jaw_loss + face_loss + body_loss + hand_loss
-            dis_output = self.discriminator(pred_aud)
-            gen_error = self.MSELoss(torch.ones_like(dis_output).to(self.device), dis_output)
-            gen_loss = self.config.Train.weights.keypoint_loss_weight * l1_loss + self.config.Train.weights.gan_loss_weight * gen_error
-            loss_dict['gen'] = gen_error
-            loss_dict['jaw_loss'] = jaw_loss
-            loss_dict['face_loss'] = face_loss
-            loss_dict['body_loss'] = body_loss
-            loss_dict['hand_loss'] = hand_loss
-            return gen_loss, loss_dict
-        else:
-            raise ValueError(mode)
-    def infer_on_audio(self, aud_fn, fps=30, initial_pose=None, norm_stats=None, id=None, B=1, **kwargs):
-        output = []
-        assert self.args.infer, "train mode"
-        self.generator.eval()
-        if self.config.Data.pose.normalization:
-            assert norm_stats is not None
-            data_mean = norm_stats[0]
-            data_std = norm_stats[1]
-        pre_length = self.config.Data.pose.pre_pose_length
-        generate_length = self.config.Data.pose.generate_length
-        # assert pre_length == initial_pose.shape[-1]
-        # pre_poses = initial_pose.permute(0, 2, 1).to(self.device).to(torch.float32)
-        # B = pre_poses.shape[0]
-        aud_feat = get_mfcc_ta(aud_fn, sr=22000, fps=fps, smlpx=True, type='mfcc').transpose(1, 0)
-        num_poses_to_generate = aud_feat.shape[-1]
-        aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
-        aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
-        with torch.no_grad():
-            pred_poses = self.generator(aud_feat)
-            pred_poses = pred_poses.cpu().numpy()
-        output = pred_poses.squeeze()
-        return output
-    def generate(self, aud, id):
-        self.generator.eval()
-        pred_poses = self.generator(aud)
-        return pred_poses
-if __name__ == '__main__':
-    from trainer.options import parse_args
-    parser = parse_args()
-    args = parser.parse_args(
-        ['--exp_name', '0', '--data_root', '0', '--speakers', '0', '--pre_pose_length', '4', '--generate_length', '64',
-         '--infer'])
-    generator = TrainWrapper(args)
-    aud_fn = '../sample_audio/jon.wav'
-    initial_pose = torch.randn(64, 108, 4)
-    norm_stats = (np.random.randn(108), np.random.randn(108))
-    output = generator.infer_on_audio(aud_fn, initial_pose, norm_stats)
-    print(output.shape)

nets/__init__.py DELETED Viewed

@@ -1,8 +0,0 @@
-from .smplx_face import TrainWrapper as s2g_face
-from .smplx_body_vq import TrainWrapper as s2g_body_vq
-from .smplx_body_pixel import TrainWrapper as s2g_body_pixel
-from .body_ae import TrainWrapper as s2g_body_ae
-from .LS3DCG import TrainWrapper as LS3DCG
-from .base import TrainWrapperBaseClass
-from .utils import normalize, denormalize

nets/base.py DELETED Viewed

@@ -1,89 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.optim as optim
-class TrainWrapperBaseClass():
-    def __init__(self, args, config) -> None:
-        self.init_optimizer()
-    def init_optimizer(self) -> None:
-        print('using Adam')
-        self.generator_optimizer = optim.Adam(
-            self.generator.parameters(),
-            lr = self.config.Train.learning_rate.generator_learning_rate,
-            betas=[0.9, 0.999]
-        )
-        if self.discriminator is not None:
-            self.discriminator_optimizer = optim.Adam(
-                self.discriminator.parameters(),
-                lr = self.config.Train.learning_rate.discriminator_learning_rate,
-                betas=[0.9, 0.999]
-            )
-    def __call__(self, bat):
-        raise NotImplementedError
-    def get_loss(self, **kwargs):
-        raise NotImplementedError
-    def state_dict(self):
-        model_state = {
-            'generator': self.generator.state_dict(),
-            'generator_optim': self.generator_optimizer.state_dict(),
-            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
-            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
-        }
-        return model_state
-    def parameters(self):
-        return self.generator.parameters()
-    def load_state_dict(self, state_dict):
-        if 'generator' in state_dict:
-            self.generator.load_state_dict(state_dict['generator'])
-        else:
-            self.generator.load_state_dict(state_dict)
-        if 'generator_optim' in state_dict and self.generator_optimizer is not None:
-            self.generator_optimizer.load_state_dict(state_dict['generator_optim'])
-        if self.discriminator is not None:
-            self.discriminator.load_state_dict(state_dict['discriminator'])
-            if 'discriminator_optim' in state_dict and self.discriminator_optimizer is not None:
-                self.discriminator_optimizer.load_state_dict(state_dict['discriminator_optim'])
-    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, **kwargs):
-        raise NotImplementedError
-    def init_params(self):
-        if self.config.Data.pose.convert_to_6d:
-            scale = 2
-        else:
-            scale = 1
-        global_orient = round(0 * scale)
-        leye_pose = reye_pose = round(0 * scale)
-        jaw_pose = round(0 * scale)
-        body_pose = round((63 - 24) * scale)
-        left_hand_pose = right_hand_pose = round(45 * scale)
-        if self.expression:
-            expression = 100
-        else:
-            expression = 0
-        b_j = 0
-        jaw_dim = jaw_pose
-        b_e = b_j + jaw_dim
-        eye_dim = leye_pose + reye_pose
-        b_b = b_e + eye_dim
-        body_dim = global_orient + body_pose
-        b_h = b_b + body_dim
-        hand_dim = left_hand_pose + right_hand_pose
-        b_f = b_h + hand_dim
-        face_dim = expression
-        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
-        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
-        self.pose = int(self.full_dim / round(3 * scale))
-        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]

nets/body_ae.py DELETED Viewed

@@ -1,152 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-from nets.base import TrainWrapperBaseClass
-from nets.spg.s2glayers import Discriminator as D_S2G
-from nets.spg.vqvae_1d import AE as s2g_body
-import torch
-import torch.optim as optim
-import torch.nn.functional as F
-from data_utils.lower_body import c_index, c_index_3d, c_index_6d
-def separate_aa(aa):
-    aa = aa[:, :, :].reshape(aa.shape[0], aa.shape[1], -1, 5)
-    axis = F.normalize(aa[:, :, :, :3], dim=-1)
-    angle = F.normalize(aa[:, :, :, 3:5], dim=-1)
-    return axis, angle
-class TrainWrapper(TrainWrapperBaseClass):
-    '''
-    a wrapper receving a batch from data_utils and calculate loss
-    '''
-    def __init__(self, args, config):
-        self.args = args
-        self.config = config
-        self.device = torch.device(self.args.gpu)
-        self.global_step = 0
-        self.gan = False
-        self.convert_to_6d = self.config.Data.pose.convert_to_6d
-        self.preleng = self.config.Data.pose.pre_pose_length
-        self.expression = self.config.Data.pose.expression
-        self.epoch = 0
-        self.init_params()
-        self.num_classes = 4
-        self.g = s2g_body(self.each_dim[1] + self.each_dim[2], embedding_dim=64, num_embeddings=0,
-                          num_hiddens=1024, num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-        if self.gan:
-            self.discriminator = D_S2G(
-                pose_dim=110 + 64, pose=self.pose
-            ).to(self.device)
-        else:
-            self.discriminator = None
-        if self.convert_to_6d:
-            self.c_index = c_index_6d
-        else:
-            self.c_index = c_index_3d
-        super().__init__(args, config)
-    def init_optimizer(self):
-        self.g_optimizer = optim.Adam(
-            self.g.parameters(),
-            lr=self.config.Train.learning_rate.generator_learning_rate,
-            betas=[0.9, 0.999]
-        )
-    def state_dict(self):
-        model_state = {
-            'g': self.g.state_dict(),
-            'g_optim': self.g_optimizer.state_dict(),
-            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
-            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
-        }
-        return model_state
-    def __call__(self, bat):
-        # assert (not self.args.infer), "infer mode"
-        self.global_step += 1
-        total_loss = None
-        loss_dict = {}
-        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
-        # id = bat['speaker'].to(self.device) - 20
-        # id = F.one_hot(id, self.num_classes)
-        poses = poses[:, self.c_index, :]
-        gt_poses = poses[:, :, self.preleng:].permute(0, 2, 1)
-        loss = 0
-        loss_dict, loss = self.vq_train(gt_poses[:, :], 'g', self.g, loss_dict, loss)
-        return total_loss, loss_dict
-    def vq_train(self, gt, name, model, dict, total_loss, pre=None):
-        x_recon = model(gt_poses=gt, pre_state=pre)
-        loss, loss_dict = self.get_loss(pred_poses=x_recon, gt_poses=gt, pre=pre)
-        # total_loss = total_loss + loss
-        if name == 'g':
-            optimizer_name = 'g_optimizer'
-        optimizer = getattr(self, optimizer_name)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-        for key in list(loss_dict.keys()):
-            dict[name + key] = loss_dict.get(key, 0).item()
-        return dict, total_loss
-    def get_loss(self,
-                 pred_poses,
-                 gt_poses,
-                 pre=None
-                 ):
-        loss_dict = {}
-        rec_loss = torch.mean(torch.abs(pred_poses - gt_poses))
-        v_pr = pred_poses[:, 1:] - pred_poses[:, :-1]
-        v_gt = gt_poses[:, 1:] - gt_poses[:, :-1]
-        velocity_loss = torch.mean(torch.abs(v_pr - v_gt))
-        if pre is None:
-            f0_vel = 0
-        else:
-            v0_pr = pred_poses[:, 0] - pre[:, -1]
-            v0_gt = gt_poses[:, 0] - pre[:, -1]
-            f0_vel = torch.mean(torch.abs(v0_pr - v0_gt))
-        gen_loss = rec_loss + velocity_loss + f0_vel
-        loss_dict['rec_loss'] = rec_loss
-        loss_dict['velocity_loss'] = velocity_loss
-        # loss_dict['e_q_loss'] = e_q_loss
-        if pre is not None:
-            loss_dict['f0_vel'] = f0_vel
-        return gen_loss, loss_dict
-    def load_state_dict(self, state_dict):
-        self.g.load_state_dict(state_dict['g'])
-    def extract(self, x):
-        self.g.eval()
-        if x.shape[2] > self.full_dim:
-            if x.shape[2] == 239:
-                x = x[:, :, 102:]
-            x = x[:, :, self.c_index]
-        feat = self.g.encode(x)
-        return feat.transpose(1, 2), x

nets/init_model.py DELETED Viewed

@@ -1,35 +0,0 @@
-from nets import *
-def init_model(model_name, args, config):
-    if model_name == 's2g_face':
-        generator = s2g_face(
-            args,
-            config,
-        )
-    elif model_name == 's2g_body_vq':
-        generator = s2g_body_vq(
-            args,
-            config,
-        )
-    elif model_name == 's2g_body_pixel':
-        generator = s2g_body_pixel(
-            args,
-            config,
-        )
-    elif model_name == 's2g_body_ae':
-        generator = s2g_body_ae(
-            args,
-            config,
-        )
-    elif model_name == 's2g_LS3DCG':
-        generator = LS3DCG(
-            args,
-            config,
-        )
-    else:
-        raise ValueError
-    return generator

nets/layers.py DELETED Viewed

@@ -1,1052 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-import torch
-import torch.nn as nn
-import numpy as np
-# TODO: be aware of the actual netork structures
-def get_log(x):
-    log = 0
-    while x > 1:
-        if x % 2 == 0:
-            x = x // 2
-            log += 1
-        else:
-            raise ValueError('x is not a power of 2')
-    return log
-class ConvNormRelu(nn.Module):
-    '''
-    (B,C_in,H,W) -> (B, C_out, H, W)
-    there exist some kernel size that makes the result is not H/s
-    #TODO: there might some problems with residual
-    '''
-    def __init__(self,
-                 in_channels,
-                 out_channels,
-                 type='1d',
-                 leaky=False,
-                 downsample=False,
-                 kernel_size=None,
-                 stride=None,
-                 padding=None,
-                 p=0,
-                 groups=1,
-                 residual=False,
-                 norm='bn'):
-        '''
-        conv-bn-relu
-        '''
-        super(ConvNormRelu, self).__init__()
-        self.residual = residual
-        self.norm_type = norm
-        # kernel_size = k
-        # stride = s
-        if kernel_size is None and stride is None:
-            if not downsample:
-                kernel_size = 3
-                stride = 1
-            else:
-                kernel_size = 4
-                stride = 2
-        if padding is None:
-            if isinstance(kernel_size, int) and isinstance(stride, tuple):
-                padding = tuple(int((kernel_size - st) / 2) for st in stride)
-            elif isinstance(kernel_size, tuple) and isinstance(stride, int):
-                padding = tuple(int((ks - stride) / 2) for ks in kernel_size)
-            elif isinstance(kernel_size, tuple) and isinstance(stride, tuple):
-                padding = tuple(int((ks - st) / 2) for ks, st in zip(kernel_size, stride))
-            else:
-                padding = int((kernel_size - stride) / 2)
-        if self.residual:
-            if downsample:
-                if type == '1d':
-                    self.residual_layer = nn.Sequential(
-                        nn.Conv1d(
-                            in_channels=in_channels,
-                            out_channels=out_channels,
-                            kernel_size=kernel_size,
-                            stride=stride,
-                            padding=padding
-                        )
-                    )
-                elif type == '2d':
-                    self.residual_layer = nn.Sequential(
-                        nn.Conv2d(
-                            in_channels=in_channels,
-                            out_channels=out_channels,
-                            kernel_size=kernel_size,
-                            stride=stride,
-                            padding=padding
-                        )
-                    )
-            else:
-                if in_channels == out_channels:
-                    self.residual_layer = nn.Identity()
-                else:
-                    if type == '1d':
-                        self.residual_layer = nn.Sequential(
-                            nn.Conv1d(
-                                in_channels=in_channels,
-                                out_channels=out_channels,
-                                kernel_size=kernel_size,
-                                stride=stride,
-                                padding=padding
-                            )
-                        )
-                    elif type == '2d':
-                        self.residual_layer = nn.Sequential(
-                            nn.Conv2d(
-                                in_channels=in_channels,
-                                out_channels=out_channels,
-                                kernel_size=kernel_size,
-                                stride=stride,
-                                padding=padding
-                            )
-                        )
-        in_channels = in_channels * groups
-        out_channels = out_channels * groups
-        if type == '1d':
-            self.conv = nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
-                                  kernel_size=kernel_size, stride=stride, padding=padding,
-                                  groups=groups)
-            self.norm = nn.BatchNorm1d(out_channels)
-            self.dropout = nn.Dropout(p=p)
-        elif type == '2d':
-            self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
-                                  kernel_size=kernel_size, stride=stride, padding=padding,
-                                  groups=groups)
-            self.norm = nn.BatchNorm2d(out_channels)
-            self.dropout = nn.Dropout2d(p=p)
-        if norm == 'gn':
-            self.norm = nn.GroupNorm(2, out_channels)
-        elif norm == 'ln':
-            self.norm = nn.LayerNorm(out_channels)
-        if leaky:
-            self.relu = nn.LeakyReLU(negative_slope=0.2)
-        else:
-            self.relu = nn.ReLU()
-    def forward(self, x, **kwargs):
-        if self.norm_type == 'ln':
-            out = self.dropout(self.conv(x))
-            out = self.norm(out.transpose(1,2)).transpose(1,2)
-        else:
-            out = self.norm(self.dropout(self.conv(x)))
-        if self.residual:
-            residual = self.residual_layer(x)
-            out += residual
-        return self.relu(out)
-class UNet1D(nn.Module):
-    def __init__(self,
-                 input_channels,
-                 output_channels,
-                 max_depth=5,
-                 kernel_size=None,
-                 stride=None,
-                 p=0,
-                 groups=1):
-        super(UNet1D, self).__init__()
-        self.pre_downsampling_conv = nn.ModuleList([])
-        self.conv1 = nn.ModuleList([])
-        self.conv2 = nn.ModuleList([])
-        self.upconv = nn.Upsample(scale_factor=2, mode='nearest')
-        self.max_depth = max_depth
-        self.groups = groups
-        self.pre_downsampling_conv.append(ConvNormRelu(input_channels, output_channels,
-                                                       type='1d', leaky=True, downsample=False,
-                                                       kernel_size=kernel_size, stride=stride, p=p, groups=groups))
-        self.pre_downsampling_conv.append(ConvNormRelu(output_channels, output_channels,
-                                                       type='1d', leaky=True, downsample=False,
-                                                       kernel_size=kernel_size, stride=stride, p=p, groups=groups))
-        for i in range(self.max_depth):
-            self.conv1.append(ConvNormRelu(output_channels, output_channels,
-                                           type='1d', leaky=True, downsample=True,
-                                           kernel_size=kernel_size, stride=stride, p=p, groups=groups))
-        for i in range(self.max_depth):
-            self.conv2.append(ConvNormRelu(output_channels, output_channels,
-                                           type='1d', leaky=True, downsample=False,
-                                           kernel_size=kernel_size, stride=stride, p=p, groups=groups))
-    def forward(self, x):
-        input_size = x.shape[-1]
-        assert get_log(
-            input_size) >= self.max_depth, 'num_frames must be a power of 2 and its power must be greater than max_depth'
-        x = nn.Sequential(*self.pre_downsampling_conv)(x)
-        residuals = []
-        residuals.append(x)
-        for i, conv1 in enumerate(self.conv1):
-            x = conv1(x)
-            if i < self.max_depth - 1:
-                residuals.append(x)
-        for i, conv2 in enumerate(self.conv2):
-            x = self.upconv(x) + residuals[self.max_depth - i - 1]
-            x = conv2(x)
-        return x
-class UNet2D(nn.Module):
-    def __init__(self):
-        super(UNet2D, self).__init__()
-        raise NotImplementedError('2D Unet is wierd')
-class AudioPoseEncoder1D(nn.Module):
-    '''
-    (B, C, T) -> (B, C*2, T) -> ... -> (B, C_out, T)
-    '''
-    def __init__(self,
-                 C_in,
-                 C_out,
-                 kernel_size=None,
-                 stride=None,
-                 min_layer_nums=None
-                 ):
-        super(AudioPoseEncoder1D, self).__init__()
-        self.C_in = C_in
-        self.C_out = C_out
-        conv_layers = nn.ModuleList([])
-        cur_C = C_in
-        num_layers = 0
-        while cur_C < self.C_out:
-            conv_layers.append(ConvNormRelu(
-                in_channels=cur_C,
-                out_channels=cur_C * 2,
-                kernel_size=kernel_size,
-                stride=stride
-            ))
-            cur_C *= 2
-            num_layers += 1
-        if (cur_C != C_out) or (min_layer_nums is not None and num_layers < min_layer_nums):
-            while (cur_C != C_out) or num_layers < min_layer_nums:
-                conv_layers.append(ConvNormRelu(
-                    in_channels=cur_C,
-                    out_channels=C_out,
-                    kernel_size=kernel_size,
-                    stride=stride
-                ))
-                num_layers += 1
-                cur_C = C_out
-        self.conv_layers = nn.Sequential(*conv_layers)
-    def forward(self, x):
-        '''
-        x: (B, C, T)
-        '''
-        x = self.conv_layers(x)
-        return x
-class AudioPoseEncoder2D(nn.Module):
-    '''
-    (B, C, T) -> (B, 1, C, T) -> ... -> (B, C_out, T)
-    '''
-    def __init__(self):
-        raise NotImplementedError
-class AudioPoseEncoderRNN(nn.Module):
-    '''
-    (B, C, T)->(B, T, C)->(B, T, C_out)->(B, C_out, T)
-    '''
-    def __init__(self,
-                 C_in,
-                 hidden_size,
-                 num_layers,
-                 rnn_cell='gru',
-                 bidirectional=False
-                 ):
-        super(AudioPoseEncoderRNN, self).__init__()
-        if rnn_cell == 'gru':
-            self.cell = nn.GRU(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                               bidirectional=bidirectional)
-        elif rnn_cell == 'lstm':
-            self.cell = nn.LSTM(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                bidirectional=bidirectional)
-        else:
-            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
-    def forward(self, x, state=None):
-        x = x.permute(0, 2, 1)
-        x, state = self.cell(x, state)
-        x = x.permute(0, 2, 1)
-        return x
-class AudioPoseEncoderGraph(nn.Module):
-    '''
-    (B, C, T)->(B, 2, V, T)->(B, 2, T, V)->(B, D, T, V)
-    '''
-    def __init__(self,
-                 layers_config,  # 理应是(C_in, C_out, kernel_size)的list
-                 A,  # adjacent matrix (num_parts, V, V)
-                 residual,
-                 local_bn=False,
-                 share_weights=False
-                 ) -> None:
-        super().__init__()
-        self.A = A
-        self.num_joints = A.shape[1]
-        self.num_parts = A.shape[0]
-        self.C_in = layers_config[0][0]
-        self.C_out = layers_config[-1][1]
-        self.conv_layers = nn.ModuleList([
-            GraphConvNormRelu(
-                C_in=c_in,
-                C_out=c_out,
-                A=self.A,
-                residual=residual,
-                local_bn=local_bn,
-                kernel_size=k,
-                share_weights=share_weights
-            ) for (c_in, c_out, k) in layers_config
-        ])
-        self.conv_layers = nn.Sequential(*self.conv_layers)
-    def forward(self, x):
-        '''
-        x: (B, C, T), C should be num_joints*D
-        output: (B, D, T, V)
-        '''
-        B, C, T = x.shape
-        x = x.view(B, self.num_joints, self.C_in, T)  # (B, V, D, T)，D：每个joint的特征维度，注意这里V在前面
-        x = x.permute(0, 2, 3, 1)  # (B, D, T, V)
-        assert x.shape[1] == self.C_in
-        x_conved = self.conv_layers(x)
-        # x_conved = x_conved.permute(0, 3, 1, 2).contiguous().view(B, self.C_out*self.num_joints, T)#(B, V*C_out, T)
-        return x_conved
-class SeqEncoder2D(nn.Module):
-    '''
-    seq_encoder, encoding a seq to a vector
-    (B, C, T)->(B, 2, V, T)->(B, 2, T, V) -> (B, 32, )->...->(B, C_out)
-    '''
-    def __init__(self,
-                 C_in,  # should be 2
-                 T_in,
-                 C_out,
-                 num_joints,
-                 min_layer_num=None,
-                 residual=False
-                 ):
-        super(SeqEncoder2D, self).__init__()
-        self.C_in = C_in
-        self.C_out = C_out
-        self.T_in = T_in
-        self.num_joints = num_joints
-        conv_layers = nn.ModuleList([])
-        conv_layers.append(ConvNormRelu(
-            in_channels=C_in,
-            out_channels=32,
-            type='2d',
-            residual=residual
-        ))
-        cur_C = 32
-        cur_H = T_in
-        cur_W = num_joints
-        num_layers = 1
-        while (cur_C < C_out) or (cur_H > 1) or (cur_W > 1):
-            ks = [3, 3]
-            st = [1, 1]
-            if cur_H > 1:
-                if cur_H > 4:
-                    ks[0] = 4
-                    st[0] = 2
-                else:
-                    ks[0] = cur_H
-                    st[0] = cur_H
-            if cur_W > 1:
-                if cur_W > 4:
-                    ks[1] = 4
-                    st[1] = 2
-                else:
-                    ks[1] = cur_W
-                    st[1] = cur_W
-            conv_layers.append(ConvNormRelu(
-                in_channels=cur_C,
-                out_channels=min(C_out, cur_C * 2),
-                type='2d',
-                kernel_size=tuple(ks),
-                stride=tuple(st),
-                residual=residual
-            ))
-            cur_C = min(cur_C * 2, C_out)
-            if cur_H > 1:
-                if cur_H > 4:
-                    cur_H //= 2
-                else:
-                    cur_H = 1
-            if cur_W > 1:
-                if cur_W > 4:
-                    cur_W //= 2
-                else:
-                    cur_W = 1
-            num_layers += 1
-        if min_layer_num is not None and (num_layers < min_layer_num):
-            while num_layers < min_layer_num:
-                conv_layers.append(ConvNormRelu(
-                    in_channels=C_out,
-                    out_channels=C_out,
-                    type='2d',
-                    kernel_size=1,
-                    stride=1,
-                    residual=residual
-                ))
-                num_layers += 1
-        self.conv_layers = nn.Sequential(*conv_layers)
-        self.num_layers = num_layers
-    def forward(self, x):
-        B, C, T = x.shape
-        x = x.view(B, self.num_joints, self.C_in, T)  # (B, V, D, T) V in front
-        x = x.permute(0, 2, 3, 1)  # (B, D, T, V)
-        assert x.shape[1] == self.C_in and x.shape[-1] == self.num_joints
-        x = self.conv_layers(x)
-        return x.squeeze()
-class SeqEncoder1D(nn.Module):
-    '''
-    (B, C, T)->(B, D)
-    '''
-    def __init__(self,
-                 C_in,
-                 C_out,
-                 T_in,
-                 min_layer_nums=None
-                 ):
-        super(SeqEncoder1D, self).__init__()
-        conv_layers = nn.ModuleList([])
-        cur_C = C_in
-        cur_T = T_in
-        self.num_layers = 0
-        while (cur_C < C_out) or (cur_T > 1):
-            ks = 3
-            st = 1
-            if cur_T > 1:
-                if cur_T > 4:
-                    ks = 4
-                    st = 2
-                else:
-                    ks = cur_T
-                    st = cur_T
-            conv_layers.append(ConvNormRelu(
-                in_channels=cur_C,
-                out_channels=min(C_out, cur_C * 2),
-                type='1d',
-                kernel_size=ks,
-                stride=st
-            ))
-            cur_C = min(cur_C * 2, C_out)
-            if cur_T > 1:
-                if cur_T > 4:
-                    cur_T = cur_T // 2
-                else:
-                    cur_T = 1
-            self.num_layers += 1
-        if min_layer_nums is not None and (self.num_layers < min_layer_nums):
-            while self.num_layers < min_layer_nums:
-                conv_layers.append(ConvNormRelu(
-                    in_channels=C_out,
-                    out_channels=C_out,
-                    type='1d',
-                    kernel_size=1,
-                    stride=1
-                ))
-                self.num_layers += 1
-        self.conv_layers = nn.Sequential(*conv_layers)
-    def forward(self, x):
-        x = self.conv_layers(x)
-        return x.squeeze()
-class SeqEncoderRNN(nn.Module):
-    '''
-    (B, C, T) -> (B, T, C) -> (B, D)
-    LSTM/GRU-FC
-    '''
-    def __init__(self,
-                 hidden_size,
-                 in_size,
-                 num_rnn_layers,
-                 rnn_cell='gru',
-                 bidirectional=False
-                 ):
-        super(SeqEncoderRNN, self).__init__()
-        self.hidden_size = hidden_size
-        self.in_size = in_size
-        self.num_rnn_layers = num_rnn_layers
-        self.bidirectional = bidirectional
-        if rnn_cell == 'gru':
-            self.cell = nn.GRU(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.num_rnn_layers,
-                               batch_first=True, bidirectional=bidirectional)
-        elif rnn_cell == 'lstm':
-            self.cell = nn.LSTM(input_size=self.in_size, hidden_size=self.hidden_size, num_layers=self.num_rnn_layers,
-                                batch_first=True, bidirectional=bidirectional)
-    def forward(self, x, state=None):
-        x = x.permute(0, 2, 1)
-        B, T, C = x.shape
-        x, _ = self.cell(x, state)
-        if self.bidirectional:
-            out = torch.cat([x[:, -1, :self.hidden_size], x[:, 0, self.hidden_size:]], dim=-1)
-        else:
-            out = x[:, -1, :]
-        assert out.shape[0] == B
-        return out
-class SeqEncoderGraph(nn.Module):
-    '''
-    '''
-    def __init__(self,
-                 embedding_size,
-                 layer_configs,
-                 residual,
-                 local_bn,
-                 A,
-                 T,
-                 share_weights=False
-                 ) -> None:
-        super().__init__()
-        self.C_in = layer_configs[0][0]
-        self.C_out = embedding_size
-        self.num_joints = A.shape[1]
-        self.graph_encoder = AudioPoseEncoderGraph(
-            layers_config=layer_configs,
-            A=A,
-            residual=residual,
-            local_bn=local_bn,
-            share_weights=share_weights
-        )
-        cur_C = layer_configs[-1][1]
-        self.spatial_pool = ConvNormRelu(
-            in_channels=cur_C,
-            out_channels=cur_C,
-            type='2d',
-            kernel_size=(1, self.num_joints),
-            stride=(1, 1),
-            padding=(0, 0)
-        )
-        temporal_pool = nn.ModuleList([])
-        cur_H = T
-        num_layers = 0
-        self.temporal_conv_info = []
-        while cur_C < self.C_out or cur_H > 1:
-            self.temporal_conv_info.append(cur_C)
-            ks = [3, 1]
-            st = [1, 1]
-            if cur_H > 1:
-                if cur_H > 4:
-                    ks[0] = 4
-                    st[0] = 2
-                else:
-                    ks[0] = cur_H
-                    st[0] = cur_H
-            temporal_pool.append(ConvNormRelu(
-                in_channels=cur_C,
-                out_channels=min(self.C_out, cur_C * 2),
-                type='2d',
-                kernel_size=tuple(ks),
-                stride=tuple(st)
-            ))
-            cur_C = min(cur_C * 2, self.C_out)
-            if cur_H > 1:
-                if cur_H > 4:
-                    cur_H //= 2
-                else:
-                    cur_H = 1
-            num_layers += 1
-        self.temporal_pool = nn.Sequential(*temporal_pool)
-        print("graph seq encoder info: temporal pool:", self.temporal_conv_info)
-        self.num_layers = num_layers
-        # need fc?
-    def forward(self, x):
-        '''
-        x: (B, C, T)
-        '''
-        B, C, T = x.shape
-        x = self.graph_encoder(x)
-        x = self.spatial_pool(x)
-        x = self.temporal_pool(x)
-        x = x.view(B, self.C_out)
-        return x
-class SeqDecoder2D(nn.Module):
-    '''
-    (B, D)->(B, D, 1, 1)->(B, C_out, C, T)->(B, C_out, T)
-    '''
-    def __init__(self):
-        super(SeqDecoder2D, self).__init__()
-        raise NotImplementedError
-class SeqDecoder1D(nn.Module):
-    '''
-    (B, D)->(B, D, 1)->...->(B, C_out, T)
-    '''
-    def __init__(self,
-                 D_in,
-                 C_out,
-                 T_out,
-                 min_layer_num=None
-                 ):
-        super(SeqDecoder1D, self).__init__()
-        self.T_out = T_out
-        self.min_layer_num = min_layer_num
-        cur_t = 1
-        self.pre_conv = ConvNormRelu(
-            in_channels=D_in,
-            out_channels=C_out,
-            type='1d'
-        )
-        self.num_layers = 1
-        self.upconv = nn.Upsample(scale_factor=2, mode='nearest')
-        self.conv_layers = nn.ModuleList([])
-        cur_t *= 2
-        while cur_t <= T_out:
-            self.conv_layers.append(ConvNormRelu(
-                in_channels=C_out,
-                out_channels=C_out,
-                type='1d'
-            ))
-            cur_t *= 2
-            self.num_layers += 1
-        post_conv = nn.ModuleList([ConvNormRelu(
-            in_channels=C_out,
-            out_channels=C_out,
-            type='1d'
-        )])
-        self.num_layers += 1
-        if min_layer_num is not None and self.num_layers < min_layer_num:
-            while self.num_layers < min_layer_num:
-                post_conv.append(ConvNormRelu(
-                    in_channels=C_out,
-                    out_channels=C_out,
-                    type='1d'
-                ))
-                self.num_layers += 1
-        self.post_conv = nn.Sequential(*post_conv)
-    def forward(self, x):
-        x = x.unsqueeze(-1)
-        x = self.pre_conv(x)
-        for conv in self.conv_layers:
-            x = self.upconv(x)
-            x = conv(x)
-        x = torch.nn.functional.interpolate(x, size=self.T_out, mode='nearest')
-        x = self.post_conv(x)
-        return x
-class SeqDecoderRNN(nn.Module):
-    '''
-    (B, D)->(B, C_out, T)
-    '''
-    def __init__(self,
-                 hidden_size,
-                 C_out,
-                 T_out,
-                 num_layers,
-                 rnn_cell='gru'
-                 ):
-        super(SeqDecoderRNN, self).__init__()
-        self.num_steps = T_out
-        if rnn_cell == 'gru':
-            self.cell = nn.GRU(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                               bidirectional=False)
-        elif rnn_cell == 'lstm':
-            self.cell = nn.LSTM(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                bidirectional=False)
-        else:
-            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
-        self.fc = nn.Linear(hidden_size, C_out)
-    def forward(self, hidden, frame_0):
-        frame_0 = frame_0.permute(0, 2, 1)
-        dec_input = frame_0
-        outputs = []
-        for i in range(self.num_steps):
-            frame_out, hidden = self.cell(dec_input, hidden)
-            frame_out = self.fc(frame_out)
-            dec_input = frame_out
-            outputs.append(frame_out)
-        output = torch.cat(outputs, dim=1)
-        return output.permute(0, 2, 1)
-class SeqTranslator2D(nn.Module):
-    '''
-    (B, C, T)->(B, 1, C, T)-> ... -> (B, 1, C_out, T_out)
-    '''
-    def __init__(self,
-                 C_in=64,
-                 C_out=108,
-                 T_in=75,
-                 T_out=25,
-                 residual=True
-                 ):
-        super(SeqTranslator2D, self).__init__()
-        print("Warning: hard coded")
-        self.C_in = C_in
-        self.C_out = C_out
-        self.T_in = T_in
-        self.T_out = T_out
-        self.residual = residual
-        self.conv_layers = nn.Sequential(
-            ConvNormRelu(1, 32, '2d', kernel_size=5, stride=1),
-            ConvNormRelu(32, 32, '2d', kernel_size=5, stride=1, residual=self.residual),
-            ConvNormRelu(32, 32, '2d', kernel_size=5, stride=1, residual=self.residual),
-            ConvNormRelu(32, 64, '2d', kernel_size=5, stride=(4, 3)),
-            ConvNormRelu(64, 64, '2d', kernel_size=5, stride=1, residual=self.residual),
-            ConvNormRelu(64, 64, '2d', kernel_size=5, stride=1, residual=self.residual),
-            ConvNormRelu(64, 128, '2d', kernel_size=5, stride=(4, 1)),
-            ConvNormRelu(128, 108, '2d', kernel_size=3, stride=(4, 1)),
-            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1, residual=self.residual),
-            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1, residual=self.residual),
-            ConvNormRelu(108, 108, '2d', kernel_size=(1, 3), stride=1),
-        )
-    def forward(self, x):
-        assert len(x.shape) == 3 and x.shape[1] == self.C_in and x.shape[2] == self.T_in
-        x = x.view(x.shape[0], 1, x.shape[1], x.shape[2])
-        x = self.conv_layers(x)
-        x = x.squeeze(2)
-        return x
-class SeqTranslator1D(nn.Module):
-    '''
-    (B, C, T)->(B, C_out, T)
-    '''
-    def __init__(self,
-                 C_in,
-                 C_out,
-                 kernel_size=None,
-                 stride=None,
-                 min_layers_num=None,
-                 residual=True,
-                 norm='bn'
-                 ):
-        super(SeqTranslator1D, self).__init__()
-        conv_layers = nn.ModuleList([])
-        conv_layers.append(ConvNormRelu(
-            in_channels=C_in,
-            out_channels=C_out,
-            type='1d',
-            kernel_size=kernel_size,
-            stride=stride,
-            residual=residual,
-            norm=norm
-        ))
-        self.num_layers = 1
-        if min_layers_num is not None and self.num_layers < min_layers_num:
-            while self.num_layers < min_layers_num:
-                conv_layers.append(ConvNormRelu(
-                    in_channels=C_out,
-                    out_channels=C_out,
-                    type='1d',
-                    kernel_size=kernel_size,
-                    stride=stride,
-                    residual=residual,
-                    norm=norm
-                ))
-                self.num_layers += 1
-        self.conv_layers = nn.Sequential(*conv_layers)
-    def forward(self, x):
-        return self.conv_layers(x)
-class SeqTranslatorRNN(nn.Module):
-    '''
-    (B, C, T)->(B, C_out, T)
-    LSTM-FC
-    '''
-    def __init__(self,
-                 C_in,
-                 C_out,
-                 hidden_size,
-                 num_layers,
-                 rnn_cell='gru'
-                 ):
-        super(SeqTranslatorRNN, self).__init__()
-        if rnn_cell == 'gru':
-            self.enc_cell = nn.GRU(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                   bidirectional=False)
-            self.dec_cell = nn.GRU(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                   bidirectional=False)
-        elif rnn_cell == 'lstm':
-            self.enc_cell = nn.LSTM(input_size=C_in, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                    bidirectional=False)
-            self.dec_cell = nn.LSTM(input_size=C_out, hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
-                                    bidirectional=False)
-        else:
-            raise ValueError('invalid rnn cell:%s' % (rnn_cell))
-        self.fc = nn.Linear(hidden_size, C_out)
-    def forward(self, x, frame_0):
-        num_steps = x.shape[-1]
-        x = x.permute(0, 2, 1)
-        frame_0 = frame_0.permute(0, 2, 1)
-        _, hidden = self.enc_cell(x, None)
-        outputs = []
-        for i in range(num_steps):
-            inputs = frame_0
-            output_frame, hidden = self.dec_cell(inputs, hidden)
-            output_frame = self.fc(output_frame)
-            frame_0 = output_frame
-            outputs.append(output_frame)
-        outputs = torch.cat(outputs, dim=1)
-        return outputs.permute(0, 2, 1)
-class ResBlock(nn.Module):
-    def __init__(self,
-                 input_dim,
-                 fc_dim,
-                 afn,
-                 nfn
-                 ):
-        '''
-        afn: activation fn
-        nfn: normalization fn
-        '''
-        super(ResBlock, self).__init__()
-        self.input_dim = input_dim
-        self.fc_dim = fc_dim
-        self.afn = afn
-        self.nfn = nfn
-        if self.afn != 'relu':
-            raise ValueError('Wrong')
-        if self.nfn == 'layer_norm':
-            raise ValueError('wrong')
-        self.layers = nn.Sequential(
-            nn.Linear(self.input_dim, self.fc_dim // 2),
-            nn.ReLU(),
-            nn.Linear(self.fc_dim // 2, self.fc_dim // 2),
-            nn.ReLU(),
-            nn.Linear(self.fc_dim // 2, self.fc_dim),
-            nn.ReLU()
-        )
-        self.shortcut_layer = nn.Sequential(
-            nn.Linear(self.input_dim, self.fc_dim),
-            nn.ReLU(),
-        )
-    def forward(self, inputs):
-        return self.layers(inputs) + self.shortcut_layer(inputs)
-class AudioEncoder(nn.Module):
-    def __init__(self, channels, padding=3, kernel_size=8, conv_stride=2, conv_pool=None, augmentation=False):
-        super(AudioEncoder, self).__init__()
-        self.in_channels = channels[0]
-        self.augmentation = augmentation
-        model = []
-        acti = nn.LeakyReLU(0.2)
-        nr_layer = len(channels) - 1
-        for i in range(nr_layer):
-            if conv_pool is None:
-                model.append(nn.ReflectionPad1d(padding))
-                model.append(nn.Conv1d(channels[i], channels[i + 1], kernel_size=kernel_size, stride=conv_stride))
-                model.append(acti)
-            else:
-                model.append(nn.ReflectionPad1d(padding))
-                model.append(nn.Conv1d(channels[i], channels[i + 1], kernel_size=kernel_size, stride=conv_stride))
-                model.append(acti)
-                model.append(conv_pool(kernel_size=2, stride=2))
-        if self.augmentation:
-            model.append(
-                nn.Conv1d(channels[-1], channels[-1], kernel_size=kernel_size, stride=conv_stride)
-            )
-            model.append(acti)
-        self.model = nn.Sequential(*model)
-    def forward(self, x):
-        x = x[:, :self.in_channels, :]
-        x = self.model(x)
-        return x
-class AudioDecoder(nn.Module):
-    def __init__(self, channels, kernel_size=7, ups=25):
-        super(AudioDecoder, self).__init__()
-        model = []
-        pad = (kernel_size - 1) // 2
-        acti = nn.LeakyReLU(0.2)
-        for i in range(len(channels) - 2):
-            model.append(nn.Upsample(scale_factor=2, mode='nearest'))
-            model.append(nn.ReflectionPad1d(pad))
-            model.append(nn.Conv1d(channels[i], channels[i + 1],
-                                   kernel_size=kernel_size, stride=1))
-            if i == 0 or i == 1:
-                model.append(nn.Dropout(p=0.2))
-            if not i == len(channels) - 2:
-                model.append(acti)
-        model.append(nn.Upsample(size=ups, mode='nearest'))
-        model.append(nn.ReflectionPad1d(pad))
-        model.append(nn.Conv1d(channels[-2], channels[-1],
-                               kernel_size=kernel_size, stride=1))
-        self.model = nn.Sequential(*model)
-    def forward(self, x):
-        return self.model(x)
-class Audio2Pose(nn.Module):
-    def __init__(self, pose_dim, embed_size, augmentation, ups=25):
-        super(Audio2Pose, self).__init__()
-        self.pose_dim = pose_dim
-        self.embed_size = embed_size
-        self.augmentation = augmentation
-        self.aud_enc = AudioEncoder(channels=[13, 64, 128, 256], padding=2, kernel_size=7, conv_stride=1,
-                                    conv_pool=nn.AvgPool1d, augmentation=self.augmentation)
-        if self.augmentation:
-            self.aud_dec = AudioDecoder(channels=[512, 256, 128, pose_dim])
-        else:
-            self.aud_dec = AudioDecoder(channels=[256, 256, 128, pose_dim], ups=ups)
-        if self.augmentation:
-            self.pose_enc = nn.Sequential(
-                nn.Linear(self.embed_size // 2, 256),
-                nn.LayerNorm(256)
-            )
-    def forward(self, audio_feat, dec_input=None):
-        B = audio_feat.shape[0]
-        aud_embed = self.aud_enc.forward(audio_feat)
-        if self.augmentation:
-            dec_input = dec_input.squeeze(0)
-            dec_embed = self.pose_enc(dec_input)
-            dec_embed = dec_embed.unsqueeze(2)
-            dec_embed = dec_embed.expand(dec_embed.shape[0], dec_embed.shape[1], aud_embed.shape[-1])
-            aud_embed = torch.cat([aud_embed, dec_embed], dim=1)
-        out = self.aud_dec.forward(aud_embed)
-        return out
-if __name__ == '__main__':
-    import numpy as np
-    import os
-    import sys
-    test_model = SeqEncoder2D(
-        C_in=2,
-        T_in=25,
-        C_out=512,
-        num_joints=54,
-    )
-    print(test_model.num_layers)
-    input = torch.randn((64, 108, 25))
-    output = test_model(input)
-    print(output.shape)

nets/smplx_body_pixel.py DELETED Viewed

@@ -1,326 +0,0 @@
-import os
-import sys
-import torch
-from torch.optim.lr_scheduler import StepLR
-sys.path.append(os.getcwd())
-from nets.layers import *
-from nets.base import TrainWrapperBaseClass
-from nets.spg.gated_pixelcnn_v2 import GatedPixelCNN as pixelcnn
-from nets.spg.vqvae_1d import VQVAE as s2g_body, Wav2VecEncoder
-from nets.spg.vqvae_1d import AudioEncoder
-from nets.utils import parse_audio, denormalize
-from data_utils import get_mfcc, get_melspec, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
-import numpy as np
-import torch.optim as optim
-import torch.nn.functional as F
-from sklearn.preprocessing import normalize
-from data_utils.lower_body import c_index, c_index_3d, c_index_6d
-from data_utils.utils import smooth_geom, get_mfcc_sepa
-class TrainWrapper(TrainWrapperBaseClass):
-    '''
-    a wrapper receving a batch from data_utils and calculate loss
-    '''
-    def __init__(self, args, config):
-        self.args = args
-        self.config = config
-        self.device = torch.device(self.args.gpu)
-        self.global_step = 0
-        self.convert_to_6d = self.config.Data.pose.convert_to_6d
-        self.expression = self.config.Data.pose.expression
-        self.epoch = 0
-        self.init_params()
-        self.num_classes = 4
-        self.audio = True
-        self.composition = self.config.Model.composition
-        self.bh_model = self.config.Model.bh_model
-        if self.audio:
-            self.audioencoder = AudioEncoder(in_dim=64, num_hiddens=256, num_residual_layers=2, num_residual_hiddens=256).to(self.device)
-        else:
-            self.audioencoder = None
-        if self.convert_to_6d:
-            dim, layer = 512, 10
-        else:
-            dim, layer = 256, 15
-        self.generator = pixelcnn(2048, dim, layer, self.num_classes, self.audio, self.bh_model).to(self.device)
-        self.g_body = s2g_body(self.each_dim[1], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
-                               num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-        self.g_hand = s2g_body(self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
-                               num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-        model_path = self.config.Model.vq_path
-        model_ckpt = torch.load(model_path, map_location=torch.device('cpu'))
-        self.g_body.load_state_dict(model_ckpt['generator']['g_body'])
-        self.g_hand.load_state_dict(model_ckpt['generator']['g_hand'])
-        if torch.cuda.device_count() > 1:
-            self.g_body = torch.nn.DataParallel(self.g_body, device_ids=[0, 1])
-            self.g_hand = torch.nn.DataParallel(self.g_hand, device_ids=[0, 1])
-            self.generator = torch.nn.DataParallel(self.generator, device_ids=[0, 1])
-            if self.audioencoder is not None:
-                self.audioencoder = torch.nn.DataParallel(self.audioencoder, device_ids=[0, 1])
-        self.discriminator = None
-        if self.convert_to_6d:
-            self.c_index = c_index_6d
-        else:
-            self.c_index = c_index_3d
-        super().__init__(args, config)
-    def init_optimizer(self):
-        print('using Adam')
-        self.generator_optimizer = optim.Adam(
-            self.generator.parameters(),
-            lr=self.config.Train.learning_rate.generator_learning_rate,
-            betas=[0.9, 0.999]
-        )
-        if self.audioencoder is not None:
-            opt = self.config.Model.AudioOpt
-            if opt == 'Adam':
-                self.audioencoder_optimizer = optim.Adam(
-                    self.audioencoder.parameters(),
-                    lr=self.config.Train.learning_rate.generator_learning_rate,
-                    betas=[0.9, 0.999]
-                )
-            else:
-                print('using SGD')
-                self.audioencoder_optimizer = optim.SGD(
-                filter(lambda p: p.requires_grad,self.audioencoder.parameters()),
-                lr=self.config.Train.learning_rate.generator_learning_rate*10,
-                momentum=0.9,
-                nesterov=False,
-        )
-    def state_dict(self):
-        model_state = {
-            'generator': self.generator.state_dict(),
-            'generator_optim': self.generator_optimizer.state_dict(),
-            'audioencoder': self.audioencoder.state_dict() if self.audio else None,
-            'audioencoder_optim': self.audioencoder_optimizer.state_dict() if self.audio else None,
-            'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
-            'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
-        }
-        return model_state
-    def load_state_dict(self, state_dict):
-        from collections import OrderedDict
-        new_state_dict = OrderedDict()  # create new OrderedDict that does not contain `module.`
-        for k, v in state_dict.items():
-            sub_dict = OrderedDict()
-            if v is not None:
-                for k1, v1 in v.items():
-                    name = k1.replace('module.', '')
-                    sub_dict[name] = v1
-            new_state_dict[k] = sub_dict
-        state_dict = new_state_dict
-        if 'generator' in state_dict:
-            self.generator.load_state_dict(state_dict['generator'])
-        else:
-            self.generator.load_state_dict(state_dict)
-        if 'generator_optim' in state_dict and self.generator_optimizer is not None:
-            self.generator_optimizer.load_state_dict(state_dict['generator_optim'])
-        if self.discriminator is not None:
-            self.discriminator.load_state_dict(state_dict['discriminator'])
-            if 'discriminator_optim' in state_dict and self.discriminator_optimizer is not None:
-                self.discriminator_optimizer.load_state_dict(state_dict['discriminator_optim'])
-        if 'audioencoder' in state_dict and self.audioencoder is not None:
-            self.audioencoder.load_state_dict(state_dict['audioencoder'])
-    def init_params(self):
-        if self.config.Data.pose.convert_to_6d:
-            scale = 2
-        else:
-            scale = 1
-        global_orient = round(0 * scale)
-        leye_pose = reye_pose = round(0 * scale)
-        jaw_pose = round(0 * scale)
-        body_pose = round((63 - 24) * scale)
-        left_hand_pose = right_hand_pose = round(45 * scale)
-        if self.expression:
-            expression = 100
-        else:
-            expression = 0
-        b_j = 0
-        jaw_dim = jaw_pose
-        b_e = b_j + jaw_dim
-        eye_dim = leye_pose + reye_pose
-        b_b = b_e + eye_dim
-        body_dim = global_orient + body_pose
-        b_h = b_b + body_dim
-        hand_dim = left_hand_pose + right_hand_pose
-        b_f = b_h + hand_dim
-        face_dim = expression
-        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
-        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
-        self.pose = int(self.full_dim / round(3 * scale))
-        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
-    def __call__(self, bat):
-        # assert (not self.args.infer), "infer mode"
-        self.global_step += 1
-        total_loss = None
-        loss_dict = {}
-        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
-        id = bat['speaker'].to(self.device) - 20
-        # id = F.one_hot(id, self.num_classes)
-        poses = poses[:, self.c_index, :]
-        aud = aud.permute(0, 2, 1)
-        gt_poses = poses.permute(0, 2, 1)
-        with torch.no_grad():
-            self.g_body.eval()
-            self.g_hand.eval()
-            if torch.cuda.device_count() > 1:
-                _, body_latents = self.g_body.module.encode(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
-                _, hand_latents = self.g_hand.module.encode(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
-            else:
-                _, body_latents = self.g_body.encode(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
-                _, hand_latents = self.g_hand.encode(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
-            latents = torch.cat([body_latents.unsqueeze(dim=-1), hand_latents.unsqueeze(dim=-1)], dim=-1)
-            latents = latents.detach()
-        if self.audio:
-            audio = self.audioencoder(aud[:, :].transpose(1, 2), frame_num=latents.shape[1]*4).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
-            logits = self.generator(latents[:, :], id, audio)
-        else:
-            logits = self.generator(latents, id)
-        logits = logits.permute(0, 2, 3, 1).contiguous()
-        self.generator_optimizer.zero_grad()
-        if self.audio:
-            self.audioencoder_optimizer.zero_grad()
-        loss = F.cross_entropy(logits.view(-1, logits.shape[-1]), latents.view(-1))
-        loss.backward()
-        grad = torch.nn.utils.clip_grad_norm(self.generator.parameters(), self.config.Train.max_gradient_norm)
-        if torch.isnan(grad).sum() > 0:
-            print('fuck')
-        loss_dict['grad'] = grad.item()
-        loss_dict['ce_loss'] = loss.item()
-        self.generator_optimizer.step()
-        if self.audio:
-            self.audioencoder_optimizer.step()
-        return total_loss, loss_dict
-    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, exp=None, var=None, w_pre=False, rand=None,
-                       continuity=False, id=None, fps=15, sr=22000, B=1, am=None, am_sr=None, frame=0,**kwargs):
-        '''
-        initial_pose: (B, C, T), normalized
-        (aud_fn, txgfile) -> generated motion (B, T, C)
-        '''
-        output = []
-        assert self.args.infer, "train mode"
-        self.generator.eval()
-        self.g_body.eval()
-        self.g_hand.eval()
-        if continuity:
-            aud_feat, gap = get_mfcc_sepa(aud_fn, sr=sr, fps=fps)
-        else:
-            aud_feat = get_mfcc_ta(aud_fn, sr=sr, fps=fps, smlpx=True, type='mfcc', am=am)
-        aud_feat = aud_feat.transpose(1, 0)
-        aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
-        aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
-        if id is None:
-            id = torch.tensor([0]).to(self.device)
-        else:
-            id = id.repeat(B)
-        with torch.no_grad():
-            aud_feat = aud_feat.permute(0, 2, 1)
-            if continuity:
-                self.audioencoder.eval()
-                pre_pose = {}
-                pre_pose['b'] = pre_pose['h'] = None
-                pre_latents, pre_audio, body_0, hand_0 = self.infer(aud_feat[:, :gap], frame, id, B, pre_pose=pre_pose)
-                pre_pose['b'] = body_0[:, :, -4:].transpose(1,2)
-                pre_pose['h'] = hand_0[:, :, -4:].transpose(1,2)
-                _, _, body_1, hand_1 = self.infer(aud_feat[:, gap:], frame, id, B, pre_latents, pre_audio, pre_pose)
-                body = torch.cat([body_0, body_1], dim=2)
-                hand = torch.cat([hand_0, hand_1], dim=2)
-            else:
-                if self.audio:
-                    self.audioencoder.eval()
-                    audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
-                    latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=B, aud_feat=audio)
-                else:
-                    latents = self.generator.generate(id, shape=[aud_feat.shape[1]//4, 2], batch_size=B)
-                body_latents = latents[..., 0]
-                hand_latents = latents[..., 1]
-                body, _ = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1], latents=body_latents)
-                hand, _ = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1], latents=hand_latents)
-            pred_poses = torch.cat([body, hand], dim=1).transpose(1,2).cpu().numpy()
-        output = pred_poses
-        return output
-    def infer(self, aud_feat, frame, id, B, pre_latents=None, pre_audio=None, pre_pose=None):
-        audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
-        latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=B, aud_feat=audio,
-                                          pre_latents=pre_latents, pre_audio=pre_audio)
-        body_latents = latents[..., 0]
-        hand_latents = latents[..., 1]
-        body, _ = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1],
-                                  latents=body_latents, pre_state=pre_pose['b'])
-        hand, _ = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1],
-                                  latents=hand_latents, pre_state=pre_pose['h'])
-        return latents, audio, body, hand
-    def generate(self, aud, id, frame_num=0):
-        self.generator.eval()
-        self.g_body.eval()
-        self.g_hand.eval()
-        aud_feat = aud.permute(0, 2, 1)
-        if self.audio:
-            self.audioencoder.eval()
-            audio = self.audioencoder(aud_feat.transpose(1, 2), frame_num=frame_num).unsqueeze(dim=-1).repeat(1, 1, 1, 2)
-            latents = self.generator.generate(id, shape=[audio.shape[2], 2], batch_size=aud.shape[0], aud_feat=audio)
-        else:
-            latents = self.generator.generate(id, shape=[aud_feat.shape[1] // 4, 2], batch_size=aud.shape[0])
-        body_latents = latents[..., 0]
-        hand_latents = latents[..., 1]
-        body = self.g_body.decode(b=body_latents.shape[0], w=body_latents.shape[1], latents=body_latents)
-        hand = self.g_hand.decode(b=hand_latents.shape[0], w=hand_latents.shape[1], latents=hand_latents)
-        pred_poses = torch.cat([body, hand], dim=1).transpose(1, 2)
-        return pred_poses

nets/smplx_body_vq.py DELETED Viewed

@@ -1,302 +0,0 @@
-import os
-import sys
-from torch.optim.lr_scheduler import StepLR
-sys.path.append(os.getcwd())
-from nets.layers import *
-from nets.base import TrainWrapperBaseClass
-from nets.spg.s2glayers import Generator as G_S2G, Discriminator as D_S2G
-from nets.spg.vqvae_1d import VQVAE as s2g_body
-from nets.utils import parse_audio, denormalize
-from data_utils import get_mfcc, get_melspec, get_mfcc_old, get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
-import numpy as np
-import torch.optim as optim
-import torch.nn.functional as F
-from sklearn.preprocessing import normalize
-from data_utils.lower_body import c_index, c_index_3d, c_index_6d
-class TrainWrapper(TrainWrapperBaseClass):
-    '''
-    a wrapper receving a batch from data_utils and calculate loss
-    '''
-    def __init__(self, args, config):
-        self.args = args
-        self.config = config
-        self.device = torch.device(self.args.gpu)
-        self.global_step = 0
-        self.convert_to_6d = self.config.Data.pose.convert_to_6d
-        self.expression = self.config.Data.pose.expression
-        self.epoch = 0
-        self.init_params()
-        self.num_classes = 4
-        self.composition = self.config.Model.composition
-        if self.composition:
-            self.g_body = s2g_body(self.each_dim[1], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
-                                   num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-            self.g_hand = s2g_body(self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num, num_hiddens=1024,
-                                   num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-        else:
-            self.g = s2g_body(self.each_dim[1] + self.each_dim[2], embedding_dim=64, num_embeddings=config.Model.code_num,
-                              num_hiddens=1024, num_residual_layers=2, num_residual_hiddens=512).to(self.device)
-        self.discriminator = None
-        if self.convert_to_6d:
-            self.c_index = c_index_6d
-        else:
-            self.c_index = c_index_3d
-        super().__init__(args, config)
-    def init_optimizer(self):
-        print('using Adam')
-        if self.composition:
-            self.g_body_optimizer = optim.Adam(
-                self.g_body.parameters(),
-                lr=self.config.Train.learning_rate.generator_learning_rate,
-                betas=[0.9, 0.999]
-            )
-            self.g_hand_optimizer = optim.Adam(
-                self.g_hand.parameters(),
-                lr=self.config.Train.learning_rate.generator_learning_rate,
-                betas=[0.9, 0.999]
-            )
-        else:
-            self.g_optimizer = optim.Adam(
-                self.g.parameters(),
-                lr=self.config.Train.learning_rate.generator_learning_rate,
-                betas=[0.9, 0.999]
-            )
-    def state_dict(self):
-        if self.composition:
-            model_state = {
-                'g_body': self.g_body.state_dict(),
-                'g_body_optim': self.g_body_optimizer.state_dict(),
-                'g_hand': self.g_hand.state_dict(),
-                'g_hand_optim': self.g_hand_optimizer.state_dict(),
-                'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
-                'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
-            }
-        else:
-            model_state = {
-                'g': self.g.state_dict(),
-                'g_optim': self.g_optimizer.state_dict(),
-                'discriminator': self.discriminator.state_dict() if self.discriminator is not None else None,
-                'discriminator_optim': self.discriminator_optimizer.state_dict() if self.discriminator is not None else None
-            }
-        return model_state
-    def init_params(self):
-        if self.config.Data.pose.convert_to_6d:
-            scale = 2
-        else:
-            scale = 1
-        global_orient = round(0 * scale)
-        leye_pose = reye_pose = round(0 * scale)
-        jaw_pose = round(0 * scale)
-        body_pose = round((63 - 24) * scale)
-        left_hand_pose = right_hand_pose = round(45 * scale)
-        if self.expression:
-            expression = 100
-        else:
-            expression = 0
-        b_j = 0
-        jaw_dim = jaw_pose
-        b_e = b_j + jaw_dim
-        eye_dim = leye_pose + reye_pose
-        b_b = b_e + eye_dim
-        body_dim = global_orient + body_pose
-        b_h = b_b + body_dim
-        hand_dim = left_hand_pose + right_hand_pose
-        b_f = b_h + hand_dim
-        face_dim = expression
-        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
-        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim
-        self.pose = int(self.full_dim / round(3 * scale))
-        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
-    def __call__(self, bat):
-        # assert (not self.args.infer), "infer mode"
-        self.global_step += 1
-        total_loss = None
-        loss_dict = {}
-        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
-        # id = bat['speaker'].to(self.device) - 20
-        # id = F.one_hot(id, self.num_classes)
-        poses = poses[:, self.c_index, :]
-        gt_poses = poses.permute(0, 2, 1)
-        b_poses = gt_poses[..., :self.each_dim[1]]
-        h_poses = gt_poses[..., self.each_dim[1]:]
-        if self.composition:
-            loss = 0
-            loss_dict, loss = self.vq_train(b_poses[:, :], 'b', self.g_body, loss_dict, loss)
-            loss_dict, loss = self.vq_train(h_poses[:, :], 'h', self.g_hand, loss_dict, loss)
-        else:
-            loss = 0
-            loss_dict, loss = self.vq_train(gt_poses[:, :], 'g', self.g, loss_dict, loss)
-        return total_loss, loss_dict
-    def vq_train(self, gt, name, model, dict, total_loss, pre=None):
-        e_q_loss, x_recon = model(gt_poses=gt, pre_state=pre)
-        loss, loss_dict = self.get_loss(pred_poses=x_recon, gt_poses=gt, e_q_loss=e_q_loss, pre=pre)
-        # total_loss = total_loss + loss
-        if name == 'b':
-            optimizer_name = 'g_body_optimizer'
-        elif name == 'h':
-            optimizer_name = 'g_hand_optimizer'
-        elif name == 'g':
-            optimizer_name = 'g_optimizer'
-        else:
-            raise ValueError("model's name must be b or h")
-        optimizer = getattr(self, optimizer_name)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-        for key in list(loss_dict.keys()):
-            dict[name + key] = loss_dict.get(key, 0).item()
-        return dict, total_loss
-    def get_loss(self,
-                 pred_poses,
-                 gt_poses,
-                 e_q_loss,
-                 pre=None
-                 ):
-        loss_dict = {}
-        rec_loss = torch.mean(torch.abs(pred_poses - gt_poses))
-        v_pr = pred_poses[:, 1:] - pred_poses[:, :-1]
-        v_gt = gt_poses[:, 1:] - gt_poses[:, :-1]
-        velocity_loss = torch.mean(torch.abs(v_pr - v_gt))
-        if pre is None:
-            f0_vel = 0
-        else:
-            v0_pr = pred_poses[:, 0] - pre[:, -1]
-            v0_gt = gt_poses[:, 0] - pre[:, -1]
-            f0_vel = torch.mean(torch.abs(v0_pr - v0_gt))
-        gen_loss = rec_loss + e_q_loss + velocity_loss + f0_vel
-        loss_dict['rec_loss'] = rec_loss
-        loss_dict['velocity_loss'] = velocity_loss
-        # loss_dict['e_q_loss'] = e_q_loss
-        if pre is not None:
-            loss_dict['f0_vel'] = f0_vel
-        return gen_loss, loss_dict
-    def infer_on_audio(self, aud_fn, initial_pose=None, norm_stats=None, exp=None, var=None, w_pre=False, continuity=False,
-                       id=None, fps=15, sr=22000, smooth=False, **kwargs):
-        '''
-        initial_pose: (B, C, T), normalized
-        (aud_fn, txgfile) -> generated motion (B, T, C)
-        '''
-        output = []
-        assert self.args.infer, "train mode"
-        if self.composition:
-            self.g_body.eval()
-            self.g_hand.eval()
-        else:
-            self.g.eval()
-        if self.config.Data.pose.normalization:
-            assert norm_stats is not None
-            data_mean = norm_stats[0]
-            data_std = norm_stats[1]
-        # assert initial_pose.shape[-1] == pre_length
-        if initial_pose is not None:
-            gt = initial_pose[:, :, :].to(self.device).to(torch.float32)
-            pre_poses = initial_pose[:, :, :15].permute(0, 2, 1).to(self.device).to(torch.float32)
-            poses = initial_pose.permute(0, 2, 1).to(self.device).to(torch.float32)
-            B = pre_poses.shape[0]
-        else:
-            gt = None
-            pre_poses = None
-            B = 1
-        if type(aud_fn) == torch.Tensor:
-            aud_feat = torch.tensor(aud_fn, dtype=torch.float32).to(self.device)
-            num_poses_to_generate = aud_feat.shape[-1]
-        else:
-            aud_feat = get_mfcc_ta(aud_fn, sr=sr, fps=fps, smlpx=True, type='mfcc').transpose(1, 0)
-            aud_feat = aud_feat[:, :]
-            num_poses_to_generate = aud_feat.shape[-1]
-            aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
-            aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.device)
-        # pre_poses = torch.randn(pre_poses.shape).to(self.device).to(torch.float32)
-        if id is None:
-            id = F.one_hot(torch.tensor([[0]]), self.num_classes).to(self.device)
-        with torch.no_grad():
-            aud_feat = aud_feat.permute(0, 2, 1)
-            gt_poses = gt[:, self.c_index].permute(0, 2, 1)
-            if self.composition:
-                if continuity:
-                    pred_poses_body = []
-                    pred_poses_hand = []
-                    pre_b = None
-                    pre_h = None
-                    for i in range(5):
-                        _, pred_body = self.g_body(gt_poses=gt_poses[:, i*60:(i+1)*60, :self.each_dim[1]], pre_state=pre_b)
-                        pre_b = pred_body[..., -1:].transpose(1,2)
-                        pred_poses_body.append(pred_body)
-                        _, pred_hand = self.g_hand(gt_poses=gt_poses[:, i*60:(i+1)*60, self.each_dim[1]:], pre_state=pre_h)
-                        pre_h = pred_hand[..., -1:].transpose(1,2)
-                        pred_poses_hand.append(pred_hand)
-                    pred_poses_body = torch.cat(pred_poses_body, dim=2)
-                    pred_poses_hand = torch.cat(pred_poses_hand, dim=2)
-                else:
-                    _, pred_poses_body = self.g_body(gt_poses=gt_poses[..., :self.each_dim[1]], id=id)
-                    _, pred_poses_hand = self.g_hand(gt_poses=gt_poses[..., self.each_dim[1]:], id=id)
-                pred_poses = torch.cat([pred_poses_body, pred_poses_hand], dim=1)
-            else:
-                _, pred_poses = self.g(gt_poses=gt_poses, id=id)
-            pred_poses = pred_poses.transpose(1, 2).cpu().numpy()
-        output = pred_poses
-        if self.config.Data.pose.normalization:
-            output = denormalize(output, data_mean, data_std)
-        if smooth:
-            lamda = 0.8
-            smooth_f = 10
-            frame = 149
-            for i in range(smooth_f):
-                f = frame + i
-                l = lamda * (i + 1) / smooth_f
-                output[0, f] = (1 - l) * output[0, f - 1] + l * output[0, f]
-        output = np.concatenate(output, axis=1)
-        return output
-    def load_state_dict(self, state_dict):
-        if self.composition:
-            self.g_body.load_state_dict(state_dict['g_body'])
-            self.g_hand.load_state_dict(state_dict['g_hand'])
-        else:
-            self.g.load_state_dict(state_dict['g'])

nets/smplx_face.py DELETED Viewed

@@ -1,238 +0,0 @@
-import os
-import sys
-sys.path.append(os.getcwd())
-from nets.layers import *
-from nets.base import TrainWrapperBaseClass
-# from nets.spg.faceformer import Faceformer
-from nets.spg.s2g_face import Generator as s2g_face
-from losses import KeypointLoss
-from nets.utils import denormalize
-from data_utils import get_mfcc_psf, get_mfcc_psf_min, get_mfcc_ta
-import numpy as np
-import torch.optim as optim
-import torch.nn.functional as F
-from sklearn.preprocessing import normalize
-import smplx
-class TrainWrapper(TrainWrapperBaseClass):
-    '''
-    a wrapper receving a batch from data_utils and calculate loss
-    '''
-    def __init__(self, args, config):
-        self.args = args
-        self.config = config
-        self.device = torch.device(self.args.gpu)
-        self.global_step = 0
-        self.convert_to_6d = self.config.Data.pose.convert_to_6d
-        self.expression = self.config.Data.pose.expression
-        self.epoch = 0
-        self.init_params()
-        self.num_classes = 4
-        self.generator = s2g_face(
-            n_poses=self.config.Data.pose.generate_length,
-            each_dim=self.each_dim,
-            dim_list=self.dim_list,
-            training=not self.args.infer,
-            device=self.device,
-            identity=False if self.convert_to_6d else True,
-            num_classes=self.num_classes,
-        ).to(self.device)
-        # self.generator = Faceformer().to(self.device)
-        self.discriminator = None
-        self.am = None
-        self.MSELoss = KeypointLoss().to(self.device)
-        super().__init__(args, config)
-    def init_optimizer(self):
-        self.generator_optimizer = optim.SGD(
-            filter(lambda p: p.requires_grad,self.generator.parameters()),
-            lr=0.001,
-            momentum=0.9,
-            nesterov=False,
-        )
-    def init_params(self):
-        if self.convert_to_6d:
-            scale = 2
-        else:
-            scale = 1
-        global_orient = round(3 * scale)
-        leye_pose = reye_pose = round(3 * scale)
-        jaw_pose = round(3 * scale)
-        body_pose = round(63 * scale)
-        left_hand_pose = right_hand_pose = round(45 * scale)
-        if self.expression:
-            expression = 100
-        else:
-            expression = 0
-        b_j = 0
-        jaw_dim = jaw_pose
-        b_e = b_j + jaw_dim
-        eye_dim = leye_pose + reye_pose
-        b_b = b_e + eye_dim
-        body_dim = global_orient + body_pose
-        b_h = b_b + body_dim
-        hand_dim = left_hand_pose + right_hand_pose
-        b_f = b_h + hand_dim
-        face_dim = expression
-        self.dim_list = [b_j, b_e, b_b, b_h, b_f]
-        self.full_dim = jaw_dim + eye_dim + body_dim + hand_dim + face_dim
-        self.pose = int(self.full_dim / round(3 * scale))
-        self.each_dim = [jaw_dim, eye_dim + body_dim, hand_dim, face_dim]
-    def __call__(self, bat):
-        # assert (not self.args.infer), "infer mode"
-        self.global_step += 1
-        total_loss = None
-        loss_dict = {}
-        aud, poses = bat['aud_feat'].to(self.device).to(torch.float32), bat['poses'].to(self.device).to(torch.float32)
-        id = bat['speaker'].to(self.device) - 20
-        id = F.one_hot(id, self.num_classes)
-        aud = aud.permute(0, 2, 1)
-        gt_poses = poses.permute(0, 2, 1)
-        if self.expression:
-            expression = bat['expression'].to(self.device).to(torch.float32)
-            gt_poses = torch.cat([gt_poses, expression.permute(0, 2, 1)], dim=2)
-        pred_poses, _ = self.generator(
-            aud,
-            gt_poses,
-            id,
-        )
-        G_loss, G_loss_dict = self.get_loss(
-            pred_poses=pred_poses,
-            gt_poses=gt_poses,
-            pre_poses=None,
-            mode='training_G',
-            gt_conf=None,
-            aud=aud,
-        )
-        self.generator_optimizer.zero_grad()
-        G_loss.backward()
-        grad = torch.nn.utils.clip_grad_norm(self.generator.parameters(), self.config.Train.max_gradient_norm)
-        loss_dict['grad'] = grad.item()
-        self.generator_optimizer.step()
-        for key in list(G_loss_dict.keys()):
-            loss_dict[key] = G_loss_dict.get(key, 0).item()
-        return total_loss, loss_dict
-    def get_loss(self,
-                 pred_poses,
-                 gt_poses,
-                 pre_poses,
-                 aud,
-                 mode='training_G',
-                 gt_conf=None,
-                 exp=1,
-                 gt_nzero=None,
-                 pre_nzero=None,
-                 ):
-        loss_dict = {}
-        [b_j, b_e, b_b, b_h, b_f] = self.dim_list
-        MSELoss = torch.mean(torch.abs(pred_poses[:, :, :6] - gt_poses[:, :, :6]))
-        if self.expression:
-            expl = torch.mean((pred_poses[:, :, -100:] - gt_poses[:, :, -100:])**2)
-        else:
-            expl = 0
-        gen_loss = expl + MSELoss
-        loss_dict['MSELoss'] = MSELoss
-        if self.expression:
-            loss_dict['exp_loss'] = expl
-        return gen_loss, loss_dict
-    def infer_on_audio(self, aud_fn, id=None, initial_pose=None, norm_stats=None, w_pre=False, frame=None, am=None, am_sr=16000, **kwargs):
-        '''
-        initial_pose: (B, C, T), normalized
-        (aud_fn, txgfile) -> generated motion (B, T, C)
-        '''
-        output = []
-        # assert self.args.infer, "train mode"
-        self.generator.eval()
-        if self.config.Data.pose.normalization:
-            assert norm_stats is not None
-            data_mean = norm_stats[0]
-            data_std = norm_stats[1]
-        # assert initial_pose.shape[-1] == pre_length
-        if initial_pose is not None:
-            gt = initial_pose[:,:,:].permute(0, 2, 1).to(self.generator.device).to(torch.float32)
-            pre_poses = initial_pose[:,:,:15].permute(0, 2, 1).to(self.generator.device).to(torch.float32)
-            poses = initial_pose.permute(0, 2, 1).to(self.generator.device).to(torch.float32)
-            B = pre_poses.shape[0]
-        else:
-            gt = None
-            pre_poses=None
-            B = 1
-        if type(aud_fn) == torch.Tensor:
-            aud_feat = torch.tensor(aud_fn, dtype=torch.float32).to(self.generator.device)
-            num_poses_to_generate = aud_feat.shape[-1]
-        else:
-            aud_feat = get_mfcc_ta(aud_fn, am=am, am_sr=am_sr, fps=30, encoder_choice='faceformer')
-            aud_feat = aud_feat[np.newaxis, ...].repeat(B, axis=0)
-            aud_feat = torch.tensor(aud_feat, dtype=torch.float32).to(self.generator.device).transpose(1, 2)
-        if frame is None:
-            frame = aud_feat.shape[2]*30//16000
-        #
-        if id is None:
-            id = torch.tensor([[0, 0, 0, 0]], dtype=torch.float32, device=self.generator.device)
-        else:
-            id = F.one_hot(id, self.num_classes).to(self.generator.device)
-        with torch.no_grad():
-            pred_poses = self.generator(aud_feat, pre_poses, id, time_steps=frame)[0]
-            pred_poses = pred_poses.cpu().numpy()
-        output = pred_poses
-        if self.config.Data.pose.normalization:
-            output = denormalize(output, data_mean, data_std)
-        return output
-    def generate(self, wv2_feat, frame):
-        '''
-        initial_pose: (B, C, T), normalized
-        (aud_fn, txgfile) -> generated motion (B, T, C)
-        '''
-        output = []
-        # assert self.args.infer, "train mode"
-        self.generator.eval()
-        B = 1
-        id = torch.tensor([[0, 0, 0, 0]], dtype=torch.float32, device=self.generator.device)
-        id = id.repeat(wv2_feat.shape[0], 1)
-        with torch.no_grad():
-            pred_poses = self.generator(wv2_feat, None, id, time_steps=frame)[0]
-        return pred_poses

nets/spg/gated_pixelcnn_v2.py DELETED Viewed

@@ -1,179 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-def weights_init(m):
-    classname = m.__class__.__name__
-    if classname.find('Conv') != -1:
-        try:
-            nn.init.xavier_uniform_(m.weight.data)
-            m.bias.data.fill_(0)
-        except AttributeError:
-            print("Skipping initialization of ", classname)
-class GatedActivation(nn.Module):
-    def __init__(self):
-        super().__init__()
-    def forward(self, x):
-        x, y = x.chunk(2, dim=1)
-        return F.tanh(x) * F.sigmoid(y)
-class GatedMaskedConv2d(nn.Module):
-    def __init__(self, mask_type, dim, kernel, residual=True, n_classes=10, bh_model=False):
-        super().__init__()
-        assert kernel % 2 == 1, print("Kernel size must be odd")
-        self.mask_type = mask_type
-        self.residual = residual
-        self.bh_model = bh_model
-        self.class_cond_embedding = nn.Embedding(n_classes, 2 * dim)
-        self.class_cond_embedding = self.class_cond_embedding.to("cpu")
-        kernel_shp = (kernel // 2 + 1, 3 if self.bh_model else 1)  # (ceil(n/2), n)
-        padding_shp = (kernel // 2, 1 if self.bh_model else 0)
-        self.vert_stack = nn.Conv2d(
-            dim, dim * 2,
-            kernel_shp, 1, padding_shp
-        )
-        self.vert_to_horiz = nn.Conv2d(2 * dim, 2 * dim, 1)
-        kernel_shp = (1, 2)
-        padding_shp = (0, 1)
-        self.horiz_stack = nn.Conv2d(
-            dim, dim * 2,
-            kernel_shp, 1, padding_shp
-        )
-        self.horiz_resid = nn.Conv2d(dim, dim, 1)
-        self.gate = GatedActivation()
-    def make_causal(self):
-        self.vert_stack.weight.data[:, :, -1].zero_()  # Mask final row
-        self.horiz_stack.weight.data[:, :, :, -1].zero_()  # Mask final column
-    def forward(self, x_v, x_h, h):
-        if self.mask_type == 'A':
-            self.make_causal()
-        h = h.to(self.class_cond_embedding.weight.device)
-        h = self.class_cond_embedding(h)
-        h_vert = self.vert_stack(x_v)
-        h_vert = h_vert[:, :, :x_v.size(-2), :]
-        out_v = self.gate(h_vert + h[:, :, None, None])
-        if self.bh_model:
-            h_horiz = self.horiz_stack(x_h)
-            h_horiz = h_horiz[:, :, :, :x_h.size(-1)]
-            v2h = self.vert_to_horiz(h_vert)
-            out = self.gate(v2h + h_horiz + h[:, :, None, None])
-            if self.residual:
-                out_h = self.horiz_resid(out) + x_h
-            else:
-                out_h = self.horiz_resid(out)
-        else:
-            if self.residual:
-                out_v = self.horiz_resid(out_v) + x_v
-            else:
-                out_v = self.horiz_resid(out_v)
-            out_h = out_v
-        return out_v, out_h
-class GatedPixelCNN(nn.Module):
-    def __init__(self, input_dim=256, dim=64, n_layers=15, n_classes=10, audio=False, bh_model=False):
-        super().__init__()
-        self.dim = dim
-        self.audio = audio
-        self.bh_model = bh_model
-        if self.audio:
-            self.embedding_aud = nn.Conv2d(256, dim, 1, 1, padding=0)
-            self.fusion_v = nn.Conv2d(dim * 2, dim, 1, 1, padding=0)
-            self.fusion_h = nn.Conv2d(dim * 2, dim, 1, 1, padding=0)
-        # Create embedding layer to embed input
-        self.embedding = nn.Embedding(input_dim, dim)
-        # Building the PixelCNN layer by layer
-        self.layers = nn.ModuleList()
-        # Initial block with Mask-A convolution
-        # Rest with Mask-B convolutions
-        for i in range(n_layers):
-            mask_type = 'A' if i == 0 else 'B'
-            kernel = 7 if i == 0 else 3
-            residual = False if i == 0 else True
-            self.layers.append(
-                GatedMaskedConv2d(mask_type, dim, kernel, residual, n_classes, bh_model)
-            )
-        # Add the output layer
-        self.output_conv = nn.Sequential(
-            nn.Conv2d(dim, 512, 1),
-            nn.ReLU(True),
-            nn.Conv2d(512, input_dim, 1)
-        )
-        self.apply(weights_init)
-        self.dp = nn.Dropout(0.1)
-        self.to("cpu")
-    def forward(self, x, label, aud=None):
-        shp = x.size() + (-1,)
-        x = self.embedding(x.view(-1)).view(shp)  # (B, H, W, C)
-        x = x.permute(0, 3, 1, 2)  # (B, C, W, W)
-        x_v, x_h = (x, x)
-        for i, layer in enumerate(self.layers):
-            if i == 1 and self.audio is True:
-                aud = self.embedding_aud(aud)
-                a = torch.ones(aud.shape[-2]).to(aud.device)
-                a = self.dp(a)
-                aud = (aud.transpose(-1, -2) * a).transpose(-1, -2)
-                x_v = self.fusion_v(torch.cat([x_v, aud], dim=1))
-                if self.bh_model:
-                    x_h = self.fusion_h(torch.cat([x_h, aud], dim=1))
-            x_v, x_h = layer(x_v, x_h, label)
-        if self.bh_model:
-            return self.output_conv(x_h)
-        else:
-            return self.output_conv(x_v)
-    def generate(self, label, shape=(8, 8), batch_size=64, aud_feat=None, pre_latents=None, pre_audio=None):
-        param = next(self.parameters())
-        x = torch.zeros(
-            (batch_size, *shape),
-            dtype=torch.int64, device=param.device
-        )
-        if pre_latents is not None:
-            x = torch.cat([pre_latents, x], dim=1)
-            aud_feat = torch.cat([pre_audio, aud_feat], dim=2)
-            h0 = pre_latents.shape[1]
-            h = h0 + shape[0]
-        else:
-            h0 = 0
-            h = shape[0]
-        for i in range(h0, h):
-            for j in range(shape[1]):
-                if self.audio:
-                    logits = self.forward(x, label, aud_feat)
-                else:
-                    logits = self.forward(x, label)
-                probs = F.softmax(logits[:, :, i, j], -1)
-                x.data[:, i, j].copy_(
-                    probs.multinomial(1).squeeze().data
-                )
-        return x[:, h0:h]

nets/spg/s2g_face.py DELETED Viewed

@@ -1,226 +0,0 @@
-'''
-not exactly the same as the official repo but the results are good
-'''
-import sys
-import os
-from transformers import Wav2Vec2Processor
-from .wav2vec import Wav2Vec2Model
-from torchaudio.sox_effects import apply_effects_tensor
-sys.path.append(os.getcwd())
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio as ta
-import math
-from nets.layers import SeqEncoder1D, SeqTranslator1D, ConvNormRelu
-""" from https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git """
-def audio_chunking(audio: torch.Tensor, frame_rate: int = 30, chunk_size: int = 16000):
-    """
-    :param audio: 1 x T tensor containing a 16kHz audio signal
-    :param frame_rate: frame rate for video (we need one audio chunk per video frame)
-    :param chunk_size: number of audio samples per chunk
-    :return: num_chunks x chunk_size tensor containing sliced audio
-    """
-    samples_per_frame = 16000 // frame_rate
-    padding = (chunk_size - samples_per_frame) // 2
-    audio = torch.nn.functional.pad(audio.unsqueeze(0), pad=[padding, padding]).squeeze(0)
-    anchor_points = list(range(chunk_size//2, audio.shape[-1]-chunk_size//2, samples_per_frame))
-    audio = torch.cat([audio[:, i-chunk_size//2:i+chunk_size//2] for i in anchor_points], dim=0)
-    return audio
-class MeshtalkEncoder(nn.Module):
-    def __init__(self, latent_dim: int = 128, model_name: str = 'audio_encoder'):
-        """
-        :param latent_dim: size of the latent audio embedding
-        :param model_name: name of the model, used to load and save the model
-        """
-        super().__init__()
-        self.melspec = ta.transforms.MelSpectrogram(
-            sample_rate=16000, n_fft=2048, win_length=800, hop_length=160, n_mels=80
-        )
-        conv_len = 5
-        self.convert_dimensions = torch.nn.Conv1d(80, 128, kernel_size=conv_len)
-        self.weights_init(self.convert_dimensions)
-        self.receptive_field = conv_len
-        convs = []
-        for i in range(6):
-            dilation = 2 * (i % 3 + 1)
-            self.receptive_field += (conv_len - 1) * dilation
-            convs += [torch.nn.Conv1d(128, 128, kernel_size=conv_len, dilation=dilation)]
-            self.weights_init(convs[-1])
-        self.convs = torch.nn.ModuleList(convs)
-        self.code = torch.nn.Linear(128, latent_dim)
-        self.apply(lambda x: self.weights_init(x))
-    def weights_init(self, m):
-        if isinstance(m, torch.nn.Conv1d):
-            torch.nn.init.xavier_uniform_(m.weight)
-            try:
-                torch.nn.init.constant_(m.bias, .01)
-            except:
-                pass
-    def forward(self, audio: torch.Tensor):
-        """
-        :param audio: B x T x 16000 Tensor containing 1 sec of audio centered around the current time frame
-        :return: code: B x T x latent_dim Tensor containing a latent audio code/embedding
-        """
-        B, T = audio.shape[0], audio.shape[1]
-        x = self.melspec(audio).squeeze(1)
-        x = torch.log(x.clamp(min=1e-10, max=None))
-        if T == 1:
-            x = x.unsqueeze(1)
-        # Convert to the right dimensionality
-        x = x.view(-1, x.shape[2], x.shape[3])
-        x = F.leaky_relu(self.convert_dimensions(x), .2)
-        # Process stacks
-        for conv in self.convs:
-            x_ = F.leaky_relu(conv(x), .2)
-            if self.training:
-                x_ = F.dropout(x_, .2)
-            l = (x.shape[2] - x_.shape[2]) // 2
-            x = (x[:, :, l:-l] + x_) / 2
-        x = torch.mean(x, dim=-1)
-        x = x.view(B, T, x.shape[-1])
-        x = self.code(x)
-        return {"code": x}
-class AudioEncoder(nn.Module):
-    def __init__(self, in_dim, out_dim, identity=False, num_classes=0):
-        super().__init__()
-        self.identity = identity
-        if self.identity:
-            in_dim = in_dim + 64
-            self.id_mlp = nn.Conv1d(num_classes, 64, 1, 1)
-        self.first_net = SeqTranslator1D(in_dim, out_dim,
-                                         min_layers_num=3,
-                                         residual=True,
-                                         norm='ln'
-                                         )
-        self.grus = nn.GRU(out_dim, out_dim, 1, batch_first=True)
-        self.dropout = nn.Dropout(0.1)
-        # self.att = nn.MultiheadAttention(out_dim, 4, dropout=0.1, batch_first=True)
-    def forward(self, spectrogram, pre_state=None, id=None, time_steps=None):
-        spectrogram = spectrogram
-        spectrogram = self.dropout(spectrogram)
-        if self.identity:
-            id = id.reshape(id.shape[0], -1, 1).repeat(1, 1, spectrogram.shape[2]).to(torch.float32)
-            id = self.id_mlp(id)
-            spectrogram = torch.cat([spectrogram, id], dim=1)
-        x1 = self.first_net(spectrogram)# .permute(0, 2, 1)
-        if time_steps is not None:
-            x1 = F.interpolate(x1, size=time_steps, align_corners=False, mode='linear')
-        # x1, _ = self.att(x1, x1, x1)
-        # x1, hidden_state = self.grus(x1)
-        # x1 = x1.permute(0, 2, 1)
-        hidden_state=None
-        return x1, hidden_state
-class Generator(nn.Module):
-    def __init__(self,
-                 n_poses,
-                 each_dim: list,
-                 dim_list: list,
-                 training=False,
-                 device=None,
-                 identity=True,
-                 num_classes=0,
-                 ):
-        super().__init__()
-        self.training = training
-        self.device = device
-        self.gen_length = n_poses
-        self.identity = identity
-        norm = 'ln'
-        in_dim = 256
-        out_dim = 256
-        self.encoder_choice = 'faceformer'
-        if self.encoder_choice == 'meshtalk':
-            self.audio_encoder = MeshtalkEncoder(latent_dim=in_dim)
-        elif self.encoder_choice == 'faceformer':
-            # wav2vec 2.0 weights initialization
-            self.audio_encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")  # "vitouphy/wav2vec2-xls-r-300m-phoneme""facebook/wav2vec2-base-960h"
-            self.audio_encoder.feature_extractor._freeze_parameters()
-            self.audio_feature_map = nn.Linear(768, in_dim)
-        else:
-            self.audio_encoder = AudioEncoder(in_dim=64, out_dim=out_dim)
-        self.audio_middle = AudioEncoder(in_dim, out_dim, identity, num_classes)
-        self.dim_list = dim_list
-        self.decoder = nn.ModuleList()
-        self.final_out = nn.ModuleList()
-        self.decoder.append(nn.Sequential(
-            ConvNormRelu(out_dim, 64, norm=norm),
-            ConvNormRelu(64, 64, norm=norm),
-            ConvNormRelu(64, 64, norm=norm),
-        ))
-        self.final_out.append(nn.Conv1d(64, each_dim[0], 1, 1))
-        self.decoder.append(nn.Sequential(
-            ConvNormRelu(out_dim, out_dim, norm=norm),
-            ConvNormRelu(out_dim, out_dim, norm=norm),
-            ConvNormRelu(out_dim, out_dim, norm=norm),
-        ))
-        self.final_out.append(nn.Conv1d(out_dim, each_dim[3], 1, 1))
-    def forward(self, in_spec, gt_poses=None, id=None, pre_state=None, time_steps=None):
-        if self.training:
-            time_steps = gt_poses.shape[1]
-        # vector, hidden_state = self.audio_encoder(in_spec, pre_state, time_steps=time_steps)
-        if self.encoder_choice == 'meshtalk':
-            in_spec = audio_chunking(in_spec.squeeze(-1), frame_rate=30, chunk_size=16000)
-            feature = self.audio_encoder(in_spec.unsqueeze(0))["code"].transpose(1, 2)
-        elif self.encoder_choice == 'faceformer':
-            hidden_states = self.audio_encoder(in_spec.reshape(in_spec.shape[0], -1), frame_num=time_steps).last_hidden_state
-            feature = self.audio_feature_map(hidden_states).transpose(1, 2)
-        else:
-            feature, hidden_state = self.audio_encoder(in_spec, pre_state, time_steps=time_steps)
-        # hidden_states = in_spec
-        feature, _ = self.audio_middle(feature, id=id)
-        out = []
-        for i in range(self.decoder.__len__()):
-            mid = self.decoder[i](feature)
-            mid = self.final_out[i](mid)
-            out.append(mid)
-        out = torch.cat(out, dim=1)
-        out = out.transpose(1, 2)
-        return out, None

nets/spg/s2glayers.py DELETED Viewed

@@ -1,522 +0,0 @@
-'''
-not exactly the same as the official repo but the results are good
-'''
-import sys
-import os
-sys.path.append(os.getcwd())
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import math
-from nets.layers import SeqEncoder1D, SeqTranslator1D
-""" from https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.git """
-class Conv2d_tf(nn.Conv2d):
-    """
-    Conv2d with the padding behavior from TF
-    from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
-    """
-    def __init__(self, *args, **kwargs):
-        super(Conv2d_tf, self).__init__(*args, **kwargs)
-        self.padding = kwargs.get("padding", "SAME")
-    def _compute_padding(self, input, dim):
-        input_size = input.size(dim + 2)
-        filter_size = self.weight.size(dim + 2)
-        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
-        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
-        total_padding = max(
-            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
-        )
-        additional_padding = int(total_padding % 2 != 0)
-        return additional_padding, total_padding
-    def forward(self, input):
-        if self.padding == "VALID":
-            return F.conv2d(
-                input,
-                self.weight,
-                self.bias,
-                self.stride,
-                padding=0,
-                dilation=self.dilation,
-                groups=self.groups,
-            )
-        rows_odd, padding_rows = self._compute_padding(input, dim=0)
-        cols_odd, padding_cols = self._compute_padding(input, dim=1)
-        if rows_odd or cols_odd:
-            input = F.pad(input, [0, cols_odd, 0, rows_odd])
-        return F.conv2d(
-            input,
-            self.weight,
-            self.bias,
-            self.stride,
-            padding=(padding_rows // 2, padding_cols // 2),
-            dilation=self.dilation,
-            groups=self.groups,
-        )
-class Conv1d_tf(nn.Conv1d):
-    """
-    Conv1d with the padding behavior from TF
-    modified from https://github.com/mlperf/inference/blob/482f6a3beb7af2fb0bd2d91d6185d5e71c22c55f/others/edge/object_detection/ssd_mobilenet/pytorch/utils.py
-    """
-    def __init__(self, *args, **kwargs):
-        super(Conv1d_tf, self).__init__(*args, **kwargs)
-        self.padding = kwargs.get("padding")
-    def _compute_padding(self, input, dim):
-        input_size = input.size(dim + 2)
-        filter_size = self.weight.size(dim + 2)
-        effective_filter_size = (filter_size - 1) * self.dilation[dim] + 1
-        out_size = (input_size + self.stride[dim] - 1) // self.stride[dim]
-        total_padding = max(
-            0, (out_size - 1) * self.stride[dim] + effective_filter_size - input_size
-        )
-        additional_padding = int(total_padding % 2 != 0)
-        return additional_padding, total_padding
-    def forward(self, input):
-        # if self.padding == "valid":
-        #     return F.conv1d(
-        #         input,
-        #         self.weight,
-        #         self.bias,
-        #         self.stride,
-        #         padding=0,
-        #         dilation=self.dilation,
-        #         groups=self.groups,
-        #     )
-        rows_odd, padding_rows = self._compute_padding(input, dim=0)
-        if rows_odd:
-            input = F.pad(input, [0, rows_odd])
-        return F.conv1d(
-            input,
-            self.weight,
-            self.bias,
-            self.stride,
-            padding=(padding_rows // 2),
-            dilation=self.dilation,
-            groups=self.groups,
-        )
-def ConvNormRelu(in_channels, out_channels, type='1d', downsample=False, k=None, s=None, padding='valid', groups=1,
-                 nonlinear='lrelu', bn='bn'):
-    if k is None and s is None:
-        if not downsample:
-            k = 3
-            s = 1
-            padding = 'same'
-        else:
-            k = 4
-            s = 2
-            padding = 'valid'
-    if type == '1d':
-        conv_block = Conv1d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding, groups=groups)
-        norm_block = nn.BatchNorm1d(out_channels)
-    elif type == '2d':
-        conv_block = Conv2d_tf(in_channels, out_channels, kernel_size=k, stride=s, padding=padding, groups=groups)
-        norm_block = nn.BatchNorm2d(out_channels)
-    else:
-        assert False
-    if bn != 'bn':
-        if bn == 'gn':
-            norm_block = nn.GroupNorm(1, out_channels)
-        elif bn == 'ln':
-            norm_block = nn.LayerNorm(out_channels)
-        else:
-            norm_block = nn.Identity()
-    if nonlinear == 'lrelu':
-        nlinear = nn.LeakyReLU(0.2, True)
-    elif nonlinear == 'tanh':
-        nlinear = nn.Tanh()
-    elif nonlinear == 'none':
-        nlinear = nn.Identity()
-    return nn.Sequential(
-        conv_block,
-        norm_block,
-        nlinear
-    )
-class UnetUp(nn.Module):
-    def __init__(self, in_ch, out_ch):
-        super(UnetUp, self).__init__()
-        self.conv = ConvNormRelu(in_ch, out_ch)
-    def forward(self, x1, x2):
-        # x1 = torch.repeat_interleave(x1, 2, dim=2)
-        # x1 = x1[:, :, :x2.shape[2]]
-        x1 = torch.nn.functional.interpolate(x1, size=x2.shape[2], mode='linear')
-        x = x1 + x2
-        x = self.conv(x)
-        return x
-class UNet(nn.Module):
-    def __init__(self, input_dim, dim):
-        super(UNet, self).__init__()
-        # dim = 512
-        self.down1 = nn.Sequential(
-            ConvNormRelu(input_dim, input_dim, '1d', False),
-            ConvNormRelu(input_dim, dim, '1d', False),
-            ConvNormRelu(dim, dim, '1d', False)
-        )
-        self.gru = nn.GRU(dim, dim, 1, batch_first=True)
-        self.down2 = ConvNormRelu(dim, dim, '1d', True)
-        self.down3 = ConvNormRelu(dim, dim, '1d', True)
-        self.down4 = ConvNormRelu(dim, dim, '1d', True)
-        self.down5 = ConvNormRelu(dim, dim, '1d', True)
-        self.down6 = ConvNormRelu(dim, dim, '1d', True)
-        self.up1 = UnetUp(dim, dim)
-        self.up2 = UnetUp(dim, dim)
-        self.up3 = UnetUp(dim, dim)
-        self.up4 = UnetUp(dim, dim)
-        self.up5 = UnetUp(dim, dim)
-    def forward(self, x1, pre_pose=None, w_pre=False):
-        x2_0 = self.down1(x1)
-        if w_pre:
-            i = 1
-            x2_pre = self.gru(x2_0[:,:,0:i].permute(0,2,1), pre_pose[:,:,-1:].permute(2,0,1).contiguous())[0].permute(0,2,1)
-            x2 = torch.cat([x2_pre, x2_0[:,:,i:]], dim=-1)
-            # x2 = torch.cat([pre_pose, x2_0], dim=2) # [B, 512, 15]
-        else:
-            # x2 = self.gru(x2_0.transpose(1, 2))[0].transpose(1,2)
-            x2 = x2_0
-        x3 = self.down2(x2)
-        x4 = self.down3(x3)
-        x5 = self.down4(x4)
-        x6 = self.down5(x5)
-        x7 = self.down6(x6)
-        x = self.up1(x7, x6)
-        x = self.up2(x, x5)
-        x = self.up3(x, x4)
-        x = self.up4(x, x3)
-        x = self.up5(x, x2)             # [B, 512, 15]
-        return x, x2_0
-class AudioEncoder(nn.Module):
-    def __init__(self, n_frames, template_length, pose=False, common_dim=512):
-        super().__init__()
-        self.n_frames = n_frames
-        self.pose = pose
-        self.step = 0
-        self.weight = 0
-        if self.pose:
-            # self.first_net = nn.Sequential(
-            #     ConvNormRelu(1, 64, '2d', False),
-            #     ConvNormRelu(64, 64, '2d', True),
-            #     ConvNormRelu(64, 128, '2d', False),
-            #     ConvNormRelu(128, 128, '2d', True),
-            #     ConvNormRelu(128, 256, '2d', False),
-            #     ConvNormRelu(256, 256, '2d', True),
-            #     ConvNormRelu(256, 256, '2d', False),
-            #     ConvNormRelu(256, 256, '2d', False, padding='VALID')
-            # )
-            # decoder_layer = nn.TransformerDecoderLayer(d_model=args.feature_dim, nhead=4,
-            #                                            dim_feedforward=2 * args.feature_dim, batch_first=True)
-            # a = nn.TransformerDecoder
-            self.first_net = SeqTranslator1D(256, 256,
-                                             min_layers_num=4,
-                                             residual=True
-                                             )
-            self.dropout_0 = nn.Dropout(0.1)
-            self.mu_fc = nn.Conv1d(256, 128, 1, 1)
-            self.var_fc = nn.Conv1d(256, 128, 1, 1)
-            self.trans_motion = SeqTranslator1D(common_dim, common_dim,
-                                                kernel_size=1,
-                                                stride=1,
-                                                min_layers_num=3,
-                                                residual=True
-                                                )
-            # self.att = nn.MultiheadAttention(64 + template_length, 4, dropout=0.1)
-            self.unet = UNet(128 + template_length, common_dim)
-        else:
-            self.first_net = SeqTranslator1D(256, 256,
-                                             min_layers_num=4,
-                                             residual=True
-                                             )
-            self.dropout_0 = nn.Dropout(0.1)
-            # self.att = nn.MultiheadAttention(256, 4, dropout=0.1)
-            self.unet = UNet(256, 256)
-            self.dropout_1 = nn.Dropout(0.0)
-    def forward(self, spectrogram, time_steps=None, template=None, pre_pose=None, w_pre=False):
-        self.step = self.step + 1
-        if self.pose:
-            spect = spectrogram.transpose(1, 2)
-            if w_pre:
-                spect = spect[:, :, :]
-            out = self.first_net(spect)
-            out = self.dropout_0(out)
-            mu = self.mu_fc(out)
-            var = self.var_fc(out)
-            audio = self.__reparam(mu, var)
-            # audio = out
-            # template = self.trans_motion(template)
-            x1 = torch.cat([audio, template], dim=1)#.permute(2,0,1)
-            # x1 = out
-            #x1, _ = self.att(x1, x1, x1)
-            #x1 = x1.permute(1,2,0)
-            x1, x2_0 = self.unet(x1, pre_pose=pre_pose, w_pre=w_pre)
-        else:
-            spectrogram = spectrogram.transpose(1, 2)
-            x1 = self.first_net(spectrogram)#.permute(2,0,1)
-            #out, _ = self.att(out, out, out)
-            #out = out.permute(1, 2, 0)
-            x1 = self.dropout_0(x1)
-            x1, x2_0 = self.unet(x1)
-            x1 = self.dropout_1(x1)
-            mu = None
-            var = None
-        return x1, (mu, var), x2_0
-    def __reparam(self, mu, log_var):
-        std = torch.exp(0.5 * log_var)
-        eps = torch.randn_like(std, device='cuda')
-        z = eps * std + mu
-        return z
-class Generator(nn.Module):
-    def __init__(self,
-                 n_poses,
-                 pose_dim,
-                 pose,
-                 n_pre_poses,
-                 each_dim: list,
-                 dim_list: list,
-                 use_template=False,
-                 template_length=0,
-                 training=False,
-                 device=None,
-                 separate=False,
-                 expression=False
-                 ):
-        super().__init__()
-        self.use_template = use_template
-        self.template_length = template_length
-        self.training = training
-        self.device = device
-        self.separate = separate
-        self.pose = pose
-        self.decoderf = True
-        self.expression = expression
-        common_dim = 256
-        if self.use_template:
-            assert template_length > 0
-            # self.KLLoss = KLLoss(kl_tolerance=self.config.Train.weights.kl_tolerance).to(self.device)
-            # self.pose_encoder = SeqEncoder1D(
-            #     C_in=pose_dim,
-            #     C_out=512,
-            #     T_in=n_poses,
-            #     min_layer_nums=6
-            #
-            # )
-            self.pose_encoder = SeqTranslator1D(pose_dim - 50, common_dim,
-                                                # kernel_size=1,
-                                                # stride=1,
-                                                min_layers_num=3,
-                                                residual=True
-                                                )
-            self.mu_fc = nn.Conv1d(common_dim, template_length, kernel_size=1, stride=1)
-            self.var_fc = nn.Conv1d(common_dim, template_length, kernel_size=1, stride=1)
-        else:
-            self.template_length = 0
-        self.gen_length = n_poses
-        self.audio_encoder = AudioEncoder(n_poses, template_length, True, common_dim)
-        self.speech_encoder = AudioEncoder(n_poses, template_length, False)
-        # self.pre_pose_encoder = SeqEncoder1D(
-        #     C_in=pose_dim,
-        #     C_out=128,
-        #     T_in=15,
-        #     min_layer_nums=3
-        #
-        # )
-        # self.pmu_fc = nn.Linear(128, 64)
-        # self.pvar_fc = nn.Linear(128, 64)
-        self.pre_pose_encoder = SeqTranslator1D(pose_dim-50, common_dim,
-                                                min_layers_num=5,
-                                                residual=True
-                                                )
-        self.decoder_in = 256 + 64
-        self.dim_list = dim_list
-        if self.separate:
-            self.decoder = nn.ModuleList()
-            self.final_out = nn.ModuleList()
-            self.decoder.append(nn.Sequential(
-                ConvNormRelu(256, 64),
-                ConvNormRelu(64, 64),
-                ConvNormRelu(64, 64),
-            ))
-            self.final_out.append(nn.Conv1d(64, each_dim[0], 1, 1))
-            self.decoder.append(nn.Sequential(
-                ConvNormRelu(common_dim, common_dim),
-                ConvNormRelu(common_dim, common_dim),
-                ConvNormRelu(common_dim, common_dim),
-            ))
-            self.final_out.append(nn.Conv1d(common_dim, each_dim[1], 1, 1))
-            self.decoder.append(nn.Sequential(
-                ConvNormRelu(common_dim, common_dim),
-                ConvNormRelu(common_dim, common_dim),
-                ConvNormRelu(common_dim, common_dim),
-            ))
-            self.final_out.append(nn.Conv1d(common_dim, each_dim[2], 1, 1))
-            if self.expression:
-                self.decoder.append(nn.Sequential(
-                    ConvNormRelu(256, 256),
-                    ConvNormRelu(256, 256),
-                    ConvNormRelu(256, 256),
-                ))
-                self.final_out.append(nn.Conv1d(256, each_dim[3], 1, 1))
-        else:
-            self.decoder = nn.Sequential(
-                ConvNormRelu(self.decoder_in, 512),
-                ConvNormRelu(512, 512),
-                ConvNormRelu(512, 512),
-                ConvNormRelu(512, 512),
-                ConvNormRelu(512, 512),
-                ConvNormRelu(512, 512),
-            )
-            self.final_out = nn.Conv1d(512, pose_dim, 1, 1)
-    def __reparam(self, mu, log_var):
-        std = torch.exp(0.5 * log_var)
-        eps = torch.randn_like(std, device=self.device)
-        z = eps * std + mu
-        return z
-    def forward(self, in_spec, pre_poses, gt_poses, template=None, time_steps=None, w_pre=False, norm=True):
-        if time_steps is not None:
-            self.gen_length = time_steps
-        if self.use_template:
-            if self.training:
-                if w_pre:
-                    in_spec = in_spec[:, 15:, :]
-                    pre_pose = self.pre_pose_encoder(gt_poses[:, 14:15, :-50].permute(0, 2, 1))
-                    pose_enc = self.pose_encoder(gt_poses[:, 15:, :-50].permute(0, 2, 1))
-                    mu = self.mu_fc(pose_enc)
-                    var = self.var_fc(pose_enc)
-                    template = self.__reparam(mu, var)
-                else:
-                    pre_pose = None
-                    pose_enc = self.pose_encoder(gt_poses[:, :, :-50].permute(0, 2, 1))
-                    mu = self.mu_fc(pose_enc)
-                    var = self.var_fc(pose_enc)
-                    template = self.__reparam(mu, var)
-            elif pre_poses is not None:
-                if w_pre:
-                    pre_pose = pre_poses[:, -1:, :-50]
-                    if norm:
-                        pre_pose = pre_pose.reshape(1, -1, 55, 5)
-                        pre_pose = torch.cat([F.normalize(pre_pose[..., :3], dim=-1),
-                                             F.normalize(pre_pose[..., 3:5], dim=-1)],
-                                             dim=-1).reshape(1, -1, 275)
-                    pre_pose = self.pre_pose_encoder(pre_pose.permute(0, 2, 1))
-                    template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length ]).to(
-                        in_spec.device)
-                else:
-                    pre_pose = None
-                    template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length]).to(in_spec.device)
-            elif gt_poses is not None:
-                template = self.pre_pose_encoder(gt_poses[:, :, :-50].permute(0, 2, 1))
-            elif template is None:
-                pre_pose = None
-                template = torch.randn([in_spec.shape[0], self.template_length, self.gen_length]).to(in_spec.device)
-        else:
-            template = None
-            mu = None
-            var = None
-        a_t_f, (mu2, var2), x2_0 = self.audio_encoder(in_spec, time_steps=time_steps, template=template, pre_pose=pre_pose, w_pre=w_pre)
-        s_f, _, _ = self.speech_encoder(in_spec, time_steps=time_steps)
-        out = []
-        if self.separate:
-            for i in range(self.decoder.__len__()):
-                if i == 0 or i == 3:
-                    mid = self.decoder[i](s_f)
-                else:
-                    mid = self.decoder[i](a_t_f)
-                mid = self.final_out[i](mid)
-                out.append(mid)
-            out = torch.cat(out, dim=1)
-        else:
-            out = self.decoder(a_t_f)
-            out = self.final_out(out)
-        out = out.transpose(1, 2)
-        if self.training:
-            if w_pre:
-                return out, template, mu, var, (mu2, var2, x2_0, pre_pose)
-            else:
-                return out, template, mu, var, (mu2, var2, None, None)
-        else:
-            return out
-class Discriminator(nn.Module):
-    def __init__(self, pose_dim, pose):
-        super().__init__()
-        self.net = nn.Sequential(
-            Conv1d_tf(pose_dim, 64, kernel_size=4, stride=2, padding='SAME'),
-            nn.LeakyReLU(0.2, True),
-            ConvNormRelu(64, 128, '1d', True),
-            ConvNormRelu(128, 256, '1d', k=4, s=1),
-            Conv1d_tf(256, 1, kernel_size=4, stride=1, padding='SAME'),
-        )
-    def forward(self, x):
-        x = x.transpose(1, 2)
-        out = self.net(x)
-        return out
-def main():
-    d = Discriminator(275, 55)
-    x = torch.randn([8, 60, 275])
-    result = d(x)
-if __name__ == "__main__":
-    main()