File size: 1,835 Bytes
8d11d43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Customize the architecture of SyncNet

The config file of SyncNet defines the architectures of audio and visual encoders. Let's first look at an example of an audio encoder:

```yaml
audio_encoder: # input (1, 80, 52)
  in_channels: 1
  block_out_channels: [32, 64, 128, 256, 512, 1024, 2048]
  downsample_factors: [[2, 1], 2, 2, 1, 2, 2, [2, 3]]
  attn_blocks: [0, 0, 0, 1, 1, 0, 0]
  dropout: 0.0
```

The above model arch accept a `1 x 80 x 52` image (mel spectrogram) and output a `2048 x 1 x 1` feature map. If the resolution of input image changes, you need to redefine the `downsample_factors` to make the output looks like `D x 1 x 1`, so that it can be used to compute cosine similarity.  Also reset the `block_out_channels`, in most cases, deeper networks require larger numbers of channels to store more features. We recommend reading the paper [EfficientNet](https://arxiv.org/abs/1905.11946), which discusses how to set the depth and width of CNN networks balancely. The `attn_blocks` defines whether a certain layer has a self-attention layer, where 1 indicates presence and 0 indicates absence. 

Now we look at an example of a visual encoder:

```yaml
visual_encoder: # input (48, 128, 256)
  in_channels: 48 # (16 x 3)
  block_out_channels: [64, 128, 256, 256, 512, 1024, 2048, 2048]
  downsample_factors: [[1, 2], 2, 2, 2, 2, 2, 2, 2]
  attn_blocks: [0, 0, 0, 0, 1, 1, 0, 0]
  dropout: 0.0
```

It is important to note that `in_channels`: it equals `num_frames * image_channels`. For pixel-space SyncNet, `image_channels` is 3, while for latent-space SyncNet, `image_channels` equals the `latent_channels` of the VAE you are using, typically 4 (SD 1.5, SDXL) or 16 (FLUX, SD3). In the example above, the visual encoder has an input frame length of 16 and is a pixel-space SyncNet, so `in_channels` is `16 x 3 = 48`.