BorisovMaksim commited on
Commit
d5e939a
·
1 Parent(s): 791e637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -5
README.md CHANGED
@@ -8,12 +8,62 @@ sdk_version: 3.28.1
8
  app_file: app.py
9
  pinned: false
10
  ---
 
 
 
11
 
12
- # Docker
 
13
 
14
- ## Build
15
- `` docker build . --tag python-docker ``
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- ## Run
18
- `` docker run -p 7860:7860 -e GRADIO_SERVER_NAME=0.0.0.0` -it python-docker:latest ``
19
 
 
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+ This is a repo that implements web interface for DEMUCS model proposed in [Real Time Speech Enhancement in the Waveform Domain](https://arxiv.org/abs/2006.12847).
12
+ The model was trained from scratch in Pytorch. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions.
13
+ You can record your voice in noisy conditions and get denoised version using DEMUCS model. There is also Spectral Gating denoiser as baseline.
14
 
15
+ # Running
16
+ Without docker:
17
 
18
+ <pre><code>pip install -r requirement.txt
19
+ python app.py</code></pre>
20
+
21
+
22
+ Using docker:
23
+ <pre><code>docker build . --tag python-docker
24
+ docker run -p 7860:7860 -e GRADIO_SERVER_NAME=0.0.0.0 -it python-docker:latest</code></pre>
25
+
26
+
27
+
28
+ # Data
29
+ In the scope of this project [Valentini](https://datashare.ed.ac.uk/handle/10283/2791) dataset in used. It is clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. There are 56 speakers and ~10 gb of speech data.
30
+
31
+ For model improvement it is possible to use a bigger training set from [DNS](https://www.bing.com/search?q=dns+challenge&cvid=3773a401b19d40269d725a02faf6f79c&aqs=edge.0.69i59j69i57j0l6j69i60.1021j0j4&FORM=ANAB01&PC=U531) challenge.
32
+
33
+ # Training
34
+ The training process in impemented in Pytorch. The data is (noisy speech, clean speech) pairs that are loaded as 2 second samples, randomly cutted from audio and padded if necessary. Model is optimized using SGD. In terms of loss functions, the L1 loss and MultiResolutionSTFTLoss are used. MultiResolutionSTFTLoss is the sum of STFT loss over different window sizes, hop sizes and fft sizes.
35
+
36
+ $$L_{STFT}= L_{sc} + L_{mag}$$
37
+
38
+ $$L_{sc}= \frac{|| |STFT(\tilde{x})| - |STFT(x)| ||_{F}^{1}}{|STFT(x)|}$$
39
+
40
+ $$L_{mag} = \frac{1}{T}|| log|STFT(\tilde{x})| - log|STFT(x)| ||_{F}^{1}$$
41
+
42
+ where T is the time points in the waveform.
43
+
44
+ # Metrics
45
+ - Perceptual Evaluation of Speech Quality ([PESQ](https://torchmetrics.readthedocs.io/en/stable/audio/perceptual_evaluation_speech_quality.html))
46
+ - Short-Time Objective Intelligibility ([STOI](https://torchmetrics.readthedocs.io/en/stable/audio/short_time_objective_intelligibility.html))
47
+
48
+ The PESQ metric is used for estimating overall speech quality after denoising and STOI is used for estimating speech intelligibility after denoising.
49
+ Intelligibility measure is highly correlated with the intelligibility of degraded speech signals
50
+
51
+ # Experiments
52
+ For tracking experiments local server of [Weights & Biases](https://wandb.ai/site) is used. To manage configs for different experiments [hydra](https://hydra.cc/) is used. It allows an easy way to track configs and override paramaters.
53
+
54
+
55
+ | Experiment | Description | Result |
56
+ |--------------|:-----:|--------------------------------------------------------|
57
+ | Baseline | Initial experiment with L1 loss | Poor quality |
58
+ | Baseline_L1_Multi_STFT_loss | Changed loss to Multi STFT + L1 loss | Better performance |
59
+ |L1_Multi_STFT_no_resample | Tried to train without resampling | No impovement, probably because RELU on the last layer |
60
+ |Updated_DEMUCS | Used relu in the last layer. Removed it.| Significant improvement |
61
+ |wav_normalization | Tried to normalized wav by std during training| Small improvement |
62
+ | original_sr| Train with original sample rate | Significant improvement |
63
+ |increased_L | Increased number of encoder-decoder pairs from 3 to 5| Performance comparable with original_sr |
64
+ | double_sr| Train with double sample rate| Small improvement |
65
+ |replicate paper | Lower learning rate and fix bug in dataloader | Massive improvement! |
66
+
67
+ ![img.png](images/img.png)
68
 
 
 
69