new commits

This commit is contained in:
lhzstar
2023-10-27 15:03:22 +08:00
commit 5b49c1c526
100 changed files with 10126 additions and 0 deletions

28
.gitignore vendored Normal file
View File

@@ -0,0 +1,28 @@
saved_models/
out_audios/
launch.json
*.pyc
*.aux
*.log
*.out
*.synctex.gz
*.suo
*__pycache__
*.idea
*.ipynb_checkpoints
*.pickle
*.npy
*.bz2
*.blg
*.bbl
*.bcf
*.toc
*.sh
*.pt
*.whl
*.m4a
log/
syn_results
toolbox_results
dim_reduction_results

18
CHANGELOG.md Normal file
View File

@@ -0,0 +1,18 @@
## What's new
**2022.05.19** We calculated GE2E loss in encoder with CUDA rather than originally-configured CPU. It speeds up the encoder training speed.<br>
**2022.07.15** We added Loss animation plot for synthesizer and vocoder.<br>
**2022.07.19** We added response time and Griffin-Lim vocoder results for demo_toolbox.<br>
**2022.07.29** We added model validation for encoder, synthesizer and vocoder.<br>
**2022.08.02** We added voxceleb train and dev data for encoder. We added [noisereduce](https://github.com/timsainb/noisereduce) denoiser for the output wav from vocoder.<br>
**2022.08.06** We split the long text into short sentences using spacy for input of synthesizer. Make sure to install spaCy model en_core_web_sm by
`python -m spacy download en_core_web_sm`<br>
**2022.09.02** We set prop_decrease=0.6 for male and 0.9 for female in noisereduce function.(输出滤波,男女声使用不同的滤波参数)<br>
**2022.09.26** We added speed adjustment(声音变速) for output audios using praat, install parselmouth using pip: `pip install praat-parselmouth`<br>
**2022.10.10** We added voice filter functioning(声音美颜) for input audios, the weight ratio of the input audio embed and the standard audio embed is 7: 3. <br>
**2022.10.25** We set small values(<0.06) to zeros in embed.(对嵌入向量较小值置零)<br>
**2022.10.26** The split frequency for input audio is 170Hz. The split frequency for output noise reduce is 165Hz.<br>
**2022.12.01** merge the single sentences to input.<br>
**2022.12.31** added speaker embeddings dimension reduction visualzation results.<br>
**2023.01.01** did more text preprocessing and text cleaning for TTS text input.<br>
**2023.02.27** preprocessed ascii chars and abbreviations.<br>
**2023.06.09** We added VCTK train and dev data for synthesizer. We also combine a [deep learning denoiser](https://github.com/facebookresearch/denoiser) with the [noisereduce](https://github.com/timsainb/noisereduce) denoiser for optimized output wav quality.<br>

24
LICENSE.md Normal file
View File

@@ -0,0 +1,24 @@
MIT License
Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
Original work Copyright (c) 2015 braindead (https://github.com/braindead)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

132
README.md Normal file
View File

@@ -0,0 +1,132 @@
# Real-Time Voice Cloning v2
### What is this?
It is an improved version of [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning). Our emotion voice cloning implementation is [here](https://github.com/liuhaozhe6788/voice-cloning-collab/tree/add_emotion)!
## Installation
1. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files.
2. Create a new conda environment with
```
conda create -n rtvc python=3.7.13
```
3. Install [PyTorch](https://download.pytorch.org/whl/torch_stable.html). Pick the proposed CUDA version if you have a GPU, otherwise pick CPU.
My torch version: `torch=1.9.1+cu111`
`torchvision=0.10.1+cu111`
4. Install the remaining requirements with
```
pip install -r requirements.txt
```
5. Install spaCy model en_core_web_sm by
`python -m spacy download en_core_web_sm`
## Training
### Encoder
**Download dataset**
1. [LibriSpeech](https://www.openslr.org/12): train-other-500 for training, dev-other for validation
(extract as <datasets_root>/LibriSpeech/<dataset_name>)
2. [VoxCeleb1](https://mm.kaist.ac.kr/datasets/voxceleb/): Dev A - D for training, Test for validation as well as the metadata file `vox1_meta.csv` (extract as <datasets_root>/VoxCeleb1/ and <datasets_root>/VoxCeleb1/vox1_meta.csv)
3. [VoxCeleb2](https://mm.kaist.ac.kr/datasets/voxceleb/): Dev A - H for training, Test for validation
(extract as <datasets_root>/VoxCeleb2/)
**Encoder preprocessing**
```
python encoder_preprocess.py <datasets_root>
```
**Encoder training**
it is recommended to start visdom server for monitor training with
```
visdom
```
then start training with
```
python encoder_train.py <model_id> <datasets_root>/SV2TTS/encoder
```
### Synthesizer
**Download dataset**
1. [LibriSpeech](https://www.openslr.org/12): train-clean-100 and train-clean-360 for training, dev-clean for validation (extract as <datasets_root>/LibriSpeech/<dataset_name>)
2. [LibriSpeech alignments](https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing): merge the directory structure with the LibriSpeech datasets you have downloaded (do not take the alignments from the datasets you haven't downloaded else the scripts will think you have them)
3. [VCTK](https://datashare.ed.ac.uk/handle/10283/3443): used for training and validation
**Synthesizer preprocessing:**
```
python synthesizer_preprocess_audio.py <datasets_root>
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer
```
**Synthesizer training:**
```
python synthesizer_train.py <model_id> <datasets_root>/SV2TTS/synthesizer --use_tb
```
if you want to monitor the training progress, run
```
tensorboard --logdir log/vc/synthesizer --host localhost --port 8088
```
### Vocoder
**Download dataset**
The same as synthesizer. You can skip this if you already download synthesizer training dataset.
**Vocoder preprocessing:**
```
python vocoder_preprocess.py <datasets_root>
```
**Vocoder training:**
```
python vocoder_train.py <model_id> <datasets_root> --use_tb
```
if you want to monitor the training progress, run
```
tensorboard --logdir log/vc/vocoder --host localhost --port 8080
```
**Note:**
Training breakpoints are saved periodically, so you can run the training command and resume training when the breakpoint exists.
## Inference
**Terminal:**
```
python demo_cli.py
```
First input the number of audios, then input the audio file paths, then input the text message. The attention alignments and mel spectrogram are stored in syn_results/. The generated audio is stored in out_audios/.
**GUI demo:**
```
python demo_toolbox.py
```
## Dimension reduction visualization
**Download dataset:**
[LibriSpeech](https://www.openslr.org/12): test-other
(extract as <datasets_root>/LibriSpeech/<dataset_name>)
**Preprocessing:**
```
python encoder_test_preprocess.py <datasets_root>
```
**Visualization:**
```
python encoder_test_visualization.py <model_id> <datasets_root>
```
The results are saved in dim_reduction_results/.
## Pretrained models
You can download the pretrained model from [this](https://drive.google.com/drive/folders/11DFU_JBGet_HEwUoPZGDfe-fDZ42eqiG) and extract as saved_models/default
## Demo results
The audio results are [here](https://liuhaozhe6788.github.io/voice-cloning-collab/index.html)

7
css/bootstrap.min.css vendored Normal file

File diff suppressed because one or more lines are too long

196
css/custom.css Normal file
View File

@@ -0,0 +1,196 @@
body {
font-family: "Roboto", "HelveticaNeue", "Helvetica Neue", Helvetica, Arial, sans-serif;
background-color: #FCFCFC;
-webkit-font-smoothing: antialiased;
font-size: 1.8em;
line-height: 1.5;
font-weight: 300;
width: 100%
}
h1, h2, h3, h4, h5, h6 {
color: #263c4c;
}
h2, h3, h4, h5, h6 {
margin-top: 5rem;
margin-bottom: 3rem;
font-weight: bold;
padding-bottom: 10px;
}
h1 { font-size: 3.0rem; }
h2 {
margin-top: 6rem;
font-size: 2.6rem;
}
h3 { font-size: 2.1rem; }
h4,
h5,
h6 { font-size: 1.9rem; }
h2.entry-title {
font-size: 2.1rem;
margin-top: 0;
font-weight: 400;
border-bottom: none;
}
li {
margin-bottom: 0.5rem;
margin-left: 0.7em;
}
img {
max-width: 100%;
height: auto;
vertical-align: middle;
border: 0;
margin: 1em 0;
}
header,
footer {
margin: 4rem 0;
text-align: center;
}
main {
margin: 4rem 0;
}
.container {
width: 90%;
/* max-width: 700px; */
}
.header-logo img {
border-radius: 50%;
border: 2px solid #E1E1E1;
}
.header-logo img:hover {
border-color: #F1F1F1;
}
.site-title {
margin-top: 2rem;
}
.entry-title {
margin-bottom: 0;
}
.entry-title a {
text-decoration: none;
}
.entry-meta {
display: inline-block;
margin-bottom: 2rem;
font-size: 1.6rem;
color: #888;
}
.footer-link {
margin: 2rem 0;
}
.hr {
height: 1px;
margin: 2rem 0;
background: #E1E1E1;
background: -webkit-gradient(linear, left top, right top, from(white), color-stop(#E1E1E1), to(white));
background: -webkit-linear-gradient(left, white, #E1E1E1, white);
background: linear-gradient(to right, white, #E1E1E1, white);
}
article .social {
height: 40px;
padding: 10px 0;
}
address {
margin: 0;
font-size:0.9em;
max-height: 60px;
font-weight: 300;
font-style: normal;
display: block;
}
address a {
text-decoration: none;
}
.avatar-bottom img {
border-radius: 50%;
border: 1px solid #E1E1E1;
float: left;
max-width: 100%;
vertical-align: middle;
width: 32px;
height: 32px;
margin: 0 20px 0 0;
margin-top: -7px;
}
.avatar-bottom img:hover {
border-color: #F1F1F1;
}
.copyright {
font-size:0.9em;
font-weight: 300;
}
.github {
float: right;
}
blockquote {
position: relative;
padding: 10px 10px 10px 32px;
box-sizing: border-box;
font-style: italic;
color: #464646;
background: #e0e0e0;
}
blockquote:before{
display: inline-block;
position: absolute;
top: 0;
left: 0;
vertical-align: middle;
content: "\f10d";
font-family: FontAwesome;
color: #e0e0e0;
font-size: 22px;
line-height: 1;
z-index: 2;
}
blockquote:after{
position: absolute;
content: '';
left: 0;
top: 0;
border-width: 0 0 40px 40px;
border-style: solid;
border-color: transparent #ffffff;
}
blockquote p {
position: relative;
padding: 0;
margin: 10px 0;
z-index: 3;
line-height: 1.7;
}
blockquote cite {
display: block;
text-align: right;
color: #888888;
font-size: 0.9em;
}

427
css/normalize.css vendored Normal file
View File

@@ -0,0 +1,427 @@
/*! normalize.css v3.0.2 | MIT License | git.io/normalize */
/**
* 1. Set default font family to sans-serif.
* 2. Prevent iOS text size adjust after orientation change, without disabling
* user zoom.
*/
html {
font-family: sans-serif; /* 1 */
-ms-text-size-adjust: 100%; /* 2 */
-webkit-text-size-adjust: 100%; /* 2 */
}
/**
* Remove default margin.
*/
body {
margin: 0;
}
/* HTML5 display definitions
========================================================================== */
/**
* Correct `block` display not defined for any HTML5 element in IE 8/9.
* Correct `block` display not defined for `details` or `summary` in IE 10/11
* and Firefox.
* Correct `block` display not defined for `main` in IE 11.
*/
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
main,
menu,
nav,
section,
summary {
display: block;
}
/**
* 1. Correct `inline-block` display not defined in IE 8/9.
* 2. Normalize vertical alignment of `progress` in Chrome, Firefox, and Opera.
*/
audio,
canvas,
progress,
video {
display: inline-block; /* 1 */
vertical-align: baseline; /* 2 */
}
/**
* Prevent modern browsers from displaying `audio` without controls.
* Remove excess height in iOS 5 devices.
*/
audio:not([controls]) {
display: none;
height: 0;
}
/**
* Address `[hidden]` styling not present in IE 8/9/10.
* Hide the `template` element in IE 8/9/11, Safari, and Firefox < 22.
*/
[hidden],
template {
display: none;
}
/* Links
========================================================================== */
/**
* Remove the gray background color from active links in IE 10.
*/
a {
background-color: transparent;
}
/**
* Improve readability when focused and also mouse hovered in all browsers.
*/
a:active,
a:hover {
outline: 0;
}
/* Text-level semantics
========================================================================== */
/**
* Address styling not present in IE 8/9/10/11, Safari, and Chrome.
*/
abbr[title] {
border-bottom: 1px dotted;
}
/**
* Address style set to `bolder` in Firefox 4+, Safari, and Chrome.
*/
b,
strong {
font-weight: bold;
}
/**
* Address styling not present in Safari and Chrome.
*/
dfn {
font-style: italic;
}
/**
* Address variable `h1` font-size and margin within `section` and `article`
* contexts in Firefox 4+, Safari, and Chrome.
*/
h1 {
font-size: 2em;
margin: 0.67em 0;
}
/**
* Address styling not present in IE 8/9.
*/
mark {
background: #ff0;
color: #000;
}
/**
* Address inconsistent and variable font size in all browsers.
*/
small {
font-size: 80%;
}
/**
* Prevent `sub` and `sup` affecting `line-height` in all browsers.
*/
sub,
sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {
top: -0.5em;
}
sub {
bottom: -0.25em;
}
/* Embedded content
========================================================================== */
/**
* Remove border when inside `a` element in IE 8/9/10.
*/
img {
border: 0;
}
/**
* Correct overflow not hidden in IE 9/10/11.
*/
svg:not(:root) {
overflow: hidden;
}
/* Grouping content
========================================================================== */
/**
* Address margin not present in IE 8/9 and Safari.
*/
figure {
margin: 1em 40px;
}
/**
* Address differences between Firefox and other browsers.
*/
hr {
-moz-box-sizing: content-box;
box-sizing: content-box;
height: 0;
}
/**
* Contain overflow in all browsers.
*/
pre {
overflow: auto;
}
/**
* Address odd `em`-unit font size rendering in all browsers.
*/
code,
kbd,
pre,
samp {
font-family: monospace, monospace;
font-size: 1em;
}
/* Forms
========================================================================== */
/**
* Known limitation: by default, Chrome and Safari on OS X allow very limited
* styling of `select`, unless a `border` property is set.
*/
/**
* 1. Correct color not being inherited.
* Known issue: affects color of disabled elements.
* 2. Correct font properties not being inherited.
* 3. Address margins set differently in Firefox 4+, Safari, and Chrome.
*/
button,
input,
optgroup,
select,
textarea {
color: inherit; /* 1 */
font: inherit; /* 2 */
margin: 0; /* 3 */
}
/**
* Address `overflow` set to `hidden` in IE 8/9/10/11.
*/
button {
overflow: visible;
}
/**
* Address inconsistent `text-transform` inheritance for `button` and `select`.
* All other form control elements do not inherit `text-transform` values.
* Correct `button` style inheritance in Firefox, IE 8/9/10/11, and Opera.
* Correct `select` style inheritance in Firefox.
*/
button,
select {
text-transform: none;
}
/**
* 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio`
* and `video` controls.
* 2. Correct inability to style clickable `input` types in iOS.
* 3. Improve usability and consistency of cursor style between image-type
* `input` and others.
*/
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
-webkit-appearance: button; /* 2 */
cursor: pointer; /* 3 */
}
/**
* Re-set default cursor for disabled elements.
*/
button[disabled],
html input[disabled] {
cursor: default;
}
/**
* Remove inner padding and border in Firefox 4+.
*/
button::-moz-focus-inner,
input::-moz-focus-inner {
border: 0;
padding: 0;
}
/**
* Address Firefox 4+ setting `line-height` on `input` using `!important` in
* the UA stylesheet.
*/
input {
line-height: normal;
}
/**
* It's recommended that you don't attempt to style these elements.
* Firefox's implementation doesn't respect box-sizing, padding, or width.
*
* 1. Address box sizing set to `content-box` in IE 8/9/10.
* 2. Remove excess padding in IE 8/9/10.
*/
input[type="checkbox"],
input[type="radio"] {
box-sizing: border-box; /* 1 */
padding: 0; /* 2 */
}
/**
* Fix the cursor style for Chrome's increment/decrement buttons. For certain
* `font-size` values of the `input`, it causes the cursor style of the
* decrement button to change from `default` to `text`.
*/
input[type="number"]::-webkit-inner-spin-button,
input[type="number"]::-webkit-outer-spin-button {
height: auto;
}
/**
* 1. Address `appearance` set to `searchfield` in Safari and Chrome.
* 2. Address `box-sizing` set to `border-box` in Safari and Chrome
* (include `-moz` to future-proof).
*/
input[type="search"] {
-webkit-appearance: textfield; /* 1 */
-moz-box-sizing: content-box;
-webkit-box-sizing: content-box; /* 2 */
box-sizing: content-box;
}
/**
* Remove inner padding and search cancel button in Safari and Chrome on OS X.
* Safari (but not Chrome) clips the cancel button when the search input has
* padding (and `textfield` appearance).
*/
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
-webkit-appearance: none;
}
/**
* Define consistent border, margin, and padding.
*/
fieldset {
border: 1px solid #c0c0c0;
margin: 0 2px;
padding: 0.35em 0.625em 0.75em;
}
/**
* 1. Correct `color` not being inherited in IE 8/9/10/11.
* 2. Remove padding so people aren't caught out if they zero out fieldsets.
*/
legend {
border: 0; /* 1 */
padding: 0; /* 2 */
}
/**
* Remove default vertical scrollbar in IE 8/9/10/11.
*/
textarea {
overflow: auto;
}
/**
* Don't inherit the `font-weight` (applied by a rule above).
* NOTE: the default cannot safely be changed in Chrome and Safari on OS X.
*/
optgroup {
font-weight: bold;
}
/* Tables
========================================================================== */
/**
* Remove most spacing between table cells.
*/
table {
border-collapse: collapse;
border-spacing: 0;
}
td,
th {
padding: 0;
}

418
css/skeleton.css vendored Normal file
View File

@@ -0,0 +1,418 @@
/*
* Skeleton V2.0.4
* Copyright 2014, Dave Gamache
* www.getskeleton.com
* Free to use under the MIT license.
* http://www.opensource.org/licenses/mit-license.php
* 12/29/2014
*/
/* Table of contents
- Grid
- Base Styles
- Typography
- Links
- Buttons
- Forms
- Lists
- Code
- Tables
- Spacing
- Utilities
- Clearing
- Media Queries
*/
/* Grid
*/
.container {
position: relative;
width: 100%;
max-width: 960px;
margin: 0 auto;
padding: 0 20px;
box-sizing: border-box; }
.column,
.columns {
width: 100%;
float: left;
box-sizing: border-box; }
/* For devices larger than 400px */
@media (min-width: 400px) {
.container {
width: 85%;
padding: 0; }
}
/* For devices larger than 550px */
@media (min-width: 550px) {
.container {
width: 80%; }
.column,
.columns {
margin-left: 4%; }
.column:first-child,
.columns:first-child {
margin-left: 0; }
.one.column,
.one.columns { width: 4.66666666667%; }
.two.columns { width: 13.3333333333%; }
.three.columns { width: 22%; }
.four.columns { width: 30.6666666667%; }
.five.columns { width: 39.3333333333%; }
.six.columns { width: 48%; }
.seven.columns { width: 56.6666666667%; }
.eight.columns { width: 65.3333333333%; }
.nine.columns { width: 74.0%; }
.ten.columns { width: 82.6666666667%; }
.eleven.columns { width: 91.3333333333%; }
.twelve.columns { width: 100%; margin-left: 0; }
.one-third.column { width: 30.6666666667%; }
.two-thirds.column { width: 65.3333333333%; }
.one-half.column { width: 48%; }
/* Offsets */
.offset-by-one.column,
.offset-by-one.columns { margin-left: 8.66666666667%; }
.offset-by-two.column,
.offset-by-two.columns { margin-left: 17.3333333333%; }
.offset-by-three.column,
.offset-by-three.columns { margin-left: 26%; }
.offset-by-four.column,
.offset-by-four.columns { margin-left: 34.6666666667%; }
.offset-by-five.column,
.offset-by-five.columns { margin-left: 43.3333333333%; }
.offset-by-six.column,
.offset-by-six.columns { margin-left: 52%; }
.offset-by-seven.column,
.offset-by-seven.columns { margin-left: 60.6666666667%; }
.offset-by-eight.column,
.offset-by-eight.columns { margin-left: 69.3333333333%; }
.offset-by-nine.column,
.offset-by-nine.columns { margin-left: 78.0%; }
.offset-by-ten.column,
.offset-by-ten.columns { margin-left: 86.6666666667%; }
.offset-by-eleven.column,
.offset-by-eleven.columns { margin-left: 95.3333333333%; }
.offset-by-one-third.column,
.offset-by-one-third.columns { margin-left: 34.6666666667%; }
.offset-by-two-thirds.column,
.offset-by-two-thirds.columns { margin-left: 69.3333333333%; }
.offset-by-one-half.column,
.offset-by-one-half.columns { margin-left: 52%; }
}
/* Base Styles
*/
/* NOTE
html is set to 62.5% so that all the REM measurements throughout Skeleton
are based on 10px sizing. So basically 1.5rem = 15px :) */
html {
font-size: 62.5%; }
body {
font-size: 1.5em; /* currently ems cause chrome bug misinterpreting rems on body element */
line-height: 1.6;
font-weight: 400;
font-family: "Raleway", "HelveticaNeue", "Helvetica Neue", Helvetica, Arial, sans-serif;
color: #222; }
/* Typography
*/
h1, h2, h3, h4, h5, h6 {
margin-top: 0;
margin-bottom: 2rem;
font-weight: 300; }
h1 { font-size: 4.0rem; line-height: 1.2; letter-spacing: -.1rem;}
h2 { font-size: 3.6rem; line-height: 1.25; letter-spacing: -.1rem; }
h3 { font-size: 3.0rem; line-height: 1.3; letter-spacing: -.1rem; }
h4 { font-size: 2.4rem; line-height: 1.35; letter-spacing: -.08rem; }
h5 { font-size: 1.8rem; line-height: 1.5; letter-spacing: -.05rem; }
h6 { font-size: 1.5rem; line-height: 1.6; letter-spacing: 0; }
/* Larger than phablet */
@media (min-width: 550px) {
h1 { font-size: 5.0rem; }
h2 { font-size: 4.2rem; }
h3 { font-size: 3.6rem; }
h4 { font-size: 3.0rem; }
h5 { font-size: 2.4rem; }
h6 { font-size: 1.5rem; }
}
p {
margin-top: 0; }
/* Links
*/
a {
color: #1EAEDB; }
a:hover {
color: #0FA0CE; }
/* Buttons
*/
.button,
button,
input[type="submit"],
input[type="reset"],
input[type="button"] {
display: inline-block;
height: 38px;
padding: 0 30px;
color: #555;
text-align: center;
font-size: 11px;
font-weight: 600;
line-height: 38px;
letter-spacing: .1rem;
text-transform: uppercase;
text-decoration: none;
white-space: nowrap;
background-color: transparent;
border-radius: 4px;
border: 1px solid #bbb;
cursor: pointer;
box-sizing: border-box; }
.button:hover,
button:hover,
input[type="submit"]:hover,
input[type="reset"]:hover,
input[type="button"]:hover,
.button:focus,
button:focus,
input[type="submit"]:focus,
input[type="reset"]:focus,
input[type="button"]:focus {
color: #333;
border-color: #888;
outline: 0; }
.button.button-primary,
button.button-primary,
input[type="submit"].button-primary,
input[type="reset"].button-primary,
input[type="button"].button-primary {
color: #FFF;
background-color: #33C3F0;
border-color: #33C3F0; }
.button.button-primary:hover,
button.button-primary:hover,
input[type="submit"].button-primary:hover,
input[type="reset"].button-primary:hover,
input[type="button"].button-primary:hover,
.button.button-primary:focus,
button.button-primary:focus,
input[type="submit"].button-primary:focus,
input[type="reset"].button-primary:focus,
input[type="button"].button-primary:focus {
color: #FFF;
background-color: #1EAEDB;
border-color: #1EAEDB; }
/* Forms
*/
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea,
select {
height: 38px;
padding: 6px 10px; /* The 6px vertically centers text on FF, ignored by Webkit */
background-color: #fff;
border: 1px solid #D1D1D1;
border-radius: 4px;
box-shadow: none;
box-sizing: border-box; }
/* Removes awkward default styles on some inputs for iOS */
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none; }
textarea {
min-height: 65px;
padding-top: 6px;
padding-bottom: 6px; }
input[type="email"]:focus,
input[type="number"]:focus,
input[type="search"]:focus,
input[type="text"]:focus,
input[type="tel"]:focus,
input[type="url"]:focus,
input[type="password"]:focus,
textarea:focus,
select:focus {
border: 1px solid #33C3F0;
outline: 0; }
label,
legend {
display: block;
margin-bottom: .5rem;
font-weight: 600; }
fieldset {
padding: 0;
border-width: 0; }
input[type="checkbox"],
input[type="radio"] {
display: inline; }
label > .label-body {
display: inline-block;
margin-left: .5rem;
font-weight: normal; }
/* Lists
*/
ul {
list-style: circle inside; }
ol {
list-style: decimal inside; }
ol, ul {
padding-left: 0;
margin-top: 0; }
ul ul,
ul ol,
ol ol,
ol ul {
margin: 1.5rem 0 1.5rem 3rem;
font-size: 90%; }
li {
margin-bottom: 1rem; }
/* Code
*/
code {
padding: .2rem .5rem;
margin: 0 .2rem;
font-size: 90%;
white-space: nowrap;
background: #F1F1F1;
border: 1px solid #E1E1E1;
border-radius: 4px; }
pre > code {
display: block;
padding: 1rem 1.5rem;
white-space: pre; }
/* Tables
*/
th,
td {
padding: 6px 5px;
text-align: left;
border-bottom: 1px solid #E1E1E1; }
th:first-child,
td:first-child {
padding-left: 0; }
th:last-child,
td:last-child {
padding-right: 0; }
/* Spacing
*/
button,
.button {
margin-bottom: 1rem; }
input,
textarea,
select,
fieldset {
margin-bottom: 0.5rem; }
pre,
blockquote,
dl,
figure,
table,
p,
ul,
ol,
form {
margin-bottom: 1.5rem; }
/* Utilities
*/
.u-full-width {
width: 100%;
box-sizing: border-box; }
.u-max-full-width {
max-width: 100%;
box-sizing: border-box; }
.u-pull-right {
float: right; }
.u-pull-left {
float: left; }
/* Misc
*/
hr {
margin-top: 3rem;
margin-bottom: 3.5rem;
border-width: 0;
border-top: 1px solid #E1E1E1; }
/* Clearing
*/
/* Self Clearing Goodness */
.container:after,
.row:after,
.u-cf {
content: "";
display: table;
clear: both; }
/* Media Queries
*/
/*
Note: The best way to structure the use of media queries is to create the queries
near the relevant code. For example, if you wanted to change the styles for buttons
on small devices, paste the mobile query code up in the buttons section and style it
there.
*/
/* Larger than mobile */
@media (min-width: 400px) {}
/* Larger than phablet (also point when grid becomes active) */
@media (min-width: 550px) {}
/* Larger than tablet */
@media (min-width: 750px) {}
/* Larger than desktop */
@media (min-width: 1000px) {}
/* Larger than Desktop HD */
@media (min-width: 1200px) {}

330
demo_cli.py Normal file
View File

@@ -0,0 +1,330 @@
import argparse
from ctypes import alignment
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from pathlib import Path
import spacy
import time
if __name__ == '__main__':
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("--run_id", type=str, default="default", help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models",
help="Directory containing all saved models")
parser.add_argument("--weight", type=float, default=1,
help="weight of input audio for voice filter")
parser.add_argument("--griffin_lim",
action="store_true",
help="if True, use griffin-lim, else use vocoder")
parser.add_argument("--cpu", action="store_true", help=\
"If True, processing is done on CPU, even when a GPU is available.")
parser.add_argument("--no_sound", action="store_true", help=\
"If True, audio won't be played.")
parser.add_argument("--seed", type=int, default=None, help=\
"Optional random number seed value to make toolbox deterministic.")
args = parser.parse_args()
arg_dict = vars(args)
# print_args(args, parser)
# Hide GPUs from Pytorch to force CPU processing
if arg_dict.pop("cpu"):
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
print("Running a test of your configuration...\n")
import numpy as np
import soundfile as sf
import torch
import encoder.inference
import encoder.params_data
from synthesizer.inference import Synthesizer_infer
from synthesizer.utils.cleaners import add_breaks, english_cleaners_predict
from vocoder import inference as vocoder
from vocoder.display import save_attention_multiple, save_spectrogram, save_stop_tokens
from utils.argutils import print_args
from utils.default_models import ensure_default_models
from speed_changer.fixSpeed import *
if torch.cuda.is_available():
device_id = torch.cuda.current_device()
gpu_properties = torch.cuda.get_device_properties(device_id)
## Print some environment information (for debugging purposes)
print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
"%.1fGb total memory.\n" %
(torch.cuda.device_count(),
device_id,
gpu_properties.name,
gpu_properties.major,
gpu_properties.minor,
gpu_properties.total_memory / 1e9))
else:
print("Using CPU for inference.\n")
## Load the models one by one.
if not args.griffin_lim:
print("Preparing the encoder, the synthesizer and the vocoder...")
else:
print("Preparing the encoder and the synthesizer...")
ensure_default_models(args.run_id, Path("saved_models"))
encoder.inference.load_model(list(args.models_dir.glob(f"{args.run_id}/encoder.pt"))[0])
synthesizer = Synthesizer_infer(list(args.models_dir.glob(f"{args.run_id}/synthesizer.pt"))[0])
if not args.griffin_lim:
vocoder.load_model(list(args.models_dir.glob(f"{args.run_id}/vocoder.pt"))[0])
# ## Run a test
# print("Testing your configuration with small inputs.")
# # Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's
# # sampling rate, which may differ.
# # If you're unfamiliar with digital audio, know that it is encoded as an array of floats
# # (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1.
# # The sampling rate is the number of values (samples) recorded per second, it is set to
# # 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond
# # to an audio of 1 second.
# print("\tTesting the encoder...")
# encoder.embed_utterance(np.zeros(encoder.sampling_rate))
# # Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance
# # returns, but here we're going to make one ourselves just for the sake of showing that it's
# # possible.
# embed = np.random.rand(speaker_embedding_size)
# # Embeddings are L2-normalized (this isn't important here, but if you want to make your own
# # embeddings it will be).
# embed /= np.linalg.norm(embed)
# # The synthesizer can handle multiple inputs with batching. Let's create another embedding to
# # illustrate that
# embeds = [embed, np.zeros(speaker_embedding_size)]
# texts = ["test 1", "test 2"]
# print("\tTesting the synthesizer... (loading the model will output a lot of text)")
# mels = synthesizer.synthesize_spectrograms(texts, embeds)
# # The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We
# # can concatenate the mel spectrograms to a single one.
# mel = np.concatenate(mels, axis=1)
# # The vocoder can take a callback function to display the generation. More on that later. For
# # now we'll simply hide it like this:
# if not args.griffin_lim:
# no_action = lambda *args: None
# print("\tTesting the vocoder...")
# # For the sake of making this test short, we'll pass a short target length. The target length
# # is the length of the wav segments that are processed in parallel. E.g. for audio sampled
# # at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of
# # 0.5 seconds which will all be generated together. The parameters here are absurdly short, and
# # that has a detrimental effect on the quality of the audio. The default parameters are
# # recommended in general.
# vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action)
# print("All test passed! You can now synthesize speech.\n\n")
## Interactive speech generation
print("This is a GUI-less example of interface to SV2TTS. The purpose of this script is to "
"show how you can interface this project easily with your own. See the source code for "
"an explanation of what is happening.\n")
print("Interactive generation loop")
num_generated = 0
nlp = spacy.load('en_core_web_sm')
weight = arg_dict["weight"] # 声音美颜的用户语音权重
amp = 1
while True:
# try:
# Get the reference audio filepath
num_of_input_audio = 1
for i in range(num_of_input_audio):
# Computing the embedding
# First, we load the wav using the function that the speaker encoder provides. This is
# important: there is preprocessing that must be applied.
# The following two methods are equivalent:
# - Directly load from the filepath:
# preprocessed_wav = encoder.preprocess_wav(in_fpath)
# - If the wav is already loaded:
# get duration info from input audio
message2 = "Reference voice: enter an audio folder of a voice to be cloned (mp3, " \
f"wav, m4a, flac, ...):({i+1}/{num_of_input_audio})\n"
in_fpath = Path(input(message2).replace("\"", "").replace("\'", ""))
fpath_without_ext = os.path.splitext(str(in_fpath))[0]
speaker_name = os.path.normpath(fpath_without_ext).split(os.sep)[-1]
is_wav_file, single_wav, wav_path = TransFormat(in_fpath, 'wav')
if not is_wav_file:
os.remove(wav_path) # remove intermediate wav files
# merge
if i == 0:
wav = single_wav
else:
wav = np.append(wav, single_wav)
# write to disk
path_ori, _ = os.path.split(wav_path)
file_ori = 'temp.wav'
fpath = os.path.join(path_ori, file_ori)
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
# adjust the speed
totDur_ori, nPause_ori, arDur_ori, nSyl_ori, arRate_ori = AudioAnalysis(path_ori, file_ori)
DelFile(path_ori, '.TextGrid')
os.remove(fpath)
preprocessed_wav = encoder.inference.preprocess_wav(wav)
print("Loaded input audio file succesfully")
# Then we derive the embedding. There are many functions and parameters that the
# speaker encoder interfaces. These are mostly for in-depth research. You will typically
# only use this function (with its default parameters):
input_embed = encoder.inference.embed_utterance(preprocessed_wav)
# Choose standard audio
fft_max_freq = vocoder.get_dominant_freq(preprocessed_wav)
print(f"\nthe dominant frequency of input audio is {fft_max_freq}Hz")
if fft_max_freq < encoder.params_data.split_freq:
vocoder.hp.sex = 1
standard_fpath = "standard_audios/male_1.wav"
else:
vocoder.hp.sex = 0
standard_fpath = "standard_audios/female_1.wav"
if os.path.exists(standard_fpath):
standard_wav = Synthesizer_infer.load_preprocess_wav(standard_fpath)
preprocessed_standard_wav = encoder.inference.preprocess_wav(standard_wav)
print("Loaded standard audio file successfully")
standard_embed = encoder.inference.embed_utterance(preprocessed_standard_wav)
embed1=np.copy(input_embed).dot(weight)
embed2=np.copy(standard_embed).dot(1 - weight)
embed=embed1+embed2
else:
embed = np.copy(input_embed)
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
embed = embed * amp
start_syn = time.time()
# Generating the spectrogram
text = input("Write a sentence to be synthesized:\n")
# If seed is specified, reset torch seed and force synthesizer reload
if args.seed is not None:
torch.manual_seed(args.seed)
synthesizer = Synthesizer_infer(args.syn_model_fpath)
# The synthesizer works in batch, so you need to put your data in a list or numpy array
def preprocess_text(text):
text = add_breaks(text)
text = english_cleaners_predict(text)
texts = [i.text.strip() for i in nlp(text).sents] # split paragraph to sentences
return texts
texts = preprocess_text(text)
print(f"the list of inputs texts:\n{texts}")
# embeds = [embed] * len(texts)
specs = []
alignments = []
stop_tokens = []
for text in texts:
spec, align, stop_token = synthesizer.synthesize_spectrograms([text], [embed], require_visualization=True)
specs.append(spec[0])
alignments.append(align[0])
stop_tokens.append(stop_token[0])
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
## Save synthesizer visualization results
if not os.path.exists("syn_results"):
os.mkdir("syn_results")
save_attention_multiple(alignments, "syn_results/attention")
save_stop_tokens(stop_tokens, "syn_results/stop_tokens")
save_spectrogram(spec, "syn_results/mel")
print("Created the mel spectrogram")
end_syn = time.time()
print(f"Prediction time of synthesizer is {end_syn - start_syn}s")
start_voc = time.time()
## Generating the waveform
print("Synthesizing the waveform:")
# If seed is specified, reset torch seed and reload vocoder
if args.seed is not None:
torch.manual_seed(args.seed)
vocoder.load_model(args.voc_model_fpath)
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
# spectrogram, the more time-efficient the vocoder.
if not args.griffin_lim:
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade)
else:
wav = Synthesizer_infer.griffin_lim(spec)
end_voc = time.time()
print(f"Prediction time of vocoder is {end_voc - start_voc}s")
print(f"Prediction time of TTS is {end_voc - start_syn}s")
# Add breaks
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
# generated_wav = encoder.inference.preprocess_wav(wav)
wav = wav / np.abs(wav).max() * 4
# Save it on the disk
# filename = "demo_output_%02d.wav" % num_generated
if not os.path.exists("out_audios"):
os.mkdir("out_audios")
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
filename = os.path.join(dir_path, f"out_audios/{speaker_name}_syn.wav")
# print(wav.dtype)
sf.write(filename, wav.astype(np.float32), synthesizer.sample_rate)
num_generated += 1
print("\nSaved output (havent't change speed) as %s\n\n" % filename)
# Fix Speed(generate new audio)
fix_file = work(totDur_ori,
nPause_ori,
arDur_ori,
nSyl_ori,
arRate_ori,
filename)
print(f"\nSaved output (fixed speed) as {fix_file}\n\n")
# # Play the audio (non-blocking)
# if not args.no_sound:
# import sounddevice as sd
# try:
# sd.stop()
# sd.play(wav, synthesizer.sample_rate)
# except sd.PortAudioError as e:
# print("\nCaught exception: %s" % repr(e))
# print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
# except:
# raise
# except Exception as e:
# print("Caught exception: %s" % repr(e))
# print("Restarting\n")

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
Life was like a box of chocolates, you never know what you're gonna get.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.

41
demo_toolbox.py Normal file
View File

@@ -0,0 +1,41 @@
import argparse
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from pathlib import Path
from toolbox import Toolbox
from utils.argutils import print_args
from utils.default_models import ensure_default_models
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description="Runs the toolbox.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("--run_id", type=str, default="20230609", help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("-d", "--datasets_root", type=Path, help= \
"Path to the directory containing your datasets. See toolbox/__init__.py for a list of "
"supported datasets.", default=None)
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models",
help="Directory containing all saved models")
parser.add_argument("--cpu", action="store_true", help=\
"If True, all inference will be done on CPU")
parser.add_argument("--seed", type=int, default=None, help=\
"Optional random number seed value to make toolbox deterministic.")
args = parser.parse_args()
arg_dict = vars(args)
print_args(args, parser)
# Hide GPUs from Pytorch to force CPU processing
if arg_dict.pop("cpu"):
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
# Remind the user to download pretrained models if needed
ensure_default_models(args.run_id, args.models_dir)
# Launch the toolbox
Toolbox(**arg_dict)

BIN
docs/images/audio_icon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

0
encoder/__init__.py Normal file
View File

136
encoder/audio.py Normal file
View File

@@ -0,0 +1,136 @@
from scipy.ndimage.morphology import binary_dilation
from encoder.params_data import *
from pathlib import Path
from typing import Optional, Union
from warnings import warn
import numpy as np
import librosa
import struct
import os
from pydub import AudioSegment
import noisereduce
try:
import webrtcvad
except:
warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
webrtcvad=None
int16_max = (2 ** 15) - 1
def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
source_sr: Optional[int] = None,
normalize: Optional[bool] = True,
trim_silence: Optional[bool] = True):
"""
Applies the preprocessing operations used in training the Speaker Encoder to a waveform
either on disk or in memory. The waveform will be resampled to match the data hyperparameters.
:param fpath_or_wav: either a filepath to an audio file (many extensions are supported, not
just .wav), either the waveform as a numpy array of floats.
:param source_sr: if passing an audio waveform, the sampling rate of the waveform before
preprocessing. After preprocessing, the waveform's sampling rate will match the data
hyperparameters. If passing a filepath, the sampling rate will be automatically detected and
this argument will be ignored.
"""
# Load the wav from disk if needed
if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
# if str(fpath_or_wav).endswith(".m4a"):
# try:
# track = AudioSegment.from_file(fpath_or_wav, format="m4a")
# except:
# return []
# fpath = os.path.splitext(str(fpath_or_wav))[0]
# path_components = os.path.normpath(fpath).split(os.sep)
# wav_dir = Path("D:\\liuhaozhe").joinpath(f"VoxCeleb2_wav") # local path
# wav_dir.mkdir(exist_ok=True)
# wav_name = "_".join(path_components[-6: ])
# wav_path = wav_dir.joinpath(f"{wav_name}.wav")
# track.export(wav_path, format="wav")
# wav, source_sr = librosa.load(str(wav_path), sr=None)
# else:
wav, source_sr = librosa.load(str(fpath_or_wav), sr=None)
else:
wav = fpath_or_wav
# Resample the wav if needed
if source_sr is not None and source_sr != sampling_rate:
wav = librosa.resample(wav, source_sr, sampling_rate)
# Apply the preprocessing: normalize volume and shorten long silences
if normalize:
wav = normalize_volume(wav, audio_norm_target_dBFS, increase_only=True)
if webrtcvad and trim_silence:
wav = trim_long_silences(wav)
return wav
def wav_to_mel_spectrogram(wav):
"""
Derives a mel spectrogram ready to be used by the encoder from a preprocessed audio waveform.
Note: this not a log-mel spectrogram.
"""
frames = librosa.feature.melspectrogram(
wav,
sampling_rate,
n_fft=int(sampling_rate * mel_window_length / 1000),
hop_length=int(sampling_rate * mel_window_step / 1000),
n_mels=mel_n_channels
)
return frames.astype(np.float32).T
def trim_long_silences(wav):
"""
Ensures that segments without voice in the waveform remain no longer than a
threshold determined by the VAD parameters in params.py.
:param wav: the raw waveform as a numpy array of floats
:return: the same waveform with silences trimmed away (length <= original wav length)
"""
# Compute the voice detection window size
samples_per_window = (vad_window_length * sampling_rate) // 1000
# Trim the end of the audio to have a multiple of the window size
wav = wav[:len(wav) - (len(wav) % samples_per_window)]
# Convert the float waveform to 16-bit mono PCM
pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
# Perform voice activation detection
voice_flags = []
vad = webrtcvad.Vad(mode=3)
for window_start in range(0, len(wav), samples_per_window):
window_end = window_start + samples_per_window
voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
sample_rate=sampling_rate))
voice_flags = np.array(voice_flags)
# Smooth the voice detection with a moving average
def moving_average(array, width):
array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
ret = np.cumsum(array_padded, dtype=float)
ret[width:] = ret[width:] - ret[:-width]
return ret[width - 1:] / width
audio_mask = moving_average(voice_flags, vad_moving_average_width)
audio_mask = np.round(audio_mask).astype(np.bool)
# Dilate the voiced regions
audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))
audio_mask = np.repeat(audio_mask, samples_per_window)
return wav[audio_mask == True]
def normalize_volume(wav, target_dBFS, increase_only=False, decrease_only=False):
if increase_only and decrease_only:
raise ValueError("Both increase only and decrease only are set")
dBFS_change = target_dBFS - 10 * np.log10(np.mean(wav ** 2))
if (dBFS_change < 0 and increase_only) or (dBFS_change > 0 and decrease_only):
return wav
return wav * (10 ** (dBFS_change / 20))

45
encoder/config.py Normal file
View File

@@ -0,0 +1,45 @@
librispeech_datasets = {
"train": {
"clean": ["LibriSpeech/train-clean-100", "LibriSpeech/train-clean-360"],
"other": ["LibriSpeech/train-other-500"]
},
"test": {
"clean": ["LibriSpeech/test-clean"],
"other": ["LibriSpeech/test-other"]
},
"dev": {
"clean": ["LibriSpeech/dev-clean"],
"other": ["LibriSpeech/dev-other"]
},
}
libritts_datasets = {
"train": {
"clean": ["LibriTTS/train-clean-100", "LibriTTS/train-clean-360"],
"other": ["LibriTTS/train-other-500"]
},
"test": {
"clean": ["LibriTTS/test-clean"],
"other": ["LibriTTS/test-other"]
},
"dev": {
"clean": ["LibriTTS/dev-clean"],
"other": ["LibriTTS/dev-other"]
},
}
voxceleb_datasets = {
"voxceleb1" : {
"train": ["VoxCeleb1/wav"],
"test": ["VoxCeleb1/test_wav"]
},
"voxceleb2" : {
"train": ["VoxCeleb2/dev/aac"],
"test": ["VoxCeleb2/test_wav"]
}
}
other_datasets = [
"LJSpeech-1.1",
"VCTK-Corpus/wav48",
]
anglophone_nationalites = ["australia", "canada", "ireland", "uk", "usa"]

View File

@@ -0,0 +1,2 @@
from encoder.data_objects.speaker_verification_dataset import Train_Dataset, Dev_Dataset
from encoder.data_objects.speaker_verification_dataset import DataLoader

View File

@@ -0,0 +1,37 @@
import random
class RandomCycler:
"""
Creates an internal copy of a sequence and allows access to its items in a constrained random
order. For a source sequence of n items and one or several consecutive queries of a total
of m items, the following guarantees hold (one implies the other):
- Each item will be returned between m // n and ((m - 1) // n) + 1 times.
- Between two appearances of the same item, there may be at most 2 * (n - 1) other items.
"""
def __init__(self, source):
if len(source) == 0:
raise Exception("Can't create RandomCycler from an empty collection")
self.all_items = list(source)
self.next_items = []
def sample(self, count: int):
shuffle = lambda l: random.sample(l, len(l))
out = []
while count > 0:
if count >= len(self.all_items):
out.extend(shuffle(list(self.all_items)))
count -= len(self.all_items)
continue
n = min(count, len(self.next_items))
out.extend(self.next_items[:n])
count -= n
self.next_items = self.next_items[n:]
if len(self.next_items) == 0:
self.next_items = shuffle(list(self.all_items))
return out
def __next__(self):
return self.sample(1)[0]

View File

@@ -0,0 +1,40 @@
from encoder.data_objects.random_cycler import RandomCycler
from encoder.data_objects.utterance import Utterance
from pathlib import Path
# Contains the set of utterances of a single speaker
class Speaker:
def __init__(self, root: Path):
self.root = root
self.name = root.name
self.utterances = None
self.utterance_cycler = None
def _load_utterances(self):
with self.root.joinpath("_sources.txt").open("r") as sources_file:
sources = [l.split(",") for l in sources_file]
sources = {frames_fname: wave_fpath for frames_fname, wave_fpath in sources}
self.utterances = [Utterance(self.root.joinpath(f), w) for f, w in sources.items()]
self.utterance_cycler = RandomCycler(self.utterances)
def random_partial(self, count, n_frames):
"""
Samples a batch of <count> unique partial utterances from the disk in a way that all
utterances come up at least once every two cycles and in a random order every time.
:param count: The number of partial utterances to sample from the set of utterances from
that speaker. Utterances are guaranteed not to be repeated if <count> is not larger than
the number of utterances available.
:param n_frames: The number of frames in the partial utterance.
:return: A list of tuples (utterance, frames, range) where utterance is an Utterance,
frames are the frames of the partial utterances and range is the range of the partial
utterance with regard to the complete utterance.
"""
if self.utterances is None:
self._load_utterances()
utterances = self.utterance_cycler.sample(count)
a = [(u,) + u.random_partial(n_frames) for u in utterances]
return a

View File

@@ -0,0 +1,13 @@
import numpy as np
from typing import List
from encoder.data_objects.speaker import Speaker
class SpeakerBatch:
def __init__(self, speakers: List[Speaker], utterances_per_speaker: int, n_frames: int):
self.speakers = speakers
self.partials = {s: s.random_partial(utterances_per_speaker, n_frames) for s in speakers}
# Array of shape (n_speakers * n_utterances, n_frames, mel_n), e.g. for 3 speakers with
# 4 utterances each of 160 frames of 40 mel coefficients: (12, 160, 40)
self.data = np.array([frames for s in speakers for _, frames, _ in self.partials[s]])

View File

@@ -0,0 +1,76 @@
from encoder.data_objects.random_cycler import RandomCycler
from encoder.data_objects.speaker_batch import SpeakerBatch
from encoder.data_objects.utterance_batch import UtteranceBatch
from encoder.data_objects.speaker import Speaker
from encoder.params_data import partials_n_frames
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
from os import listdir
from os.path import isfile
import numpy as np
# TODO: improve with a pool of speakers for data efficiency
class Train_Dataset(Dataset):
def __init__(self, datasets_root: Path):
self.root = datasets_root
speaker_dirs = [f for f in self.root.glob("*") if f.is_dir()]
if len(speaker_dirs) == 0:
raise Exception("No speakers found. Make sure you are pointing to the directory "
"containing all preprocessed speaker directories.")
self.speakers = [Speaker(speaker_dir) for speaker_dir in speaker_dirs]
self.speaker_cycler = RandomCycler(self.speakers)
def __len__(self):
return int(1e8)
def __getitem__(self, index):
return next(self.speaker_cycler)
def get_logs(self):
log_string = ""
for log_fpath in self.root.glob("*.txt"):
with log_fpath.open("r") as log_file:
log_string += "".join(log_file.readlines())
return log_string
class Dev_Dataset(Dataset):
def __init__(self, datasets_root: Path):
self.root = datasets_root
speaker_dirs = [f for f in self.root.glob("*") if f.is_dir()]
if len(speaker_dirs) == 0:
raise Exception("No speakers found. Make sure you are pointing to the directory "
"containing all preprocessed speaker directories.")
self.speakers = [Speaker(speaker_dir) for speaker_dir in speaker_dirs]
self.speaker_cycler = RandomCycler(self.speakers)
def __len__(self):
return len(self.speakers)
def __getitem__(self, index):
return next(self.speaker_cycler)
class DataLoader(DataLoader):
def __init__(self, dataset, speakers_per_batch, utterances_per_speaker, shuffle, sampler=None,
batch_sampler=None, num_workers=0, pin_memory=False, timeout=0,
worker_init_fn=None):
self.utterances_per_speaker = utterances_per_speaker
super().__init__(
dataset=dataset,
batch_size=speakers_per_batch,
shuffle=shuffle,
sampler=sampler,
batch_sampler=batch_sampler,
num_workers=num_workers,
collate_fn=self.collate,
pin_memory=pin_memory,
drop_last=False,
timeout=timeout,
worker_init_fn=worker_init_fn
)
def collate(self, speakers):
return SpeakerBatch(speakers, self.utterances_per_speaker, partials_n_frames)

View File

@@ -0,0 +1,29 @@
import numpy as np
class Utterance:
def __init__(self, frames_fpath, wave_fpath):
self.frames_fpath = frames_fpath
self.wave_fpath = wave_fpath
def get_frames(self):
# frame_len = len(np.load(self.frames_fpath))
return np.load(self.frames_fpath)
def random_partial(self, n_frames):
"""
Crops the frames into a partial utterance of n_frames
:param n_frames: The number of frames of the partial utterance
:return: the partial utterance frames and a tuple indicating the start and end of the
partial utterance in the complete utterance.
"""
frames = self.get_frames()
if frames.shape[0] == n_frames:
start = 0
else:
start = np.random.randint(0, frames.shape[0] - n_frames)
end = start + n_frames
# frame_len = end - start
# frames_trim = frames[start:end]
return frames[start:end], (start, end)

View File

@@ -0,0 +1,10 @@
from pathlib import Path
import numpy as np
from typing import List
from encoder.data_objects.utterance import Utterance
class UtteranceBatch:
def __init__(self, utterance_path: List[Path], n_frames: int):
self.utterance = Utterance(utterance_path, None)
self.data = np.array(self.utterance.random_partial(n_frames)[0])

178
encoder/inference.py Normal file
View File

@@ -0,0 +1,178 @@
from encoder.params_data import *
from encoder.model import SpeakerEncoder
from encoder.audio import preprocess_wav # We want to expose this function from here
from matplotlib import cm
from encoder import audio
from pathlib import Path
import numpy as np
import torch
_model = None # type: SpeakerEncoder
_device = None # type: torch.device
def load_model(weights_fpath: Path, device=None):
"""
Loads the model in memory. If this function is not explicitely called, it will be run on the
first call to embed_frames() with the default weights file.
:param weights_fpath: the path to saved model weights.
:param device: either a torch device or the name of a torch device (e.g. "cpu", "cuda"). The
model will be loaded and will run on this device. Outputs will however always be on the cpu.
If None, will default to your GPU if it"s available, otherwise your CPU.
"""
# TODO: I think the slow loading of the encoder might have something to do with the device it
# was saved on. Worth investigating.
global _model, _device
if device is None:
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
elif isinstance(device, str):
_device = torch.device(device)
_model = SpeakerEncoder(_device, torch.device("cpu"))
checkpoint = torch.load(weights_fpath, _device)
_model.load_state_dict(checkpoint["model_state"])
_model.eval()
print("Loaded encoder \"%s\" trained to step %d" % (weights_fpath.name, checkpoint["step"]))
def is_loaded():
return _model is not None
def embed_frames_batch(frames_batch):
"""
Computes embeddings for a batch of mel spectrogram.
:param frames_batch: a batch mel of spectrogram as a numpy array of float32 of shape
(batch_size, n_frames, n_channels)
:return: the embeddings as a numpy array of float32 of shape (batch_size, model_embedding_size)
"""
if _model is None:
raise Exception("Model was not loaded. Call load_model() before inference.")
frames = torch.from_numpy(frames_batch).to(_device)
embed = _model.forward(frames).detach().cpu().numpy()
return embed
def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_frames,
min_pad_coverage=0.75, overlap=0.5):
"""
Computes where to split an utterance waveform and its corresponding mel spectrogram to obtain
partial utterances of <partial_utterance_n_frames> each. Both the waveform and the mel
spectrogram slices are returned, so as to make each partial utterance waveform correspond to
its spectrogram. This function assumes that the mel spectrogram parameters used are those
defined in params_data.py.
The returned ranges may be indexing further than the length of the waveform. It is
recommended that you pad the waveform with zeros up to wave_slices[-1].stop.
:param n_samples: the number of samples in the waveform
:param partial_utterance_n_frames: the number of mel spectrogram frames in each partial
utterance
:param min_pad_coverage: when reaching the last partial utterance, it may or may not have
enough frames. If at least <min_pad_coverage> of <partial_utterance_n_frames> are present,
then the last partial utterance will be considered, as if we padded the audio. Otherwise,
it will be discarded, as if we trimmed the audio. If there aren't enough frames for 1 partial
utterance, this parameter is ignored so that the function always returns at least 1 slice.
:param overlap: by how much the partial utterance should overlap. If set to 0, the partial
utterances are entirely disjoint.
:return: the waveform slices and mel spectrogram slices as lists of array slices. Index
respectively the waveform and the mel spectrogram with these slices to obtain the partial
utterances.
"""
assert 0 <= overlap < 1
assert 0 < min_pad_coverage <= 1
samples_per_frame = int((sampling_rate * mel_window_step / 1000))
n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
# Compute the slices
wav_slices, mel_slices = [], []
steps = max(1, n_frames - partial_utterance_n_frames + frame_step + 1)
for i in range(0, steps, frame_step):
mel_range = np.array([i, i + partial_utterance_n_frames])
wav_range = mel_range * samples_per_frame
mel_slices.append(slice(*mel_range))
wav_slices.append(slice(*wav_range))
# Evaluate whether extra padding is warranted or not
last_wav_range = wav_slices[-1]
coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
if coverage < min_pad_coverage and len(mel_slices) > 1:
mel_slices = mel_slices[:-1]
wav_slices = wav_slices[:-1]
return wav_slices, mel_slices
def embed_utterance(wav, using_partials=True, return_partials=False, **kwargs):
"""
Computes an embedding for a single utterance.
# TODO: handle multiple wavs to benefit from batching on GPU
:param wav: a preprocessed (see audio.py) utterance waveform as a numpy array of float32
:param using_partials: if True, then the utterance is split in partial utterances of
<partial_utterance_n_frames> frames and the utterance embedding is computed from their
normalized average. If False, the utterance is instead computed from feeding the entire
spectogram to the network.
:param return_partials: if True, the partial embeddings will also be returned along with the
wav slices that correspond to the partial embeddings.
:param kwargs: additional arguments to compute_partial_splits()
:return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
<return_partials> is True, the partial utterances as a numpy array of float32 of shape
(n_partials, model_embedding_size) and the wav partials as a list of slices will also be
returned. If <using_partials> is simultaneously set to False, both these values will be None
instead.
"""
# Process the entire utterance if not using partials
if not using_partials:
frames = audio.wav_to_mel_spectrogram(wav)
embed = embed_frames_batch(frames[None, ...])[0]
if return_partials:
return embed, None, None
return embed
# Compute where to split the utterance into partials and pad if necessary
wave_slices, mel_slices = compute_partial_slices(len(wav), **kwargs)
max_wave_length = wave_slices[-1].stop
if max_wave_length >= len(wav):
wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
# Split the utterance into partials
frames = audio.wav_to_mel_spectrogram(wav)
frames_batch = np.array([frames[s] for s in mel_slices])
partial_embeds = embed_frames_batch(frames_batch)
# Compute the utterance embedding from the partial embeddings
raw_embed = np.mean(partial_embeds, axis=0)
embed = raw_embed / np.linalg.norm(raw_embed, 2)
if return_partials:
return embed, partial_embeds, wave_slices
return embed
def embed_speaker(wavs, **kwargs):
raise NotImplemented()
def plot_embedding_as_heatmap(embed, ax=None, title="", shape=None, color_range=(0, 0.30)):
import matplotlib.pyplot as plt
if ax is None:
ax = plt.gca()
if shape is None:
height = int(np.sqrt(len(embed)))
shape = (height, -1)
embed = embed.reshape(shape)
cmap = cm.get_cmap()
mappable = ax.imshow(embed, cmap=cmap)
cbar = plt.colorbar(mappable, ax=ax, fraction=0.046, pad=0.04)
sm = cm.ScalarMappable(cmap=cmap)
sm.set_clim(*color_range)
ax.set_xticks([]), ax.set_yticks([])
ax.set_title(title)

135
encoder/model.py Normal file
View File

@@ -0,0 +1,135 @@
from encoder.params_model import *
from encoder.params_data import *
from scipy.interpolate import interp1d
from sklearn.metrics import roc_curve
from torch.nn.utils import clip_grad_norm_
from scipy.optimize import brentq
from torch import nn
import numpy as np
import torch
class SpeakerEncoder(nn.Module):
def __init__(self, device, loss_device):
super().__init__()
self.loss_device = loss_device
# Network defition
self.lstm = nn.LSTM(input_size=mel_n_channels,
hidden_size=model_hidden_size,
num_layers=model_num_layers,
batch_first=True).to(device)
self.linear = nn.Linear(in_features=model_hidden_size,
out_features=model_embedding_size).to(device)
self.relu = torch.nn.ReLU().to(device)
# Cosine similarity scaling (with fixed initial parameter values)
self.similarity_weight = nn.Parameter(torch.tensor([10.], device=loss_device))
self.similarity_bias = nn.Parameter(torch.tensor([-5.], device=loss_device)) ####modified####
# Loss
self.loss_fn = nn.CrossEntropyLoss().to(loss_device)
def do_gradient_ops(self):
# Gradient scale
self.similarity_weight.grad *= 0.01
self.similarity_bias.grad *= 0.01
# Gradient clipping
clip_grad_norm_(self.parameters(), 3, norm_type=2)
def forward(self, utterances, hidden_init=None):
"""
Computes the embeddings of a batch of utterance spectrograms.
:param utterances: batch of mel-scale filterbanks of same duration as a tensor of shape
(batch_size, n_frames, n_channels)
:param hidden_init: initial hidden state of the LSTM as a tensor of shape (num_layers,
batch_size, hidden_size). Will default to a tensor of zeros if None.
:return: the embeddings as a tensor of shape (batch_size, embedding_size)
"""
# Pass the input through the LSTM layers and retrieve all outputs, the final hidden state
# and the final cell state.
out, (hidden, cell) = self.lstm(utterances, hidden_init)
# We take only the hidden state of the last layer
embeds_raw = self.relu(self.linear(hidden[-1]))
# L2-normalize it
embeds = embeds_raw / (torch.norm(embeds_raw, dim=1, keepdim=True) + 1e-5)
return embeds
def similarity_matrix(self, embeds):
"""
Computes the similarity matrix according the section 2.1 of GE2E.
:param embeds: the embeddings as a tensor of shape (speakers_per_batch,
utterances_per_speaker, embedding_size)
:return: the similarity matrix as a tensor of shape (speakers_per_batch,
utterances_per_speaker, speakers_per_batch)
"""
speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
# Inclusive centroids (1 per speaker). Cloning is needed for reverse differentiation
centroids_incl = torch.mean(embeds, dim=1, keepdim=True)
centroids_incl = centroids_incl.clone() / (torch.norm(centroids_incl, dim=2, keepdim=True) + 1e-5)
# Exclusive centroids (1 per utterance)
centroids_excl = (torch.sum(embeds, dim=1, keepdim=True) - embeds)
centroids_excl /= (utterances_per_speaker - 1)
centroids_excl = centroids_excl.clone() / (torch.norm(centroids_excl, dim=2, keepdim=True) + 1e-5)
# Similarity matrix. The cosine similarity of already 2-normed vectors is simply the dot
# product of these vectors (which is just an element-wise multiplication reduced by a sum).
# We vectorize the computation for efficiency.
sim_matrix = torch.zeros(speakers_per_batch, utterances_per_speaker,
speakers_per_batch).to(self.loss_device)
mask_matrix = 1 - np.eye(speakers_per_batch, dtype=np.int)
for j in range(speakers_per_batch):
mask = np.where(mask_matrix[j])[0]
sim_matrix[mask, :, j] = (embeds[mask] * centroids_incl[j]).sum(dim=2)
sim_matrix[j, :, j] = (embeds[j] * centroids_excl[j]).sum(dim=1)
## Even more vectorized version (slower maybe because of transpose)
# sim_matrix2 = torch.zeros(speakers_per_batch, speakers_per_batch, utterances_per_speaker
# ).to(self.loss_device)
# eye = np.eye(speakers_per_batch, dtype=np.int)
# mask = np.where(1 - eye)
# sim_matrix2[mask] = (embeds[mask[0]] * centroids_incl[mask[1]]).sum(dim=2)
# mask = np.where(eye)
# sim_matrix2[mask] = (embeds * centroids_excl).sum(dim=2)
# sim_matrix2 = sim_matrix2.transpose(1, 2)
sim_matrix = sim_matrix * self.similarity_weight + self.similarity_bias
return sim_matrix
def loss(self, embeds):
"""
Computes the softmax loss according the section 2.1 of GE2E.
:param embeds: the embeddings as a tensor of shape (speakers_per_batch,
utterances_per_speaker, embedding_size)
:return: the loss and the EER for this batch of embeddings.
"""
speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
# Loss
sim_matrix = self.similarity_matrix(embeds)
sim_matrix = sim_matrix.reshape((speakers_per_batch * utterances_per_speaker,
speakers_per_batch))
ground_truth = np.repeat(np.arange(speakers_per_batch), utterances_per_speaker)
target = torch.from_numpy(ground_truth).long().to(self.loss_device)
loss = self.loss_fn(sim_matrix, target)
# EER (not backpropagated)
with torch.no_grad():
inv_argmax = lambda i: np.eye(1, speakers_per_batch, i, dtype=np.int)[0]
labels = np.array([inv_argmax(i) for i in ground_truth])
preds = sim_matrix.detach().cpu().numpy()
# Snippet from https://yangcha.github.io/EER-ROC/
fpr, tpr, thresholds = roc_curve(labels.flatten(), preds.flatten())
eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
return loss, eer

34
encoder/params_data.py Normal file
View File

@@ -0,0 +1,34 @@
## Mel-filterbank
mel_window_length = 25 # In milliseconds
mel_window_step = 10 # In milliseconds
mel_n_channels = 40
## Audio
sampling_rate = 16000
# Number of spectrogram frames in a partial utterance
partials_n_frames = 160 # 1600 ms
# Number of spectrogram frames at inference
inference_n_frames = 80 # 800 ms
## Voice Activation Detection
# Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
# This sets the granularity of the VAD. Should not need to be changed.
vad_window_length = 30 # In milliseconds
# Number of frames to average together when performing the moving average smoothing.
# The larger this value, the larger the VAD variations must be to not get smoothed out.
vad_moving_average_width = 8
# Maximum number of consecutive silent frames a segment can have.
vad_max_silence_length = 6
## Audio volume normalization
audio_norm_target_dBFS = -30
# 判断用户输入语音为男声或女声的分界频率
split_freq = 170
# embed去噪置零的阈值
set_zero_thres=0.06

11
encoder/params_model.py Normal file
View File

@@ -0,0 +1,11 @@
## Model parameters
model_hidden_size = 256
model_embedding_size = 256
model_num_layers = 3
## Training parameters
learning_rate_init = 5* 1e-6
speakers_per_batch = 64
utterances_per_speaker = 10

232
encoder/preprocess.py Normal file
View File

@@ -0,0 +1,232 @@
from datetime import datetime
from functools import partial
from multiprocessing import Pool
from pathlib import Path
import numpy as np
from tqdm import tqdm
from encoder import audio
from encoder.config import librispeech_datasets, anglophone_nationalites
from encoder.params_data import *
_AUDIO_EXTENSIONS = ("wav", "flac", "m4a", "mp3")
class DatasetLog:
"""
Registers metadata about the dataset in a text file.
"""
def __init__(self, root, name):
self.text_file = open(Path(root, "Log_%s.txt" % name.replace("/", "_")), "w")
self.sample_data = dict()
start_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
self.write_line("Creating dataset %s on %s" % (name, start_time))
self.write_line("-----")
self._log_params()
def _log_params(self):
from encoder import params_data
self.write_line("Parameter values:")
for param_name in (p for p in dir(params_data) if not p.startswith("__")):
value = getattr(params_data, param_name)
self.write_line("\t%s: %s" % (param_name, value))
self.write_line("-----")
def write_line(self, line):
self.text_file.write("%s\n" % line)
def add_sample(self, **kwargs):
for param_name, value in kwargs.items():
if not param_name in self.sample_data:
self.sample_data[param_name] = []
self.sample_data[param_name].append(value)
def finalize(self):
self.write_line("Statistics:")
for param_name, values in self.sample_data.items():
self.write_line("\t%s:" % param_name)
self.write_line("\t\tmin %.3f, max %.3f" % (np.min(values), np.max(values)))
self.write_line("\t\tmean %.3f, median %.3f" % (np.mean(values), np.median(values)))
self.write_line("-----")
end_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
self.write_line("Finished on %s" % end_time)
self.text_file.close()
def _init_preprocess_dataset(dataset_name, datasets_root, out_dir):
dataset_root = datasets_root.joinpath(dataset_name)
if not dataset_root.exists():
print("Couldn\'t find %s, skipping this dataset." % dataset_root)
return None, None
return dataset_root, DatasetLog(out_dir, dataset_name)
def _preprocess_speaker(speaker_dir: Path, datasets_root: Path, out_dir: Path, skip_existing: bool):
out_dir.mkdir(exist_ok=True)
# Give a name to the speaker that includes its dataset
speaker_name = "_".join(speaker_dir.relative_to(datasets_root).parts)
# Create an output directory with that name, as well as a txt file containing a
# reference to each source file.
speaker_out_dir = out_dir.joinpath(speaker_name)
speaker_out_dir.mkdir(exist_ok=True)
sources_fpath = speaker_out_dir.joinpath("_sources.txt")
# There's a possibility that the preprocessing was interrupted earlier, check if
# there already is a sources file.
if sources_fpath.exists():
try:
with sources_fpath.open("r") as sources_file:
existing_fnames = {line.split(",")[0] for line in sources_file}
except:
existing_fnames = {}
else:
existing_fnames = {}
# Gather all audio files for that speaker recursively
sources_file = sources_fpath.open("a" if skip_existing else "w")
audio_durs = []
for extension in _AUDIO_EXTENSIONS:
for in_fpath in speaker_dir.glob("**/*.%s" % extension):
# Check if the target output file already exists
out_fname = "_".join(in_fpath.relative_to(speaker_dir).parts)
out_fname = out_fname.replace(".%s" % extension, ".npy")
if skip_existing and out_fname in existing_fnames:
continue
# Load and preprocess the waveform
wav = audio.preprocess_wav(in_fpath)
if len(wav) == 0:
continue
# Create the mel spectrogram, discard those that are too short
frames = audio.wav_to_mel_spectrogram(wav)
if len(frames) < partials_n_frames:
continue
out_fpath = speaker_out_dir.joinpath(out_fname)
np.save(out_fpath, frames)
sources_file.write("%s,%s\n" % (out_fname, in_fpath))
audio_durs.append(len(wav) / sampling_rate)
sources_file.close()
return audio_durs
def _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, skip_existing, logger):
print("%s: Preprocessing data for %d speakers." % (dataset_name, len(speaker_dirs)))
# Process the utterances for each speaker
work_fn = partial(_preprocess_speaker, datasets_root=datasets_root, out_dir=out_dir, skip_existing=skip_existing)
with Pool(4) as pool:
tasks = pool.imap(work_fn, speaker_dirs)
for sample_durs in tqdm(tasks, dataset_name, len(speaker_dirs), unit="speakers"):
for sample_dur in sample_durs:
logger.add_sample(duration=sample_dur)
logger.finalize()
print("Done preprocessing %s.\n" % dataset_name)
def preprocess_librispeech(datasets_root: Path, out_dir: Path, skip_existing=False):
# preprocess train dataset
for dataset_name in librispeech_datasets["train"]["other"]:
# Initialize the preprocessing
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
if not dataset_root:
return
# Preprocess all speakers
speaker_dirs = list(dataset_root.glob("*"))
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("train"), skip_existing, logger)
# preprocess dev dataset
for dataset_name in librispeech_datasets["dev"]["other"]:
# Initialize the preprocessing
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
if not dataset_root:
return
# Preprocess all speakers
speaker_dirs = list(dataset_root.glob("*"))
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("dev"), skip_existing, logger)
def preprocess_voxceleb1(datasets_root: Path, out_dir: Path, skip_existing=False):
# Initialize the preprocessing
dataset_name = "VoxCeleb1"
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
if not dataset_root:
return
train_dataset_root = dataset_root.joinpath("train")
dev_dataset_root = dataset_root.joinpath("dev")
# Preprocess train data
# Get the contents of the meta file
with train_dataset_root.joinpath("vox1_meta.csv").open("r") as metafile:
metadata = [line.split("\t") for line in metafile][1:]
# Select the ID and the nationality, filter out non-anglophone speakers
nationalities = {line[0]: line[3] for line in metadata}
keep_speaker_ids = [speaker_id for speaker_id, nationality in nationalities.items() if
nationality.lower() in anglophone_nationalites]
print("VoxCeleb1: using samples from %d (presumed anglophone) speakers out of %d." %
(len(keep_speaker_ids), len(nationalities)))
# Get the speaker directories for anglophone speakers only
train_speaker_dirs = train_dataset_root.joinpath("wav").glob("*")
train_speaker_dirs = [speaker_dir for speaker_dir in train_speaker_dirs if
speaker_dir.name in keep_speaker_ids]
print("VoxCeleb1 train: found %d anglophone speakers on the disk, %d missing (this is normal)." %
(len(train_speaker_dirs), len(keep_speaker_ids) - len(train_speaker_dirs)))
# Preprocess all speakers
_preprocess_speaker_dirs(train_speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("train"), skip_existing, logger)
# Preprocess dev data
# Get the contents of the meta file
with dev_dataset_root.joinpath("vox1_meta.csv").open("r") as metafile:
metadata = [line.split("\t") for line in metafile][1:]
# Select the ID and the nationality, filter out non-anglophone speakers
nationalities = {line[0]: line[3] for line in metadata}
keep_speaker_ids = [speaker_id for speaker_id, nationality in nationalities.items() if
nationality.lower() in anglophone_nationalites]
print("VoxCeleb1: using samples from %d (presumed anglophone) speakers out of %d." %
(len(keep_speaker_ids), len(nationalities)))
# Get the speaker directories for anglophone speakers only
dev_speaker_dirs = dev_dataset_root.joinpath("wav").glob("*")
dev_speaker_dirs = [speaker_dir for speaker_dir in dev_speaker_dirs if
speaker_dir.name in keep_speaker_ids]
print("VoxCeleb1 dev: found %d anglophone speakers on the disk, %d missing (this is normal)." %
(len(dev_speaker_dirs), len(keep_speaker_ids) - len(dev_speaker_dirs)))
# Preprocess all speakers
_preprocess_speaker_dirs(dev_speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("dev"), skip_existing, logger)
def preprocess_voxceleb2(datasets_root: Path, out_dir: Path, skip_existing=False):
# Initialize the preprocessing
dataset_name = "VoxCeleb2"
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
if not dataset_root:
return
train_dataset_root = dataset_root.joinpath("train")
dev_dataset_root = dataset_root.joinpath("dev")
# Get the speaker directories
# Preprocess all speakers
speaker_dirs = list(train_dataset_root.joinpath("dev", "aac").glob("*"))
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("train"), skip_existing, logger)
# Get the speaker directories
# Preprocess all speakers
speaker_dirs = list(dev_dataset_root.joinpath("aac").glob("*"))
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("dev"), skip_existing, logger)

184
encoder/train.py Normal file
View File

@@ -0,0 +1,184 @@
from pathlib import Path
import numpy as np
from os.path import exists
import torch
from encoder.data_objects import DataLoader, Train_Dataset, Dev_Dataset
from encoder.model import SpeakerEncoder
from encoder.params_model import *
from encoder.visualizations import Visualizations
from utils.profiler import Profiler
def sync(device: torch.device):
# For correct profiling (cuda operations are async)
if device.type == "cuda":
torch.cuda.synchronize(device)
def update_lr(optimizer, lr):
for param_group in optimizer.param_groups:
param_group["lr"] = lr
def train(run_id: str, clean_data_root: Path, models_dir: Path, umap_every: int, save_every: int,
backup_every: int, vis_every: int, force_restart: bool, visdom_server: str,
no_visdom: bool):
# Create a dataset and a dataloader
train_dataset = Train_Dataset(clean_data_root.joinpath("train"))
dev_dataset = Dev_Dataset(clean_data_root.joinpath("dev"))
train_loader = DataLoader(
train_dataset,
speakers_per_batch,
utterances_per_speaker,
shuffle=True,
num_workers=8,
pin_memory=True
)
dev_batch = len(dev_dataset)
dev_loader = DataLoader(
dev_dataset,
dev_batch,
utterances_per_speaker,
shuffle=False,
num_workers=2,
pin_memory=True
)
# Setup the device on which to run the forward pass and the loss. These can be different,
# because the forward pass is faster on the GPU whereas the loss is often (depending on your
# hyperparameters) faster on the CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# FIXME: currently, the gradient is None if loss_device is cuda
# loss_device = torch.device("cpu")
loss_device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ####modified####
# Create the model and the optimizer
model = SpeakerEncoder(device, loss_device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate_init)
current_lr = learning_rate_init
init_step = 1
# Configure file path for the model
model_dir = models_dir / run_id
model_dir.mkdir(exist_ok=True, parents=True)
state_fpath = model_dir / "encoder.pt"
# Load any existing model
if not force_restart:
if state_fpath.exists():
print("Found existing model \"%s\", loading it and resuming training." % run_id)
checkpoint = torch.load(state_fpath)
init_step = checkpoint["step"]
print(f"Resuming training from step {init_step}")
model.load_state_dict(checkpoint["model_state"])
optimizer.load_state_dict(checkpoint["optimizer_state"])
optimizer.param_groups[0]["lr"] = learning_rate_init
else:
print("No model \"%s\" found, starting training from scratch." % run_id)
else:
print("Starting the training from scratch.")
# Initialize the visualization environment
vis = Visualizations(run_id, vis_every, server=visdom_server, disabled=no_visdom)
vis.log_dataset(train_dataset)
vis.log_params()
device_name = str(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")
vis.log_implementation({"Device": device_name})
best_eer_file_path = "encoder_loss/best_eer.npy"
if not exists("encoder_loss"):
import os
os.mkdir("encoder_loss")
best_eer = np.load(best_eer_file_path)[0] if exists(best_eer_file_path) else 1
# Training loop
profiler = Profiler(summarize_every=1000, disabled=False)
for step, speaker_batch in enumerate(train_loader, init_step):
model.train()
profiler.tick("Blocking, waiting for batch (threaded)")
# Data to GPU mem
inputs = torch.from_numpy(speaker_batch.data).to(device)
sync(device)
profiler.tick("Data to %s" % device)
# Forward pass
embeds = model(inputs)
sync(device)
profiler.tick("Forward pass")
embeds_loss = embeds.view((speakers_per_batch, utterances_per_speaker, -1)).to(loss_device)
loss, eer = model.loss(embeds_loss)
sync(loss_device)
profiler.tick("Loss")
# Backward pass
model.zero_grad() # Sets gradients of all model parameters to zero
loss.backward() # Calc gradients of all model parameters
profiler.tick("Backward pass")
model.do_gradient_ops()
optimizer.step() # do gradient descent of all model parameters
profiler.tick("Parameter update")
# Update visualizations
# learning_rate = optimizer.param_groups[0]["lr"]
# Overwrite the latest version of the model
if save_every != 0 and step % save_every == 0:
current_lr *= 0.995
update_lr(optimizer, current_lr)
dev_loss, dev_eer, dev_embeds = validate(dev_loader, model, dev_batch, device, loss_device)
sync(device)
sync(loss_device)
profiler.tick("validate")
vis.update(loss.item(), eer, step, dev_loss, dev_eer)
if dev_eer < best_eer:
best_eer = dev_eer
np.save(best_eer_file_path, np.array([best_eer]))
print("Saving the model (step %d)" % step)
torch.save({
"step": step + 1,
"model_state": model.state_dict(),
"optimizer_state": optimizer.state_dict(),
}, state_fpath)
else:
vis.update(loss.item(), eer, step)
# Draw projections and save them to the backup folder
if umap_every != 0 and step % umap_every == 0:
print("Drawing and saving projections (step %d)" % step)
projection_fpath = model_dir / f"umap_{step:06d}.png"
dev_projection_fpath = model_dir / f"dev_umap_{step:06d}.png"
embeds = embeds.detach().cpu().numpy()
dev_embeds = dev_embeds.detach().cpu().numpy()
vis.draw_projections(embeds, dev_embeds, utterances_per_speaker, step, projection_fpath, dev_projection_fpath)
vis.save()
# # Make a backup
# if backup_every != 0 and step % backup_every == 0:
# print("Making a backup (step %d)" % step)
# backup_fpath = model_dir / f"encoder_{step:06d}.bak"
# torch.save({
# "step": step + 1,
# "model_state": model.state_dict(),
# "optimizer_state": optimizer.state_dict(),
# }, backup_fpath)
profiler.tick("Extras (visualizations, saving)")
def validate(dev_loader: DataLoader, model: SpeakerEncoder, dev_batch, device, loss_device):
model.eval()
losses = []
eers = []
with torch.no_grad():
for step, speaker_batch in enumerate(dev_loader, 1):
frames = torch.from_numpy(speaker_batch.data).to(device)
embeds = model.forward(frames)
embeds_loss = embeds.view((dev_batch, utterances_per_speaker, -1)).to(loss_device)
loss, eer = model.loss(embeds_loss)
losses.append(loss.item())
eers.append(eer)
return sum(losses) / len(losses), sum(eers) / len(eers), embeds.detach()

215
encoder/visualizations.py Normal file
View File

@@ -0,0 +1,215 @@
from datetime import datetime
from time import perf_counter as timer
import numpy as np
import umap
import visdom
from encoder.data_objects.speaker_verification_dataset import Train_Dataset
colormap = np.array([
[76, 255, 0],
[0, 127, 70],
[255, 0, 0],
[255, 217, 38],
[0, 135, 255],
[165, 0, 165],
[255, 167, 255],
[0, 255, 255],
[255, 96, 38],
[142, 76, 0],
[33, 0, 127],
[0, 0, 0],
[183, 183, 183],
], dtype=np.float) / 255
class Visualizations:
def __init__(self, env_name=None, update_every=10, server="http://localhost", disabled=False):
# Tracking data
self.last_update_timestamp = timer()
self.update_every = update_every
self.step_times = []
self.train_losses = []
self.train_eers = []
print("Updating the visualizations every %d steps." % update_every)
# If visdom is disabled TODO: use a better paradigm for that
self.disabled = disabled
if self.disabled:
return
# Set the environment name
now = str(datetime.now().strftime("%d-%m %Hh%M"))
if env_name is None:
self.env_name = now
else:
self.env_name = "%s (%s)" % (env_name, now)
# Connect to visdom and open the corresponding window in the browser
try:
self.vis = visdom.Visdom(server, env=self.env_name, raise_exceptions=True)
except ConnectionError:
raise Exception("No visdom server detected. Run the command \"visdom\" in your CLI to "
"start it.")
# webbrowser.open("http://localhost:8097/env/" + self.env_name)
# Create the windows
self.loss_win = None
self.eer_win = None
# self.lr_win = None
self.implementation_win = None
self.projection_win = None
self.dev_projection_win = None
self.implementation_string = ""
def log_params(self):
if self.disabled:
return
from encoder import params_data
from encoder import params_model
param_string = "<b>Model parameters</b>:<br>"
for param_name in (p for p in dir(params_model) if not p.startswith("__")):
value = getattr(params_model, param_name)
param_string += "\t%s: %s<br>" % (param_name, value)
param_string += "<b>Data parameters</b>:<br>"
for param_name in (p for p in dir(params_data) if not p.startswith("__")):
value = getattr(params_data, param_name)
param_string += "\t%s: %s<br>" % (param_name, value)
self.vis.text(param_string, opts={"title": "Parameters"})
def log_dataset(self, dataset: Train_Dataset):
if self.disabled:
return
dataset_string = ""
dataset_string += "<b>Speakers</b>: %s\n" % len(dataset.speakers)
dataset_string += "\n" + dataset.get_logs()
dataset_string = dataset_string.replace("\n", "<br>")
self.vis.text(dataset_string, opts={"title": "Dataset"})
def log_implementation(self, params):
if self.disabled:
return
implementation_string = ""
for param, value in params.items():
implementation_string += "<b>%s</b>: %s\n" % (param, value)
implementation_string = implementation_string.replace("\n", "<br>")
self.implementation_string = implementation_string
self.implementation_win = self.vis.text(
implementation_string,
opts={"title": "Training implementation"}
)
def update(self, loss, eer, step, dev_loss=None, dev_eer=None):
# Update the tracking data
now = timer()
self.step_times.append(1000 * (now - self.last_update_timestamp))
self.last_update_timestamp = now
self.train_losses.append(loss)
self.train_eers.append(eer)
print(".", end="")
# Update the plots every <update_every> steps
if step % self.update_every != 0:
return
time_string = "Step time: mean: %5dms std: %5dms" % \
(int(np.mean(self.step_times)), int(np.std(self.step_times)))
print("\nStep %6d Train Loss: %.4f Train EER: %.4f Dev Loss: %.4f Dev EER: %.4f %s" %
(step, np.mean(self.train_losses), np.mean(self.train_eers), dev_loss, dev_eer, time_string))
if not self.disabled:
loss_win_id = 'win1'
self.loss_win = self.vis.line(
[np.mean(self.train_losses)],
[step],
win=loss_win_id,
name="Avg. train Loss",
update="append" if loss_win_id else "None",
opts=dict(
xlabel="Step",
ylabel="Loss",
title="Loss",
)
)
self.vis.line(
[dev_loss],
[step],
win=loss_win_id,
name="Avg. dev Loss",
update="append"
)
err_win_id = 'win2'
self.eer_win = self.vis.line(
[np.mean(self.train_eers)],
[step],
win=err_win_id,
name="Avg. train EER",
update="append" if err_win_id else "None",
opts=dict(
xlabel="Step",
ylabel="EER",
title="Equal error rate"
)
)
self.vis.line(
[dev_eer],
[step],
win=err_win_id,
name="Avg. dev EER",
update="append"
)
if self.implementation_win is not None:
self.vis.text(
self.implementation_string + ("<b>%s</b>" % time_string),
win=self.implementation_win,
opts={"title": "Training implementation"},
)
# Reset the tracking
self.train_losses.clear()
self.train_eers.clear()
self.step_times.clear()
def draw_projections(self, embeds, dev_embeds, utterances_per_speaker, step, out_fpath=None, dev_out_fpath=None, max_speakers=10):
import matplotlib.pyplot as plt
max_speakers = min(max_speakers, len(colormap))
# draw train umap projections
embeds = embeds[:max_speakers * utterances_per_speaker]
n_speakers = len(embeds) // utterances_per_speaker
ground_truth = np.repeat(np.arange(n_speakers), utterances_per_speaker)
colors = [colormap[i] for i in ground_truth]
reducer = umap.UMAP()
projected = reducer.fit_transform(embeds)
plt.scatter(projected[:, 0], projected[:, 1], c=colors)
plt.gca().set_aspect("equal", "datalim")
plt.title("UMAP projection (step %d)" % step)
if not self.disabled:
self.projection_win = self.vis.matplot(plt, win=self.projection_win)
if out_fpath is not None:
plt.savefig(out_fpath)
plt.clf()
# draw dev umap projections
dev_embeds = dev_embeds[:max_speakers * utterances_per_speaker]
n_speakers = len(dev_embeds) // utterances_per_speaker
ground_truth = np.repeat(np.arange(n_speakers), utterances_per_speaker)
colors = [colormap[i] for i in ground_truth]
reducer = umap.UMAP()
projected = reducer.fit_transform(dev_embeds)
plt.scatter(projected[:, 0], projected[:, 1], c=colors)
plt.gca().set_aspect("equal", "datalim")
plt.title("dev UMAP projection (step %d)" % step)
if not self.disabled:
self.dev_projection_win = self.vis.matplot(plt, win=self.dev_projection_win)
if dev_out_fpath is not None:
plt.savefig(dev_out_fpath)
plt.clf()
def save(self):
if not self.disabled:
self.vis.save([self.env_name])

71
encoder_preprocess.py Normal file
View File

@@ -0,0 +1,71 @@
from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2
from utils.argutils import print_args
from pathlib import Path
import argparse
if __name__ == "__main__":
class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
pass
parser = argparse.ArgumentParser(
description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
"writes them to the disk. This will allow you to train the encoder. The "
"datasets required are at least one of VoxCeleb1, VoxCeleb2 and LibriSpeech. "
"Ideally, you should have all three. You should extract them as they are "
"after having downloaded them and put them in a same directory, e.g.:\n"
"-[datasets_root]\n"
" -LibriSpeech\n"
" -train-other-500\n"
" -VoxCeleb1\n"
" -wav\n"
" -vox1_meta.csv\n"
" -VoxCeleb2\n"
" -dev",
formatter_class=MyFormatter
)
parser.add_argument("datasets_root", type=Path, help=\
"Path to the directory containing your LibriSpeech/TTS and VoxCeleb datasets.")
parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
"Path to the output directory that will contain the mel spectrograms. If left out, "
"defaults to <datasets_root>/SV2TTS/encoder/")
parser.add_argument("-d", "--datasets", type=str,
default="librispeech_other,voxceleb2,voxceleb1", help=\
"Comma-separated list of the name of the datasets you want to preprocess. Only the train "
"set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
"voxceleb2.")
parser.add_argument("-s", "--skip_existing", action="store_true", help=\
"Whether to skip existing output files with the same name. Useful if this script was "
"interrupted.")
parser.add_argument("--no_trim", action="store_true", help=\
"Preprocess audio without trimming silences (not recommended).")
args = parser.parse_args()
# Verify webrtcvad is available
if not args.no_trim:
try:
import webrtcvad
except:
raise ModuleNotFoundError("Package 'webrtcvad' not found. This package enables "
"noise removal and is recommended. Please install and try again. If installation fails, "
"use --no_trim to disable this error message.")
del args.no_trim
# Process the arguments
args.datasets = args.datasets.split(",")
if not hasattr(args, "out_dir"):
args.out_dir = args.datasets_root.joinpath("SV2TTS", "encoder")
assert args.datasets_root.exists()
args.out_dir.mkdir(exist_ok=True, parents=True)
# Preprocess the datasets
print_args(args, parser)
preprocess_func = {
"voxceleb1": preprocess_voxceleb1,
"voxceleb2": preprocess_voxceleb2,
"librispeech_other": preprocess_librispeech,
}
args = vars(args)
for dataset in args.pop("datasets"):
print("Preprocessing %s" % dataset)
preprocess_func[dataset](**args)

167
encoder_test_preprocess.py Normal file
View File

@@ -0,0 +1,167 @@
from datetime import datetime
from functools import partial
from multiprocessing import Pool
from pathlib import Path
import argparse
import numpy as np
from tqdm import tqdm
from encoder import audio
from encoder.config import librispeech_datasets, anglophone_nationalites
from encoder.params_data import *
_AUDIO_EXTENSIONS = ("wav", "flac", "m4a", "mp3")
class DatasetLog:
"""
Registers metadata about the dataset in a text file.
"""
def __init__(self, root, name):
self.text_file = open(Path(root, "Log_%s.txt" % name.replace("/", "_")), "w")
self.sample_data = dict()
start_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
self.write_line("Creating dataset %s on %s" % (name, start_time))
self.write_line("-----")
self._log_params()
def _log_params(self):
from encoder import params_data
self.write_line("Parameter values:")
for param_name in (p for p in dir(params_data) if not p.startswith("__")):
value = getattr(params_data, param_name)
self.write_line("\t%s: %s" % (param_name, value))
self.write_line("-----")
def write_line(self, line):
self.text_file.write("%s\n" % line)
def add_sample(self, **kwargs):
for param_name, value in kwargs.items():
if not param_name in self.sample_data:
self.sample_data[param_name] = []
self.sample_data[param_name].append(value)
def finalize(self):
self.write_line("Statistics:")
for param_name, values in self.sample_data.items():
self.write_line("\t%s:" % param_name)
self.write_line("\t\tmin %.3f, max %.3f" % (np.min(values), np.max(values)))
self.write_line("\t\tmean %.3f, median %.3f" % (np.mean(values), np.median(values)))
self.write_line("-----")
end_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
self.write_line("Finished on %s" % end_time)
self.text_file.close()
def _init_preprocess_dataset(dataset_name, datasets_root, out_dir):
dataset_root = datasets_root.joinpath(dataset_name)
if not dataset_root.exists():
print("Couldn\'t find %s, skipping this dataset." % dataset_root)
return None, None
return dataset_root, DatasetLog(out_dir, dataset_name)
def _preprocess_speaker(speaker_dir: Path, datasets_root: Path, out_dir: Path, skip_existing: bool):
out_dir.mkdir(exist_ok=True)
# Give a name to the speaker that includes its dataset
speaker_name = "_".join(speaker_dir.relative_to(datasets_root).parts)
# Create an output directory with that name, as well as a txt file containing a
# reference to each source file.
speaker_out_dir = out_dir.joinpath(speaker_name)
speaker_out_dir.mkdir(exist_ok=True)
sources_fpath = speaker_out_dir.joinpath("_sources.txt")
# There's a possibility that the preprocessing was interrupted earlier, check if
# there already is a sources file.
if sources_fpath.exists():
try:
with sources_fpath.open("r") as sources_file:
existing_fnames = {line.split(",")[0] for line in sources_file}
except:
existing_fnames = {}
else:
existing_fnames = {}
# Gather all audio files for that speaker recursively
sources_file = sources_fpath.open("a" if skip_existing else "w")
audio_durs = []
for extension in _AUDIO_EXTENSIONS:
for in_fpath in speaker_dir.glob("**/*.%s" % extension):
# Check if the target output file already exists
out_fname = "_".join(in_fpath.relative_to(speaker_dir).parts)
out_fname = out_fname.replace(".%s" % extension, ".npy")
if skip_existing and out_fname in existing_fnames:
continue
# Load and preprocess the waveform
wav = audio.preprocess_wav(in_fpath)
if len(wav) == 0:
continue
# Create the mel spectrogram, discard those that are too short
frames = audio.wav_to_mel_spectrogram(wav)
if len(frames) < partials_n_frames:
continue
out_fpath = speaker_out_dir.joinpath(out_fname)
np.save(out_fpath, frames)
sources_file.write("%s,%s\n" % (out_fname, in_fpath))
audio_durs.append(len(wav) / sampling_rate)
sources_file.close()
return audio_durs
def _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, skip_existing, logger):
print("%s: Preprocessing data for %d speakers." % (dataset_name, len(speaker_dirs)))
# Process the utterances for each speaker
work_fn = partial(_preprocess_speaker, datasets_root=datasets_root, out_dir=out_dir, skip_existing=skip_existing)
with Pool(4) as pool:
tasks = pool.imap(work_fn, speaker_dirs)
for sample_durs in tqdm(tasks, dataset_name, len(speaker_dirs), unit="speakers"):
for sample_dur in sample_durs:
logger.add_sample(duration=sample_dur)
logger.finalize()
print("Done preprocessing %s.\n" % dataset_name)
def preprocess_librispeechtest(datasets_root: Path, out_dir: Path, skip_existing=False):
# preprocess dev dataset
for dataset_name in librispeech_datasets["test"]["other"]:
# Initialize the preprocessing
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
if not dataset_root:
return
# Preprocess all speakers
speaker_dirs = list(dataset_root.glob("*"))
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir.joinpath("test"), skip_existing, logger)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Preprocesses audio files from librispeech test other dataset, encodes them as mel spectrograms and "
"writes them to the disk.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("datasets_root", type=Path, help=\
"Path to the directory containing your LibriSpeech/TTS and VoxCeleb datasets.")
parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
"Path to the output directory that will contain the mel spectrograms. If left out, "
"defaults to <datasets_root>/SV2TTS/encoder/")
parser.add_argument("-s", "--skip_existing", action="store_true", help=\
"Whether to skip existing output files with the same name. Useful if this script was "
"interrupted.")
args = parser.parse_args()
if not hasattr(args, "out_dir"):
args.out_dir = args.datasets_root.joinpath("SV2TTS", "encoder")
assert args.datasets_root.exists()
args.out_dir.mkdir(exist_ok=True, parents=True)
args = vars(args)
preprocess_librispeechtest(**args)

View File

@@ -0,0 +1,156 @@
from pathlib import Path
import argparse
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE
import umap
import torch
from encoder.data_objects import DataLoader, Train_Dataset, Dev_Dataset
from encoder.model import SpeakerEncoder
from encoder.params_model import *
from encoder.params_data import *
colormap = np.array([
[76, 255, 0],
[0, 255, 76],
[0, 76, 255],
[0, 127, 70],
[70, 127, 0],
[127, 70, 0],
[255, 0, 0],
[255, 217, 38],
[255, 38, 217],
[38, 217, 255],
[0, 135, 255],
[135, 0, 255],
[255, 135, 0],
[165, 0, 165],
[0, 165, 165],
[165, 165, 0],
[255, 167, 255],
[167, 255, 255],
[255, 255, 167],
[0, 255, 255],
[255, 0, 255],
[255, 255, 0],
[255, 96, 38],
[96, 255, 38],
[38, 96, 255],
[142, 76, 0],
[142, 0, 76],
[0, 76, 142],
[33, 0, 127],
[0, 33, 127],
[33, 127, 0],
[0, 0, 0],
[183, 183, 183],
], dtype=np.float) / 255
def draw_scatterplot(x, labels, num_speakers, algo):
sns.color_palette("tab10")
colors = [colormap[i] for i in labels]
plt.scatter(x=x[:, 0], y=x[:, 1], c=colors)
plt.title(f"{algo}({num_speakers} speakers)")
if not os.path.exists("dim_reduction_results"):
os.mkdir("dim_reduction_results")
plt.savefig(f"dim_reduction_results/{algo}_{num_speakers}.png", dpi=600)
plt.clf()
def test_visualization(run_id: str, clean_data_root: Path, models_dir: Path):
test_dataset = Dev_Dataset(clean_data_root.joinpath("test"))
num_speakers = len(test_dataset)
test_loader = DataLoader(
test_dataset,
num_speakers,
utterances_per_speaker,
shuffle=False,
num_workers=4,
pin_memory=True
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loss_device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ####modified####
# Create the model and the optimizer
model = SpeakerEncoder(device, loss_device)
# Configure file path for the model
model_dir = models_dir / run_id
model_dir.mkdir(exist_ok=True, parents=True)
state_fpath = model_dir / "encoder.pt"
# Load any existing model
if state_fpath.exists():
print("Found existing model \"%s\", loading it and test." % run_id)
checkpoint = torch.load(state_fpath)
model.load_state_dict(checkpoint["model_state"])
model.eval()
with torch.no_grad():
for step, speaker_batch in enumerate(test_loader, 1):
frames = torch.from_numpy(speaker_batch.data).to(device)
embeds = model.forward(frames)
num_speakers_for_visualization = num_speakers
embeds_cpu = embeds.detach().cpu().numpy()[:num_speakers_for_visualization*utterances_per_speaker, :]
labels = np.repeat(np.arange(num_speakers_for_visualization), utterances_per_speaker)
embeds_pca = PCA(n_components=2).fit_transform(embeds_cpu)
draw_scatterplot(embeds_pca, labels, num_speakers_for_visualization, "PCA")
embeds_mds = MDS(n_components=2).fit_transform(embeds_cpu)
draw_scatterplot(embeds_mds, labels, num_speakers_for_visualization, "MDS")
embeds_lda = LinearDiscriminantAnalysis(n_components=2).fit_transform(embeds_cpu, labels)
draw_scatterplot(embeds_lda, labels, num_speakers_for_visualization, "LDA")
embeds_tsne = TSNE(n_components=2, perplexity=10).fit_transform(embeds_cpu)
draw_scatterplot(embeds_tsne, labels, num_speakers_for_visualization, "T-SNE")
embeds_umap = umap.UMAP(n_components=2).fit_transform(embeds_cpu)
draw_scatterplot(embeds_umap, labels, num_speakers_for_visualization, "UMAP")
embeds_cpu_zero_op = np.copy(embeds_cpu)
embeds_cpu_zero_op[embeds_cpu_zero_op < set_zero_thres] = 0
embeds_tsne = TSNE(n_components=2, perplexity=10).fit_transform(embeds_cpu_zero_op)
draw_scatterplot(embeds_tsne, labels, num_speakers_for_visualization, "T-SNE_zero_op")
embeds_umap = umap.UMAP(n_components=2).fit_transform(embeds_cpu_zero_op)
draw_scatterplot(embeds_umap, labels, num_speakers_for_visualization, "UMAP_zero_op")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Trains the speaker encoder. You must have run encoder_preprocess.py first.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("run_id", type=str, help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("clean_data_root", type=Path, help= \
"Path to the output directory of encoder_preprocess.py. If you left the default "
"output directory when preprocessing, it should be <datasets_root>/SV2TTS/encoder/.")
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
"Path to the root directory that contains all models. A directory <run_name> will be created under this root."
"It will contain the saved model weights, as well as backups of those weights and plots generated during "
"training.")
args = parser.parse_args()
args = vars(args)
test_visualization(**args)

44
encoder_train.py Normal file
View File

@@ -0,0 +1,44 @@
from utils.argutils import print_args
from encoder.train import train
from pathlib import Path
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Trains the speaker encoder. You must have run encoder_preprocess.py first.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("run_id", type=str, help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("clean_data_root", type=Path, help= \
"Path to the output directory of encoder_preprocess.py. If you left the default "
"output directory when preprocessing, it should be <datasets_root>/SV2TTS/encoder/.")
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
"Path to the root directory that contains all models. A directory <run_name> will be created under this root."
"It will contain the saved model weights, as well as backups of those weights and plots generated during "
"training.")
parser.add_argument("-v", "--vis_every", type=int, default=1000, help= \
"Number of steps between updates of the loss and the plots.")
parser.add_argument("-u", "--umap_every", type=int, default=2000, help= \
"Number of steps between updates of the umap projection. Set to 0 to never update the "
"projections.")
parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
"Number of steps between updates of the model on the disk. Set to 0 to never save the "
"model.")
parser.add_argument("-b", "--backup_every", type=int, default=5000, help= \
"Number of steps between backups of the model. Set to 0 to never make backups of the "
"model.")
parser.add_argument("-f", "--force_restart", action="store_true", help= \
"Do not load any saved model.")
parser.add_argument("--visdom_server", type=str, default="http://localhost")
parser.add_argument("--no_visdom", action="store_true", help= \
"Disable visdom.")
args = parser.parse_args()
# Run the training
print_args(args, parser)
train(**vars(args))

186
index.html Normal file
View File

@@ -0,0 +1,186 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.88.1" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="css/custom.css">
<link rel="stylesheet" href="css/normalize.css">
<title>Voice Cloning</title>
<link href="css/bootstrap.min.css" rel="stylesheet">
</head>
<body data-new-gr-c-s-check-loaded="14.1091.0" data-gr-ext-installed="">
<div class="container" >
<header role="banner">
</header>
<main role="main">
<article itemscope itemtype="https://schema.org/BlogPosting">
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<div class="text-center">
<h1>Real-Time Voice Cloning v2</h1>
</div>
<br>
paper: <a href="https://arxiv.org/pdf/1806.04558.pdf">https://arxiv.org/pdf/1806.04558.pdf</a>
<br>
<br>
code: <a href="https://github.com/liuhaozhe6788/voice-cloning-collab">https://github.com/liuhaozhe6788/voice-cloning-collab</a>
<br>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<h2 id="model-overview" style="text-align: center;">Model Overview</h2>
<body>
<p style="text-align: center;">
<img src="docs/images/voice_cloning_arch.png" height="400" width="800">
<br>
The architecture is the same as that in the paper.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<h2 id="libriSpeech-test-samples" style="text-align: center;">LibriSpeech test Samples</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Speaker Prompt</th>
<th style="text-align: center">Text</th>
<th style="text-align: center">Generated Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align = "center">
<audio controls src="samples/260-123286-0000.flac"></audio>
<a href="samples/260-123286-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls src="demo_results/text1/260-123286-0000_syn.wav"></audio>
<a href="demo_results/text1/260-123286-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls src="demo_results/text2/260-123286-0000_syn.wav"></audio>
<a href="demo_results/text2/260-123286-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls src="demo_results/text3/260-123286-0000_syn.wav"></audio>
<a href="demo_results/text3/260-123286-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls src="samples/1688-142285-0000.flac"></audio>
<a href="samples/1688-142285-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls src="demo_results/text1/1688-142285-0000_syn.wav"></audio>
<a href="demo_results/text1/1688-142285-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls src="demo_results/text2/1688-142285-0000_syn.wav"></audio>
<a href="demo_results/text2/1688-142285-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls src="demo_results/text3/1688-142285-0000_syn.wav"></audio>
<a href="demo_results/text3/1688-142285-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls src="samples/4294-9934-0000.flac"></audio>
<a href="samples/4294-9934-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls src="demo_results/text1/4294-9934-0000_syn.wav"></audio>
<a href="demo_results/text1/4294-9934-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls src="demo_results/text2/4294-9934-0000_syn.wav"></audio>
<a href="demo_results/text2/4294-9934-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls src="demo_results/text3/4294-9934-0000_syn.wav"></audio>
<a href="demo_results/text3/4294-9934-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls src="samples/7176-88083-0000.flac"></audio>
<a href="samples/7176-88083-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls src="demo_results/text1/7176-88083-0000_syn.wav"></audio>
<a href="demo_results/text1/7176-88083-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls src="demo_results/text2/7176-88083-0000_syn.wav"></audio>
<a href="demo_results/text2/7176-88083-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls src="demo_results/text3/7176-88083-0000_syn.wav"></audio>
<a href="demo_results/text3/7176-88083-0000_syn.wav">
</a>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</article>
</main>
</div>
</body>
</html>

BIN
requirements.txt Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
samples/4294-9934-0000.flac Normal file

Binary file not shown.

Binary file not shown.

2
samples/README.md Normal file
View File

@@ -0,0 +1,2 @@
260-123286-0000.flac and 7176-88083-0000.flac are from LibriSpeech test-clean.
1688-142285-0000.flac and 4294-9934-0000.flac are from LibriSpeech test-other.

View File

105
speed_changer/fixSpeed.py Normal file
View File

@@ -0,0 +1,105 @@
import os
from ffmpeg import audio
from pathlib import Path
import numpy as np
import parselmouth
from synthesizer.inference import Synthesizer_infer
from synthesizer.hparams import syn_hparams
import soundfile as sf
from parselmouth.praat import run_file
high_lim_speed_factor = 1.5
low_lim_speed_factor = 0.4
def AudioAnalysis(dir, file):
sound = os.path.join(dir, file)
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
source_run = os.path.join(dir_path, "myspsolution.praat")
try:
objects = run_file(source_run, -20, 2, 0.27, "yes",sound, dir, 80, 400, 0.01, capture_output=True, return_variables = True)
# 第四个参数为原praat脚本中的 Minimum_pause_duration若有bug可适当调小
totDur = objects[2]['originaldur']
nPause = objects[2]['npause']
arDur = objects[2]['speakingtot']
nSyl = objects[2]['voicedcount']
arRate = objects[2]['articulationrate']
except:
totDur = 0
nPause = 0
arDur = 0
nSyl = 0
arRate = 0
print("Try again the sound of the audio was not clear")
return round(totDur, 2), int(nPause), round(arDur, 2), int(nSyl), round(arRate, 2)
def FixSpeed(totDur_ori: float,
nPause_ori: int,
arDur_ori: float,
nSyl_ori: int,
arRate_ori: float,
audio_syn):
speed_factor = 0
path_syn, filename_syn = os.path.split(audio_syn)
name_syn, suffix_syn = os.path.splitext(filename_syn)
totDur_syn, nPause_syn, arDur_syn, nSyl_syn, arRate_syn = AudioAnalysis(path_syn, filename_syn)
print(f"for original audio:\n\ttotDur = {totDur_ori}s\n\tnPause = {nPause_ori}\n\tarDur = {arDur_ori}s\n\tnSyl = {nSyl_ori}\n\tarRate = {arRate_ori} per second\n-----")
print(f"for synthesized audio:\n\ttotDur = {totDur_syn}s\n\tnPause = {nPause_syn}\n\tarDur = {arDur_syn}s\n\tnSyl = {nSyl_syn}\n\tarRate = {arRate_syn} per second\n-----")
if arRate_syn == 0:
print("exception!\n The speed factor is abnormal")
return audio_syn, speed_factor
speed_factor = round(arRate_ori/arRate_syn, 2)
print(f"speed_factor = {speed_factor}")
if speed_factor > high_lim_speed_factor or\
speed_factor < low_lim_speed_factor:
print("exception!\n The speed factor is abnormal")
return audio_syn, speed_factor
else:
out_file = os.path.join(path_syn, name_syn + "_{}".format(speed_factor) + suffix_syn)
audio.a_speed(audio_syn, speed_factor, out_file)
os.remove(audio_syn) # remove intermediate wav files
print(f"Finished!\nThe path of out_file is {out_file}")
return out_file, speed_factor
def TransFormat(fullpath, out_suffix):
is_wav_file = False # 原始音频的后缀是否为.wav
path_, name = os.path.split(fullpath)
name, suffix = os.path.splitext(name)
wav = Synthesizer_infer.load_preprocess_wav(fullpath)
if suffix == ".wav": # 如果原始音频的后缀为.wav则不用进行格式转换
is_wav_file = True
return is_wav_file, wav, str(fullpath)
else: # 如果原始音频的后缀不是.wav则需要进行格式转换
out_file = os.path.join(path_, name + "." + str(out_suffix))
sf.write(out_file, wav.astype(np.float32), syn_hparams.sample_rate)
return is_wav_file, wav, str(out_file)
def DelFile(rootDir, matchText: str):
fileList = os.listdir(rootDir)
for file in fileList:
if matchText in file:
delFile = os.path.join(rootDir, file)
os.remove(delFile)
print("Deleted", delFile)
def work(totDur_ori: float,
nPause_ori: int,
arDur_ori: float,
nSyl_ori: int,
arRate_ori: float,
audio_syn):
fix_file, speed_factor = FixSpeed(totDur_ori,
nPause_ori,
arDur_ori,
nSyl_ori,
arRate_ori,
audio_syn)
# DelFile(in_path, '.TextGrid')
out_path, _ = os.path.split(audio_syn)
DelFile(out_path, '.TextGrid')
return fix_file, speed_factor

View File

@@ -0,0 +1,627 @@
###########################################################################
# The library was developed based upon the idea introduced #
# by Nivja DeJong and Ton Wempe [1], Paul Boersma and David Weenink [2], #
# Carlo Gussenhoven [3], #
# S.M Witt and S.J. Young [4] #
# Peaks in intensity (dB) that are preceded and followed by dips in #
# intensity are considered as potential syllable cores. #
# #
# Praat Script voice analysis #
# Copyright (C) 2017 Shahab Sabahi #
# #
# This program is a Mysolutions software intellectual property: #
# you can redistribute it and/or modify it under the terms #
# of the Mysolutions Permision. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. #
# #
# #
###########################################################################
#
# modified 2018 July by Shahab Sabahi,
# bug fixed concerning summing total pause, May 28th 2017
###########################################################################
clearinfo
# select all
# Remove
form Counting Syllables in Sound Utterances
real Silence_threshold_(dB) -20
real Minimum_dip_between_peaks_(dB) 2
real Minimum_pause_duration_(s) 0.27
boolean Keep_Soundfiles_and_Textgrids 1
sentence soundin
sentence directory
positive Minimum_pitch_(Hz) 80
positive Maximum_pitch_(Hz) 400
positive Time_step_(s) 0.01
endform
# shorten variables
silencedb = 'silence_threshold'
mindip = 'minimum_dip_between_peaks'
showtext = 'keep_Soundfiles_and_Textgrids'
minpause = 'minimum_pause_duration'
# read files
Read from file... 'soundin$'
# use object ID
soundname$ = selected$("Sound")
soundid = selected("Sound")
originaldur = Get total duration
# allow non-zero starting time
bt = Get starting time
# Use intensity to get threshold
To Intensity... 50 0 yes
intid = selected("Intensity")
start = Get time from frame number... 1
nframes = Get number of frames
end = Get time from frame number... 'nframes'
# estimate noise floor
minint = Get minimum... 0 0 Parabolic
# estimate noise max
maxint = Get maximum... 0 0 Parabolic
#get .99 quantile to get maximum (without influence of non-speech sound bursts)
max99int = Get quantile... 0 0 0.99
# estimate Intensity threshold
threshold = max99int + silencedb
threshold2 = maxint - max99int
threshold3 = silencedb - threshold2
if threshold < minint
threshold = minint
endif
# get pauses (silences) and speakingtime
To TextGrid (silences)... threshold3 minpause 0.1 silent sounding
textgridid = selected("TextGrid")
silencetierid = Extract tier... 1
silencetableid = Down to TableOfReal... sounding
nsounding = Get number of rows
npauses = 'nsounding'
speakingtot = 0
for ipause from 1 to npauses
beginsound = Get value... 'ipause' 1
endsound = Get value... 'ipause' 2
speakingdur = 'endsound' - 'beginsound'
speakingtot = 'speakingdur' + 'speakingtot'
endfor
select 'intid'
Down to Matrix
matid = selected("Matrix")
# Convert intensity to sound
To Sound (slice)... 1
sndintid = selected("Sound")
# use total duration, not end time, to find out duration of intdur
# in order to allow nonzero starting times.
intdur = Get total duration
intmax = Get maximum... 0 0 Parabolic
# estimate peak positions (all peaks)
To PointProcess (extrema)... Left yes no Sinc70
ppid = selected("PointProcess")
numpeaks = Get number of points
# fill array with time points
for i from 1 to numpeaks
t'i' = Get time from index... 'i'
endfor
# fill array with intensity values
select 'sndintid'
peakcount = 0
for i from 1 to numpeaks
value = Get value at time... t'i' Cubic
if value > threshold
peakcount += 1
int'peakcount' = value
timepeaks'peakcount' = t'i'
endif
endfor
# fill array with valid peaks: only intensity values if preceding
# dip in intensity is greater than mindip
select 'intid'
validpeakcount = 0
currenttime = timepeaks1
currentint = int1
for p to peakcount-1
following = p + 1
followingtime = timepeaks'following'
dip = Get minimum... 'currenttime' 'followingtime' None
diffint = abs(currentint - dip)
if diffint > mindip
validpeakcount += 1
validtime'validpeakcount' = timepeaks'p'
endif
currenttime = timepeaks'following'
currentint = Get value at time... timepeaks'following' Cubic
endfor
# Look for only voiced parts
select 'soundid'
To Pitch (ac)... 0.02 30 4 no 0.03 0.25 0.01 0.35 0.25 450
# keep track of id of Pitch
pitchid = selected("Pitch")
voicedcount = 0
for i from 1 to validpeakcount
querytime = validtime'i'
select 'textgridid'
whichinterval = Get interval at time... 1 'querytime'
whichlabel$ = Get label of interval... 1 'whichinterval'
select 'pitchid'
value = Get value at time... 'querytime' Hertz Linear
if value <> undefined
if whichlabel$ = "sounding"
voicedcount = voicedcount + 1
voicedpeak'voicedcount' = validtime'i'
endif
endif
endfor
# calculate time correction due to shift in time for Sound object versus
# intensity object
timecorrection = originaldur/intdur
# Insert voiced peaks in TextGrid
if showtext > 0
select 'textgridid'
Insert point tier... 1 syllables
for i from 1 to voicedcount
position = voicedpeak'i' * timecorrection
Insert point... 1 position 'i'
endfor
endif
Save as text file: "'directory$'/'soundname$'.TextGrid"
# use object ID
Read from file... 'soundin$'
soundname$ = selected$("Sound")
soundid = selected("Sound")
fileName$ = "f0points'soundname$'.txt"
# Calculate F0 values
To Pitch... time_step minimum_pitch maximum_pitch
numberOfFrames = Get number of frames
# Loop through all frames in the Pitch object:
select Pitch 'soundname$'
unit$ = "Hertz"
min_Hz = Get minimum... 0 0 Hertz Parabolic
min$ = "'min_Hz'"
max_Hz = Get maximum... 0 0 Hertz Parabolic
max$ = "'max_Hz'"
mean_Hz = Get mean... 0 0 Hertz
mean$ = "'mean_Hz'"
stdev_Hz = Get standard deviation... 0 0 Hertz
stdev$ = "'stdev_Hz'"
median_Hz = Get quantile... 0 0 0.50 Hertz
median$ = "'median_Hz'"
quantile25_Hz = Get quantile... 0 0 0.25 Hertz
quantile25$ = "'quantile25_Hz'"
quantile75_Hz = Get quantile... 0 0 0.75 Hertz
quantile75$ = "'quantile75_Hz'"
# Collect and save the pitch values from the individual frames to the text file:
quantile250 = 'quantile25$'
quantile750 = 'quantile75$'
meanall = 'mean$'
sd='stdev$'
medi='median$'
mini='min$'
maxi='max$'
# clean up before next sound file is opened
select 'intid'
plus 'matid'
plus 'sndintid'
plus 'ppid'
plus 'pitchid'
plus 'silencetierid'
plus 'silencetableid'
Read from file... 'soundin$'
soundname$ = selected$ ("Sound")
To Formant (burg)... 0 5 5500 0.025 50
Read from file... 'directory$'/'soundname$'.TextGrid
int=Get number of intervals... 2
appendInfoLine:"int = ", 'int'
if int<2
warning$="A noisy background or unnatural-sounding speech detected. No result try again"
appendInfoLine: warning$
# exitScript()
endif
# We then calculate F1, F2 and F3
fff= 0
eee= 0
inside= 0
outside= 0
for k from 2 to 'int'
select TextGrid 'soundname$'
label$ = Get label of interval... 2 'k'
if label$ <> ""
# calculates the onset and offset
vowel_onset = Get starting point... 2 'k'
vowel_offset = Get end point... 2 'k'
select Formant 'soundname$'
f_one = Get mean... 1 vowel_onset vowel_offset Hertz
f_two = Get mean... 2 vowel_onset vowel_offset Hertz
f_three = Get mean... 3 vowel_onset vowel_offset Hertz
appendInfoLine: "f_one = ", 'f_one'
appendInfoLine: "f_two = ", 'f_two'
appendInfoLine: "f_three = ", 'f_three'
ff = 'f_two'/'f_one'
lnf1 = 'f_one'
lnf2f1 = ('f_two'/'f_one')
uplim =(-0.012*'lnf1')+13.17
lowlim =(-0.0148*'lnf1')+8.18
f1uplim =(lnf2f1-13.17)/-0.012
f1lowlim =(lnf2f1-8.18)/-0.0148
if lnf1>='f1lowlim' and lnf1<='f1uplim'
inside = 'inside'+1
else
outside = 'outside'+1
endif
fff = 'fff'+'f1uplim'
eee = 'eee'+'f1lowlim'
ffff = 'fff'/'int'
eeee = 'eee'/'int'
pron =('inside'*100)/('inside'+'outside')
prom =('outside'*100)/('inside'+'outside')
prob1 = invBinomialP ('pron'/100, 'inside', 'inside'+'outside')
prob = 'prob1:2'
endif
endfor
lnf0 = (ln(f_one)-5.65)/0.31
f00 = exp (lnf0)
Remove
if showtext < 1
select 'soundid'
plus 'textgridid'
Remove
endif
# summarize results in Info window
speakingrate = 'voicedcount'/'originaldur'
speakingraterp = ('voicedcount'/'originaldur')*100/3.93
articulationrate = 'voicedcount'/'speakingtot'
articulationraterp = ('voicedcount'/'speakingtot')*100/4.64
npause = 'npauses'-1
asd = 'speakingtot'/'voicedcount'
avenumberofwords = ('voicedcount'/1.74)/'speakingtot'
avenumberofwordsrp = (('voicedcount'/1.74)/'speakingtot')*100/2.66
nuofwrdsinchunk = (('voicedcount'/1.74)/'speakingtot')* 'speakingtot'/'npauses'
nuofwrdsinchunkrp = ((('voicedcount'/1.74)/'speakingtot')* 'speakingtot'/'npauses')*100/9
avepauseduratin = ('originaldur'-'speakingtot')/('npauses'-1)
avepauseduratinrp = (('originaldur'-'speakingtot')/('npauses'-1))*100/0.75
balance = ('voicedcount'/'originaldur')/('voicedcount'/'speakingtot')
balancerp = (('voicedcount'/'originaldur')/('voicedcount'/'speakingtot'))*100/0.85
nuofwrds= ('voicedcount'/1.74)
f1norm = -0.0118*'pron'*'pron'+0.5072*'pron'+394.34
inpro = ('nuofwrds'*60/'originaldur')
polish = 'originaldur'/2
# Read the saved pitch points as a Matrix object:
if meanall<150
q25='quantile250'/100
q75='quantile750'/140
mr= 'meanall'/119
else
q25='quantile250'/183
q75='quantile750'/237
mr= 'meanall'/210
endif
# Convert the original minimum and maximum parameters in order to define the x scale of the
if q25<=1 and q75<=1 and mr>=0.95 and mr<=1.05
ins=10
elsif q25<=1 and q75<=1 and mr>=0.9 and mr<=1.1
ins=9
elsif q25<=1 and q75<=1 and mr>=0.85 and mr<=1.15
ins=8
elsif mr>=0.9 and mr<=1.1
ins=7
elsif mr>=0.8 and mr<=1.2
ins=6
elsif mr<=0.8
ins=4
else
ins=5
endif
#SCORING
if f00<90 or f00>255
z=1.16
elsif f00<97 or f00>245
z=2
elsif f00<115 or f00>245
z=3
elsif f00<=245 or f00>=115
z=4
else
z=1
endif
if nuofwrdsinchunk>=6.24 and avepauseduratin<=1.0
l=4
elsif nuofwrdsinchunk>=6.24 and avepauseduratin>1.0
l=3.6
elsif nuofwrdsinchunk>=4.4 and nuofwrdsinchunk<=6.24 and avepauseduratin<=1.15
l=3.3
elsif nuofwrdsinchunk>=4.4 and nuofwrdsinchunk<=6.24 and avepauseduratin>1.15
l=3
elsif nuofwrdsinchunk<4.4 and avepauseduratin<=1.15
l=2
elsif nuofwrdsinchunk<=4.4 and avepauseduratin>1.15
l=1.16
else
l=1
endif
if balance>=0.69 and avenumberofwords>=2.60
o=4
elsif balance>=0.60 and avenumberofwords>=2.43
o=3.5
elsif balance>=0.5 and avenumberofwords>=2.25
o=3
elsif balance>=0.5 and avenumberofwords>=2.07
o=2
elsif balance>=0.5 and avenumberofwords>=1.95
o=1.16
else
o=1
endif
if speakingrate<=4.26 and speakingrate>=3.16
q=4
elsif speakingrate<=3.16 and speakingrate>=2.54
q=3.5
elsif speakingrate<=2.54 and speakingrate>=1.91
q=3
elsif speakingrate<=1.91 and speakingrate>=1.28
q=2
elsif speakingrate<=1.28 and speakingrate>=1.0
q=1.16
else
q=1
endif
if balance>=0.69 and articulationrate>=4.54
w=4
elsif balance>=0.60 and articulationrate>=4.22
w=3.5
elsif balance>=0.50 and articulationrate>=3.91
w=3
elsif balance>=0.5 and articulationrate>=3.59
w=2
elsif balance>=0.5 and articulationrate>=3.10
w=1.16
else
w=1
endif
if inpro>=119 and ('f1norm'*1.1)>=f1lowlim
r = 4
elsif inpro>=119 and ('f1norm'*1.1)<f1lowlim
r = 3.8
elsif inpro<119 and inpro>=100 and ('f1norm'*1.1)>=f1lowlim
r = 3.6
elsif inpro<119 and inpro>=100 and ('f1norm'*1.1)<f1lowlim
r = 3.4
elsif inpro<100 and inpro>=80 and ('f1norm'*1.1)>=f1lowlim
r= 3.2
elsif inpro<100 and inpro>=80 and ('f1norm'*1.1)<f1lowlim
r = 2.8
elsif inpro<80 and inpro>=70 and ('f1norm'*1.1)>=f1lowlim
r = 2.4
elsif inpro<70 and inpro>=60 and ('f1norm'*1.1)>=f1lowlim
r = 2
elsif inpro<70 and inpro>=60 and ('f1norm'*1.1)<f1lowlim
r = 1.1
else
r = 0.3
endif
if articulationrate>=4.80 and balance>=0.8
qr = 4
elsif articulationrate>=4.80 and balance<0.8
qr = 3.8
elsif articulationrate<4.80 and articulationrate>=4.65 and balance>=0.8
qr = 3.6
elsif articulationrate<4.80 and articulationrate>=4.65 and balance<0.8
qr = 3.4
elsif articulationrate<4.65 and articulationrate>=4.55 and balance>=0.8
qr= 3.2
elsif articulationrate<4.65 and articulationrate>=4.55 and balance<0.8
qr = 2.8
elsif articulationrate<4.55 and articulationrate>=4.40 and balance>=0.8
qr = 2.4
elsif articulationrate<4.40 and articulationrate>=4.30 and balance>=0.8
qr = 2
elsif articulationrate<4.40 and articulationrate>=4.30 and balance<0.8
qr = 1.5
else
qr = 1
endif
# summarize SCORE in Info window
totalscore =(l*2+z*4+o*3+qr*3+w*4+r*4)/20
totalscale= 'totalscore'*25
if totalscore>=3.6
a=4
elsif totalscore>=0.6 and totalscore<2
a=1
elsif totalscore>=2 and totalscore<3
a=2
elsif totalscore>=3 and totalscore<3.6
a=3
else
a=0.5
endif
if totalscale>=90
s=4
elsif totalscale>=15 and totalscale<50
s=1
elsif totalscale>=50 and totalscale<75
s=2
elsif totalscale>=75 and totalscale<90
s=3
else
s=0.5
endif
#vvv=a+('totalscale'/100)
vvv=totalscore+('totalscale'/100)
if vvv>=4
u=4*(1-(randomInteger(1,16)/100))
else
u=vvv-(randomInteger(1,16)/100)
endif
if totalscore>=4
xx=30
elsif totalscore>=3.80 and totalscore<4
xx=29
elsif totalscore>=3.60 and totalscore<3.80
xx=28
elsif totalscore>=3.5 and totalscore<3.6
xx=27
elsif totalscore>=3.3 and totalscore<3.5
xx=26
elsif totalscore>=3.15 and totalscore<3.3
xx=25
elsif totalscore>=3.08 and totalscore<3.15
xx=24
elsif totalscore>=3 and totalscore<3.08
xx=23
elsif totalscore>=2.83 and totalscore<3
xx=22
elsif totalscore>=2.60 and totalscore<2.83
xx=21
elsif totalscore>=2.5 and totalscore<2.60
xx=20
elsif totalscore>=2.30 and totalscore<2.50
xx=19
elsif totalscore>=2.23 and totalscore<2.30
xx=18
elsif totalscore>=2.15 and totalscore<2.23
xx=17
elsif totalscore>=2 and totalscore<2.15
xx=16
elsif totalscore>=1.93 and totalscore<2
xx=15
elsif totalscore>=1.83 and totalscore<1.93
xx=14
elsif totalscore>=1.74 and totalscore<1.83
xx=13
elsif totalscore>=1.66 and totalscore<1.74
xx=12
elsif totalscore>=1.50 and totalscore<1.66
xx=11
elsif totalscore>=1.33 and totalscore<1.50
xx=10
else
xx=9
endif
overscore = xx*4/30
ov = overscore
if xx>=25
xxban$="C"
elsif xx>=20 and xx<25
xxban$="B2"
elsif xx>=16 and xx<20
xxban$="B1"
elsif xx>=10 and xx<16
xxban$="A2"
else
xxban$="A1"
endif
qaz = 0.18
rr = (r*4+qr*2+z*1)/7
lu = (l*1+w*2+inpro*4/125)/4
td = (w*1+o*2+inpro*1/125)/3.25
facts=(ln(7/4)*4/7+ln(7/2)*2/7+ln(7)*1/7+ln(4)*1/4+ln(2)*1/2+ln(4)*1/4+ln(3.25)*1/3.25+ln(3.25/2)*2/3.25+ln(3.25/0.25)*0.25/3.25+ln(14.25/7)*7/14.25+ln(14.25/4)*4/14.25+ln(14.25/3.35)*3.25/14.25)
totsco = (r*ln(7/4)*4/7+qr*ln(7/2)*2/7+z*ln(7)*1/7+l*ln(4)*1/4+w*ln(2)*1/2+ln(4)*1/4*inpro*4/125+w*ln(3.25)*1/3.25+o*ln(3.25/2)*2/3.25+ln(3.25/0.25)*0.25/3.25*inpro*4/125)/facts
if totalscore>=4
totsco=3.9
else
totsco=totalscore
endif
rrr = rr*qaz
lulu = lu*qaz
tdtd = td*qaz
totscoo = totsco*qaz
whx=rrr*cos(1.309)
why=rrr*sin(1.309)
who=4*qaz
probpron=(r/4)
lstd=(10*l)/4
ostd=(10*o)/4
wstd=(10*w)/4
rstd=(10*r)/4
zstd=(10*z)/4
qstd=(10*qr)/4
Erase all
appendInfoLine: "1. voicedcount = ", 'voicedcount:0'
appendInfoLine: "2. npause = ", 'npause:0'
appendInfoLine: "3. speakingrate = ", 'speakingrate:2'
appendInfoLine: "4. articulationrate = ", 'articulationrate:2'
appendInfoLine: "5. speakingtot = ", 'speakingtot:2'
appendInfoLine: "6. originaldur = ", 'originaldur:2'
appendInfoLine: "7. balance = ", 'balance:1'
appendInfoLine: "8. meanall = ", 'meanall:2'
appendInfoLine: "9. sd = ", 'sd:2'
appendInfoLine: "10. medi = ", 'medi:1'
appendInfoLine: "11. mini = ", 'mini:0'
appendInfoLine: "12. maxi = ", 'maxi:0'
appendInfoLine: "13. quantile250 = ", 'quantile250:0'
appendInfoLine: "14. quantile750 = ", 'quantile750:0'
appendInfoLine: "15. probpron = ", 'probpron:2'

24
synthesizer/LICENSE.txt Normal file
View File

@@ -0,0 +1,24 @@
MIT License
Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
Modified work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
Modified work Copyright (c) 2020 blue-fish (https://github.com/blue-fish)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

1
synthesizer/__init__.py Normal file
View File

@@ -0,0 +1 @@
#

206
synthesizer/audio.py Normal file
View File

@@ -0,0 +1,206 @@
import librosa
import librosa.filters
import numpy as np
from scipy import signal
from scipy.io import wavfile
import soundfile as sf
def load_wav(path, sr):
return librosa.core.load(path, sr=sr)[0]
def save_wav(wav, path, sr):
wav *= 32767 / max(0.01, np.max(np.abs(wav)))
#proposed by @dsmiller
wavfile.write(path, sr, wav.astype(np.int16))
def save_wavenet_wav(wav, path, sr):
sf.write(path, wav.astype(np.float32), sr)
def preemphasis(wav, k, preemphasize=True):
if preemphasize:
return signal.lfilter([1, -k], [1], wav)
return wav
def inv_preemphasis(wav, k, inv_preemphasize=True):
if inv_preemphasize:
return signal.lfilter([1], [1, -k], wav)
return wav
#From https://github.com/r9y9/wavenet_vocoder/blob/master/audio.py
def start_and_end_indices(quantized, silence_threshold=2):
for start in range(quantized.size):
if abs(quantized[start] - 127) > silence_threshold:
break
for end in range(quantized.size - 1, 1, -1):
if abs(quantized[end] - 127) > silence_threshold:
break
assert abs(quantized[start] - 127) > silence_threshold
assert abs(quantized[end] - 127) > silence_threshold
return start, end
def get_hop_size(hparams):
hop_size = hparams.hop_size
if hop_size is None:
assert hparams.frame_shift_ms is not None
hop_size = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)
return hop_size
def linearspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db
if hparams.signal_normalization:
return _normalize(S, hparams)
return S
def melspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db
if hparams.signal_normalization:
return _normalize(S, hparams)
return S
def inv_linear_spectrogram(linear_spectrogram, hparams):
"""Converts linear spectrogram to waveform using librosa"""
if hparams.signal_normalization:
D = _denormalize(linear_spectrogram, hparams)
else:
D = linear_spectrogram
S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear
if hparams.use_lws:
processor = _lws_processor(hparams)
D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
y = processor.istft(D).astype(np.float32)
return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
else:
return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)
def inv_mel_spectrogram(mel_spectrogram, hparams):
"""Converts mel spectrogram to waveform using librosa"""
if hparams.signal_normalization:
D = _denormalize(mel_spectrogram, hparams)
else:
D = mel_spectrogram
S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams) # Convert back to linear
if hparams.use_lws:
processor = _lws_processor(hparams)
D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
y = processor.istft(D).astype(np.float32)
return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
else:
return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)
def _lws_processor(hparams):
import lws
return lws.lws(hparams.n_fft, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech")
def _griffin_lim(S, hparams):
"""librosa implementation of Griffin-Lim
Based on https://github.com/librosa/librosa/issues/434
"""
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
S_complex = np.abs(S).astype(np.complex)
y = _istft(S_complex * angles, hparams)
for i in range(hparams.griffin_lim_iters):
angles = np.exp(1j * np.angle(_stft(y, hparams)))
y = _istft(S_complex * angles, hparams)
return y
def _stft(y, hparams):
if hparams.use_lws:
return _lws_processor(hparams).stft(y).T
else:
return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
def _istft(y, hparams):
return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
##########################################################
#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
def num_frames(length, fsize, fshift):
"""Compute number of time frames of spectrogram
"""
pad = (fsize - fshift)
if length % fshift == 0:
M = (length + pad * 2 - fsize) // fshift + 1
else:
M = (length + pad * 2 - fsize) // fshift + 2
return M
def pad_lr(x, fsize, fshift):
"""Compute left and right padding
"""
M = num_frames(len(x), fsize, fshift)
pad = (fsize - fshift)
T = len(x) + 2 * pad
r = (M - 1) * fshift + fsize - T
return pad, pad + r
##########################################################
#Librosa correct padding
def librosa_pad_lr(x, fsize, fshift):
return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]
# Conversions
_mel_basis = None
_inv_mel_basis = None
def _linear_to_mel(spectogram, hparams):
global _mel_basis
if _mel_basis is None:
_mel_basis = _build_mel_basis(hparams)
return np.dot(_mel_basis, spectogram)
def _mel_to_linear(mel_spectrogram, hparams):
global _inv_mel_basis
if _inv_mel_basis is None:
_inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))
def _build_mel_basis(hparams):
assert hparams.fmax <= hparams.sample_rate // 2
return librosa.filters.mel(hparams.sample_rate, hparams.n_fft, n_mels=hparams.num_mels,
fmin=hparams.fmin, fmax=hparams.fmax)
def _amp_to_db(x, hparams):
min_level = np.exp(hparams.min_level_db / 20 * np.log(10))
return 20 * np.log10(np.maximum(min_level, x))
def _db_to_amp(x):
return np.power(10.0, (x) * 0.05)
def _normalize(S, hparams):
if hparams.allow_clipping_in_normalization:
if hparams.symmetric_mels:
return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,
-hparams.max_abs_value, hparams.max_abs_value)
else:
return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)
assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
if hparams.symmetric_mels:
return (2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value
else:
return hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db))
def _denormalize(D, hparams):
if hparams.allow_clipping_in_normalization:
if hparams.symmetric_mels:
return (((np.clip(D, -hparams.max_abs_value,
hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))
+ hparams.min_level_db)
else:
return ((np.clip(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
if hparams.symmetric_mels:
return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
else:
return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)

95
synthesizer/hparams.py Normal file
View File

@@ -0,0 +1,95 @@
import ast
import pprint
class HParams(object):
def __init__(self, **kwargs): self.__dict__.update(kwargs)
def __setitem__(self, key, value): setattr(self, key, value)
def __getitem__(self, key): return getattr(self, key)
def __repr__(self): return pprint.pformat(self.__dict__)
def parse(self, string):
# Overrides hparams from a comma-separated string of name=value pairs
if len(string) > 0:
overrides = [s.split("=") for s in string.split(",")]
keys, values = zip(*overrides)
keys = list(map(str.strip, keys))
values = list(map(str.strip, values))
for k in keys:
self.__dict__[k] = ast.literal_eval(values[keys.index(k)])
return self
syn_hparams = HParams(
### Signal Processing (used in both synthesizer and vocoder)
sample_rate = 16000,
n_fft = 800,
num_mels = 80,
hop_size = 200, # Tacotron uses 12.5 ms frame shift (set to sample_rate * 0.0125)
win_size = 800, # Tacotron uses 50 ms frame length (set to sample_rate * 0.050)
fmin = 55,
min_level_db = -100,
ref_level_db = 20,
max_abs_value = 4., # Gradient explodes if too big, premature convergence if too small.
preemphasis = 0.97, # Filter coefficient to use if preemphasize is True
preemphasize = True,
### Tacotron Text-to-Speech (TTS)
tts_embed_dims = 512, # Embedding dimension for the graphemes/phoneme inputs
tts_encoder_dims = 256,
tts_decoder_dims = 128,
tts_postnet_dims = 512,
tts_encoder_K = 5,
tts_lstm_dims = 1024,
tts_postnet_K = 5,
tts_num_highways = 4,
tts_dropout = 0.5,
tts_cleaner_names = ["english_cleaners"],
tts_start_threshold = -1.2,
tts_stop_threshold = -1.2, # Value below which audio generation ends.
# For example, for a range of [-4, 4], this
# will terminate the sequence at the first
# frame that has all values < -3.4
### Tacotron Training
tts_schedule = [(2, 1e-3, 40_000, 12), # Progressive training schedule
(2, 5e-4, 80_000, 12), # (r, lr, step, batch_size)
(2, 2e-4, 160_000, 12), #
(2, 1e-4, 320_000, 64), # r = reduction factor (# of mel frames
(2, 3e-5, 640_000, 64), # synthesized for each decoder iteration)
(2, 1e-5, 1280_000, 64),
(2, 5e-6, 2560_000, 64),
(2, 1e-6, 5120_000, 64)],
# lr = learning rate
tts_clip_grad_norm = 1.0, # clips the gradient norm to prevent explosion - set to None if not needed
tts_eval_interval = 100, # Number of steps between model evaluation (sample generation)
# Set to -1 to generate after completing epoch, or 0 to disable
tts_eval_num_samples = 1, # Makes this number of samples
### Data Preprocessing
max_mel_frames = 900,
rescale = True,
rescaling_max = 0.9,
synthesis_batch_size = 16, # For vocoder preprocessing and inference.
### Mel Visualization and Griffin-Lim
signal_normalization = True,
power = 1.5,
griffin_lim_iters = 60,
### Audio processing options
fmax = 7600, # Should not exceed (sample_rate // 2)
allow_clipping_in_normalization = True, # Used when signal_normalization = True
clip_mels_length = True, # If true, discards samples exceeding max_mel_frames
use_lws = False, # "Fast spectrogram phase recovery using local weighted sums"
symmetric_mels = True, # Sets mel range to [-max_abs_value, max_abs_value] if True,
# and [0, max_abs_value] if False
### SV2TTS
speaker_embedding_size = 256, # Dimension for the speaker embedding
silence_min_duration_split = 0.4, # Duration in seconds of a silence for an utterance to be split
utterance_min_duration = 1, # Duration in seconds below which utterances are discarded
)
def hparams_debug_string():
return str(syn_hparams)

174
synthesizer/inference.py Normal file
View File

@@ -0,0 +1,174 @@
import torch
from synthesizer import audio
from synthesizer.hparams import syn_hparams
from synthesizer.models.tacotron import Tacotron
from synthesizer.utils.symbols import symbols
from synthesizer.utils.text import text_to_sequence
from vocoder.display import simple_table
from pathlib import Path
from typing import Union, List
import numpy as np
import librosa
class Synthesizer_infer:
sample_rate = syn_hparams.sample_rate
hparams = syn_hparams
def __init__(self, model_fpath: Path, verbose=True):
"""
The model isn't instantiated and loaded in memory until needed or until load() is called.
:param model_fpath: path to the trained model file
:param verbose: if False, prints less information when using the model
"""
self.model_fpath = model_fpath
self.verbose = verbose
# Check for GPU
if torch.cuda.is_available():
self.device = torch.device("cuda")
else:
self.device = torch.device("cpu")
if self.verbose:
print("Synthesizer using device:", self.device)
# Tacotron model will be instantiated later on first use.
self._model = None
def is_loaded(self):
"""
Whether the model is loaded in memory.
"""
return self._model is not None
def load(self):
"""
Instantiates and loads the model given the weights file that was passed in the constructor.
"""
self._model = Tacotron(embed_dims=syn_hparams.tts_embed_dims,
num_chars=len(symbols),
encoder_dims=syn_hparams.tts_encoder_dims,
decoder_dims=syn_hparams.tts_decoder_dims,
n_mels=syn_hparams.num_mels,
fft_bins=syn_hparams.num_mels,
postnet_dims=syn_hparams.tts_postnet_dims,
encoder_K=syn_hparams.tts_encoder_K,
lstm_dims=syn_hparams.tts_lstm_dims,
postnet_K=syn_hparams.tts_postnet_K,
num_highways=syn_hparams.tts_num_highways,
dropout=syn_hparams.tts_dropout,
stop_threshold=syn_hparams.tts_stop_threshold,
speaker_embedding_size=syn_hparams.speaker_embedding_size).to(self.device)
self._model.load(self.model_fpath)
self._model.eval()
if self.verbose:
print("Loaded synthesizer \"%s\" trained to step %d" % (self.model_fpath.name, self._model.state_dict()["step"]))
def synthesize_spectrograms(self, texts: List[str],
embeddings: Union[np.ndarray, List[np.ndarray]],
require_visualization=False):
"""
Synthesizes mel spectrograms from texts and speaker embeddings.
:param texts: a list of N text prompts to be synthesized
:param embeddings: a numpy array or list of speaker embeddings of shape (N, 256)
:param require_visualization: if True, a matrix representing the alignments between the
characters
and each decoder output step will be returned for each spectrogram
:return: a list of N melspectrograms as numpy arrays of shape (80, Mi), where Mi is the
sequence length of spectrogram i, and possibly the alignments.
"""
# Load the model on the first request.
if not self.is_loaded():
self.load()
# Preprocess text inputs
inputs = [text_to_sequence(text.strip()) for text in texts]
if not isinstance(embeddings, list):
embeddings = [embeddings]
# Batch inputs
batched_inputs = [inputs[i:i+syn_hparams.synthesis_batch_size]
for i in range(0, len(inputs), syn_hparams.synthesis_batch_size)]
batched_embeds = [embeddings[i:i+syn_hparams.synthesis_batch_size]
for i in range(0, len(embeddings), syn_hparams.synthesis_batch_size)]
specs = []
for i, batch in enumerate(batched_inputs, 1):
if self.verbose:
print(f"\n| Generating {i}/{len(batched_inputs)}")
# Pad texts so they are all the same length
text_lens = [len(text) for text in batch]
max_text_len = max(text_lens)
chars = [pad1d(text, max_text_len) for text in batch]
chars = np.stack(chars)
# Stack speaker embeddings into 2D array for batch processing
speaker_embeds = np.stack(batched_embeds[i-1])
# Convert to tensor
chars = torch.tensor(chars).long().to(self.device)
speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)
# Inference
_, mels, alignments, stop_tokens = self._model.generate(chars, speaker_embeddings)
mels = mels.detach().cpu().numpy()
alignments = alignments.detach().cpu().numpy()
stop_tokens = stop_tokens.detach().cpu().numpy()
for m in mels:
# Trim silence from end of each spectrogram
while np.max(m[:, -1]) < syn_hparams.tts_stop_threshold:
if m.shape[-1] == 1:
break
m = m[:, :-1]
# Trim silence from start of each spectrogram
while np.max(m[:, 0]) < syn_hparams.tts_start_threshold:
if m.shape[-1] == 1:
break
m = m[:, 1:]
specs.append(m)
if self.verbose:
print("\n\nDone.\n")
return (specs, alignments, stop_tokens) if require_visualization else specs
@staticmethod
def load_preprocess_wav(fpath):
"""
Loads and preprocesses an audio file under the same conditions the audio files were used to
train the synthesizer.
"""
wav = librosa.load(str(fpath), syn_hparams.sample_rate)[0]
if syn_hparams.rescale:
wav = wav / np.abs(wav).max() * syn_hparams.rescaling_max
return wav
@staticmethod
def make_spectrogram(fpath_or_wav: Union[str, Path, np.ndarray]):
"""
Creates a mel spectrogram from an audio file in the same manner as the mel spectrograms that
were fed to the synthesizer when training.
"""
if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
wav = Synthesizer_infer.load_preprocess_wav(fpath_or_wav)
else:
wav = fpath_or_wav
mel_spectrogram = audio.melspectrogram(wav, syn_hparams).astype(np.float32)
return mel_spectrogram
@staticmethod
def griffin_lim(mel):
"""
Inverts a mel spectrogram using Griffin-Lim. The mel spectrogram is expected to have been built
with the same parameters present in hparams.py.
"""
return audio.inv_mel_spectrogram(mel, syn_hparams)
def pad1d(x, max_len, pad_value=0):
return np.pad(x, (0, max_len - len(x)), mode="constant", constant_values=pad_value)

View File

@@ -0,0 +1,525 @@
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from pathlib import Path
from typing import Union
class HighwayNetwork(nn.Module):
def __init__(self, size):
super().__init__()
self.W1 = nn.Linear(size, size)
self.W2 = nn.Linear(size, size)
self.W1.bias.data.fill_(0.)
def forward(self, x):
x1 = self.W1(x)
x2 = self.W2(x)
g = torch.sigmoid(x2)
y = g * F.relu(x1) + (1. - g) * x
return y
class Encoder(nn.Module):
def __init__(self, embed_dims, num_chars, encoder_dims, K, num_highways, dropout):
super().__init__()
prenet_dims = (encoder_dims, encoder_dims)
cbhg_channels = encoder_dims
self.embedding = nn.Embedding(num_chars, embed_dims)
self.pre_net = PreNet(embed_dims, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
dropout=dropout)
self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
proj_channels=[cbhg_channels, cbhg_channels],
num_highways=num_highways)
def forward(self, x, speaker_embedding=None):
x = self.embedding(x)
x = self.pre_net(x)
x.transpose_(1, 2)
x = self.cbhg(x)
if speaker_embedding is not None:
x = self.add_speaker_embedding(x, speaker_embedding)
return x
def add_speaker_embedding(self, x, speaker_embedding):
# SV2TTS
# The input x is the encoder output and is a 3D tensor with size (batch_size, num_chars, tts_embed_dims)
# When training, speaker_embedding is also a 2D tensor with size (batch_size, speaker_embedding_size)
# (for inference, speaker_embedding is a 1D tensor with size (speaker_embedding_size))
# This concats the speaker embedding for each char in the encoder output
# Save the dimensions as human-readable names
batch_size = x.size()[0]
num_chars = x.size()[1]
if speaker_embedding.dim() == 1:
idx = 0
else:
idx = 1
# Start by making a copy of each speaker embedding to match the input text length
# The output of this has size (batch_size, num_chars * tts_embed_dims)
speaker_embedding_size = speaker_embedding.size()[idx]
e = speaker_embedding.repeat_interleave(num_chars, dim=idx)
# Reshape it and transpose
e = e.reshape(batch_size, speaker_embedding_size, num_chars)
e = e.transpose(1, 2)
# Concatenate the tiled speaker embedding with the encoder output
x = torch.cat((x, e), 2)
return x
class BatchNormConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel, relu=True):
super().__init__()
self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
self.bnorm = nn.BatchNorm1d(out_channels)
self.relu = relu
def forward(self, x):
x = self.conv(x)
x = F.relu(x) if self.relu is True else x
return self.bnorm(x)
class CBHG(nn.Module):
def __init__(self, K, in_channels, channels, proj_channels, num_highways):
super().__init__()
# List of all rnns to call `flatten_parameters()` on
self._to_flatten = []
self.bank_kernels = [i for i in range(1, K + 1)]
self.conv1d_bank = nn.ModuleList()
for k in self.bank_kernels:
conv = BatchNormConv(in_channels, channels, k)
self.conv1d_bank.append(conv)
self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
# Fix the highway input if necessary
if proj_channels[-1] != channels:
self.highway_mismatch = True
self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
else:
self.highway_mismatch = False
self.highways = nn.ModuleList()
for i in range(num_highways):
hn = HighwayNetwork(channels)
self.highways.append(hn)
self.rnn = nn.GRU(channels, channels // 2, batch_first=True, bidirectional=True)
self._to_flatten.append(self.rnn)
# Avoid fragmentation of RNN parameters and associated warning
self._flatten_parameters()
def forward(self, x):
# Although we `_flatten_parameters()` on init, when using DataParallel
# the model gets replicated, making it no longer guaranteed that the
# weights are contiguous in GPU memory. Hence, we must call it again
self._flatten_parameters()
# Save these for later
residual = x
seq_len = x.size(-1)
conv_bank = []
# Convolution Bank
for conv in self.conv1d_bank:
c = conv(x) # Convolution
conv_bank.append(c[:, :, :seq_len])
# Stack along the channel axis
conv_bank = torch.cat(conv_bank, dim=1)
# dump the last padding to fit residual
x = self.maxpool(conv_bank)[:, :, :seq_len]
# Conv1d projections
x = self.conv_project1(x)
x = self.conv_project2(x)
# Residual Connect
x = x + residual
# Through the highways
x = x.transpose(1, 2)
if self.highway_mismatch is True:
x = self.pre_highway(x)
for h in self.highways: x = h(x)
# And then the RNN
x, _ = self.rnn(x)
return x
def _flatten_parameters(self):
"""Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
to improve efficiency and avoid PyTorch yelling at us."""
[m.flatten_parameters() for m in self._to_flatten]
class PreNet(nn.Module):
def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
super().__init__()
self.fc1 = nn.Linear(in_dims, fc1_dims)
self.fc2 = nn.Linear(fc1_dims, fc2_dims)
self.p = dropout
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = F.dropout(x, self.p, self.training)
x = self.fc2(x)
x = F.relu(x)
x = F.dropout(x, self.p, self.training)
return x
class Attention(nn.Module):
def __init__(self, attn_dims):
super().__init__()
self.W = nn.Linear(attn_dims, attn_dims, bias=False)
self.v = nn.Linear(attn_dims, 1, bias=False)
def forward(self, encoder_seq_proj, query, t):
# print(encoder_seq_proj.shape)
# Transform the query vector
query_proj = self.W(query).unsqueeze(1)
# Compute the scores
u = self.v(torch.tanh(encoder_seq_proj + query_proj))
scores = F.softmax(u, dim=1)
return scores.transpose(1, 2)
class LSA(nn.Module):
def __init__(self, attn_dim, kernel_size=31, filters=32):
super().__init__()
self.conv = nn.Conv1d(1, filters, padding=(kernel_size - 1) // 2, kernel_size=kernel_size, bias=True)
self.L = nn.Linear(filters, attn_dim, bias=False)
self.W = nn.Linear(attn_dim, attn_dim, bias=True) # Include the attention bias in this term
self.v = nn.Linear(attn_dim, 1, bias=False)
self.cumulative = None
self.attention = None
def init_attention(self, encoder_seq_proj):
device = next(self.parameters()).device # use same device as parameters
b, t, c = encoder_seq_proj.size()
self.cumulative = torch.zeros(b, t, device=device)
self.attention = torch.zeros(b, t, device=device)
def forward(self, encoder_seq_proj, query, t, chars):
if t == 0: self.init_attention(encoder_seq_proj)
processed_query = self.W(query).unsqueeze(1)
location = self.cumulative.unsqueeze(1)
processed_loc = self.L(self.conv(location).transpose(1, 2))
u = self.v(torch.tanh(processed_query + encoder_seq_proj + processed_loc))
u = u.squeeze(-1)
# Mask zero padding chars
u = u * (chars != 0).float()
# Smooth Attention
# scores = torch.sigmoid(u) / torch.sigmoid(u).sum(dim=1, keepdim=True)
scores = F.softmax(u, dim=1)
self.attention = scores
self.cumulative = self.cumulative + self.attention
return scores.unsqueeze(-1).transpose(1, 2)
class Decoder(nn.Module):
# Class variable because its value doesn't change between classes
# yet ought to be scoped by class because its a property of a Decoder
max_r = 20
def __init__(self, n_mels, encoder_dims, decoder_dims, lstm_dims,
dropout, speaker_embedding_size):
super().__init__()
self.register_buffer("r", torch.tensor(1, dtype=torch.int))
self.n_mels = n_mels
prenet_dims = (decoder_dims * 2, decoder_dims * 2)
self.prenet = PreNet(n_mels, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
dropout=dropout)
self.attn_net = LSA(decoder_dims)
self.attn_rnn = nn.GRUCell(encoder_dims + prenet_dims[1] + speaker_embedding_size, decoder_dims)
self.rnn_input = nn.Linear(encoder_dims + decoder_dims + speaker_embedding_size, lstm_dims)
self.res_rnn1 = nn.LSTMCell(lstm_dims, lstm_dims)
self.res_rnn2 = nn.LSTMCell(lstm_dims, lstm_dims)
self.mel_proj = nn.Linear(lstm_dims, n_mels * self.max_r, bias=False)
self.stop_proj = nn.Linear(encoder_dims + speaker_embedding_size + lstm_dims, 1)
def zoneout(self, prev, current, p=0.1):
device = next(self.parameters()).device # Use same device as parameters
mask = torch.zeros(prev.size(), device=device).bernoulli_(p)
return prev * mask + current * (1 - mask)
def forward(self, encoder_seq, encoder_seq_proj, prenet_in,
hidden_states, cell_states, context_vec, t, chars):
# Need this for reshaping mels
batch_size = encoder_seq.size(0)
# Unpack the hidden and cell states
attn_hidden, rnn1_hidden, rnn2_hidden = hidden_states
rnn1_cell, rnn2_cell = cell_states
# PreNet for the Attention RNN
prenet_out = self.prenet(prenet_in)
# Compute the Attention RNN hidden state
attn_rnn_in = torch.cat([context_vec, prenet_out], dim=-1)
attn_hidden = self.attn_rnn(attn_rnn_in.squeeze(1), attn_hidden)
# Compute the attention scores
scores = self.attn_net(encoder_seq_proj, attn_hidden, t, chars)
# Dot product to create the context vector
context_vec = scores @ encoder_seq
context_vec = context_vec.squeeze(1)
# Concat Attention RNN output w. Context Vector & project
x = torch.cat([context_vec, attn_hidden], dim=1)
x = self.rnn_input(x)
# Compute first Residual RNN
rnn1_hidden_next, rnn1_cell = self.res_rnn1(x, (rnn1_hidden, rnn1_cell))
if self.training:
rnn1_hidden = self.zoneout(rnn1_hidden, rnn1_hidden_next)
else:
rnn1_hidden = rnn1_hidden_next
x = x + rnn1_hidden
# Compute second Residual RNN
rnn2_hidden_next, rnn2_cell = self.res_rnn2(x, (rnn2_hidden, rnn2_cell))
if self.training:
rnn2_hidden = self.zoneout(rnn2_hidden, rnn2_hidden_next)
else:
rnn2_hidden = rnn2_hidden_next
x = x + rnn2_hidden
# Project Mels
mels = self.mel_proj(x)
mels = mels.view(batch_size, self.n_mels, self.max_r)[:, :, :self.r]
hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
cell_states = (rnn1_cell, rnn2_cell)
# Stop token prediction
s = torch.cat((x, context_vec), dim=1)
s = self.stop_proj(s)
stop_tokens = torch.sigmoid(s)
return mels, scores, hidden_states, cell_states, context_vec, stop_tokens
class Tacotron(nn.Module):
def __init__(self, embed_dims, num_chars, encoder_dims, decoder_dims, n_mels,
fft_bins, postnet_dims, encoder_K, lstm_dims, postnet_K, num_highways,
dropout, stop_threshold, speaker_embedding_size):
super().__init__()
self.n_mels = n_mels
self.lstm_dims = lstm_dims
self.encoder_dims = encoder_dims
self.decoder_dims = decoder_dims
self.speaker_embedding_size = speaker_embedding_size
self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
encoder_K, num_highways, dropout)
self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
dropout, speaker_embedding_size)
self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
[postnet_dims, fft_bins], num_highways)
self.post_proj = nn.Linear(postnet_dims, fft_bins, bias=False)
self.init_model()
self.num_params()
self.register_buffer("step", torch.zeros(1, dtype=torch.long))
self.register_buffer("stop_threshold", torch.tensor(stop_threshold, dtype=torch.float32))
@property
def r(self):
return self.decoder.r.item()
@r.setter
def r(self, value):
self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
def forward(self, x, m, speaker_embedding):
device = next(self.parameters()).device # use same device as parameters
self.step += 1
batch_size, _, steps = m.size()
# Initialise all hidden states and pack into tuple
attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
rnn1_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
rnn2_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
# Initialise all lstm cell states and pack into tuple
rnn1_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
rnn2_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
cell_states = (rnn1_cell, rnn2_cell)
# <GO> Frame for start of decoder loop
go_frame = torch.zeros(batch_size, self.n_mels, device=device)
# Need an initial context vector
context_vec = torch.zeros(batch_size, self.encoder_dims + self.speaker_embedding_size, device=device)
# SV2TTS: Run the encoder with the speaker embedding
# The projection avoids unnecessary matmuls in the decoder loop
encoder_seq = self.encoder(x, speaker_embedding)
encoder_seq_proj = self.encoder_proj(encoder_seq)
# Need a couple of lists for outputs
mel_outputs, attn_scores, stop_outputs = [], [], []
# Run the decoder loop
for t in range(0, steps, self.r):
prenet_in = m[:, :, t - 1] if t > 0 else go_frame
mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
hidden_states, cell_states, context_vec, t, x)
mel_outputs.append(mel_frames)
attn_scores.append(scores)
stop_outputs.extend([stop_tokens] * self.r)
# Concat the mel outputs into sequence
mel_outputs = torch.cat(mel_outputs, dim=2)
# Post-Process for Linear Spectrograms
postnet_out = self.postnet(mel_outputs)
linear = self.post_proj(postnet_out)
linear = linear.transpose(1, 2)
# For easy visualisation
attn_scores = torch.cat(attn_scores, 1)
# attn_scores = attn_scores.cpu().data.numpy()
stop_outputs = torch.cat(stop_outputs, 1)
return mel_outputs, linear, attn_scores, stop_outputs
def generate(self, x, speaker_embedding=None, steps=2000):
self.eval()
device = next(self.parameters()).device # use same device as parameters
batch_size, _ = x.size()
# Need to initialise all hidden states and pack into tuple for tidyness
attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
rnn1_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
rnn2_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
# Need to initialise all lstm cell states and pack into tuple for tidyness
rnn1_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
rnn2_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
cell_states = (rnn1_cell, rnn2_cell)
# Need a <GO> Frame for start of decoder loop
go_frame = torch.zeros(batch_size, self.n_mels, device=device)
# Need an initial context vector
context_vec = torch.zeros(batch_size, self.encoder_dims + self.speaker_embedding_size, device=device)
# SV2TTS: Run the encoder with the speaker embedding
# The projection avoids unnecessary matmuls in the decoder loop
encoder_seq = self.encoder(x, speaker_embedding)
encoder_seq_proj = self.encoder_proj(encoder_seq)
# Need a couple of lists for outputs
mel_outputs, attn_scores, stop_outputs = [], [], []
# Run the decoder loop
for t in range(0, steps, self.r):
prenet_in = mel_outputs[-1][:, :, -1] if t > 0 else go_frame
mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
hidden_states, cell_states, context_vec, t, x)
mel_outputs.append(mel_frames)
attn_scores.append(scores)
stop_outputs.extend([stop_tokens] * self.r)
if t == 0:
first_stop_token = stop_tokens
# Stop the loop when all stop tokens in batch exceed threshold compared with the 1st token and the sequence's length exceeds threshold
# if torch.gt(stop_tokens, first_stop_token*10).all() and t > (1 * self.r):
# break
if (stop_tokens > 0.01).all() and t > (20 * self.r): break
if torch.cuda.is_available():
torch.cuda.empty_cache()
# Concat the mel outputs into sequence
mel_outputs = torch.cat(mel_outputs, dim=2)
# Post-Process for Linear Spectrograms
postnet_out = self.postnet(mel_outputs)
linear = self.post_proj(postnet_out)
linear = linear.transpose(1, 2)
# For easy visualisation
attn_scores = torch.cat(attn_scores, 1)
stop_outputs = torch.cat(stop_outputs, 1)
self.train()
return mel_outputs, linear, attn_scores, stop_outputs
def init_model(self):
for p in self.parameters():
if p.dim() > 1: nn.init.xavier_uniform_(p)
def get_step(self):
return self.step.data.item()
def reset_step(self):
# assignment to parameters or buffers is overloaded, updates internal dict entry
self.step = self.step.data.new_tensor(1)
def log(self, path, msg):
with open(path, "a") as f:
print(msg, file=f)
def load(self, path, optimizer=None):
# Use device of model params as location for loaded state
device = "cpu"
checkpoint = torch.load(str(path), map_location=device)
self.load_state_dict(checkpoint["model_state"])
if "optimizer_state" in checkpoint and optimizer is not None:
optimizer.load_state_dict(checkpoint["optimizer_state"])
def save(self, path, optimizer=None):
if optimizer is not None:
torch.save({
"model_state": self.state_dict(),
"optimizer_state": optimizer.state_dict(),
}, str(path))
else:
torch.save({
"model_state": self.state_dict(),
}, str(path))
def num_params(self, print_out=True):
parameters = filter(lambda p: p.requires_grad, self.parameters())
parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
if print_out:
print("Trainable Parameters: %.3fM" % parameters)
return parameters

402
synthesizer/preprocess.py Normal file
View File

@@ -0,0 +1,402 @@
from multiprocessing.pool import Pool
from synthesizer import audio
from functools import partial
from itertools import chain, groupby
from encoder import inference as encoder_infer
from pathlib import Path
from utils import logmmse
from tqdm import tqdm
import numpy as np
import librosa
import random
def preprocess_librispeech(datasets_root: Path, out_dir: Path, n_processes: int, skip_existing: bool, hparams,
datasets_name: str, subfolders: str, no_alignments=False):
# Gather the input directories of LibriSpeeech
dataset_root = datasets_root.joinpath(datasets_name)
input_dirs = [dataset_root.joinpath(subfolder.strip()) for subfolder in subfolders.split(",")]
print("\n ".join(map(str, ["Using data from:"] + input_dirs)))
assert all(input_dir.exists() for input_dir in input_dirs)
train_input_dirs = input_dirs[: -1]
dev_input_dirs = input_dirs[-1: ]
# Create the output directories for each output file type
train_out_dir = out_dir.joinpath("train")
train_out_dir.mkdir(exist_ok=True)
train_out_dir.joinpath("mels").mkdir(exist_ok=True)
train_out_dir.joinpath("audio").mkdir(exist_ok=True)
# Create a metadata file
train_metadata_fpath = train_out_dir.joinpath("train.txt")
train_metadata_file = train_metadata_fpath.open("a" if skip_existing else "w", encoding="utf-8")
dev_out_dir = out_dir.joinpath("dev")
dev_out_dir.mkdir(exist_ok=True)
dev_out_dir.joinpath("mels").mkdir(exist_ok=True)
dev_out_dir.joinpath("audio").mkdir(exist_ok=True)
# Create a metadata file
dev_metadata_fpath = dev_out_dir.joinpath("dev.txt")
dev_metadata_file = dev_metadata_fpath.open("a" if skip_existing else "w", encoding="utf-8")
# Preprocess the train dataset
train_speaker_dirs = list(chain.from_iterable(train_input_dir.glob("*") for train_input_dir in train_input_dirs))
func = partial(preprocess_speaker, out_dir=train_out_dir, skip_existing=skip_existing,
hparams=hparams, no_alignments=no_alignments)
job = Pool(n_processes).imap(func, train_speaker_dirs)
for speaker_metadata in tqdm(job, datasets_name, len(train_speaker_dirs), unit="speakers"):
for metadatum in speaker_metadata:
train_metadata_file.write("|".join(str(x) for x in metadatum) + "\n")
train_metadata_file.close()
# Verify the contents of the metadata file
with train_metadata_fpath.open("r", encoding="utf-8") as train_metadata_file:
metadata = [line.split("|") for line in train_metadata_file]
mel_frames = sum([int(m[4]) for m in metadata])
timesteps = sum([int(m[3]) for m in metadata])
sample_rate = hparams.sample_rate
hours = (timesteps / sample_rate) / 3600
print("The train dataset consists of %d utterances, %d mel frames, %d audio timesteps (%.2f hours)." %
(len(metadata), mel_frames, timesteps, hours))
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
print("Max mel frames length: %d" % max(int(m[4]) for m in metadata))
print("Max audio timesteps length: %d" % max(int(m[3]) for m in metadata))
# Preprocess the dev dataset
dev_speaker_dirs = list(chain.from_iterable(dev_input_dir.glob("*") for dev_input_dir in dev_input_dirs))
func = partial(preprocess_speaker, out_dir=dev_out_dir, skip_existing=skip_existing,
hparams=hparams, no_alignments=no_alignments)
job = Pool(n_processes).imap(func, dev_speaker_dirs)
for speaker_metadata in tqdm(job, datasets_name, len(dev_speaker_dirs), unit="speakers"):
for metadatum in speaker_metadata:
dev_metadata_file.write("|".join(str(x) for x in metadatum) + "\n")
dev_metadata_file.close()
# Verify the contents of the metadata file
with dev_metadata_fpath.open("r", encoding="utf-8") as dev_metadata_file:
metadata = [line.split("|") for line in dev_metadata_file]
mel_frames = sum([int(m[4]) for m in metadata])
timesteps = sum([int(m[3]) for m in metadata])
sample_rate = hparams.sample_rate
hours = (timesteps / sample_rate) / 3600
print("The dev dataset consists of %d utterances, %d mel frames, %d audio timesteps (%.2f hours)." %
(len(metadata), mel_frames, timesteps, hours))
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
print("Max mel frames length: %d" % max(int(m[4]) for m in metadata))
print("Max audio timesteps length: %d" % max(int(m[3]) for m in metadata))
def preprocess_vctk(datasets_root: Path, out_dir: Path, n_processes: int, skip_existing: bool, hparams,
datasets_name: str, subfolders: str, no_alignments=True):
# TODO:Gather the input directories of VCTK
dataset_root = datasets_root.joinpath(datasets_name)
input_dir = dataset_root.joinpath(subfolders)
print("Using data from:" + str(input_dir))
assert input_dir.exists()
paths = [*input_dir.rglob("*.flac")]
# train dev audio data split
train_input_fpaths = []
dev_input_fpaths = []
pairs = sorted([(p.parts[-2].split('_')[0], p) for p in paths])
del paths
for _, group in groupby(pairs, lambda pair: pair[0]):
paths = sorted([p for _, p in group if "mic1.flac" in str(p)]) # only get mic1 flac file
random.seed(0)
random.shuffle(paths)
n = round(len(paths) * 0.9)
train_input_fpaths.extend(paths[:n])
# dev dataset has the same speakers as train dataset
dev_input_fpaths.extend(paths[n:])
# Create the output directories for each output file type
train_out_dir = out_dir.joinpath("train")
train_out_dir.mkdir(exist_ok=True)
train_out_dir.joinpath("mels").mkdir(exist_ok=True)
train_out_dir.joinpath("audio").mkdir(exist_ok=True)
dev_out_dir = out_dir.joinpath("dev")
dev_out_dir.mkdir(exist_ok=True)
dev_out_dir.joinpath("mels").mkdir(exist_ok=True)
dev_out_dir.joinpath("audio").mkdir(exist_ok=True)
# Preprocess the train dataset
preprocess_data(train_input_fpaths, mode="train", out_dir=train_out_dir, skip_existing=skip_existing, hparams=hparams, no_alignments=no_alignments)
# Preprocess the dev dataset
preprocess_data(dev_input_fpaths, mode="dev", out_dir=dev_out_dir, skip_existing=skip_existing, hparams=hparams, no_alignments=no_alignments)
def preprocess_speaker(speaker_dir, out_dir: Path, skip_existing: bool, hparams, no_alignments: bool):
metadata = []
for book_dir in speaker_dir.glob("*"):
if no_alignments:
# Gather the utterance audios and texts
# LibriTTS uses .wav but we will include extensions for compatibility with other datasets
extensions = ["*.wav", "*.flac", "*.mp3"]
for extension in extensions:
wav_fpaths = book_dir.glob(extension)
for wav_fpath in wav_fpaths:
# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max
# Get the corresponding text
# Check for .txt (for compatibility with other datasets)
text_fpath = wav_fpath.with_suffix(".txt")
if not text_fpath.exists():
# Check for .normalized.txt (LibriTTS)
text_fpath = wav_fpath.with_suffix(".normalized.txt")
assert text_fpath.exists()
with text_fpath.open("r") as text_file:
text = "".join([line for line in text_file])
text = text.replace("\"", "")
text = text.strip()
# Process the utterance
metadata.append(process_utterance(wav, text, out_dir, str(wav_fpath.with_suffix("").name),
skip_existing, hparams))
else:
# Process alignment file (LibriSpeech support)
# Gather the utterance audios and texts
try:
alignments_fpath = next(book_dir.glob("*.alignment.txt"))
with alignments_fpath.open("r") as alignments_file:
alignments = [line.rstrip().split(" ") for line in alignments_file]
except StopIteration:
# A few alignment files will be missing
continue
# Iterate over each entry in the alignments file
for wav_fname, words, end_times in alignments:
wav_fpath = book_dir.joinpath(wav_fname + ".flac")
assert wav_fpath.exists()
words = words.replace("\"", "").split(",")
end_times = list(map(float, end_times.replace("\"", "").split(",")))
# Process each sub-utterance
wavs, texts = split_on_silences(wav_fpath, words, end_times, hparams)
for i, (wav, text) in enumerate(zip(wavs, texts)):
sub_basename = "%s_%02d" % (wav_fname, i)
metadata.append(process_utterance(wav, text, out_dir, sub_basename,
skip_existing, hparams))
return [m for m in metadata if m is not None]
def preprocess_data(wav_fpaths, mode, out_dir: Path, skip_existing: bool, hparams, no_alignments: bool):
assert mode in ["train", "dev"]
# Create a metadata file
metadata_fpath = out_dir.joinpath(f"{mode}.txt")
metadata_file = metadata_fpath.open("a", encoding="utf-8")
if no_alignments:
for wav_fpath in tqdm(wav_fpaths, desc=mode):
# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max
# Get the corresponding text
# Check for .txt (for compatibility with other datasets)
base_name = "_".join(wav_fpath.name.split(".")[0].split("_")[: -1]) + ".txt"
text_fpath = wav_fpath.with_name(base_name)
if not text_fpath.exists():
continue
with text_fpath.open("r") as text_file:
text = "".join([line for line in text_file])
text = text.replace("\"", "")
text = text.strip()
# Process the utterance
metadata = process_utterance(wav, text, out_dir, str(wav_fpath.with_suffix("").name), skip_existing, hparams, trim_silence=False)
if metadata is not None:
metadata_file.write("|".join(str(x) for x in metadata) + "\n")
metadata_file.close()
# Verify the contents of the metadata file
with metadata_fpath.open("r", encoding="utf-8") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
mel_frames = sum([int(m[4]) for m in metadata])
timesteps = sum([int(m[3]) for m in metadata])
sample_rate = hparams.sample_rate
hours = (timesteps / sample_rate) / 3600
print(f"The {mode} dataset consists of %d utterances, %d mel frames, %d audio timesteps (%.2f hours)." %
(len(metadata), mel_frames, timesteps, hours))
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
print("Max mel frames length: %d" % max(int(m[4]) for m in metadata))
print("Max audio timesteps length: %d" % max(int(m[3]) for m in metadata))
def split_on_silences(wav_fpath, words, end_times, hparams):
# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max
words = np.array(words)
start_times = np.array([0.0] + end_times[:-1])
end_times = np.array(end_times)
assert len(words) == len(end_times) == len(start_times)
assert words[0] == "" and words[-1] == ""
# Find pauses that are too long
mask = (words == "") & (end_times - start_times >= hparams.silence_min_duration_split)
mask[0] = mask[-1] = True
breaks = np.where(mask)[0]
# Profile the noise from the silences and perform noise reduction on the waveform
silence_times = [[start_times[i], end_times[i]] for i in breaks]
silence_times = (np.array(silence_times) * hparams.sample_rate).astype(np.int)
noisy_wav = np.concatenate([wav[stime[0]:stime[1]] for stime in silence_times])
if len(noisy_wav) > hparams.sample_rate * 0.02:
profile = logmmse.profile_noise(noisy_wav, hparams.sample_rate)
wav = logmmse.denoise(wav, profile, eta=0)
# Re-attach segments that are too short
segments = list(zip(breaks[:-1], breaks[1:]))
segment_durations = [start_times[end] - end_times[start] for start, end in segments]
i = 0
while i < len(segments) and len(segments) > 1:
if segment_durations[i] < hparams.utterance_min_duration:
# See if the segment can be re-attached with the right or the left segment
left_duration = float("inf") if i == 0 else segment_durations[i - 1]
right_duration = float("inf") if i == len(segments) - 1 else segment_durations[i + 1]
joined_duration = segment_durations[i] + min(left_duration, right_duration)
# Do not re-attach if it causes the joined utterance to be too long
if joined_duration > hparams.hop_size * hparams.max_mel_frames / hparams.sample_rate:
i += 1
continue
# Re-attach the segment with the neighbour of shortest duration
j = i - 1 if left_duration <= right_duration else i
segments[j] = (segments[j][0], segments[j + 1][1])
segment_durations[j] = joined_duration
del segments[j + 1], segment_durations[j + 1]
else:
i += 1
# Split the utterance
segment_times = [[end_times[start], start_times[end]] for start, end in segments]
segment_times = (np.array(segment_times) * hparams.sample_rate).astype(np.int)
wavs = [wav[segment_time[0]:segment_time[1]] for segment_time in segment_times]
texts = [" ".join(words[start + 1:end]).replace(" ", " ") for start, end in segments]
# # DEBUG: play the audio segments (run with -n=1)
# import sounddevice as sd
# if len(wavs) > 1:
# print("This sentence was split in %d segments:" % len(wavs))
# else:
# print("There are no silences long enough for this sentence to be split:")
# for wav, text in zip(wavs, texts):
# # Pad the waveform with 1 second of silence because sounddevice tends to cut them early
# # when playing them. You shouldn't need to do that in your parsers.
# wav = np.concatenate((wav, [0] * 16000))
# print("\t%s" % text)
# sd.play(wav, 16000, blocking=True)
# print("")
return wavs, texts
def process_utterance(wav: np.ndarray, text: str, out_dir: Path, basename: str,
skip_existing: bool, hparams, trim_silence=True):
## FOR REFERENCE:
# For you not to lose your head if you ever wish to change things here or implement your own
# synthesizer.
# - Both the audios and the mel spectrograms are saved as numpy arrays
# - There is no processing done to the audios that will be saved to disk beyond volume
# normalization (in split_on_silences)
# - However, pre-emphasis is applied to the audios before computing the mel spectrogram. This
# is why we re-apply it on the audio on the side of the vocoder.
# - Librosa pads the waveform before computing the mel spectrogram. Here, the waveform is saved
# without extra padding. This means that you won't have an exact relation between the length
# of the wav and of the mel spectrogram. See the vocoder data loader.
# Skip existing utterances if needed
mel_fpath = out_dir.joinpath("mels", "mel-%s.npy" % basename)
wav_fpath = out_dir.joinpath("audio", "audio-%s.npy" % basename)
if skip_existing and mel_fpath.exists() and wav_fpath.exists():
return None
# Trim silence
wav = encoder_infer.preprocess_wav(wav, normalize=False, trim_silence=trim_silence)
# Skip utterances that are too short
if len(wav) < hparams.utterance_min_duration * hparams.sample_rate:
return None
# Compute the mel spectrogram
mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
mel_frames = mel_spectrogram.shape[1]
# Skip utterances that are too long
if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:
return None
# Write the spectrogram, embed and audio to disk
np.save(mel_fpath, mel_spectrogram.T, allow_pickle=False)
np.save(wav_fpath, wav, allow_pickle=False)
# Return a tuple describing this training example
return wav_fpath.name, mel_fpath.name, "embed-%s.npy" % basename, len(wav), mel_frames, text
def embed_utterance(fpaths, encoder_model_fpath):
if not encoder_infer.is_loaded():
encoder_infer.load_model(encoder_model_fpath)
# Compute the speaker embedding of the utterance
wav_fpath, embed_fpath = fpaths
wav = np.load(wav_fpath)
wav = encoder_infer.preprocess_wav(wav)
embed = encoder_infer.embed_utterance(wav)
np.save(embed_fpath, embed, allow_pickle=False)
def create_embeddings(synthesizer_root: Path, encoder_model_fpath: Path, n_processes: int):
# create train embeddings
train_wav_dir = synthesizer_root.joinpath("train/audio")
train_metadata_fpath = synthesizer_root.joinpath("train/train.txt")
assert train_wav_dir.exists() and train_metadata_fpath.exists()
train_embed_dir = synthesizer_root.joinpath("train/embeds")
train_embed_dir.mkdir(exist_ok=True)
# Gather the input wave filepath and the target output embed filepath
with train_metadata_fpath.open("r") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
fpaths = [(train_wav_dir.joinpath(m[0]), train_embed_dir.joinpath(m[2])) for m in metadata]
# TODO: improve on the multiprocessing, it's terrible. Disk I/O is the bottleneck here.
# Embed the utterances in separate threads
func = partial(embed_utterance, encoder_model_fpath=encoder_model_fpath)
job = Pool(n_processes).imap(func, fpaths)
list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
# create dev embeddings
dev_wav_dir = synthesizer_root.joinpath("dev/audio")
dev_metadata_fpath = synthesizer_root.joinpath("dev/dev.txt")
assert dev_wav_dir.exists() and dev_metadata_fpath.exists()
dev_embed_dir = synthesizer_root.joinpath("dev/embeds")
dev_embed_dir.mkdir(exist_ok=True)
# Gather the input wave filepath and the target output embed filepath
with dev_metadata_fpath.open("r") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
fpaths = [(dev_wav_dir.joinpath(m[0]), dev_embed_dir.joinpath(m[2])) for m in metadata]
# TODO: improve on the multiprocessing, it's terrible. Disk I/O is the bottleneck here.
# Embed the utterances in separate threads
func = partial(embed_utterance, encoder_model_fpath=encoder_model_fpath)
job = Pool(n_processes).imap(func, fpaths)
list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))

130
synthesizer/synthesize.py Normal file
View File

@@ -0,0 +1,130 @@
import platform
from functools import partial
from pathlib import Path
import numpy as np
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from synthesizer.hparams import hparams_debug_string
from synthesizer.models.tacotron import Tacotron
from synthesizer.synthesizer_dataset import SynthesizerDataset, collate_synthesizer
from synthesizer.utils import data_parallel_workaround
from synthesizer.utils.symbols import symbols
def run_synthesis(in_dir: Path, out_dir: Path, syn_model_fpath: Path, hparams):
# This generates ground truth-aligned mels for vocoder training
train_in_dir = in_dir.joinpath("train")
train_out_dir = out_dir.joinpath("train")
dev_in_dir = in_dir.joinpath("dev")
dev_out_dir = out_dir.joinpath("dev")
train_synth_dir = train_out_dir / "mels_gta"
train_synth_dir.mkdir(exist_ok=True, parents=True)
dev_synth_dir = dev_out_dir / "mels_gta"
dev_synth_dir.mkdir(exist_ok=True, parents=True)
print(hparams_debug_string())
# Check for GPU
if torch.cuda.is_available():
device = torch.device("cuda")
if hparams.synthesis_batch_size % torch.cuda.device_count() != 0:
raise ValueError("`hparams.synthesis_batch_size` must be evenly divisible by n_gpus!")
else:
device = torch.device("cpu")
print("Synthesizer using device:", device)
# Instantiate Tacotron model
model = Tacotron(embed_dims=hparams.tts_embed_dims,
num_chars=len(symbols),
encoder_dims=hparams.tts_encoder_dims,
decoder_dims=hparams.tts_decoder_dims,
n_mels=hparams.num_mels,
fft_bins=hparams.num_mels,
postnet_dims=hparams.tts_postnet_dims,
encoder_K=hparams.tts_encoder_K,
lstm_dims=hparams.tts_lstm_dims,
postnet_K=hparams.tts_postnet_K,
num_highways=hparams.tts_num_highways,
dropout=0., # Use zero dropout for gta mels
stop_threshold=hparams.tts_stop_threshold,
speaker_embedding_size=hparams.speaker_embedding_size).to(device)
# Load the weights
print("\nLoading weights at %s" % syn_model_fpath)
model.load(syn_model_fpath)
print("Tacotron weights loaded from step %d" % model.step)
# Synthesize using same reduction factor as the model is currently trained
r = np.int32(model.r)
# Set model to eval mode (disable gradient and zoneout)
model.eval()
# Initialize the dataset
train_metadata_fpath = train_in_dir.joinpath("train.txt")
train_mel_dir = train_in_dir.joinpath("mels")
train_embed_dir = train_in_dir.joinpath("embeds")
dev_metadata_fpath = dev_in_dir.joinpath("dev.txt")
dev_mel_dir = dev_in_dir.joinpath("mels")
dev_embed_dir = dev_in_dir.joinpath("embeds")
train_dataset = SynthesizerDataset(train_metadata_fpath, train_mel_dir, train_embed_dir, hparams)
dev_dataset = SynthesizerDataset(dev_metadata_fpath, dev_mel_dir, dev_embed_dir, hparams)
collate_fn = partial(collate_synthesizer, r=r, hparams=hparams)
train_data_loader = DataLoader(train_dataset, hparams.synthesis_batch_size, collate_fn=collate_fn, num_workers=2)
dev_data_loader = DataLoader(dev_dataset, hparams.synthesis_batch_size, collate_fn=collate_fn, num_workers=2)
# Generate train GTA mels
train_meta_out_fpath = train_out_dir / "synthesized.txt"
with train_meta_out_fpath.open("w") as file:
for i, (texts, mels, embeds, idx) in tqdm(enumerate(train_data_loader), total=len(train_data_loader)):
texts, mels, embeds = texts.to(device), mels.to(device), embeds.to(device)
# Parallelize model onto GPUS using workaround due to python bug
# if device.type == "cuda" and torch.cuda.device_count() > 1:
# _, mels_out, _ = data_parallel_workaround(model, texts, mels, embeds)
# else:
_, mels_out, _, _ = model(texts, mels, embeds)
for j, k in enumerate(idx):
# Note: outputs mel-spectrogram files and target ones have same names, just different folders
mel_filename = Path(train_synth_dir).joinpath(train_dataset.metadata[k][1])
mel_out = mels_out[j].detach().cpu().numpy().T
# Use the length of the ground truth mel to remove padding from the generated mels
mel_out = mel_out[:int(train_dataset.metadata[k][4])]
# Write the spectrogram to disk
np.save(mel_filename, mel_out, allow_pickle=False)
# Write metadata into the synthesized file
file.write("|".join(train_dataset.metadata[k]))
# Generate dev GTA mels
dev_meta_out_fpath = dev_out_dir / "synthesized.txt"
with dev_meta_out_fpath.open("w") as file:
for i, (texts, mels, embeds, idx) in tqdm(enumerate(dev_data_loader), total=len(dev_data_loader)):
texts, mels, embeds = texts.to(device), mels.to(device), embeds.to(device)
# Parallelize model onto GPUS using workaround due to python bug
# if device.type == "cuda" and torch.cuda.device_count() > 1:
# _, mels_out, _ = data_parallel_workaround(model, texts, mels, embeds)
# else:
_, mels_out, _, _ = model(texts, mels, embeds)
for j, k in enumerate(idx):
# Note: outputs mel-spectrogram files and target ones have same names, just different folders
mel_filename = Path(dev_synth_dir).joinpath(dev_dataset.metadata[k][1])
mel_out = mels_out[j].detach().cpu().numpy().T
# Use the length of the ground truth mel to remove padding from the generated mels
mel_out = mel_out[:int(dev_dataset.metadata[k][4])]
# Write the spectrogram to disk
np.save(mel_filename, mel_out, allow_pickle=False)
# Write metadata into the synthesized file
file.write("|".join(dev_dataset.metadata[k]))

View File

@@ -0,0 +1,92 @@
import torch
from torch.utils.data import Dataset
import numpy as np
from pathlib import Path
from synthesizer.utils.text import text_to_sequence
class SynthesizerDataset(Dataset):
def __init__(self, metadata_fpath: Path, mel_dir: Path, embed_dir: Path, hparams):
print("Using inputs from:\n\t%s\n\t%s\n\t%s" % (metadata_fpath, mel_dir, embed_dir))
with metadata_fpath.open("r") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
mel_fnames = [x[1] for x in metadata if int(x[4])]
mel_fpaths = [mel_dir.joinpath(fname) for fname in mel_fnames]
embed_fnames = [x[2] for x in metadata if int(x[4])]
embed_fpaths = [embed_dir.joinpath(fname) for fname in embed_fnames]
self.samples_fpaths = list(zip(mel_fpaths, embed_fpaths))
self.samples_texts = [x[5].strip() for x in metadata if int(x[4])]
self.metadata = metadata
self.hparams = hparams
print("Found %d samples" % len(self.samples_fpaths))
def __getitem__(self, index):
# Sometimes index may be a list of 2 (not sure why this happens)
# If that is the case, return a single item corresponding to first element in index
if index is list:
index = index[0]
mel_path, embed_path = self.samples_fpaths[index]
mel = np.load(mel_path).T.astype(np.float32)
# Load the embed
embed = np.load(embed_path)
# Get the text and clean it
text = text_to_sequence(self.samples_texts[index], self.hparams.tts_cleaner_names)
# Convert the list returned by text_to_sequence to a numpy array
text = np.asarray(text).astype(np.int32)
return text, mel.astype(np.float32), embed.astype(np.float32), index
def __len__(self):
return len(self.samples_fpaths)
def collate_synthesizer(batch, r, hparams):
# Text
x_lens = [len(x[0]) for x in batch]
max_x_len = max(x_lens)
chars = [pad1d(x[0], max_x_len) for x in batch]
chars = np.stack(chars)
# Mel spectrogram
spec_lens = [x[1].shape[-1] for x in batch]
max_spec_len = max(spec_lens) + 1
if max_spec_len % r != 0:
max_spec_len += r - max_spec_len % r
# WaveRNN mel spectrograms are normalized to [0, 1] so zero padding adds silence
# By default, SV2TTS uses symmetric mels, where -1*max_abs_value is silence.
if hparams.symmetric_mels:
mel_pad_value = -1 * hparams.max_abs_value
else:
mel_pad_value = 0
mel = [pad2d(x[1], max_spec_len, pad_value=mel_pad_value) for x in batch]
mel = np.stack(mel)
# Speaker embedding (SV2TTS)
embeds = np.array([x[2] for x in batch])
# Index (for vocoder preprocessing)
indices = [x[3] for x in batch]
# Convert all to tensor
chars = torch.tensor(chars).long()
mel = torch.tensor(mel)
embeds = torch.tensor(embeds)
return chars, mel, embeds, indices
def pad1d(x, max_len, pad_value=0):
return np.pad(x, (0, max_len - len(x)), mode="constant", constant_values=pad_value)
def pad2d(x, max_len, pad_value=0):
return np.pad(x, ((0, 0), (0, max_len - x.shape[-1])), mode="constant", constant_values=pad_value)

389
synthesizer/train.py Normal file
View File

@@ -0,0 +1,389 @@
from datetime import datetime
from functools import partial
from pathlib import Path
from os.path import exists
import os
import torch
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader
from synthesizer import audio
from synthesizer.models.tacotron import Tacotron
from synthesizer.synthesizer_dataset import SynthesizerDataset, collate_synthesizer
from synthesizer.utils import ValueWindow, data_parallel_workaround
from synthesizer.utils.plot import plot_spectrogram
from synthesizer.utils.symbols import symbols
from synthesizer.utils.text import sequence_to_text
from vocoder.display import *
def np_now(x: torch.Tensor): return x.detach().cpu().numpy()
def time_string():
return datetime.now().strftime("%Y-%m-%d %H:%M")
def sync(device: torch.device):
# For correct profiling (cuda operations are async)
if device.type == "cuda":
torch.cuda.synchronize(device)
def train(run_id: str, syn_dir: Path, models_dir: Path, save_every: int, backup_every: int, force_restart: bool, use_tb: bool,
hparams):
if use_tb:
print("Use Tensorboard")
import tensorflow as tf
import datetime
# Hide GPU from visible devices
log_dir = f"log/vc/synthesizer/tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary_writer = tf.summary.create_file_writer(log_dir)
models_dir.mkdir(exist_ok=True)
model_dir = models_dir.joinpath(run_id)
plot_dir = model_dir.joinpath("plots")
wav_dir = model_dir.joinpath("wavs")
mel_output_dir = model_dir.joinpath("mel-spectrograms")
meta_folder = model_dir.joinpath("metas")
model_dir.mkdir(exist_ok=True)
plot_dir.mkdir(exist_ok=True)
wav_dir.mkdir(exist_ok=True)
mel_output_dir.mkdir(exist_ok=True)
meta_folder.mkdir(exist_ok=True)
weights_fpath = model_dir / f"synthesizer.pt"
train_metadata_fpath = syn_dir.joinpath("train/train.txt")
dev_metadata_fpath = syn_dir.joinpath("dev/dev.txt")
print("Checkpoint path: {}".format(weights_fpath))
print("Loading training data from: {}".format(train_metadata_fpath))
print("Using model: Tacotron")
# Bookkeeping
time_window = ValueWindow(100)
loss_window = ValueWindow(100)
# From WaveRNN/train_tacotron.py
if torch.cuda.is_available():
device = torch.device("cuda")
for session in hparams.tts_schedule:
_, _, _, batch_size = session
if batch_size % torch.cuda.device_count() != 0:
raise ValueError("`batch_size` must be evenly divisible by n_gpus!")
else:
device = torch.device("cpu")
print("Using device:", device)
# Instantiate Tacotron Model
print("\nInitialising Tacotron Model...\n")
model = Tacotron(embed_dims=hparams.tts_embed_dims,
num_chars=len(symbols),
encoder_dims=hparams.tts_encoder_dims,
decoder_dims=hparams.tts_decoder_dims,
n_mels=hparams.num_mels,
fft_bins=hparams.num_mels,
postnet_dims=hparams.tts_postnet_dims,
encoder_K=hparams.tts_encoder_K,
lstm_dims=hparams.tts_lstm_dims,
postnet_K=hparams.tts_postnet_K,
num_highways=hparams.tts_num_highways,
dropout=hparams.tts_dropout,
stop_threshold=hparams.tts_stop_threshold,
speaker_embedding_size=hparams.speaker_embedding_size).to(device)
# Initialize the optimizer
optimizer = optim.Adam(model.parameters())
# train_loss_file_path = "synthesizer_loss/synthesizer_train_loss.npy"
# dev_loss_file_path = "synthesizer_loss/synthesizer_dev_loss.npy"
# if not exists("synthesizer_loss"):
# import os
# os.mkdir("synthesizer_loss")
# Load the weights
if force_restart or not weights_fpath.exists():
print("\nStarting the training of Tacotron from scratch\n")
model.save(weights_fpath)
# Embeddings metadata
char_embedding_fpath = meta_folder.joinpath("CharacterEmbeddings.tsv")
with open(char_embedding_fpath, "w", encoding="utf-8") as f:
for symbol in symbols:
if symbol == " ":
symbol = "\\s" # For visual purposes, swap space with \s
f.write("{}\n".format(symbol))
# losses = []
# dev_losses = []
else:
print("\nLoading weights at %s" % weights_fpath)
model.load(weights_fpath, optimizer)
print("Tacotron weights loaded from step %d" % model.step)
# losses = list(np.load(train_loss_file_path)) if exists(train_loss_file_path) else []
# dev_losses = list(np.load(dev_loss_file_path)) if exists(dev_loss_file_path) else []
# Initialize the dataset
train_mel_dir = syn_dir.joinpath("train/mels")
train_embed_dir = syn_dir.joinpath("train/embeds")
dev_mel_dir = syn_dir.joinpath("dev/mels")
dev_embed_dir = syn_dir.joinpath("dev/embeds")
train_dataset = SynthesizerDataset(train_metadata_fpath, train_mel_dir, train_embed_dir, hparams)
dev_dataset = SynthesizerDataset(dev_metadata_fpath, dev_mel_dir, dev_embed_dir, hparams)
best_loss_file_path = "synthesizer_loss/best_loss.npy"
best_loss = np.load(best_loss_file_path)[0] if exists(best_loss_file_path) else 1000
if not exists("synthesizer_loss"):
os.makedirs("synthesizer_loss")
# profiler = Profiler(summarize_every=10, disabled=False)
for i, session in enumerate(hparams.tts_schedule):
current_step = model.get_step()
r, lr, max_step, batch_size = session
training_steps = max_step - current_step
# Do we need to change to the next session?
if current_step >= max_step:
# Are there no further sessions than the current one?
if i == len(hparams.tts_schedule) - 1:
# We have completed training. Save the model and exit
model.save(weights_fpath, optimizer)
break
else:
# There is a following session, go to it
continue
model.r = r
# Begin the training
simple_table([(f"Steps with r={r}", str(training_steps // 1000) + "k Steps"),
("Batch Size", batch_size),
("Learning Rate", lr),
("Outputs/Step (r)", model.r)])
for p in optimizer.param_groups:
p["lr"] = lr
collate_fn = partial(collate_synthesizer, r=r, hparams=hparams)
train_dataloader = DataLoader(train_dataset, batch_size, shuffle=True, num_workers=4, collate_fn=collate_fn, pin_memory=True)
total_iters = len(train_dataset)
steps_per_epoch = np.ceil(total_iters / batch_size).astype(np.int32)
epochs = np.ceil(training_steps / steps_per_epoch).astype(np.int32)
for epoch in range(1, epochs+1):
for i, (texts, mels, embeds, idx) in enumerate(train_dataloader, 1):
start_time = time.time()
# profiler.tick("Blocking, waiting for batch (threaded)")
# Generate stop tokens for training
stop = torch.ones(mels.shape[0], mels.shape[2])
for j, k in enumerate(idx):
stop[j, :int(train_dataset.metadata[k][4])-1] = 0
texts = texts.to(device)
mels = mels.to(device)
embeds = embeds.to(device)
stop = stop.to(device)
# sync(device)
# profiler.tick("Data to %s" % device)
# Forward pass
# Parallelize model onto GPUS using workaround due to python bug
# if device.type == "cuda" and torch.cuda.device_count() > 1:
# m1_hat, m2_hat, attention, stop_pred = data_parallel_workaround(model, texts, mels, embeds)
# else:
m1_hat, m2_hat, attention, stop_pred = model(texts, mels, embeds)
# sync(device)
# profiler.tick("Forward pass")
# Backward pass
m1_loss = F.mse_loss(m1_hat, mels) + F.l1_loss(m1_hat, mels)
m2_loss = F.mse_loss(m2_hat, mels)
stop_loss = F.binary_cross_entropy(stop_pred, stop)
loss = m1_loss + m2_loss + stop_loss
# sync(device)
# profiler.tick("Loss")
optimizer.zero_grad()
loss.backward()
# profiler.tick("Backward pass")
if hparams.tts_clip_grad_norm is not None:
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), hparams.tts_clip_grad_norm)
if np.isnan(grad_norm.cpu()):
print("grad_norm was NaN!")
optimizer.step()
# profiler.tick("Parameter update")
time_window.append(time.time() - start_time)
loss_window.append(loss.item())
step = model.get_step()
k = step // 1000
msg = f"| Epoch: {epoch}/{epochs} ({i}/{steps_per_epoch}) | Train Loss: {loss_window.average:#.4} | " \
f"{1./time_window.average:#.2} steps/s | Step: {k}k | "
stream(msg)
if use_tb:
with train_summary_writer.as_default():
tf.summary.scalar('train_loss', loss_window.average, step=step)
tf.summary.scalar('learning_rate', lr, step=step)
# Backup or save model as appropriate
# if backup_every != 0 and step % backup_every == 0 :
# backup_fpath = weights_fpath.parent / f"synthesizer_{k:06d}.pt"
# model.save(backup_fpath, optimizer)
torch.cuda.empty_cache()
if save_every != 0 and i % save_every == 0:
dev_loss = validate(dev_dataset, model, collate_fn)
msg = f"\n| Epoch: {epoch}/{epochs} ({i}/{steps_per_epoch}) | Train Loss: {loss_window.average:#.4} | " \
f"Dev Loss: {dev_loss:#.4} | {1./time_window.average:#.2} steps/s | Step: {k}k | "
print(msg)
if use_tb:
with train_summary_writer.as_default():
tf.summary.scalar('val_loss', dev_loss, step=step)
# losses.append(loss_window.average)
# np.save(train_loss_file_path, np.array(losses, dtype=float))
# dev_losses.append(dev_loss)
# np.save(dev_loss_file_path, np.array(dev_losses, dtype=float))
# Must save latest optimizer state to ensure that resuming training
# doesn't produce artifacts
if dev_loss < best_loss:
best_loss = dev_loss
np.save(best_loss_file_path, np.array([best_loss]))
model.save(weights_fpath, optimizer)
# Evaluate model to generate dev samples
# epoch_eval = hparams.tts_eval_interval == -1 and i == steps_per_epoch # If epoch is done
# step_eval = hparams.tts_eval_interval > 0 and i % hparams.tts_eval_interval == 0 # Every N steps
# if step_eval:
# generate train samples
# for sample_idx in range(hparams.tts_eval_num_samples):
# # At most, generate samples equal to number in the batch
# if sample_idx + 1 <= len(texts):
# # Remove padding from mels using frame length in metadata
# mel_length = int(train_dataset.metadata[idx[sample_idx]][4])
# mel_prediction = np_now(m2_hat[sample_idx]).T[:mel_length]
# target_spectrogram = np_now(mels[sample_idx]).T[:mel_length]
# attention_len = mel_length // model.r
# eval_model(attention=np_now(attention[sample_idx][:, :attention_len]),
# mel_prediction=mel_prediction,
# target_spectrogram=target_spectrogram,
# input_seq=np_now(texts[sample_idx]),
# step=step,
# plot_dir=plot_dir,
# mel_output_dir=mel_output_dir,
# wav_dir=wav_dir,
# sample_num=sample_idx + 1,
# loss=loss,
# hparams=hparams,
# if_dev="train")
# generate dev samples
# for sample_idx in range(hparams.tts_eval_num_samples):
# # At most, generate samples equal to number in the batch
# if sample_idx + 1 <= len(dev_input_texts):
# # Remove padding from mels using frame length in metadata
# mel_length = int(dev_dataset.metadata[dev_idx[sample_idx]][4])
# dev_mel_prediction = np_now(dev_m2_hat[sample_idx]).T[:mel_length]
# target_spectrogram = np_now(dev_target_mels[sample_idx]).T[:mel_length]
# attention_len = mel_length // model.r
# eval_model(attention=np_now(dev_attention[sample_idx][:, :attention_len]),
# mel_prediction=dev_mel_prediction,
# target_spectrogram=target_spectrogram,
# input_seq=np_now(dev_input_texts[sample_idx]),
# step=step,
# plot_dir=plot_dir,
# mel_output_dir=mel_output_dir,
# wav_dir=wav_dir,
# sample_num=sample_idx + 1,
# loss=dev_loss,
# hparams=hparams,
# if_dev="dev")
# Break out of loop to update training schedule
if step >= max_step:
break
# Add line break after every epoch
print("")
def eval_model(attention, mel_prediction, target_spectrogram, input_seq, step,
plot_dir, mel_output_dir, wav_dir, sample_num, loss, hparams, if_dev = None):
# Save some results for evaluation
attention_path = str(plot_dir.joinpath("{}_attention_step_{}_sample_{}".format(if_dev, step, sample_num)))
save_attention_multiple(attention, attention_path)
# save predicted mel spectrogram to disk (debug)
mel_output_fpath = mel_output_dir.joinpath("{}-mel-prediction-step-{}_sample_{}.npy".format(if_dev, step, sample_num))
np.save(str(mel_output_fpath), mel_prediction, allow_pickle=False)
# save griffin lim inverted wav for debug (mel -> wav)
wav = audio.inv_mel_spectrogram(mel_prediction.T, hparams)
wav_fpath = wav_dir.joinpath("{}-step-{}-wave-from-mel_sample_{}.wav".format(if_dev, step, sample_num))
audio.save_wav(wav, str(wav_fpath), sr=hparams.sample_rate)
# save real and predicted mel-spectrogram plot to disk (control purposes)
spec_fpath = plot_dir.joinpath("{}-step-{}-mel-spectrogram_sample_{}.png".format(if_dev, step, sample_num))
title_str = "{}, {}, step={}, {} loss={:.5f}".format("Tacotron", time_string(), step, if_dev, loss)
plot_spectrogram(mel_prediction, str(spec_fpath), title=title_str,
target_spectrogram=target_spectrogram,
max_len=target_spectrogram.size // hparams.num_mels)
print("Input at step {}: {}".format(step, sequence_to_text(input_seq)))
def validate(dataset, model, collate_fn):
model.eval()
with torch.no_grad():
losses = []
dataloader = DataLoader(dataset, 32, num_workers=4, shuffle=False, collate_fn=collate_fn)
for i, (texts, mels, embeds, idx) in enumerate(dataloader, 1):
# Generate stop tokens for training
stop = torch.ones(mels.shape[0], mels.shape[2])
for j, k in enumerate(idx):
stop[j, :int(dataset.metadata[k][4])-1] = 0
texts = texts.cuda()
mels = mels.cuda()
embeds = embeds.cuda()
stop = stop.cuda()
# Forward pass
# Parallelize model onto GPUS using workaround due to python bug
# if device.type == "cuda" and torch.cuda.device_count() > 1:
# m1_hat, m2_hat, attention, stop_pred = data_parallel_workaround(model, texts, mels, embeds)
# else:
m1_hat, m2_hat, attention, stop_pred = model(texts, mels, embeds)
# Backward pass
m1_loss = F.mse_loss(m1_hat, mels) + F.l1_loss(m1_hat, mels)
m2_loss = F.mse_loss(m2_hat, mels)
stop_loss = F.binary_cross_entropy(stop_pred, stop)
loss = m1_loss + m2_loss + stop_loss
losses.append(loss.item())
model.train()
torch.cuda.empty_cache()
return sum(losses) / len(losses)

View File

@@ -0,0 +1,45 @@
import torch
_output_ref = None
_replicas_ref = None
def data_parallel_workaround(model, *input):
global _output_ref
global _replicas_ref
device_ids = list(range(torch.cuda.device_count()))
output_device = device_ids[0]
replicas = torch.nn.parallel.replicate(model, device_ids)
# input.shape = (num_args, batch, ...)
inputs = torch.nn.parallel.scatter(input, device_ids)
# inputs.shape = (num_gpus, num_args, batch/num_gpus, ...)
replicas = replicas[:len(inputs)]
outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
y_hat = torch.nn.parallel.gather(outputs, output_device)
_output_ref = outputs
_replicas_ref = replicas
return y_hat
class ValueWindow():
def __init__(self, window_size=100):
self._window_size = window_size
self._values = []
def append(self, x):
self._values = self._values[-(self._window_size - 1):] + [x]
@property
def sum(self):
return sum(self._values)
@property
def count(self):
return len(self._values)
@property
def average(self):
return self.sum / max(1, self.count)
def reset(self):
self._values = []

View File

@@ -0,0 +1,62 @@
import re
valid_symbols = [
"AA", "AA0", "AA1", "AA2", "AE", "AE0", "AE1", "AE2", "AH", "AH0", "AH1", "AH2",
"AO", "AO0", "AO1", "AO2", "AW", "AW0", "AW1", "AW2", "AY", "AY0", "AY1", "AY2",
"B", "CH", "D", "DH", "EH", "EH0", "EH1", "EH2", "ER", "ER0", "ER1", "ER2", "EY",
"EY0", "EY1", "EY2", "F", "G", "HH", "IH", "IH0", "IH1", "IH2", "IY", "IY0", "IY1",
"IY2", "JH", "K", "L", "M", "N", "NG", "OW", "OW0", "OW1", "OW2", "OY", "OY0",
"OY1", "OY2", "P", "R", "S", "SH", "T", "TH", "UH", "UH0", "UH1", "UH2", "UW",
"UW0", "UW1", "UW2", "V", "W", "Y", "Z", "ZH"
]
_valid_symbol_set = set(valid_symbols)
class CMUDict:
"""Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict"""
def __init__(self, file_or_path, keep_ambiguous=True):
if isinstance(file_or_path, str):
with open(file_or_path, encoding="latin-1") as f:
entries = _parse_cmudict(f)
else:
entries = _parse_cmudict(file_or_path)
if not keep_ambiguous:
entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
self._entries = entries
def __len__(self):
return len(self._entries)
def lookup(self, word):
"""Returns list of ARPAbet pronunciations of the given word."""
return self._entries.get(word.upper())
_alt_re = re.compile(r"\([0-9]+\)")
def _parse_cmudict(file):
cmudict = {}
for line in file:
if len(line) and (line[0] >= "A" and line[0] <= "Z" or line[0] == "'"):
parts = line.split(" ")
word = re.sub(_alt_re, "", parts[0])
pronunciation = _get_pronunciation(parts[1])
if pronunciation:
if word in cmudict:
cmudict[word].append(pronunciation)
else:
cmudict[word] = [pronunciation]
return cmudict
def _get_pronunciation(s):
parts = s.strip().split(" ")
for part in parts:
if part not in _valid_symbol_set:
return None
return " ".join(parts)

View File

@@ -0,0 +1,235 @@
"""
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You"ll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
"""
import re
from unidecode import unidecode
from synthesizer.utils.numbers import normalize_numbers
# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
_alphabet2pronunciation = {
'A': 'eiiy',
'B': 'bee',
'b': 'bee',
'C': 'see',
'c': 'see',
'D': 'dee',
'd': 'dee',
'E': 'eee',
'e': 'eee',
'F': 'efph',
'f': 'efph',
'G': 'jee',
'g': 'jee',
'H': 'eiich',
'h': 'eiich',
'I': 'eye',
'i': 'eye',
'J': 'jay',
'j': 'jay',
'K': 'kay',
'k': 'kay',
'L': 'ell',
'l': 'ell',
'M': 'emm',
'm': 'emm',
'N': 'enn',
'n': 'enn',
'O': 'oww',
'o': 'oww',
'P': 'pee',
'p': 'pee',
'Q': 'kyuw',
'q': 'kyuw',
'R': 'arr',
'r': 'arr',
'S': 'ess',
's': 'ess',
'T': 'tee',
't': 'tee',
'U': 'yyou',
'u': 'yyou',
'V': 'wee',
'v': 'wee',
'W': 'dablyu',
'w': 'dablyu',
'X': 'ecks',
'x': 'ecks',
'Y': 'why',
'y': 'why',
'Z': 'zee',
'z': 'zee'
}
_abbreviations_lowercase = ["lol", "pov", "tbh", "omg"]
# Regular expression matching whitespace:
_whitespace_regex = re.compile(r"\s+")
# Regular expression
_abbreviations_lowercase_regex = re.compile(rf"\b(?!')({'|'.join(_abbreviations_lowercase)})\b(?!')")
_abbreviations_capital_regex = re.compile(r"\b(?!')([A-Z0-9]*[A-Z][A-Z0-9]*)(?!')\b")
_abbreviations_capital_plural_regex = re.compile(r"\b(?!')([A-Z0-9]*[A-Z][A-Z0-9]*s)(?!')\b")
# List of (regular expression, replacement) pairs for abbreviations with ending '.':
_abbreviations_dot_tail_regex = [(re.compile(r"\b%s\." % x[0], re.IGNORECASE), x[1]) for x in [
("mrs", "misess"),
("mr", "mister"),
("dr", "doctor"),
("st", "saint"),
("co", "company"),
("jr", "junior"),
("maj", "major"),
("gen", "general"),
("drs", "doctors"),
("rev", "reverend"),
("lt", "lieutenant"),
("hon", "honorable"),
("sgt", "sergeant"),
("capt", "captain"),
("esq", "esquire"),
("ltd", "limited"),
("col", "colonel"),
("ft", "fort"),
]]
# List of (regular expression, replacement) pairs for special char abbreviation:
_abbreviations_special_char_regex = [(re.compile(r"%s" % x[0], re.IGNORECASE), x[1]) for x in [
("#(\w+)", r'\1.'), # split the hashtag word
("@", " at "),
('~', ' to '),
('&', ' and '),
('%', ' percent '),
('\+', ' plus '),
('-', ' ')]]
# convert words that do not pronounce properly
_words_convert_regex = [(re.compile(rf"\b{x[0]}\b", flags=re.IGNORECASE), x[1]) for x in [
("etc", "et cetera"),
("guy", "guuy"),
("guys", "gize")
]]
def replace_special_char(text):
# replace special characters
for regex, replacement in _abbreviations_special_char_regex:
text = re.sub(regex, replacement, text)
return text
def letter2pronunciation(text):
# uppercase some abbreviations that may not be uppercase
text = re.sub(_abbreviations_lowercase_regex, lambda match: match.group(1).upper() + '.', text)
def convert(match):
char_list = [*match]
if char_list[-1] == 's' and len(char_list) < 5:
for idx in range(len(char_list)):
if idx < len(char_list) - 1:
char_list[idx] = _alphabet2pronunciation.get(char_list[idx], char_list[idx])
else:
char_list[idx - 1] += char_list[idx]
return " ".join(char_list[:idx])
elif len(char_list) < 4:
char_list = map(lambda char: _alphabet2pronunciation.get(char, char), char_list)
return " ".join(char_list)
else: return "".join(char_list)
# split abbreviations consisting of one or more capital letters and zero or more numbers in single form to individual letters
# and convert the letters to pronunciation
text = re.sub(_abbreviations_capital_regex, lambda match: convert(match.group(1)), text)
# split abbreviations consisting of one or more capital letters and zero or more numbers in plural form to individual letters
# and convert the letters to pronunciation
text = re.sub(_abbreviations_capital_plural_regex, lambda match: convert(match.group(1)), text)
return text
def expand_abbreviations(text):
# expand abbreviations ending with dot
for regex, replacement in _abbreviations_dot_tail_regex:
text = re.sub(regex, replacement, text)
# expand other abbreviations
for regex, replacement in _words_convert_regex:
text = re.sub(regex, replacement, text)
return text
def expand_numbers(text):
return normalize_numbers(text)
def lowercase(text):
"""lowercase input tokens."""
return text.lower()
def collapse_whitespace(text):
return re.sub(_whitespace_regex, " ", text)
def convert_to_ascii(text):
return unidecode(text)
def split_conj(text):
wordtable=['at','on','in','during','for','before','after','since','until',
'between','under','above','below','by','beside','near','next to','outside','inside',
'behind','with','through']
a='\\b('+"|".join([' ' + i for i in wordtable])+')\\b'
b=re.sub(a,r". \1",text)
return b
def add_breaks(text):
text = re.sub(r"(\d{1,3}(,\d{3})+)\.?(\d+)?", lambda x: x.group(1).replace(",", "") + (("." + x.group(3)) if x.group(3) else ""), text) # remove comma in numbers
text = text.replace('-', ' ')
text = text.replace(',', '. ')
text = text.replace(';', '. ')
text = text.replace(':', '. ')
text = text.replace('!', '. ')
text = text.replace('?', '. ')
return text
def basic_cleaners(text):
"""Basic pipeline that lowercases and collapses whitespace without transliteration."""
text = lowercase(text)
text = collapse_whitespace(text)
return text
def transliteration_cleaners(text):
"""Pipeline for non-English text that transliterates to ASCII."""
text = convert_to_ascii(text)
text = lowercase(text)
text = collapse_whitespace(text)
return text
def english_cleaners_predict(text):
"""Pipeline for English text, including number and abbreviation expansion for prediction."""
text = convert_to_ascii(text)
text = replace_special_char(text)
text = expand_abbreviations(text)
text = letter2pronunciation(text)
text = lowercase(text)
text = expand_numbers(text)
# text = split_conj(text)
text = collapse_whitespace(text)
return text
def english_cleaners(text):
"""Pipeline for English text, including number and abbreviation expansion for training preprocessing."""
text = convert_to_ascii(text)
text = lowercase(text)
text = expand_numbers(text)
text = expand_abbreviations(text)
text = collapse_whitespace(text)
return text

View File

@@ -0,0 +1,69 @@
import re
import inflect
_inflect = inflect.engine()
_comma_number_re = re.compile(r"([0-9][0-9\,]+[0-9])")
_decimal_number_re = re.compile(r"([0-9]+\.[0-9]+)")
_pounds_re = re.compile(r"£([0-9\,]*[0-9]+)")
_dollars_re = re.compile(r"\$([0-9\.\,]*[0-9]+)")
_ordinal_re = re.compile(r"[0-9]+(st|nd|rd|th)")
_number_re = re.compile(r"[0-9]+")
def _remove_commas(m):
return m.group(1).replace(",", "")
def _expand_decimal_point(m):
return m.group(1).replace(".", " point ")
def _expand_dollars(m):
match = m.group(1)
parts = match.split(".")
if len(parts) > 2:
return match + " dollars" # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = "dollar" if dollars == 1 else "dollars"
cent_unit = "cent" if cents == 1 else "cents"
return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = "dollar" if dollars == 1 else "dollars"
return "%s %s" % (dollars, dollar_unit)
elif cents:
cent_unit = "cent" if cents == 1 else "cents"
return "%s %s" % (cents, cent_unit)
else:
return "zero dollars"
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return " two thousand "
elif num > 2000 and num < 2010:
return " two thousand " + _inflect.number_to_words(num % 100) + " "
elif num % 100 == 0:
return " " + _inflect.number_to_words(num // 100) + " hundred "
else:
return " " + _inflect.number_to_words(num, andword="", zero="oh", group=2).replace(", ", " ") + " "
else:
return " " + _inflect.number_to_words(num, andword="") + " "
def normalize_numbers(text):
# text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r"\1 pounds", text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text

82
synthesizer/utils/plot.py Normal file
View File

@@ -0,0 +1,82 @@
import numpy as np
def split_title_line(title_text, max_words=5):
"""
A function that splits any string based on specific character
(returning it with the string), with maximum number of words on it
"""
seq = title_text.split()
return "\n".join([" ".join(seq[i:i + max_words]) for i in range(0, len(seq), max_words)])
def plot_alignment(alignment, path, title=None, split_title=False, max_len=None):
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
if max_len is not None:
alignment = alignment[:, :max_len]
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
im = ax.imshow(
alignment,
aspect="auto",
origin="lower",
interpolation="none")
fig.colorbar(im, ax=ax)
xlabel = "Decoder timestep"
if split_title:
title = split_title_line(title)
plt.xlabel(xlabel)
plt.title(title)
plt.ylabel("Encoder timestep")
plt.tight_layout()
plt.savefig(path, format="png")
plt.close()
def plot_spectrogram(pred_spectrogram, path, title=None, split_title=False, target_spectrogram=None, max_len=None, auto_aspect=False):
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
if max_len is not None:
target_spectrogram = target_spectrogram[:max_len]
pred_spectrogram = pred_spectrogram[:max_len]
if split_title:
title = split_title_line(title)
fig = plt.figure(figsize=(10, 8))
# Set common labels
fig.text(0.5, 0.18, title, horizontalalignment="center", fontsize=16)
#target spectrogram subplot
if target_spectrogram is not None:
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
if auto_aspect:
im = ax1.imshow(np.rot90(target_spectrogram), aspect="auto", interpolation="none")
else:
im = ax1.imshow(np.rot90(target_spectrogram), interpolation="none")
ax1.set_title("Target Mel-Spectrogram")
fig.colorbar(mappable=im, shrink=0.65, orientation="horizontal", ax=ax1)
ax2.set_title("Predicted Mel-Spectrogram")
else:
ax2 = fig.add_subplot(211)
if auto_aspect:
im = ax2.imshow(np.rot90(pred_spectrogram), aspect="auto", interpolation="none")
else:
im = ax2.imshow(np.rot90(pred_spectrogram), interpolation="none")
fig.colorbar(mappable=im, shrink=0.65, orientation="horizontal", ax=ax2)
plt.tight_layout()
plt.savefig(path, format="png")
plt.close()

View File

@@ -0,0 +1,17 @@
"""
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run
through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details.
"""
# from . import cmudict
_pad = "_"
_eos = "~"
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'\"(),-.:;? "
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
#_arpabet = ["@' + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = [_pad, _eos] + list(_characters) #+ _arpabet

75
synthesizer/utils/text.py Normal file
View File

@@ -0,0 +1,75 @@
from synthesizer.utils.symbols import symbols
from synthesizer.utils import cleaners
import re
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")
def text_to_sequence(text, cleaner_names=[]):
"""Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
Args:
text: string to convert to a sequence
cleaner_names: names of the cleaner functions to run the text through
Returns:
List of integers corresponding to the symbols in the text
"""
sequence = []
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
break
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
# Append EOS token
sequence.append(_symbol_to_id["~"])
return sequence
def sequence_to_text(sequence):
"""Converts a sequence of IDs back to a string"""
result = ""
for symbol_id in sequence:
if symbol_id in _id_to_symbol:
s = _id_to_symbol[symbol_id]
# Enclose ARPAbet back in curly braces:
if len(s) > 1 and s[0] == "@":
s = "{%s}" % s[1:]
result += s
return result.replace("}{", " ")
def _clean_text(text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception("Unknown cleaner: %s" % name)
text = cleaner(text)
return text
def _symbols_to_sequence(symbols):
return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
def _arpabet_to_sequence(text):
return _symbols_to_sequence(["@" + s for s in text.split()])
def _should_keep_symbol(s):
return s in _symbol_to_id and s not in ("_", "~")

View File

@@ -0,0 +1,55 @@
from synthesizer.preprocess import preprocess_librispeech, preprocess_vctk
from synthesizer.hparams import syn_hparams
from utils.argutils import print_args
from pathlib import Path
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Preprocesses audio files from datasets, encodes them as mel spectrograms "
"and writes them to the disk. Audio files are also saved, to be used by the "
"vocoder for training.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("datasets_root", type=Path, help=\
"Path to the directory containing your LibriSpeech/TTS datasets.")
parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
"Path to the output directory that will contain the mel spectrograms, the audios and the "
"embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/")
parser.add_argument("-n", "--n_processes", type=int, default=4, help=\
"Number of processes in parallel.")
parser.add_argument("-s", "--skip_existing", action="store_true", help=\
"Whether to overwrite existing files with the same name. Useful if the preprocessing was "
"interrupted.")
parser.add_argument("--hparams", type=str, default="", help=\
"Hyperparameter overrides as a comma-separated list of name-value pairs")
parser.add_argument("--datasets_names", type=list, default=["LibriSpeech","VCTK"], help=\
"Name of the dataset directory to process.")
parser.add_argument("--all_subfolders", type=list, default=["train-clean-100,train-clean-360,dev-clean", "wav48_silence_trimmed"], help=\
"Comma-separated list of subfolders to process inside your dataset directory")
args = parser.parse_args()
# Process the arguments
if not hasattr(args, "out_dir"):
args.out_dir = args.datasets_root.joinpath("SV2TTS", "synthesizer")
# Create directories
assert args.datasets_root.exists()
args.out_dir.mkdir(exist_ok=True, parents=True)
# Preprocess the dataset
print_args(args, parser)
args.hparams = syn_hparams.parse(args.hparams)
preprocess_func = {
"LibriSpeech": preprocess_librispeech,
"VCTK": preprocess_vctk,
}
args = vars(args)
for i in range(len(args["datasets_names"])):
dataset = args["datasets_names"][i]
subfolders = args["all_subfolders"][i]
print("Preprocessing %s" % dataset)
preprocess_func[dataset](datasets_root=args["datasets_root"], out_dir=args["out_dir"], n_processes=args["n_processes"], skip_existing=args["skip_existing"], hparams=args["hparams"],
datasets_name=dataset, subfolders=subfolders)

View File

@@ -0,0 +1,25 @@
from synthesizer.preprocess import create_embeddings
from utils.argutils import print_args
from pathlib import Path
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Creates embeddings for the synthesizer from the LibriSpeech utterances.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("synthesizer_root", type=Path, help=\
"Path to the synthesizer training data that contains the audios and the train.txt file. "
"If you let everything as default, it should be <datasets_root>/SV2TTS/synthesizer/.")
parser.add_argument("-e", "--encoder_model_fpath", type=Path,
default="saved_models/default/encoder.pt", help=\
"Path your trained encoder model.")
parser.add_argument("-n", "--n_processes", type=int, default=4, help= \
"Number of parallel processes. An encoder is created for each, so you may need to lower "
"this value on GPUs with low memory. Set it to 1 if CUDA is unhappy.")
args = parser.parse_args()
# Preprocess the dataset
print_args(args, parser)
create_embeddings(**vars(args))

38
synthesizer_train.py Normal file
View File

@@ -0,0 +1,38 @@
from pathlib import Path
from synthesizer.hparams import syn_hparams
from synthesizer.train import train
from utils.argutils import print_args
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("run_id", type=str, help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("syn_dir", type=Path, help= \
"Path to the synthesizer directory that contains the ground truth mel spectrograms, "
"the wavs and the embeds.")
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
"Path to the output directory that will contain the saved model weights and the logs.")
parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
"Number of steps between updates of the model on the disk. Set to 0 to never save the "
"model.")
parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
"Number of steps between backups of the model. Set to 0 to never make backups of the "
"model.")
parser.add_argument("-f", "--force_restart", action="store_true", help= \
"Do not load any saved model and restart from scratch.")
parser.add_argument("--use_tb", action="store_true", help= \
"Use Tensorboard support")
parser.add_argument("--hparams", default="", help=\
"Hyperparameter overrides as a comma-separated list of name=value pairs")
args = parser.parse_args()
print_args(args, parser)
args.hparams = syn_hparams.parse(args.hparams)
# Run the training
train(**vars(args))

406
toolbox/__init__.py Normal file
View File

@@ -0,0 +1,406 @@
import sys
import traceback
from pathlib import Path
from time import perf_counter as timer
import re
import numpy as np
import torch
import soundfile as sf
import librosa
import spacy
import encoder
from encoder import inference as encoder_infer
from synthesizer.inference import Synthesizer_infer
from synthesizer.utils.cleaners import add_breaks, english_cleaners_predict
from vocoder.display import save_attention_multiple, save_spectrogram, save_stop_tokens
from synthesizer.hparams import syn_hparams
from toolbox.ui import UI
from toolbox.utterance import Utterance
from vocoder import inference as vocoder
from speed_changer.fixSpeed import *
import time
# Use this directory structure for your datasets, or modify it to fit your needs
recognized_datasets = [
"LibriSpeech/dev-clean",
"LibriSpeech/dev-other",
"LibriSpeech/test-clean",
"LibriSpeech/test-other",
"LibriSpeech/train-clean-100",
"LibriSpeech/train-clean-360",
"LibriSpeech/train-other-500",
"LibriTTS/dev-clean",
"LibriTTS/dev-other",
"LibriTTS/test-clean",
"LibriTTS/test-other",
"LibriTTS/train-clean-100",
"LibriTTS/train-clean-360",
"LibriTTS/train-other-500",
"LJSpeech-1.1",
"VoxCeleb1/wav",
"VoxCeleb1/test_wav",
"VoxCeleb2/dev/aac",
"VoxCeleb2/test/aac",
"VCTK-Corpus/wav48",
]
# Maximum of generated wavs to keep on memory
MAX_WAVS = 15
class Toolbox:
def __init__(self, run_id: str, datasets_root: Path, models_dir: Path, seed: int=None):
sys.excepthook = self.excepthook
self.datasets_root = datasets_root
self.utterances = set()
self.current_generated = (None, None, None, None) # speaker_name, spec, breaks, wav
self.synthesizer = None # type: Synthesizer_infer
self.current_wav = None
self.waves_list = []
self.waves_count = 0
self.waves_namelist = []
self.start_generate_time = None
self.nlp = spacy.load('en_core_web_sm')
if not os.path.exists("toolbox_results"):
os.mkdir("toolbox_results")
# Check for webrtcvad (enables removal of silences in vocoder output)
try:
import webrtcvad
self.trim_silences = True
except:
self.trim_silences = False
# Initialize the events and the interface
self.ui = UI()
self.reset_ui(run_id, models_dir, seed)
self.setup_events()
self.ui.start()
def excepthook(self, exc_type, exc_value, exc_tb):
traceback.print_exception(exc_type, exc_value, exc_tb)
self.ui.log("Exception: %s" % exc_value)
def setup_events(self):
# Dataset, speaker and utterance selection
self.ui.browser_load_button.clicked.connect(lambda: self.load_from_browser())
random_func = lambda level: lambda: self.ui.populate_browser(self.datasets_root,
recognized_datasets,
level)
self.ui.random_dataset_button.clicked.connect(random_func(0))
self.ui.random_speaker_button.clicked.connect(random_func(1))
self.ui.random_utterance_button.clicked.connect(random_func(2))
self.ui.dataset_box.currentIndexChanged.connect(random_func(1))
self.ui.speaker_box.currentIndexChanged.connect(random_func(2))
# Model selection
self.ui.encoder_box.currentIndexChanged.connect(self.init_encoder)
def func():
self.synthesizer = None
self.ui.synthesizer_box.currentIndexChanged.connect(func)
self.ui.vocoder_box.currentIndexChanged.connect(self.init_vocoder)
# Utterance selection
func = lambda: self.load_from_browser(self.ui.browse_file())
self.ui.browser_browse_button.clicked.connect(func)
func = lambda: self.ui.draw_utterance(self.ui.selected_utterance, "current")
self.ui.utterance_history.currentIndexChanged.connect(func)
func = lambda: self.ui.play(self.ui.selected_utterance.wav, Synthesizer_infer.sample_rate)
self.ui.play_button.clicked.connect(func)
self.ui.stop_button.clicked.connect(self.ui.stop)
self.ui.record_button.clicked.connect(self.record)
#Audio
self.ui.setup_audio_devices(Synthesizer_infer.sample_rate)
#Wav playback & save
func = lambda: self.replay_last_wav()
self.ui.replay_wav_button.clicked.connect(func)
func = lambda: self.export_current_wave()
self.ui.export_wav_button.clicked.connect(func)
self.ui.waves_cb.currentIndexChanged.connect(self.set_current_wav)
# Generation
func = lambda: self.synthesize() or self.vocode()
self.ui.generate_button.clicked.connect(func)
self.ui.synthesize_button.clicked.connect(self.synthesize)
self.ui.vocode_button.clicked.connect(self.vocode)
self.ui.random_seed_checkbox.clicked.connect(self.update_seed_textbox)
# UMAP legend
self.ui.clear_button.clicked.connect(self.clear_utterances)
def set_current_wav(self, index):
self.current_wav = self.waves_list[index]
def export_current_wave(self):
self.ui.save_audio_file(self.current_wav, Synthesizer_infer.sample_rate)
def replay_last_wav(self):
self.ui.play(self.current_wav, Synthesizer_infer.sample_rate)
def reset_ui(self, run_id: str, models_dir: Path, seed: int=None):
self.ui.populate_browser(self.datasets_root, recognized_datasets, 0, True)
self.ui.populate_models(run_id, models_dir)
self.ui.populate_gen_options(seed, self.trim_silences)
def load_from_browser(self, fpath=None):
if fpath is None:
fpath = Path(self.datasets_root,
self.ui.current_dataset_name,
self.ui.current_speaker_name,
self.ui.current_utterance_name)
name = str(fpath.relative_to(self.datasets_root))
speaker_name = self.ui.current_dataset_name + '_' + self.ui.current_speaker_name
# Select the next utterance
if self.ui.auto_next_checkbox.isChecked():
self.ui.browser_select_next()
elif fpath == "":
return
else:
name = fpath.name
speaker_name = fpath.parent.name
# Get the wav from the disk. We take the wav with the vocoder/synthesizer format for
# playback, so as to have a fair comparison with the generated audio
wav = Synthesizer_infer.load_preprocess_wav(fpath)
self.ui.log("Loaded %s" % name)
self.add_real_utterance(wav, name, speaker_name)
def record(self):
wav = self.ui.record_one(encoder_infer.sampling_rate, 5)
if wav is None:
return
self.ui.play(wav, encoder_infer.sampling_rate)
speaker_name = "user01"
name = speaker_name + "_rec_%05d" % np.random.randint(100000)
self.add_real_utterance(wav, name, speaker_name)
def add_real_utterance(self, wav, name, speaker_name):
# Compute the mel spectrogram
spec = Synthesizer_infer.make_spectrogram(wav)
self.ui.draw_spec(spec, "current")
path_ori = os.getcwd()
file_ori = 'temp.wav'
fpath = os.path.join(path_ori, file_ori)
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
# adjust the speed
self.wav_ori_info = AudioAnalysis(path_ori, file_ori)
DelFile(path_ori, '.TextGrid')
os.remove(fpath)
# Compute the embedding
if not encoder_infer.is_loaded():
self.init_encoder()
encoder_wav = encoder_infer.preprocess_wav(wav)
embed, partial_embeds, _ = encoder_infer.embed_utterance(encoder_wav, return_partials=True)
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
# Add the utterance
utterance = Utterance(name, speaker_name, wav, spec, embed, partial_embeds, False)
self.utterances.add(utterance)
self.ui.register_utterance(utterance)
# Plot it
self.ui.draw_embed(embed, name, "current")
self.ui.draw_umap_projections(self.utterances)
self.ui.wav_ori_fig.savefig(f"toolbox_results/{name}_info.png", dpi=500)
if len(self.utterances) >= self.ui.min_umap_points:
self.ui.umap_fig.savefig(f"toolbox_results/umap_{len(self.utterances)}.png", dpi=500)
def clear_utterances(self):
self.utterances.clear()
self.ui.draw_umap_projections(self.utterances)
def synthesize(self):
self.start_generate_time = time.time()
self.ui.log("Generating the mel spectrogram...")
self.ui.set_loading(1)
# Update the synthesizer random seed
if self.ui.random_seed_checkbox.isChecked():
seed = int(self.ui.seed_textbox.text())
self.ui.populate_gen_options(seed, self.trim_silences)
else:
seed = None
if seed is not None:
torch.manual_seed(seed)
# Synthesize the spectrogram
if self.synthesizer is None or seed is not None:
self.init_synthesizer()
embed = self.ui.selected_utterance.embed
def preprocess_text(text):
text = add_breaks(text)
text = english_cleaners_predict(text)
texts = [i.text.strip() for i in self.nlp(text).sents] # split paragraph to sentences
return texts
texts = preprocess_text(self.ui.text_prompt.toPlainText())
print(f"the list of inputs texts:\n{texts}")
embeds = [embed] * len(texts)
specs, alignments, stop_tokens = self.synthesizer.synthesize_spectrograms(texts, embeds, require_visualization=True)
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
save_attention_multiple(alignments, "toolbox_results/attention")
save_stop_tokens(stop_tokens, "toolbox_results/stop_tokens")
self.ui.draw_spec(spec, "generated")
self.current_generated = (self.ui.selected_utterance.speaker_name, spec, breaks, None)
self.ui.set_loading(0)
def vocode(self):
speaker_name, spec, breaks, _ = self.current_generated
assert spec is not None
# Initialize the vocoder model and make it determinstic, if user provides a seed
if self.ui.random_seed_checkbox.isChecked():
seed = int(self.ui.seed_textbox.text())
self.ui.populate_gen_options(seed, self.trim_silences)
else:
seed = None
if seed is not None:
torch.manual_seed(seed)
# Synthesize the waveform
if not vocoder.is_loaded() or seed is not None:
self.init_vocoder()
def vocoder_progress(i, seq_len, b_size, gen_rate):
real_time_factor = (gen_rate / Synthesizer_infer.sample_rate) * 1000
line = "Waveform generation: %d/%d (batch size: %d, rate: %.1fkHz - %.2fx real time)" \
% (i * b_size, seq_len * b_size, b_size, gen_rate, real_time_factor)
self.ui.log(line, "overwrite")
self.ui.set_loading(i, seq_len)
if self.ui.current_vocoder_fpath is not None and not self.ui.griffin_lim_checkbox.isChecked():
self.ui.log("")
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade, progress_callback=vocoder_progress)
else:
self.ui.log("Waveform generation with Griffin-Lim... ")
wav = Synthesizer_infer.griffin_lim(spec)
self.ui.set_loading(0)
self.ui.log(" Done!", "append")
self.ui.log(f"Generate time: {time.time() - self.start_generate_time}s")
# Add breaks
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
# Trim excessive silences
if self.ui.trim_silences_checkbox.isChecked():
wav = encoder_infer.preprocess_wav(wav)
path_ori = os.getcwd()
file_ori = 'temp.wav'
filename = os.path.join(path_ori, file_ori)
sf.write(filename, wav.astype(np.float32), syn_hparams.sample_rate)
self.ui.log("\nSaved output (haven't change speed) as %s\n\n" % filename)
# Fix Speed(generate new audio)
fix_file, speed_factor = work(*self.wav_ori_info, filename)
self.ui.log(f"\nSaved output (fixed speed) as {fix_file}\n\n")
wav, _ = librosa.load(fix_file, syn_hparams.sample_rate)
os.remove(fix_file)
# Play it
wav = wav / np.abs(wav).max() * 4
self.ui.play(wav, Synthesizer_infer.sample_rate)
# Name it (history displayed in combobox)
# TODO better naming for the combobox items?
wav_name = str(self.waves_count + 1)
#Update waves combobox
self.waves_count += 1
if self.waves_count > MAX_WAVS:
self.waves_list.pop()
self.waves_namelist.pop()
self.waves_list.insert(0, wav)
self.waves_namelist.insert(0, wav_name)
self.ui.waves_cb.disconnect()
self.ui.waves_cb_model.setStringList(self.waves_namelist)
self.ui.waves_cb.setCurrentIndex(0)
self.ui.waves_cb.currentIndexChanged.connect(self.set_current_wav)
# Update current wav
self.set_current_wav(0)
#Enable replay and save buttons:
self.ui.replay_wav_button.setDisabled(False)
self.ui.export_wav_button.setDisabled(False)
# Compute the embedding
# TODO: this is problematic with different sampling rates, gotta fix it
if not encoder_infer.is_loaded():
self.init_encoder()
encoder_wav = encoder_infer.preprocess_wav(wav)
embed, partial_embeds, _ = encoder_infer.embed_utterance(encoder_wav, return_partials=True)
# Add the utterance
name = speaker_name + "_gen_%05d_" % np.random.randint(100000) + str(speed_factor)
utterance = Utterance(name, speaker_name, wav, spec, embed, partial_embeds, True)
self.utterances.add(utterance)
# Plot it
self.ui.draw_embed(embed, name, "generated")
self.ui.draw_umap_projections(self.utterances)
self.ui.wav_gen_fig.savefig(f"toolbox_results/{name}_info.png", dpi=500)
if len(self.utterances) >= self.ui.min_umap_points:
self.ui.umap_fig.savefig(f"toolbox_results/umap_{len(self.utterances)}.png", dpi=500)
def init_encoder(self):
model_fpath = self.ui.current_encoder_fpath
self.ui.log("Loading the encoder %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
encoder_infer.load_model(model_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)
def init_synthesizer(self):
model_fpath = self.ui.current_synthesizer_fpath
self.ui.log("Loading the synthesizer %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
self.synthesizer = Synthesizer_infer(model_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)
def init_vocoder(self):
model_fpath = self.ui.current_vocoder_fpath
# Case of Griffin-lim
if model_fpath is None:
return
self.ui.log("Loading the vocoder %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
vocoder.load_model(model_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)
def update_seed_textbox(self):
self.ui.update_seed_textbox()

611
toolbox/ui.py Normal file
View File

@@ -0,0 +1,611 @@
import sys
from pathlib import Path
from time import sleep
from typing import List, Set
from warnings import filterwarnings, warn
import matplotlib.pyplot as plt
import numpy as np
import sounddevice as sd
import soundfile as sf
import umap
from PyQt5.QtCore import Qt, QStringListModel
from PyQt5.QtWidgets import *
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from encoder.inference import plot_embedding_as_heatmap
from toolbox.utterance import Utterance
filterwarnings("ignore")
colormap = np.array([
[0, 127, 70],
[255, 0, 0],
[255, 217, 38],
[0, 135, 255],
[165, 0, 165],
[255, 167, 255],
[97, 142, 151],
[0, 255, 255],
[255, 96, 38],
[142, 76, 0],
[33, 0, 127],
[0, 0, 0],
[183, 183, 183],
[76, 255, 0],
], dtype=np.float) / 255
default_text = \
"We have to reduce the number of plastic bags."
class UI(QDialog):
min_umap_points = 4
max_log_lines = 5
max_saved_utterances = 20
def draw_utterance(self, utterance: Utterance, which):
self.draw_spec(utterance.spec, which)
self.draw_embed(utterance.embed, utterance.name, which)
def draw_embed(self, embed, name, which):
embed_ax, _ = self.current_ax if which == "current" else self.gen_ax
embed_ax.figure.suptitle("" if embed is None else name)
## Embedding
# Clear the plot
if len(embed_ax.images) > 0:
embed_ax.images[0].colorbar.remove()
embed_ax.clear()
# Draw the embed
if embed is not None:
plot_embedding_as_heatmap(embed, embed_ax)
embed_ax.set_title("embedding")
embed_ax.set_aspect("equal", "datalim")
embed_ax.set_xticks([])
embed_ax.set_yticks([])
embed_ax.figure.canvas.draw()
def draw_spec(self, spec, which):
_, spec_ax = self.current_ax if which == "current" else self.gen_ax
## Spectrogram
# Draw the spectrogram
spec_ax.clear()
if spec is not None:
spec_ax.imshow(spec, aspect="auto", interpolation="none")
spec_ax.set_title("mel spectrogram")
spec_ax.set_xticks([])
spec_ax.set_yticks([])
spec_ax.figure.canvas.draw()
if which != "current":
self.vocode_button.setDisabled(spec is None)
def draw_umap_projections(self, utterances: Set[Utterance]):
def umap_progress(i, seq_len):
self.set_loading(i, seq_len)
self.umap_ax.clear()
speakers = np.unique([u.speaker_name for u in utterances])
colors = {speaker_name: colormap[i] for i, speaker_name in enumerate(speakers)}
embeds = [u.embed for u in utterances]
# Display a message if there aren't enough points
if len(utterances) < self.min_umap_points:
self.umap_ax.text(.5, .5, "Add %d more points to\ngenerate the projections" %
(self.min_umap_points - len(utterances)),
horizontalalignment='center', fontsize=15)
self.umap_ax.set_title("")
# Compute the projections
else:
if not self.umap_hot:
self.log(
"Drawing UMAP projections for the first time, this will take a few seconds.")
self.umap_hot = True
reducer = umap.UMAP(int(np.ceil(np.sqrt(len(embeds)))), metric="cosine")
projections = reducer.fit_transform(embeds)
speakers_done = set()
i = 0
for projection, utterance in zip(projections, utterances):
i+=1
color = colors[utterance.speaker_name]
mark = "x" if "_gen_" in utterance.name else "o"
label = None if utterance.speaker_name in speakers_done else utterance.speaker_name
speakers_done.add(utterance.speaker_name)
self.umap_ax.scatter(projection[0], projection[1], c=[color], marker=mark,
label=label)
self.set_loading(i, projections.shape[0])
self.umap_ax.legend(prop={'size': 10})
self.set_loading(0)
# Draw the plot
self.umap_ax.set_aspect("equal", "datalim")
self.umap_ax.set_xticks([])
self.umap_ax.set_yticks([])
self.umap_ax.figure.canvas.draw()
def save_audio_file(self, wav, sample_rate):
dialog = QFileDialog()
dialog.setDefaultSuffix(".wav")
fpath, _ = dialog.getSaveFileName(
parent=self,
caption="Select a path to save the audio file",
filter="Audio Files (*.flac *.wav)"
)
if fpath:
#Default format is wav
if Path(fpath).suffix == "":
fpath += ".wav"
sf.write(fpath, wav, sample_rate)
def setup_audio_devices(self, sample_rate):
input_devices = []
output_devices = []
for device in sd.query_devices():
# Check if valid input
try:
sd.check_input_settings(device=device["name"], samplerate=sample_rate)
input_devices.append(device["name"])
except:
pass
# Check if valid output
try:
sd.check_output_settings(device=device["name"], samplerate=sample_rate)
output_devices.append(device["name"])
except Exception as e:
# Log a warning only if the device is not an input
if not device["name"] in input_devices:
warn("Unsupported output device %s for the sample rate: %d \nError: %s" % (device["name"], sample_rate, str(e)))
if len(input_devices) == 0:
self.log("No audio input device detected. Recording may not work.")
self.audio_in_device = None
else:
self.audio_in_device = input_devices[0]
if len(output_devices) == 0:
self.log("No supported output audio devices were found! Audio output may not work.")
self.audio_out_devices_cb.addItems(["None"])
self.audio_out_devices_cb.setDisabled(True)
else:
self.audio_out_devices_cb.clear()
self.audio_out_devices_cb.addItems(output_devices)
self.audio_out_devices_cb.currentTextChanged.connect(self.set_audio_device)
self.set_audio_device()
def set_audio_device(self):
output_device = self.audio_out_devices_cb.currentText()
if output_device == "None":
output_device = None
# If None, sounddevice queries portaudio
sd.default.device = (self.audio_in_device, output_device)
def play(self, wav, sample_rate):
try:
sd.stop()
sd.play(wav, sample_rate)
except Exception as e:
print(e)
self.log("Error in audio playback. Try selecting a different audio output device.")
self.log("Your device must be connected before you start the toolbox.")
def stop(self):
sd.stop()
def record_one(self, sample_rate, duration):
self.record_button.setText("Recording...")
self.record_button.setDisabled(True)
self.log("Recording %d seconds of audio" % duration)
sd.stop()
try:
wav = sd.rec(duration * sample_rate, sample_rate, 1)
except Exception as e:
print(e)
self.log("Could not record anything. Is your recording device enabled?")
self.log("Your device must be connected before you start the toolbox.")
return None
for i in np.arange(0, duration, 0.1):
self.set_loading(i, duration)
sleep(0.1)
self.set_loading(duration, duration)
sd.wait()
self.log("Done recording.")
self.record_button.setText("Record")
self.record_button.setDisabled(False)
return wav.squeeze()
@property
def current_dataset_name(self):
return self.dataset_box.currentText()
@property
def current_speaker_name(self):
return self.speaker_box.currentText()
@property
def current_utterance_name(self):
return self.utterance_box.currentText()
def browse_file(self):
fpath = QFileDialog().getOpenFileName(
parent=self,
caption="Select an audio file",
filter="Audio Files (*.mp3 *.flac *.wav *.m4a)"
)
return Path(fpath[0]) if fpath[0] != "" else ""
@staticmethod
def repopulate_box(box, items, random=False):
"""
Resets a box and adds a list of items. Pass a list of (item, data) pairs instead to join
data to the items
"""
box.blockSignals(True)
box.clear()
for item in items:
item = list(item) if isinstance(item, tuple) else [item]
box.addItem(str(item[0]), *item[1:])
if len(items) > 0:
box.setCurrentIndex(np.random.randint(len(items)) if random else 0)
box.setDisabled(len(items) == 0)
box.blockSignals(False)
def populate_browser(self, datasets_root: Path, recognized_datasets: List, level: int,
random=True):
# Select a random dataset
if level <= 0:
if datasets_root is not None:
datasets = [datasets_root.joinpath(d) for d in recognized_datasets]
datasets = [d.relative_to(datasets_root) for d in datasets if d.exists()]
self.browser_load_button.setDisabled(len(datasets) == 0)
if datasets_root is None or len(datasets) == 0:
msg = "Warning: you d" + ("id not pass a root directory for datasets as argument" \
if datasets_root is None else "o not have any of the recognized datasets" \
" in %s" % datasets_root)
self.log(msg)
msg += ".\nThe recognized datasets are:\n\t%s\nFeel free to add your own. You " \
"can still use the toolbox by recording samples yourself." % \
("\n\t".join(recognized_datasets))
print(msg, file=sys.stderr)
self.random_utterance_button.setDisabled(True)
self.random_speaker_button.setDisabled(True)
self.random_dataset_button.setDisabled(True)
self.utterance_box.setDisabled(True)
self.speaker_box.setDisabled(True)
self.dataset_box.setDisabled(True)
self.browser_load_button.setDisabled(True)
self.auto_next_checkbox.setDisabled(True)
return
self.repopulate_box(self.dataset_box, datasets, random)
# Select a random speaker
if level <= 1:
speakers_root = datasets_root.joinpath(self.current_dataset_name)
speaker_names = [d.stem for d in speakers_root.glob("*") if d.is_dir()]
self.repopulate_box(self.speaker_box, speaker_names, random)
# Select a random utterance
if level <= 2:
utterances_root = datasets_root.joinpath(
self.current_dataset_name,
self.current_speaker_name
)
utterances = []
for extension in ['mp3', 'flac', 'wav', 'm4a']:
utterances.extend(Path(utterances_root).glob("**/*.%s" % extension))
utterances = [fpath.relative_to(utterances_root) for fpath in utterances]
self.repopulate_box(self.utterance_box, utterances, random)
def browser_select_next(self):
index = (self.utterance_box.currentIndex() + 1) % len(self.utterance_box)
self.utterance_box.setCurrentIndex(index)
@property
def current_encoder_fpath(self):
return self.encoder_box.itemData(self.encoder_box.currentIndex())
@property
def current_synthesizer_fpath(self):
return self.synthesizer_box.itemData(self.synthesizer_box.currentIndex())
@property
def current_vocoder_fpath(self):
return self.vocoder_box.itemData(self.vocoder_box.currentIndex())
def populate_models(self, run_id: str, models_dir: Path):
# Encoder
encoder_fpaths = list(models_dir.glob(f"{run_id}/encoder.pt"))
if len(encoder_fpaths) == 0:
raise Exception("No encoder models found in %s" % models_dir)
self.repopulate_box(self.encoder_box, [(f.parent.name, f) for f in encoder_fpaths])
# Synthesizer
synthesizer_fpaths = list(models_dir.glob(f"{run_id}/synthesizer.pt"))
if len(synthesizer_fpaths) == 0:
raise Exception("No synthesizer models found in %s" % models_dir)
self.repopulate_box(self.synthesizer_box, [(f.parent.name, f) for f in synthesizer_fpaths])
# Vocoder
vocoder_fpaths = list(models_dir.glob(f"{run_id}/vocoder.pt"))
vocoder_items = [(f.parent.name, f) for f in vocoder_fpaths] + [("Griffin-Lim", None)]
self.repopulate_box(self.vocoder_box, vocoder_items)
@property
def selected_utterance(self):
return self.utterance_history.itemData(self.utterance_history.currentIndex())
def register_utterance(self, utterance: Utterance):
self.utterance_history.blockSignals(True)
self.utterance_history.insertItem(0, utterance.name, utterance)
self.utterance_history.setCurrentIndex(0)
self.utterance_history.blockSignals(False)
if len(self.utterance_history) > self.max_saved_utterances:
self.utterance_history.removeItem(self.max_saved_utterances)
self.play_button.setDisabled(False)
self.generate_button.setDisabled(False)
self.synthesize_button.setDisabled(False)
def log(self, line, mode="newline"):
if mode == "newline":
self.logs.append(line)
if len(self.logs) > self.max_log_lines:
del self.logs[0]
elif mode == "append":
self.logs[-1] += line
elif mode == "overwrite":
self.logs[-1] = line
log_text = '\n'.join(self.logs)
self.log_window.setText(log_text)
self.app.processEvents()
def set_loading(self, value, maximum=1):
self.loading_bar.setValue(value * 100)
self.loading_bar.setMaximum(maximum * 100)
self.loading_bar.setTextVisible(value != 0)
self.app.processEvents()
def populate_gen_options(self, seed, trim_silences):
if seed is not None:
self.random_seed_checkbox.setChecked(True)
self.seed_textbox.setText(str(seed))
self.seed_textbox.setEnabled(True)
else:
self.random_seed_checkbox.setChecked(False)
self.seed_textbox.setText(str(0))
self.seed_textbox.setEnabled(False)
if not trim_silences:
self.trim_silences_checkbox.setChecked(False)
self.trim_silences_checkbox.setDisabled(True)
def update_seed_textbox(self):
if self.random_seed_checkbox.isChecked():
self.seed_textbox.setEnabled(True)
else:
self.seed_textbox.setEnabled(False)
def reset_interface(self):
self.draw_embed(None, None, "current")
self.draw_embed(None, None, "generated")
self.draw_spec(None, "current")
self.draw_spec(None, "generated")
self.draw_umap_projections(set())
self.set_loading(0)
self.play_button.setDisabled(True)
self.generate_button.setDisabled(True)
self.synthesize_button.setDisabled(True)
self.vocode_button.setDisabled(True)
self.replay_wav_button.setDisabled(True)
self.export_wav_button.setDisabled(True)
[self.log("") for _ in range(self.max_log_lines)]
def __init__(self):
## Initialize the application
self.app = QApplication(sys.argv)
super().__init__(None)
self.setWindowTitle("SV2TTS toolbox")
## Main layouts
# Root
root_layout = QGridLayout()
self.setLayout(root_layout)
# Browser
browser_layout = QGridLayout()
root_layout.addLayout(browser_layout, 0, 0, 1, 2)
# Generation
gen_layout = QVBoxLayout()
root_layout.addLayout(gen_layout, 0, 2, 1, 2)
# Projections
self.projections_layout = QVBoxLayout()
root_layout.addLayout(self.projections_layout, 1, 0, 1, 1)
# Visualizations
vis_layout = QVBoxLayout()
root_layout.addLayout(vis_layout, 1, 1, 1, 3)
## Projections
# UMap
self.umap_fig, self.umap_ax = plt.subplots(1, 1, figsize=(3, 3), facecolor="#F0F0F0")
self.umap_fig.subplots_adjust(left=0.02, bottom=0.02, right=0.98, top=0.9)
self.projections_layout.addWidget(FigureCanvas(self.umap_fig))
self.umap_hot = False
self.clear_button = QPushButton("Clear")
self.projections_layout.addWidget(self.clear_button)
## Browser
# Dataset, speaker and utterance selection
i = 0
self.dataset_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Dataset</b>"), i, 0)
browser_layout.addWidget(self.dataset_box, i + 1, 0)
self.speaker_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Speaker</b>"), i, 1)
browser_layout.addWidget(self.speaker_box, i + 1, 1)
self.utterance_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Utterance</b>"), i, 2)
browser_layout.addWidget(self.utterance_box, i + 1, 2)
self.browser_load_button = QPushButton("Load")
browser_layout.addWidget(self.browser_load_button, i + 1, 3)
i += 2
# Random buttons
self.random_dataset_button = QPushButton("Random")
browser_layout.addWidget(self.random_dataset_button, i, 0)
self.random_speaker_button = QPushButton("Random")
browser_layout.addWidget(self.random_speaker_button, i, 1)
self.random_utterance_button = QPushButton("Random")
browser_layout.addWidget(self.random_utterance_button, i, 2)
self.auto_next_checkbox = QCheckBox("Auto select next")
self.auto_next_checkbox.setChecked(True)
browser_layout.addWidget(self.auto_next_checkbox, i, 3)
i += 1
# Utterance box
browser_layout.addWidget(QLabel("<b>Use embedding from:</b>"), i, 0)
self.utterance_history = QComboBox()
browser_layout.addWidget(self.utterance_history, i, 1, 1, 3)
i += 1
# Random & next utterance buttons
self.browser_browse_button = QPushButton("Browse")
browser_layout.addWidget(self.browser_browse_button, i, 0)
self.record_button = QPushButton("Record")
browser_layout.addWidget(self.record_button, i, 1)
self.play_button = QPushButton("Play")
browser_layout.addWidget(self.play_button, i, 2)
self.stop_button = QPushButton("Stop")
browser_layout.addWidget(self.stop_button, i, 3)
i += 1
# Model and audio output selection
self.encoder_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Encoder</b>"), i, 0)
browser_layout.addWidget(self.encoder_box, i + 1, 0)
self.synthesizer_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Synthesizer</b>"), i, 1)
browser_layout.addWidget(self.synthesizer_box, i + 1, 1)
self.vocoder_box = QComboBox()
browser_layout.addWidget(QLabel("<b>Vocoder</b>"), i, 2)
browser_layout.addWidget(self.vocoder_box, i + 1, 2)
self.audio_out_devices_cb=QComboBox()
browser_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 3)
browser_layout.addWidget(self.audio_out_devices_cb, i + 1, 3)
i += 2
#Replay & Save Audio
browser_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
self.waves_cb = QComboBox()
self.waves_cb_model = QStringListModel()
self.waves_cb.setModel(self.waves_cb_model)
self.waves_cb.setToolTip("Select one of the last generated waves in this section for replaying or exporting")
browser_layout.addWidget(self.waves_cb, i, 1)
self.replay_wav_button = QPushButton("Replay")
self.replay_wav_button.setToolTip("Replay last generated vocoder")
browser_layout.addWidget(self.replay_wav_button, i, 2)
self.export_wav_button = QPushButton("Export")
self.export_wav_button.setToolTip("Save last generated vocoder audio in filesystem as a wav file")
browser_layout.addWidget(self.export_wav_button, i, 3)
i += 1
## Embed & spectrograms
vis_layout.addStretch()
gridspec_kw = {"width_ratios": [1, 4]}
self.wav_ori_fig, self.current_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0",
gridspec_kw=gridspec_kw)
self.wav_ori_fig.subplots_adjust(left=0, bottom=0.1, right=1, top=0.8)
vis_layout.addWidget(FigureCanvas(self.wav_ori_fig))
self.wav_gen_fig, self.gen_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0",
gridspec_kw=gridspec_kw)
self.wav_gen_fig.subplots_adjust(left=0, bottom=0.1, right=1, top=0.8)
vis_layout.addWidget(FigureCanvas(self.wav_gen_fig))
for ax in self.current_ax.tolist() + self.gen_ax.tolist():
ax.set_facecolor("#F0F0F0")
for side in ["top", "right", "bottom", "left"]:
ax.spines[side].set_visible(False)
## Generation
self.text_prompt = QPlainTextEdit(default_text)
gen_layout.addWidget(self.text_prompt, stretch=1)
self.generate_button = QPushButton("Synthesize and vocode")
gen_layout.addWidget(self.generate_button)
layout = QHBoxLayout()
self.synthesize_button = QPushButton("Synthesize only")
layout.addWidget(self.synthesize_button)
self.vocode_button = QPushButton("Vocode only")
layout.addWidget(self.vocode_button)
gen_layout.addLayout(layout)
layout_seed = QGridLayout()
self.random_seed_checkbox = QCheckBox("Random seed:")
self.random_seed_checkbox.setToolTip("When checked, makes the synthesizer and vocoder deterministic.")
layout_seed.addWidget(self.random_seed_checkbox, 0, 0)
self.seed_textbox = QLineEdit()
self.seed_textbox.setMaximumWidth(80)
layout_seed.addWidget(self.seed_textbox, 0, 1)
self.trim_silences_checkbox = QCheckBox("Enhance vocoder output")
self.trim_silences_checkbox.setChecked(False)
self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
" This feature requires `webrtcvad` to be installed.")
layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
self.griffin_lim_checkbox = QCheckBox("Griffin-Lim as vocoder")
self.griffin_lim_checkbox.setChecked(False)
self.griffin_lim_checkbox.setToolTip("When checked, Griffin-Lim is vocoder."
" This feature requires `webrtcvad` to be installed.")
layout_seed.addWidget(self.griffin_lim_checkbox, 0, 3)
gen_layout.addLayout(layout_seed)
self.loading_bar = QProgressBar()
gen_layout.addWidget(self.loading_bar)
self.log_window = QLabel()
self.log_window.setAlignment(Qt.AlignBottom | Qt.AlignLeft)
gen_layout.addWidget(self.log_window)
self.logs = []
gen_layout.addStretch()
## Set the size of the window and of the elements
max_size = QDesktopWidget().availableGeometry(self).size()
self.resize(max_size)
## Finalize the display
self.reset_interface()
self.show()
def start(self):
self.app.exec_()

5
toolbox/utterance.py Normal file
View File

@@ -0,0 +1,5 @@
from collections import namedtuple
Utterance = namedtuple("Utterance", "name speaker_name wav spec embed partial_embeds synth")
Utterance.__eq__ = lambda x, y: x.name == y.name
Utterance.__hash__ = lambda x: hash(x.name)

0
utils/__init__.py Normal file
View File

40
utils/argutils.py Normal file
View File

@@ -0,0 +1,40 @@
from pathlib import Path
import numpy as np
import argparse
_type_priorities = [ # In decreasing order
Path,
str,
int,
float,
bool,
]
def _priority(o):
p = next((i for i, t in enumerate(_type_priorities) if type(o) is t), None)
if p is not None:
return p
p = next((i for i, t in enumerate(_type_priorities) if isinstance(o, t)), None)
if p is not None:
return p
return len(_type_priorities)
def print_args(args: argparse.Namespace, parser=None):
args = vars(args)
if parser is None:
priorities = list(map(_priority, args.values()))
else:
all_params = [a.dest for g in parser._action_groups for a in g._group_actions ]
priority = lambda p: all_params.index(p) if p in all_params else len(all_params)
priorities = list(map(priority, args.keys()))
pad = max(map(len, args.keys())) + 3
indices = np.lexsort((list(args.keys()), priorities))
items = list(args.items())
print("Arguments:")
for i in indices:
param, value = items[i]
print(" {0}:{1}{2}".format(param, ' ' * (pad - len(param)), value))
print("")

56
utils/default_models.py Normal file
View File

@@ -0,0 +1,56 @@
import urllib.request
from pathlib import Path
from threading import Thread
from urllib.error import HTTPError
from tqdm import tqdm
default_models = {
"encoder": ("https://drive.google.com/uc?export=download&id=1q8mEGwCkFy23KZsinbuvdKAQLqNKbYf1", 17090379),
"synthesizer": ("https://drive.google.com/u/0/uc?id=1EqFMIbvxffxtjiVrtykroF6_mUh-5Z3s&export=download&confirm=t", 370554559),
"vocoder": ("https://drive.google.com/uc?export=download&id=1cf2NO6FtI0jDuy8AV3Xgn6leO6dHjIgu", 53845290),
}
class DownloadProgressBar(tqdm):
def update_to(self, b=1, bsize=1, tsize=None):
if tsize is not None:
self.total = tsize
self.update(b * bsize - self.n)
def download(url: str, target: Path, bar_pos=0):
# Ensure the directory exists
target.parent.mkdir(exist_ok=True, parents=True)
desc = f"Downloading {target.name}"
with DownloadProgressBar(unit="B", unit_scale=True, miniters=1, desc=desc, position=bar_pos, leave=False) as t:
try:
urllib.request.urlretrieve(url, filename=target, reporthook=t.update_to)
except HTTPError:
return
def ensure_default_models(run_id: str, models_dir: Path):
# Define download tasks
jobs = []
for model_name, (url, size) in default_models.items():
target_path = models_dir / run_id / f"{model_name}.pt"
if target_path.exists():
# if target_path.stat().st_size != size:
# print(f"File {target_path} is not of expected size, redownloading...")
# else:
continue
thread = Thread(target=download, args=(url, target_path, len(jobs)))
thread.start()
jobs.append((thread, target_path, size))
# Run and join threads
for thread, target_path, size in jobs:
thread.join()
assert target_path.exists() and target_path.stat().st_size == size, \
f"Download for {target_path.name} failed. You may download models manually instead.\n" \
f"https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j"

247
utils/logmmse.py Normal file
View File

@@ -0,0 +1,247 @@
# The MIT License (MIT)
#
# Copyright (c) 2015 braindead
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
#
# This code was extracted from the logmmse package (https://pypi.org/project/logmmse/) and I
# simply modified the interface to meet my needs.
import numpy as np
import math
from scipy.special import expn
from collections import namedtuple
NoiseProfile = namedtuple("NoiseProfile", "sampling_rate window_size len1 len2 win n_fft noise_mu2")
def profile_noise(noise, sampling_rate, window_size=0):
"""
Creates a profile of the noise in a given waveform.
:param noise: a waveform containing noise ONLY, as a numpy array of floats or ints.
:param sampling_rate: the sampling rate of the audio
:param window_size: the size of the window the logmmse algorithm operates on. A default value
will be picked if left as 0.
:return: a NoiseProfile object
"""
noise, dtype = to_float(noise)
noise += np.finfo(np.float64).eps
if window_size == 0:
window_size = int(math.floor(0.02 * sampling_rate))
if window_size % 2 == 1:
window_size = window_size + 1
perc = 50
len1 = int(math.floor(window_size * perc / 100))
len2 = int(window_size - len1)
win = np.hanning(window_size)
win = win * len2 / np.sum(win)
n_fft = 2 * window_size
noise_mean = np.zeros(n_fft)
n_frames = len(noise) // window_size
for j in range(0, window_size * n_frames, window_size):
noise_mean += np.absolute(np.fft.fft(win * noise[j:j + window_size], n_fft, axis=0))
noise_mu2 = (noise_mean / n_frames) ** 2
return NoiseProfile(sampling_rate, window_size, len1, len2, win, n_fft, noise_mu2)
def denoise(wav, noise_profile: NoiseProfile, eta=0.15):
"""
Cleans the noise from a speech waveform given a noise profile. The waveform must have the
same sampling rate as the one used to create the noise profile.
:param wav: a speech waveform as a numpy array of floats or ints.
:param noise_profile: a NoiseProfile object that was created from a similar (or a segment of
the same) waveform.
:param eta: voice threshold for noise update. While the voice activation detection value is
below this threshold, the noise profile will be continuously updated throughout the audio.
Set to 0 to disable updating the noise profile.
:return: the clean wav as a numpy array of floats or ints of the same length.
"""
wav, dtype = to_float(wav)
wav += np.finfo(np.float64).eps
p = noise_profile
nframes = int(math.floor(len(wav) / p.len2) - math.floor(p.window_size / p.len2))
x_final = np.zeros(nframes * p.len2)
aa = 0.98
mu = 0.98
ksi_min = 10 ** (-25 / 10)
x_old = np.zeros(p.len1)
xk_prev = np.zeros(p.len1)
noise_mu2 = p.noise_mu2
for k in range(0, nframes * p.len2, p.len2):
insign = p.win * wav[k:k + p.window_size]
spec = np.fft.fft(insign, p.n_fft, axis=0)
sig = np.absolute(spec)
sig2 = sig ** 2
gammak = np.minimum(sig2 / noise_mu2, 40)
if xk_prev.all() == 0:
ksi = aa + (1 - aa) * np.maximum(gammak - 1, 0)
else:
ksi = aa * xk_prev / noise_mu2 + (1 - aa) * np.maximum(gammak - 1, 0)
ksi = np.maximum(ksi_min, ksi)
log_sigma_k = gammak * ksi/(1 + ksi) - np.log(1 + ksi)
vad_decision = np.sum(log_sigma_k) / p.window_size
if vad_decision < eta:
noise_mu2 = mu * noise_mu2 + (1 - mu) * sig2
a = ksi / (1 + ksi)
vk = a * gammak
ei_vk = 0.5 * expn(1, np.maximum(vk, 1e-8))
hw = a * np.exp(ei_vk)
sig = sig * hw
xk_prev = sig ** 2
xi_w = np.fft.ifft(hw * spec, p.n_fft, axis=0)
xi_w = np.real(xi_w)
x_final[k:k + p.len2] = x_old + xi_w[0:p.len1]
x_old = xi_w[p.len1:p.window_size]
output = from_float(x_final, dtype)
output = np.pad(output, (0, len(wav) - len(output)), mode="constant")
return output
## Alternative VAD algorithm to webrctvad. It has the advantage of not requiring to install that
## darn package and it also works for any sampling rate. Maybe I'll eventually use it instead of
## webrctvad
# def vad(wav, sampling_rate, eta=0.15, window_size=0):
# """
# TODO: fix doc
# Creates a profile of the noise in a given waveform.
#
# :param wav: a waveform containing noise ONLY, as a numpy array of floats or ints.
# :param sampling_rate: the sampling rate of the audio
# :param window_size: the size of the window the logmmse algorithm operates on. A default value
# will be picked if left as 0.
# :param eta: voice threshold for noise update. While the voice activation detection value is
# below this threshold, the noise profile will be continuously updated throughout the audio.
# Set to 0 to disable updating the noise profile.
# """
# wav, dtype = to_float(wav)
# wav += np.finfo(np.float64).eps
#
# if window_size == 0:
# window_size = int(math.floor(0.02 * sampling_rate))
#
# if window_size % 2 == 1:
# window_size = window_size + 1
#
# perc = 50
# len1 = int(math.floor(window_size * perc / 100))
# len2 = int(window_size - len1)
#
# win = np.hanning(window_size)
# win = win * len2 / np.sum(win)
# n_fft = 2 * window_size
#
# wav_mean = np.zeros(n_fft)
# n_frames = len(wav) // window_size
# for j in range(0, window_size * n_frames, window_size):
# wav_mean += np.absolute(np.fft.fft(win * wav[j:j + window_size], n_fft, axis=0))
# noise_mu2 = (wav_mean / n_frames) ** 2
#
# wav, dtype = to_float(wav)
# wav += np.finfo(np.float64).eps
#
# nframes = int(math.floor(len(wav) / len2) - math.floor(window_size / len2))
# vad = np.zeros(nframes * len2, dtype=np.bool)
#
# aa = 0.98
# mu = 0.98
# ksi_min = 10 ** (-25 / 10)
#
# xk_prev = np.zeros(len1)
# noise_mu2 = noise_mu2
# for k in range(0, nframes * len2, len2):
# insign = win * wav[k:k + window_size]
#
# spec = np.fft.fft(insign, n_fft, axis=0)
# sig = np.absolute(spec)
# sig2 = sig ** 2
#
# gammak = np.minimum(sig2 / noise_mu2, 40)
#
# if xk_prev.all() == 0:
# ksi = aa + (1 - aa) * np.maximum(gammak - 1, 0)
# else:
# ksi = aa * xk_prev / noise_mu2 + (1 - aa) * np.maximum(gammak - 1, 0)
# ksi = np.maximum(ksi_min, ksi)
#
# log_sigma_k = gammak * ksi / (1 + ksi) - np.log(1 + ksi)
# vad_decision = np.sum(log_sigma_k) / window_size
# if vad_decision < eta:
# noise_mu2 = mu * noise_mu2 + (1 - mu) * sig2
# print(vad_decision)
#
# a = ksi / (1 + ksi)
# vk = a * gammak
# ei_vk = 0.5 * expn(1, np.maximum(vk, 1e-8))
# hw = a * np.exp(ei_vk)
# sig = sig * hw
# xk_prev = sig ** 2
#
# vad[k:k + len2] = vad_decision >= eta
#
# vad = np.pad(vad, (0, len(wav) - len(vad)), mode="constant")
# return vad
def to_float(_input):
if _input.dtype == np.float64:
return _input, _input.dtype
elif _input.dtype == np.float32:
return _input.astype(np.float64), _input.dtype
elif _input.dtype == np.uint8:
return (_input - 128) / 128., _input.dtype
elif _input.dtype == np.int16:
return _input / 32768., _input.dtype
elif _input.dtype == np.int32:
return _input / 2147483648., _input.dtype
raise ValueError('Unsupported wave file format')
def from_float(_input, dtype):
if dtype == np.float64:
return _input, np.float64
elif dtype == np.float32:
return _input.astype(np.float32)
elif dtype == np.uint8:
return ((_input * 128) + 128).astype(np.uint8)
elif dtype == np.int16:
return (_input * 32768).astype(np.int16)
elif dtype == np.int32:
print(_input)
return (_input * 2147483648).astype(np.int32)
raise ValueError('Unsupported wave file format')

45
utils/profiler.py Normal file
View File

@@ -0,0 +1,45 @@
from time import perf_counter as timer
from collections import OrderedDict
import numpy as np
class Profiler:
def __init__(self, summarize_every=5, disabled=False):
self.last_tick = timer()
self.logs = OrderedDict()
self.summarize_every = summarize_every
self.disabled = disabled
def tick(self, name):
if self.disabled:
return
# Log the time needed to execute that function
if not name in self.logs:
self.logs[name] = []
if len(self.logs[name]) >= self.summarize_every:
self.summarize()
self.purge_logs()
self.logs[name].append(timer() - self.last_tick)
self.reset_timer()
def purge_logs(self):
for name in self.logs:
self.logs[name].clear()
def reset_timer(self):
self.last_tick = timer()
def summarize(self):
n = max(map(len, self.logs.values()))
assert n == self.summarize_every
print("\nAverage execution time over %d steps:" % n)
name_msgs = ["%s (%d/%d):" % (name, len(deltas), n) for name, deltas in self.logs.items()]
pad = max(map(len, name_msgs))
for name_msg, deltas in zip(name_msgs, self.logs.values()):
print(" %s mean: %4.0fms std: %4.0fms" %
(name_msg.ljust(pad), np.mean(deltas) * 1000, np.std(deltas) * 1000))
print("", flush=True)

22
vocoder/LICENSE.txt Normal file
View File

@@ -0,0 +1,22 @@
MIT License
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
Modified work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

108
vocoder/audio.py Normal file
View File

@@ -0,0 +1,108 @@
import math
import numpy as np
import librosa
import vocoder.hparams as hp
from scipy.signal import lfilter
import soundfile as sf
def label_2_float(x, bits) :
return 2 * x / (2**bits - 1.) - 1.
def float_2_label(x, bits) :
assert abs(x).max() <= 1.0
x = (x + 1.) * (2**bits - 1) / 2
return x.clip(0, 2**bits - 1)
def load_wav(path) :
return librosa.load(str(path), sr=hp.sample_rate)[0]
def save_wav(x, path) :
sf.write(path, x.astype(np.float32), hp.sample_rate)
def split_signal(x) :
unsigned = x + 2**15
coarse = unsigned // 256
fine = unsigned % 256
return coarse, fine
def combine_signal(coarse, fine) :
return coarse * 256 + fine - 2**15
def encode_16bits(x) :
return np.clip(x * 2**15, -2**15, 2**15 - 1).astype(np.int16)
mel_basis = None
def linear_to_mel(spectrogram):
global mel_basis
if mel_basis is None:
mel_basis = build_mel_basis()
return np.dot(mel_basis, spectrogram)
def build_mel_basis():
return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels, fmin=hp.fmin)
def normalize(S):
return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)
def denormalize(S):
return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db
def amp_to_db(x):
return 20 * np.log10(np.maximum(1e-5, x))
def db_to_amp(x):
return np.power(10.0, x * 0.05)
def spectrogram(y):
D = stft(y)
S = amp_to_db(np.abs(D)) - hp.ref_level_db
return normalize(S)
def melspectrogram(y):
D = stft(y)
S = amp_to_db(linear_to_mel(np.abs(D)))
return normalize(S)
def stft(y):
return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=hp.hop_length, win_length=hp.win_length)
def pre_emphasis(x):
return lfilter([1, -hp.preemphasis], [1], x)
def de_emphasis(x):
return lfilter([1], [1, -hp.preemphasis], x)
def encode_mu_law(x, mu) :
mu = mu - 1
fx = np.sign(x) * np.log(1 + mu * np.abs(x)) / np.log(1 + mu)
return np.floor((fx + 1) / 2 * mu + 0.5)
def decode_mu_law(y, mu, from_labels=True) :
if from_labels:
y = label_2_float(y, math.log2(mu))
mu = mu - 1
x = np.sign(y) / mu * ((1 + mu) ** np.abs(y) - 1)
return x

157
vocoder/display.py Normal file
View File

@@ -0,0 +1,157 @@
import time
import numpy as np
import sys
def progbar(i, n, size=16):
done = (i * size) // n
bar = ''
for i in range(size):
bar += '' if i <= done else ''
return bar
def stream(message) :
try:
sys.stdout.write("\r{%s}" % message)
except:
#Remove non-ASCII characters from message
message = ''.join(i for i in message if ord(i)<128)
sys.stdout.write("\r{%s}" % message)
def simple_table(item_tuples) :
border_pattern = '+---------------------------------------'
whitespace = ' '
headings, cells, = [], []
for item in item_tuples :
heading, cell = str(item[0]), str(item[1])
pad_head = True if len(heading) < len(cell) else False
pad = abs(len(heading) - len(cell))
pad = whitespace[:pad]
pad_left = pad[:len(pad)//2]
pad_right = pad[len(pad)//2:]
if pad_head :
heading = pad_left + heading + pad_right
else :
cell = pad_left + cell + pad_right
headings += [heading]
cells += [cell]
border, head, body = '', '', ''
for i in range(len(item_tuples)) :
temp_head = f'| {headings[i]} '
temp_body = f'| {cells[i]} '
border += border_pattern[:len(temp_head)]
head += temp_head
body += temp_body
if i == len(item_tuples) - 1 :
head += '|'
body += '|'
border += '+'
print(border)
print(head)
print(border)
print(body)
print(border)
print(' ')
def time_since(started) :
elapsed = time.time() - started
m = int(elapsed // 60)
s = int(elapsed % 60)
if m >= 60 :
h = int(m // 60)
m = m % 60
return f'{h}h {m}m {s}s'
else :
return f'{m}m {s}s'
def save_attention(attn, path):
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 6))
plt.imshow(attn.T, interpolation='nearest', aspect='auto')
fig.savefig(f'{path}.png', bbox_inches='tight')
plt.close(fig)
def save_attention_multiple(attn, path):
import matplotlib.pyplot as plt
num_plots = len(attn)
fig = plt.figure(figsize=(12, 6 * num_plots))
for i, a in enumerate(attn):
plt.subplot(num_plots, 1, i+1)
plt.imshow(a.T, interpolation='nearest', aspect='auto')
plt.xlabel("Decoder Step")
plt.ylabel("Encoder Step")
plt.title(f"Encoder-Decoder Alignment of No.{i} Sequence")
fig.savefig(f'{path}.png', bbox_inches='tight')
plt.close(fig)
def save_stop_tokens(stop, path):
import matplotlib.pyplot as plt
num_plots = len(stop)
fig = plt.figure(figsize=(12, 6 * num_plots))
for i, s in enumerate(stop):
plt.subplot(num_plots, 1, i+1)
plt.plot(s)
plt.xlabel("Timestep")
plt.ylabel("Stop Value")
plt.title(f"Stop Tokens of No.{i} Sequence")
fig.savefig(f'{path}.png', bbox_inches='tight')
plt.close(fig)
def save_spectrogram(M, path, length=None):
import matplotlib.pyplot as plt
M = np.flip(M, axis=0)
if length : M = M[:, :length]
fig = plt.figure(figsize=(12, 6))
plt.imshow(M, interpolation='nearest', aspect='auto')
plt.xlabel("Time")
plt.ylabel("Frequency")
plt.title("Generated Mel Spectrogram")
fig.savefig(f'{path}.png', bbox_inches='tight')
plt.close(fig)
def plot(array):
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(30, 5))
ax = fig.add_subplot(111)
ax.xaxis.label.set_color('grey')
ax.yaxis.label.set_color('grey')
ax.xaxis.label.set_fontsize(23)
ax.yaxis.label.set_fontsize(23)
ax.tick_params(axis='x', colors='grey', labelsize=23)
ax.tick_params(axis='y', colors='grey', labelsize=23)
plt.plot(array)
def plot_spec(M):
import matplotlib.pyplot as plt
M = np.flip(M, axis=0)
plt.figure(figsize=(18,4))
plt.imshow(M, interpolation='nearest', aspect='auto')
plt.show()

132
vocoder/distribution.py Normal file
View File

@@ -0,0 +1,132 @@
import numpy as np
import torch
import torch.nn.functional as F
def log_sum_exp(x):
""" numerically stable log_sum_exp implementation that prevents overflow """
# TF ordering
axis = len(x.size()) - 1
m, _ = torch.max(x, dim=axis)
m2, _ = torch.max(x, dim=axis, keepdim=True)
return m + torch.log(torch.sum(torch.exp(x - m2), dim=axis))
# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
def discretized_mix_logistic_loss(y_hat, y, num_classes=65536,
log_scale_min=None, reduce=True):
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
y_hat = y_hat.permute(0,2,1)
assert y_hat.dim() == 3
assert y_hat.size(1) % 3 == 0
nr_mix = y_hat.size(1) // 3
# (B x T x C)
y_hat = y_hat.transpose(1, 2)
# unpack parameters. (B, T, num_mixtures) x 3
logit_probs = y_hat[:, :, :nr_mix]
means = y_hat[:, :, nr_mix:2 * nr_mix]
log_scales = torch.clamp(y_hat[:, :, 2 * nr_mix:3 * nr_mix], min=log_scale_min)
# B x T x 1 -> B x T x num_mixtures
y = y.expand_as(means)
centered_y = y - means
inv_stdv = torch.exp(-log_scales)
plus_in = inv_stdv * (centered_y + 1. / (num_classes - 1))
cdf_plus = torch.sigmoid(plus_in)
min_in = inv_stdv * (centered_y - 1. / (num_classes - 1))
cdf_min = torch.sigmoid(min_in)
# log probability for edge case of 0 (before scaling)
# equivalent: torch.log(F.sigmoid(plus_in))
log_cdf_plus = plus_in - F.softplus(plus_in)
# log probability for edge case of 255 (before scaling)
# equivalent: (1 - F.sigmoid(min_in)).log()
log_one_minus_cdf_min = -F.softplus(min_in)
# probability for all other cases
cdf_delta = cdf_plus - cdf_min
mid_in = inv_stdv * centered_y
# log probability in the center of the bin, to be used in extreme cases
# (not actually used in our code)
log_pdf_mid = mid_in - log_scales - 2. * F.softplus(mid_in)
# tf equivalent
"""
log_probs = tf.where(x < -0.999, log_cdf_plus,
tf.where(x > 0.999, log_one_minus_cdf_min,
tf.where(cdf_delta > 1e-5,
tf.log(tf.maximum(cdf_delta, 1e-12)),
log_pdf_mid - np.log(127.5))))
"""
# TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
# for num_classes=65536 case? 1e-7? not sure..
inner_inner_cond = (cdf_delta > 1e-5).float()
inner_inner_out = inner_inner_cond * \
torch.log(torch.clamp(cdf_delta, min=1e-12)) + \
(1. - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
inner_cond = (y > 0.999).float()
inner_out = inner_cond * log_one_minus_cdf_min + (1. - inner_cond) * inner_inner_out
cond = (y < -0.999).float()
log_probs = cond * log_cdf_plus + (1. - cond) * inner_out
log_probs = log_probs + F.log_softmax(logit_probs, -1)
if reduce:
return -torch.mean(log_sum_exp(log_probs))
else:
return -log_sum_exp(log_probs).unsqueeze(-1)
def sample_from_discretized_mix_logistic(y, log_scale_min=None):
"""
Sample from discretized mixture of logistic distributions
Args:
y (Tensor): B x C x T
log_scale_min (float): Log scale minimum value
Returns:
Tensor: sample in range of [-1, 1].
"""
if log_scale_min is None:
log_scale_min = float(np.log(1e-14))
assert y.size(1) % 3 == 0
nr_mix = y.size(1) // 3
# B x T x C
y = y.transpose(1, 2)
logit_probs = y[:, :, :nr_mix]
# sample mixture indicator from softmax
temp = logit_probs.data.new(logit_probs.size()).uniform_(1e-5, 1.0 - 1e-5)
temp = logit_probs.data - torch.log(- torch.log(temp))
_, argmax = temp.max(dim=-1)
# (B, T) -> (B, T, nr_mix)
one_hot = to_one_hot(argmax, nr_mix)
# select logistic parameters
means = torch.sum(y[:, :, nr_mix:2 * nr_mix] * one_hot, dim=-1)
log_scales = torch.clamp(torch.sum(
y[:, :, 2 * nr_mix:3 * nr_mix] * one_hot, dim=-1), min=log_scale_min)
# sample from logistic & clip to interval
# we don't actually round to the nearest 8bit value when sampling
u = means.data.new(means.size()).uniform_(1e-5, 1.0 - 1e-5)
x = means + torch.exp(log_scales) * (torch.log(u) - torch.log(1. - u))
x = torch.clamp(torch.clamp(x, min=-1.), max=1.)
return x
def to_one_hot(tensor, n, fill_with=1.):
# we perform one hot encore with respect to the last axis
one_hot = torch.FloatTensor(tensor.size() + (n,)).zero_()
if tensor.is_cuda:
one_hot = one_hot.cuda()
one_hot.scatter_(len(tensor.size()), tensor.unsqueeze(-1), fill_with)
return one_hot

31
vocoder/gen_wavernn.py Normal file
View File

@@ -0,0 +1,31 @@
from vocoder.models.fatchord_version import WaveRNN
from vocoder.audio import *
def gen_devset(model: WaveRNN, dev_set, samples, batched, target, overlap, save_path):
k = model.get_step() // 1000
for i, (m, x) in enumerate(dev_set, 1):
if i > samples:
break
print('\n| Generating: %i/%i' % (i, samples))
x = x[0].numpy()
bits = 16 if hp.voc_mode == 'MOL' else hp.bits
if hp.mu_law and hp.voc_mode != 'MOL' :
x = decode_mu_law(x, 2**bits, from_labels=True)
else :
x = label_2_float(x, bits)
save_wav(x, save_path.joinpath("%dk_steps_%d_target.wav" % (k, i)))
batch_str = "gen_batched_target%d_overlap%d" % (target, overlap) if batched else \
"gen_not_batched"
save_str = save_path.joinpath("%dk_steps_%d_%s.wav" % (k, i, batch_str))
wav = model.generate(m, batched, target, overlap, hp.mu_law)
save_wav(wav, save_str)

51
vocoder/hparams.py Normal file
View File

@@ -0,0 +1,51 @@
from synthesizer.hparams import syn_hparams as _syn_hp
# Audio settings------------------------------------------------------------------------
# Match the values of the synthesizer
sample_rate = _syn_hp.sample_rate
n_fft = _syn_hp.n_fft
num_mels = _syn_hp.num_mels
hop_length = _syn_hp.hop_size
win_length = _syn_hp.win_size
fmin = _syn_hp.fmin
min_level_db = _syn_hp.min_level_db
ref_level_db = _syn_hp.ref_level_db
mel_max_abs_value = _syn_hp.max_abs_value
preemphasis = _syn_hp.preemphasis
apply_preemphasis = _syn_hp.preemphasize
bits = 9 # bit depth of signal
mu_law = True # Recommended to suppress noise if using raw bits in hp.voc_mode
# below
# WAVERNN / VOCODER --------------------------------------------------------------------------------
voc_mode = 'RAW' # either 'RAW' (softmax on raw bits) or 'MOL' (sample from
# mixture of logistics)
voc_upsample_factors = (5, 5, 8) # NB - this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10
# Training
voc_batch_size = 256
voc_lr = 1e-6
voc_gen_at_checkpoint = 5 # number of samples to generate at each checkpoint
voc_pad = 2 # this will pad the input so that the resnet can 'see' wider
# than input length
voc_seq_len = hop_length * 5 # must be a multiple of hop_length
# Generating / Synthesizing
voc_gen_batched = True # very fast (realtime+) single utterance batched generation
voc_target = 4000 # target number of samples to be generated in each batch entry
voc_overlap = 400 # number of samples for crossfading between batches
is_crossfade = True # crossfading or not
# Output Noise Reduce
prop_decrease_low_freq = 0.6 # prop decrease for low dominant frequency
prop_decrease_high_freq = 0.9 # prop decrease for high dominant frequency
dry = 0.1 # dry ratio for facebook denoiser
sex = -1

99
vocoder/inference.py Normal file
View File

@@ -0,0 +1,99 @@
from vocoder.models.fatchord_version import WaveRNN
from vocoder import hparams as hp
from scipy.fft import rfft, rfftfreq
from scipy import signal
from denoiser.pretrained import master64
import librosa
import numpy as np
import torch
import torchaudio
import noisereduce as nr
_model = None # type: WaveRNN
def load_model(weights_fpath, verbose=True):
global _model, _device
if verbose:
print("Building Wave-RNN")
_model = WaveRNN(
rnn_dims=hp.voc_rnn_dims,
fc_dims=hp.voc_fc_dims,
bits=hp.bits,
pad=hp.voc_pad,
upsample_factors=hp.voc_upsample_factors,
feat_dims=hp.num_mels,
compute_dims=hp.voc_compute_dims,
res_out_dims=hp.voc_res_out_dims,
res_blocks=hp.voc_res_blocks,
hop_length=hp.hop_length,
sample_rate=hp.sample_rate,
mode=hp.voc_mode
)
if torch.cuda.is_available():
_model = _model.cuda()
_device = torch.device('cuda')
else:
_device = torch.device('cpu')
if verbose:
print("Loading model weights at %s" % weights_fpath)
checkpoint = torch.load(weights_fpath, _device)
_model.load_state_dict(checkpoint['model_state'])
_model.eval()
def is_loaded():
return _model is not None
def infer_waveform(mel, normalize=True, batched=True, target=8000, overlap=800,
progress_callback=None, crossfade=True):
"""
Infers the waveform of a mel spectrogram output by the synthesizer (the format must match
that of the synthesizer!)
:param normalize:
:param batched:
:param target:
:param overlap:
:return:
"""
if _model is None:
raise Exception("Please load Wave-RNN in memory before using it")
if normalize:
mel = mel / hp.mel_max_abs_value
mel = torch.from_numpy(mel[None, ...])
wav = _model.generate(mel, batched, target, overlap, hp.mu_law, progress_callback, crossfade=crossfade)
wav = waveform_denoising(wav)
return wav
def waveform_denoising(wav):
prop_decrease = hp.prop_decrease_low_freq if hp.sex else hp.prop_decrease_high_freq
if torch.cuda.is_available():
_device = torch.device('cuda')
else:
_device = torch.device('cpu')
model = master64().to(_device)
noisy=torch.from_numpy(np.array([wav])).to(_device).float()
estimate = model(noisy)
estimate = estimate * (1-hp.dry) + noisy * hp.dry
estimate = estimate[0].cpu().detach().numpy()
return nr.reduce_noise(np.squeeze(estimate), hp.sample_rate, prop_decrease=prop_decrease)
def get_dominant_freq(wav, name="fft"):
import matplotlib.pyplot as plt
N = len(wav)
fft_wav = rfft(wav)
fft_freq = np.real(rfftfreq(N, 1 / hp.sample_rate))
fft_least_index = np.where(fft_freq >= 60)[0][0]
fft_max = max(fft_wav[fft_least_index: ])
fft_max_index = np.where(fft_wav == fft_max)[0][0]
fft_max_freq = fft_freq[fft_max_index]
# plt.clf()
# plt.plot(fft_freq, fft_wav)
# plt.savefig(f"{name}.png", dpi=300)
return fft_max_freq

View File

@@ -0,0 +1,170 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from utils.display import *
from utils.dsp import *
class WaveRNN(nn.Module) :
def __init__(self, hidden_size=896, quantisation=256) :
super(WaveRNN, self).__init__()
self.hidden_size = hidden_size
self.split_size = hidden_size // 2
# The main matmul
self.R = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias=False)
# Output fc layers
self.O1 = nn.Linear(self.split_size, self.split_size)
self.O2 = nn.Linear(self.split_size, quantisation)
self.O3 = nn.Linear(self.split_size, self.split_size)
self.O4 = nn.Linear(self.split_size, quantisation)
# Input fc layers
self.I_coarse = nn.Linear(2, 3 * self.split_size, bias=False)
self.I_fine = nn.Linear(3, 3 * self.split_size, bias=False)
# biases for the gates
self.bias_u = nn.Parameter(torch.zeros(self.hidden_size))
self.bias_r = nn.Parameter(torch.zeros(self.hidden_size))
self.bias_e = nn.Parameter(torch.zeros(self.hidden_size))
# display num params
self.num_params()
def forward(self, prev_y, prev_hidden, current_coarse) :
# Main matmul - the projection is split 3 ways
R_hidden = self.R(prev_hidden)
R_u, R_r, R_e, = torch.split(R_hidden, self.hidden_size, dim=1)
# Project the prev input
coarse_input_proj = self.I_coarse(prev_y)
I_coarse_u, I_coarse_r, I_coarse_e = \
torch.split(coarse_input_proj, self.split_size, dim=1)
# Project the prev input and current coarse sample
fine_input = torch.cat([prev_y, current_coarse], dim=1)
fine_input_proj = self.I_fine(fine_input)
I_fine_u, I_fine_r, I_fine_e = \
torch.split(fine_input_proj, self.split_size, dim=1)
# concatenate for the gates
I_u = torch.cat([I_coarse_u, I_fine_u], dim=1)
I_r = torch.cat([I_coarse_r, I_fine_r], dim=1)
I_e = torch.cat([I_coarse_e, I_fine_e], dim=1)
# Compute all gates for coarse and fine
u = F.sigmoid(R_u + I_u + self.bias_u)
r = F.sigmoid(R_r + I_r + self.bias_r)
e = F.tanh(r * R_e + I_e + self.bias_e)
hidden = u * prev_hidden + (1. - u) * e
# Split the hidden state
hidden_coarse, hidden_fine = torch.split(hidden, self.split_size, dim=1)
# Compute outputs
out_coarse = self.O2(F.relu(self.O1(hidden_coarse)))
out_fine = self.O4(F.relu(self.O3(hidden_fine)))
return out_coarse, out_fine, hidden
def generate(self, seq_len):
with torch.no_grad():
# First split up the biases for the gates
b_coarse_u, b_fine_u = torch.split(self.bias_u, self.split_size)
b_coarse_r, b_fine_r = torch.split(self.bias_r, self.split_size)
b_coarse_e, b_fine_e = torch.split(self.bias_e, self.split_size)
# Lists for the two output seqs
c_outputs, f_outputs = [], []
# Some initial inputs
out_coarse = torch.LongTensor([0]).cuda()
out_fine = torch.LongTensor([0]).cuda()
# We'll meed a hidden state
hidden = self.init_hidden()
# Need a clock for display
start = time.time()
# Loop for generation
for i in range(seq_len) :
# Split into two hidden states
hidden_coarse, hidden_fine = \
torch.split(hidden, self.split_size, dim=1)
# Scale and concat previous predictions
out_coarse = out_coarse.unsqueeze(0).float() / 127.5 - 1.
out_fine = out_fine.unsqueeze(0).float() / 127.5 - 1.
prev_outputs = torch.cat([out_coarse, out_fine], dim=1)
# Project input
coarse_input_proj = self.I_coarse(prev_outputs)
I_coarse_u, I_coarse_r, I_coarse_e = \
torch.split(coarse_input_proj, self.split_size, dim=1)
# Project hidden state and split 6 ways
R_hidden = self.R(hidden)
R_coarse_u , R_fine_u, \
R_coarse_r, R_fine_r, \
R_coarse_e, R_fine_e = torch.split(R_hidden, self.split_size, dim=1)
# Compute the coarse gates
u = F.sigmoid(R_coarse_u + I_coarse_u + b_coarse_u)
r = F.sigmoid(R_coarse_r + I_coarse_r + b_coarse_r)
e = F.tanh(r * R_coarse_e + I_coarse_e + b_coarse_e)
hidden_coarse = u * hidden_coarse + (1. - u) * e
# Compute the coarse output
out_coarse = self.O2(F.relu(self.O1(hidden_coarse)))
posterior = F.softmax(out_coarse, dim=1)
distrib = torch.distributions.Categorical(posterior)
out_coarse = distrib.sample()
c_outputs.append(out_coarse)
# Project the [prev outputs and predicted coarse sample]
coarse_pred = out_coarse.float() / 127.5 - 1.
fine_input = torch.cat([prev_outputs, coarse_pred.unsqueeze(0)], dim=1)
fine_input_proj = self.I_fine(fine_input)
I_fine_u, I_fine_r, I_fine_e = \
torch.split(fine_input_proj, self.split_size, dim=1)
# Compute the fine gates
u = F.sigmoid(R_fine_u + I_fine_u + b_fine_u)
r = F.sigmoid(R_fine_r + I_fine_r + b_fine_r)
e = F.tanh(r * R_fine_e + I_fine_e + b_fine_e)
hidden_fine = u * hidden_fine + (1. - u) * e
# Compute the fine output
out_fine = self.O4(F.relu(self.O3(hidden_fine)))
posterior = F.softmax(out_fine, dim=1)
distrib = torch.distributions.Categorical(posterior)
out_fine = distrib.sample()
f_outputs.append(out_fine)
# Put the hidden state back together
hidden = torch.cat([hidden_coarse, hidden_fine], dim=1)
# Display progress
speed = (i + 1) / (time.time() - start)
stream('Gen: %i/%i -- Speed: %i', (i + 1, seq_len, speed))
coarse = torch.stack(c_outputs).squeeze(1).cpu().data.numpy()
fine = torch.stack(f_outputs).squeeze(1).cpu().data.numpy()
output = combine_signal(coarse, fine)
return output, coarse, fine
def init_hidden(self, batch_size=1) :
return torch.zeros(batch_size, self.hidden_size).cuda()
def num_params(self) :
parameters = filter(lambda p: p.requires_grad, self.parameters())
parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
print('Trainable Parameters: %.3f million' % parameters)

View File

@@ -0,0 +1,440 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from vocoder.distribution import sample_from_discretized_mix_logistic
from vocoder.display import *
from vocoder.audio import *
class ResBlock(nn.Module):
def __init__(self, dims):
super().__init__()
self.conv1 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
self.conv2 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
self.batch_norm1 = nn.BatchNorm1d(dims)
self.batch_norm2 = nn.BatchNorm1d(dims)
def forward(self, x):
residual = x
x = self.conv1(x)
x = self.batch_norm1(x)
x = F.relu(x)
x = self.conv2(x)
x = self.batch_norm2(x)
return x + residual
class MelResNet(nn.Module):
def __init__(self, res_blocks, in_dims, compute_dims, res_out_dims, pad):
super().__init__()
k_size = pad * 2 + 1
self.conv_in = nn.Conv1d(in_dims, compute_dims, kernel_size=k_size, bias=False)
self.batch_norm = nn.BatchNorm1d(compute_dims)
self.layers = nn.ModuleList()
for i in range(res_blocks):
self.layers.append(ResBlock(compute_dims))
self.conv_out = nn.Conv1d(compute_dims, res_out_dims, kernel_size=1)
def forward(self, x):
x = self.conv_in(x)
x = self.batch_norm(x)
x = F.relu(x)
for f in self.layers: x = f(x)
x = self.conv_out(x)
return x
class Stretch2d(nn.Module):
def __init__(self, x_scale, y_scale):
super().__init__()
self.x_scale = x_scale
self.y_scale = y_scale
def forward(self, x):
b, c, h, w = x.size()
x = x.unsqueeze(-1).unsqueeze(3)
x = x.repeat(1, 1, 1, self.y_scale, 1, self.x_scale)
return x.view(b, c, h * self.y_scale, w * self.x_scale)
class UpsampleNetwork(nn.Module):
def __init__(self, feat_dims, upsample_scales, compute_dims,
res_blocks, res_out_dims, pad):
super().__init__()
total_scale = np.cumproduct(upsample_scales)[-1]
self.indent = pad * total_scale
self.resnet = MelResNet(res_blocks, feat_dims, compute_dims, res_out_dims, pad)
self.resnet_stretch = Stretch2d(total_scale, 1)
self.up_layers = nn.ModuleList()
for scale in upsample_scales:
k_size = (1, scale * 2 + 1)
padding = (0, scale)
stretch = Stretch2d(scale, 1)
conv = nn.Conv2d(1, 1, kernel_size=k_size, padding=padding, bias=False)
conv.weight.data.fill_(1. / k_size[1])
self.up_layers.append(stretch)
self.up_layers.append(conv)
def forward(self, m):
aux = self.resnet(m).unsqueeze(1)
aux = self.resnet_stretch(aux)
aux = aux.squeeze(1)
m = m.unsqueeze(1)
for f in self.up_layers: m = f(m)
m = m.squeeze(1)[:, :, self.indent:-self.indent]
return m.transpose(1, 2), aux.transpose(1, 2)
class WaveRNN(nn.Module):
def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors,
feat_dims, compute_dims, res_out_dims, res_blocks,
hop_length, sample_rate, mode='RAW'):
super().__init__()
self.mode = mode
self.pad = pad
if self.mode == 'RAW' :
self.n_classes = 2 ** bits
elif self.mode == 'MOL' :
self.n_classes = 30
else :
RuntimeError("Unknown model mode value - ", self.mode)
self.rnn_dims = rnn_dims
self.aux_dims = res_out_dims // 4
self.hop_length = hop_length
self.sample_rate = sample_rate
self.upsample = UpsampleNetwork(feat_dims, upsample_factors, compute_dims, res_blocks, res_out_dims, pad)
self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims, batch_first=True)
self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
self.fc3 = nn.Linear(fc_dims, self.n_classes)
self.step = nn.Parameter(torch.zeros(1).long(), requires_grad=False)
self.num_params()
def forward(self, x, mels):
self.step += 1
bsize = x.size(0)
if torch.cuda.is_available():
h1 = torch.zeros(1, bsize, self.rnn_dims).cuda()
h2 = torch.zeros(1, bsize, self.rnn_dims).cuda()
else:
h1 = torch.zeros(1, bsize, self.rnn_dims).cpu()
h2 = torch.zeros(1, bsize, self.rnn_dims).cpu()
mels, aux = self.upsample(mels)
aux_idx = [self.aux_dims * i for i in range(5)]
a1 = aux[:, :, aux_idx[0]:aux_idx[1]]
a2 = aux[:, :, aux_idx[1]:aux_idx[2]]
a3 = aux[:, :, aux_idx[2]:aux_idx[3]]
a4 = aux[:, :, aux_idx[3]:aux_idx[4]]
x = torch.cat([x.unsqueeze(-1), mels, a1], dim=2)
x = self.I(x)
res = x
x, _ = self.rnn1(x, h1)
x = x + res
res = x
x = torch.cat([x, a2], dim=2)
x, _ = self.rnn2(x, h2)
x = x + res
x = torch.cat([x, a3], dim=2)
x = F.relu(self.fc1(x))
x = torch.cat([x, a4], dim=2)
x = F.relu(self.fc2(x))
return self.fc3(x)
def generate(self, mels, batched, target, overlap, mu_law, progress_callback=None,crossfade=True):
mu_law = mu_law if self.mode == 'RAW' else False
progress_callback = progress_callback or self.gen_display
self.eval()
output = []
start = time.time()
rnn1 = self.get_gru_cell(self.rnn1)
rnn2 = self.get_gru_cell(self.rnn2)
with torch.no_grad():
if torch.cuda.is_available():
mels = mels.cuda()
else:
mels = mels.cpu()
wave_len = (mels.size(-1) - 1) * self.hop_length
mels = self.pad_tensor(mels.transpose(1, 2), pad=self.pad, side='both')
mels, aux = self.upsample(mels.transpose(1, 2))
if batched:
mels = self.fold_with_overlap(mels, target, overlap)
aux = self.fold_with_overlap(aux, target, overlap)
b_size, seq_len, _ = mels.size()
if torch.cuda.is_available():
h1 = torch.zeros(b_size, self.rnn_dims).cuda()
h2 = torch.zeros(b_size, self.rnn_dims).cuda()
x = torch.zeros(b_size, 1).cuda()
else:
h1 = torch.zeros(b_size, self.rnn_dims).cpu()
h2 = torch.zeros(b_size, self.rnn_dims).cpu()
x = torch.zeros(b_size, 1).cpu()
d = self.aux_dims
aux_split = [aux[:, :, d * i:d * (i + 1)] for i in range(4)]
for i in range(seq_len):
m_t = mels[:, i, :]
a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
x = torch.cat([x, m_t, a1_t], dim=1)
x = self.I(x)
h1 = rnn1(x, h1)
x = x + h1
inp = torch.cat([x, a2_t], dim=1)
h2 = rnn2(inp, h2)
x = x + h2
x = torch.cat([x, a3_t], dim=1)
x = F.relu(self.fc1(x))
x = torch.cat([x, a4_t], dim=1)
x = F.relu(self.fc2(x))
logits = self.fc3(x)
if self.mode == 'MOL':
sample = sample_from_discretized_mix_logistic(logits.unsqueeze(0).transpose(1, 2))
output.append(sample.view(-1))
if torch.cuda.is_available():
# x = torch.FloatTensor([[sample]]).cuda()
x = sample.transpose(0, 1).cuda()
else:
x = sample.transpose(0, 1)
elif self.mode == 'RAW' :
posterior = F.softmax(logits, dim=1)
distrib = torch.distributions.Categorical(posterior)
sample = 2 * distrib.sample().float() / (self.n_classes - 1.) - 1.
output.append(sample)
x = sample.unsqueeze(-1)
else:
raise RuntimeError("Unknown model mode value - ", self.mode)
if i % 100 == 0:
gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
progress_callback(i, seq_len, b_size, gen_rate)
if torch.cuda.is_available():
torch.cuda.empty_cache()
output = torch.stack(output).transpose(0, 1)
output = output.cpu().numpy()
output = output.astype(np.float64)
if batched:
output = self.xfade_and_unfold(output, target, overlap, crossfade=crossfade)
else:
output = output[0]
if mu_law:
output = decode_mu_law(output, self.n_classes, False)
if hp.apply_preemphasis:
output = de_emphasis(output)
# Fade-out at the end to avoid signal cutting out suddenly
fade_out_len = min(wave_len, 20 * self.hop_length)
fade_out = np.linspace(1, 0.5, fade_out_len)
output = output[:wave_len]
output[-fade_out_len:] *= fade_out
self.train()
return output
def gen_display(self, i, seq_len, b_size, gen_rate):
pbar = progbar(i, seq_len)
msg = f'| {pbar} {i*b_size}/{seq_len*b_size} | Batch Size: {b_size} | Gen Rate: {gen_rate:.1f}kHz | '
stream(msg)
def get_gru_cell(self, gru):
gru_cell = nn.GRUCell(gru.input_size, gru.hidden_size)
gru_cell.weight_hh.data = gru.weight_hh_l0.data
gru_cell.weight_ih.data = gru.weight_ih_l0.data
gru_cell.bias_hh.data = gru.bias_hh_l0.data
gru_cell.bias_ih.data = gru.bias_ih_l0.data
return gru_cell
def pad_tensor(self, x, pad, side='both'):
# NB - this is just a quick method i need right now
# i.e., it won't generalise to other shapes/dims
b, t, c = x.size()
total = t + 2 * pad if side == 'both' else t + pad
if torch.cuda.is_available():
padded = torch.zeros(b, total, c).cuda()
else:
padded = torch.zeros(b, total, c).cpu()
if side == 'before' or side == 'both':
padded[:, pad:pad + t, :] = x
elif side == 'after':
padded[:, :t, :] = x
return padded
def fold_with_overlap(self, x, target, overlap):
''' Fold the tensor with overlap for quick batched inference.
Overlap will be used for crossfading in xfade_and_unfold()
Args:
x (tensor) : Upsampled conditioning features.
shape=(1, timesteps, features)
target (int) : Target timesteps for each index of batch
overlap (int) : Timesteps for both xfade and rnn warmup
Return:
(tensor) : shape=(num_folds, target + 2 * overlap, features)
Details:
x = [[h1, h2, ... hn]]
Where each h is a vector of conditioning features
Eg: target=2, overlap=1 with x.size(1)=10
folded = [[h1, h2, h3, h4],
[h4, h5, h6, h7],
[h7, h8, h9, h10]]
'''
_, total_len, features = x.size()
# Calculate variables needed
num_folds = (total_len - overlap) // (target + overlap)
extended_len = num_folds * (overlap + target) + overlap
remaining = total_len - extended_len
# Pad if some time steps poking out
if remaining != 0:
num_folds += 1
padding = target + 2 * overlap - remaining
x = self.pad_tensor(x, padding, side='after')
if torch.cuda.is_available():
folded = torch.zeros(num_folds, target + 2 * overlap, features).cuda()
else:
folded = torch.zeros(num_folds, target + 2 * overlap, features).cpu()
# Get the values for the folded tensor
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
folded[i] = x[:, start:end, :]
return folded
def xfade_and_unfold(self, y, target, overlap, crossfade=True):
''' Applies a crossfade and unfolds into a 1d array.
Args:
y (ndarry) : Batched sequences of audio samples
shape=(num_folds, target + 2 * overlap)
dtype=np.float64
overlap (int) : Timesteps for both xfade and rnn warmup
Return:
(ndarry) : audio samples in a 1d array
shape=(total_len)
dtype=np.float64
Details:
y = [[seq1],
[seq2],
[seq3]]
Apply a gain envelope at both ends of the sequences
y = [[seq1_in, seq1_target, seq1_out],
[seq2_in, seq2_target, seq2_out],
[seq3_in, seq3_target, seq3_out]]
Stagger and add up the groups of samples:
[seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
'''
num_folds, length = y.shape
target = length - 2 * overlap
total_len = num_folds * (target + overlap) + overlap
# Need some silence for the rnn warmup
silence_len = overlap // 2
fade_len = overlap - silence_len
silence = np.zeros((silence_len), dtype=np.float64)
# Equal power crossfade
if crossfade:
t = np.linspace(-1, 1, fade_len, dtype=np.float64)
fade_in = np.sqrt(0.5 * (1 + t))
fade_out = np.sqrt(0.5 * (1 - t))
else:
fade_in = fade_out = np.ones((fade_len), dtype=np.float64)
# Concat the silence to the fades
fade_in = np.concatenate([silence, fade_in])
fade_out = np.concatenate([fade_out, silence])
# Apply the gain to the overlap samples
y[:, :overlap] *= fade_in
y[:, -overlap:] *= fade_out
unfolded = np.zeros((total_len), dtype=np.float64)
# Loop to add up all the samples
for i in range(num_folds):
start = i * (target + overlap)
end = start + target + 2 * overlap
unfolded[start:end] += y[i]
return unfolded
def get_step(self) :
return self.step.data.item()
def checkpoint(self, model_dir, optimizer) :
k_steps = self.get_step() // 1000
self.save(model_dir.joinpath("checkpoint_%dk_steps.pt" % k_steps), optimizer)
def log(self, path, msg) :
with open(path, 'a') as f:
print(msg, file=f)
def load(self, path, optimizer) :
checkpoint = torch.load(path, map_location="cpu")
if "optimizer_state" in checkpoint:
self.load_state_dict(checkpoint["model_state"])
optimizer.load_state_dict(checkpoint["optimizer_state"])
else:
# Backwards compatibility
self.load_state_dict(checkpoint)
def save(self, path, optimizer) :
torch.save({
"model_state": self.state_dict(),
"optimizer_state": optimizer.state_dict(),
}, path)
def num_params(self, print_out=True):
parameters = filter(lambda p: p.requires_grad, self.parameters())
parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
if print_out :
print('Trainable Parameters: %.3fM' % parameters)

198
vocoder/train.py Normal file
View File

@@ -0,0 +1,198 @@
import time
from pathlib import Path
from os.path import exists
import numpy as np
import torch
import torch.nn.functional as F
from torch import no_grad, optim
from torch.utils.data import DataLoader
import vocoder.hparams as hp
from vocoder.display import stream, simple_table
from vocoder.distribution import discretized_mix_logistic_loss
from vocoder.gen_wavernn import gen_devset
from vocoder.models.fatchord_version import WaveRNN
from vocoder.vocoder_dataset import VocoderDataset, collate_vocoder
from vocoder.utils import ValueWindow
from utils.profiler import Profiler
def train(run_id: str, syn_dir: Path, voc_dir: Path, models_dir: Path, ground_truth: bool, save_every: int,
backup_every: int, force_restart: bool, use_tb: bool):
if use_tb:
print("Use Tensorboard")
import tensorflow as tf
import datetime
# Hide GPU from visible devices
log_dir = f"log/vc/vocoder/tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary_writer = tf.summary.create_file_writer(log_dir)
# Check to make sure the hop length is correctly factorised
train_syn_dir = syn_dir.joinpath("train")
train_voc_dir = voc_dir.joinpath("train")
dev_syn_dir = syn_dir.joinpath("dev")
dev_voc_dir = voc_dir.joinpath("dev")
assert np.cumprod(hp.voc_upsample_factors)[-1] == hp.hop_length
# Instantiate the model
print("Initializing the model...")
model = WaveRNN(
rnn_dims=hp.voc_rnn_dims,
fc_dims=hp.voc_fc_dims,
bits=hp.bits,
pad=hp.voc_pad,
upsample_factors=hp.voc_upsample_factors,
feat_dims=hp.num_mels,
compute_dims=hp.voc_compute_dims,
res_out_dims=hp.voc_res_out_dims,
res_blocks=hp.voc_res_blocks,
hop_length=hp.hop_length,
sample_rate=hp.sample_rate,
mode=hp.voc_mode
)
if torch.cuda.is_available():
model = model.cuda()
# Initialize the optimizer
optimizer = optim.Adam(model.parameters())
for p in optimizer.param_groups:
p["lr"] = hp.voc_lr
loss_func = F.cross_entropy if model.mode == "RAW" else discretized_mix_logistic_loss
train_loss_window = ValueWindow(100)
# Load the weights
model_dir = models_dir / run_id
model_dir.mkdir(exist_ok=True)
weights_fpath = model_dir / "vocoder.pt"
# train_loss_file_path = "vocoder_loss/vocoder_train_loss.npy"
# dev_loss_file_path = "vocoder_loss/vocoder_dev_loss.npy"
# if not exists("vocoder_loss"):
# import os
# os.mkdir("vocoder_loss")
if force_restart or not weights_fpath.exists():
print("\nStarting the training of WaveRNN from scratch\n")
model.save(weights_fpath, optimizer)
# losses = []
# dev_losses = []
else:
print("\nLoading weights at %s" % weights_fpath)
model.load(weights_fpath, optimizer)
print("WaveRNN weights loaded from step %d" % model.step)
# losses = list(np.load(train_loss_file_path)) if exists(train_loss_file_path) else []
# dev_losses = list(np.load(dev_loss_file_path)) if exists(dev_loss_file_path) else []
# Initialize the dataset
train_metadata_fpath = train_syn_dir.joinpath("train.txt") if ground_truth else \
train_voc_dir.joinpath("synthesized.txt")
train_mel_dir = train_syn_dir.joinpath("mels") if ground_truth else train_voc_dir.joinpath("mels_gta")
train_wav_dir = train_syn_dir.joinpath("audio")
train_dataset = VocoderDataset(train_metadata_fpath, train_mel_dir, train_wav_dir)
dev_metadata_fpath = dev_syn_dir.joinpath("dev.txt") if ground_truth else \
dev_voc_dir.joinpath("synthesized.txt")
dev_mel_dir = dev_syn_dir.joinpath("mels") if ground_truth else dev_voc_dir.joinpath("mels_gta")
dev_wav_dir = dev_syn_dir.joinpath("audio")
dev_dataset = VocoderDataset(dev_metadata_fpath, dev_mel_dir, dev_wav_dir)
train_dataloader = DataLoader(train_dataset, hp.voc_batch_size, shuffle=True, num_workers=8, collate_fn=collate_vocoder, pin_memory=True)
dev_dataloader = DataLoader(dev_dataset, hp.voc_batch_size, shuffle=True, num_workers=8, collate_fn=collate_vocoder, pin_memory=True)
dev_dataloader_ = DataLoader(dev_dataset, 1, shuffle=True)
# Begin the training
simple_table([('Batch size', hp.voc_batch_size),
('LR', hp.voc_lr),
('Sequence Len', hp.voc_seq_len)])
# best_loss_file_path = "vocoder_loss/best_loss.npy"
# best_loss = np.load(best_loss_file_path)[0] if exists(best_loss_file_path) else 1000
# profiler = Profiler(summarize_every=10, disabled=False)
for epoch in range(1, 3500):
start = time.time()
for i, (x, y, m) in enumerate(train_dataloader, 1):
model.train()
# profiler.tick("Blocking, waiting for batch (threaded)")
if torch.cuda.is_available():
x, m, y = x.cuda(), m.cuda(), y.cuda()
# profiler.tick("Data to cuda")
# Forward pass
y_hat = model(x, m)
if model.mode == 'RAW':
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
elif model.mode == 'MOL':
y = y.float()
y = y.unsqueeze(-1)
# profiler.tick("Forward pass")
# Backward pass
loss = loss_func(y_hat, y)
# profiler.tick("Loss")
optimizer.zero_grad()
loss.backward()
# profiler.tick("Backward pass")
optimizer.step()
# profiler.tick("Parameter update")
speed = i / (time.time() - start)
train_loss_window.append(loss.item())
step = model.get_step()
k = step // 1000
msg = f"| Epoch: {epoch} ({i}/{len(train_dataloader)}) | " \
f"Train Loss: {train_loss_window.average:.4f} | " \
f"{speed:.4f}steps/s | Step: {k}k | "
stream(msg)
if use_tb:
with train_summary_writer.as_default():
tf.summary.scalar('train_loss', train_loss_window.average, step=step)
torch.cuda.empty_cache()
if backup_every != 0 and step % backup_every == 0 :
model.checkpoint(model_dir, optimizer)
if save_every != 0 and step % save_every == 0 :
dev_loss = validate(dev_dataloader, model, loss_func)
msg = f"| Epoch: {epoch} ({i}/{len(train_dataloader)}) | " \
f"Train Loss: {train_loss_window.average:.4f} | Dev Loss: {dev_loss:.4f} | " \
f"{speed:.4f}steps/s | Step: {k}k | "
stream(msg)
if use_tb:
with train_summary_writer.as_default():
tf.summary.scalar('val_loss', dev_loss, step=step)
# losses.append(train_loss_window.average)
# np.save(train_loss_file_path, np.array(losses, dtype=float))
# dev_losses.append(dev_loss)
# np.save(dev_loss_file_path, np.array(dev_losses, dtype=float))
# if dev_loss < best_loss :
# best_loss = dev_loss
# np.save(best_loss_file_path, np.array([best_loss]))
model.save(weights_fpath, optimizer)
# profiler.tick("Extra saving")
# gen_devset(model, dev_dataloader_, hp.voc_gen_at_checkpoint, hp.voc_gen_batched,
# hp.voc_target, hp.voc_overlap, model_dir)
print("")
def validate(dataloader, model, loss_func):
model.eval()
losses = []
with no_grad():
for i, (x, y, m) in enumerate(dataloader, 1):
if torch.cuda.is_available():
x, m, y = x.cuda(), m.cuda(), y.cuda()
y_hat = model(x, m)
if model.mode == 'RAW':
y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
elif model.mode == 'MOL':
y = y.float()
y = y.unsqueeze(-1)
loss = loss_func(y_hat, y).item()
losses.append(loss)
torch.cuda.empty_cache()
return sum(losses) / len(losses)

22
vocoder/utils.py Normal file
View File

@@ -0,0 +1,22 @@
class ValueWindow():
def __init__(self, window_size=100):
self._window_size = window_size
self._values = []
def append(self, x):
self._values = self._values[-(self._window_size - 1):] + [x]
@property
def sum(self):
return sum(self._values)
@property
def count(self):
return len(self._values)
@property
def average(self):
return self.sum / max(1, self.count)
def reset(self):
self._values = []

View File

@@ -0,0 +1,84 @@
from torch.utils.data import Dataset
from pathlib import Path
from vocoder import audio
import vocoder.hparams as hp
import numpy as np
import torch
class VocoderDataset(Dataset):
def __init__(self, metadata_fpath: Path, mel_dir: Path, wav_dir: Path):
print("Using inputs from:\n\t%s\n\t%s\n\t%s" % (metadata_fpath, mel_dir, wav_dir))
with metadata_fpath.open("r") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
gta_fnames = [x[1] for x in metadata if int(x[4])]
gta_fpaths = [mel_dir.joinpath(fname) for fname in gta_fnames]
wav_fnames = [x[0] for x in metadata if int(x[4])]
wav_fpaths = [wav_dir.joinpath(fname) for fname in wav_fnames]
self.samples_fpaths = list(zip(gta_fpaths, wav_fpaths))
print("Found %d samples" % len(self.samples_fpaths))
def __getitem__(self, index):
mel_path, wav_path = self.samples_fpaths[index]
# Load the mel spectrogram and adjust its range to [-1, 1]
mel = np.load(mel_path).T.astype(np.float32) / hp.mel_max_abs_value
# Load the wav
wav = np.load(wav_path)
if hp.apply_preemphasis:
wav = audio.pre_emphasis(wav)
wav = np.clip(wav, -1, 1)
# Fix for missing padding # TODO: settle on whether this is any useful
r_pad = (len(wav) // hp.hop_length + 1) * hp.hop_length - len(wav)
wav = np.pad(wav, (0, r_pad), mode='constant')
assert len(wav) >= mel.shape[1] * hp.hop_length
wav = wav[:mel.shape[1] * hp.hop_length]
assert len(wav) % hp.hop_length == 0
# Quantize the wav
if hp.voc_mode == 'RAW':
if hp.mu_law:
quant = audio.encode_mu_law(wav, mu=2 ** hp.bits)
else:
quant = audio.float_2_label(wav, bits=hp.bits)
elif hp.voc_mode == 'MOL':
quant = audio.float_2_label(wav, bits=16)
return mel.astype(np.float32), quant.astype(np.int64)
def __len__(self):
return len(self.samples_fpaths)
def collate_vocoder(batch):
mel_win = hp.voc_seq_len // hp.hop_length + 2 * hp.voc_pad
max_offsets = [x[0].shape[-1] -2 - (mel_win + 2 * hp.voc_pad) for x in batch]
mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
sig_offsets = [(offset + hp.voc_pad) * hp.hop_length for offset in mel_offsets]
mels = [x[0][:, mel_offsets[i]:mel_offsets[i] + mel_win] for i, x in enumerate(batch)]
labels = [x[1][sig_offsets[i]:sig_offsets[i] + hp.voc_seq_len + 1] for i, x in enumerate(batch)]
mels = np.stack(mels).astype(np.float32)
labels = np.stack(labels).astype(np.int64)
mels = torch.tensor(mels)
labels = torch.tensor(labels).long()
x = labels[:, :hp.voc_seq_len]
y = labels[:, 1:]
bits = 16 if hp.voc_mode == 'MOL' else hp.bits
x = audio.label_2_float(x.float(), bits)
if hp.voc_mode == 'MOL' :
y = audio.label_2_float(y.float(), bits)
return x, y, mels

47
vocoder_preprocess.py Normal file
View File

@@ -0,0 +1,47 @@
import argparse
import os
from pathlib import Path
from synthesizer.hparams import syn_hparams
from synthesizer.synthesize import run_synthesis
from utils.argutils import print_args
if __name__ == "__main__":
class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
pass
parser = argparse.ArgumentParser(
description="Creates ground-truth aligned (GTA) spectrograms from the vocoder.",
formatter_class=MyFormatter
)
parser.add_argument("datasets_root", type=Path, help=\
"Path to the directory containing your SV2TTS directory. If you specify both --in_dir and "
"--out_dir, this argument won't be used.")
parser.add_argument("-s", "--syn_model_fpath", type=Path,
default="saved_models/default/synthesizer.pt",
help="Path to a saved synthesizer")
parser.add_argument("-i", "--in_dir", type=Path, default=argparse.SUPPRESS, help= \
"Path to the synthesizer directory that contains the mel spectrograms, the wavs and the "
"embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/.")
parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help= \
"Path to the output vocoder directory that will contain the ground truth aligned mel "
"spectrograms. Defaults to <datasets_root>/SV2TTS/vocoder/.")
parser.add_argument("--hparams", default="", help=\
"Hyperparameter overrides as a comma-separated list of name=value pairs")
parser.add_argument("--cpu", action="store_true", help=\
"If True, processing is done on CPU, even when a GPU is available.")
args = parser.parse_args()
print_args(args, parser)
modified_hp = syn_hparams.parse(args.hparams)
if not hasattr(args, "in_dir"):
args.in_dir = args.datasets_root / "SV2TTS" / "synthesizer"
if not hasattr(args, "out_dir"):
args.out_dir = args.datasets_root / "SV2TTS" / "vocoder"
if args.cpu:
# Hide GPUs from Pytorch to force CPU processing
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
run_synthesis(args.in_dir, args.out_dir, args.syn_model_fpath, modified_hp)

55
vocoder_train.py Normal file
View File

@@ -0,0 +1,55 @@
import argparse
from pathlib import Path
from utils.argutils import print_args
from vocoder.train import train
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Trains the vocoder from the synthesizer audios and the GTA synthesized mels, "
"or ground truth mels.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("run_id", type=str, help= \
"Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
"from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
"states and restart from scratch.")
parser.add_argument("datasets_root", type=Path, help= \
"Path to the directory containing your SV2TTS directory. Specifying --syn_dir or --voc_dir "
"will take priority over this argument.")
parser.add_argument("--syn_dir", type=Path, default=argparse.SUPPRESS, help= \
"Path to the synthesizer directory that contains the ground truth mel spectrograms, "
"the wavs and the embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/.")
parser.add_argument("--voc_dir", type=Path, default=argparse.SUPPRESS, help= \
"Path to the vocoder directory that contains the GTA synthesized mel spectrograms. "
"Defaults to <datasets_root>/SV2TTS/vocoder/. Unused if --ground_truth is passed.")
parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
"Path to the directory that will contain the saved model weights, as well as backups "
"of those weights and wavs generated during training.")
parser.add_argument("-g", "--ground_truth", action="store_true", help= \
"Train on ground truth spectrograms (<datasets_root>/SV2TTS/synthesizer/mels).")
parser.add_argument("-s", "--save_every", type=int, default=100, help= \
"Number of steps between updates of the model on the disk. Set to 0 to never save the "
"model.")
parser.add_argument("-b", "--backup_every", type=int, default=10000, help= \
"Number of steps between backups of the model. Set to 0 to never make backups of the "
"model.")
parser.add_argument("-f", "--force_restart", action="store_true", help= \
"Do not load any saved model and restart from scratch.")
parser.add_argument("--use_tb", action="store_true", help= \
"Use Tensorboard support")
args = parser.parse_args()
# Process the arguments
if not hasattr(args, "syn_dir"):
args.syn_dir = args.datasets_root / "SV2TTS" / "synthesizer"
if not hasattr(args, "voc_dir"):
args.voc_dir = args.datasets_root / "SV2TTS" / "vocoder"
del args.datasets_root
args.models_dir.mkdir(exist_ok=True)
# Run the training
print_args(args, parser)
train(**vars(args))