2021-06-27 20:55:20 +02:00
|
|
|
# Tutorial For Nervous Beginners
|
|
|
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
|
|
User friendly installation. Recommended only for synthesizing voice.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ pip install TTS
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Developer friendly installation.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ git clone https://github.com/coqui-ai/TTS
|
|
|
|
|
$ cd TTS
|
|
|
|
|
$ pip install -e .
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Training a `tts` Model
|
|
|
|
|
|
|
|
|
|
A breakdown of a simple script training a GlowTTS model on LJspeech dataset. See the comments for the explanation of
|
|
|
|
|
each line.
|
|
|
|
|
|
|
|
|
|
### Pure Python Way
|
|
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
1. Define `train.py`.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import os
|
|
|
|
|
|
|
|
|
|
# GlowTTSConfig: all model related values for training, validating and testing.
|
2021-10-21 16:22:12 +00:00
|
|
|
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
|
2021-09-30 14:34:53 +00:00
|
|
|
|
|
|
|
|
# BaseDatasetConfig: defines name, formatter and path of the dataset.
|
2021-10-21 16:22:12 +00:00
|
|
|
from TTS.tts.configs.shared_config import BaseDatasetConfig
|
2021-09-30 14:34:53 +00:00
|
|
|
|
|
|
|
|
# init_training: Initialize and setup the training environment.
|
|
|
|
|
# Trainer: Where the ✨️ happens.
|
|
|
|
|
# TrainingArgs: Defines the set of arguments of the Trainer.
|
|
|
|
|
from TTS.trainer import init_training, Trainer, TrainingArgs
|
|
|
|
|
|
|
|
|
|
# we use the same path as this script as our training folder.
|
|
|
|
|
output_path = os.path.dirname(os.path.abspath(__file__))
|
|
|
|
|
|
|
|
|
|
# set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
|
|
|
|
|
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
|
|
|
|
|
|
|
|
|
|
# Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
|
|
|
|
|
config = GlowTTSConfig(
|
|
|
|
|
batch_size=32,
|
|
|
|
|
eval_batch_size=16,
|
|
|
|
|
num_loader_workers=4,
|
|
|
|
|
num_eval_loader_workers=4,
|
|
|
|
|
run_eval=True,
|
|
|
|
|
test_delay_epochs=-1,
|
|
|
|
|
epochs=1000,
|
|
|
|
|
text_cleaner="english_cleaners",
|
|
|
|
|
use_phonemes=False,
|
|
|
|
|
phoneme_language="en-us",
|
|
|
|
|
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
|
|
|
|
|
print_step=25,
|
|
|
|
|
print_eval=True,
|
|
|
|
|
mixed_precision=False,
|
|
|
|
|
output_path=output_path,
|
|
|
|
|
datasets=[dataset_config]
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# initialize the audio processor used for feature extraction and audio I/O.
|
|
|
|
|
# It is mainly used by the dataloader and the training loggers.
|
|
|
|
|
ap = AudioProcessor(**config.audio.to_dict())
|
|
|
|
|
|
|
|
|
|
# load a list of training samples
|
|
|
|
|
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
|
|
|
|
|
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
|
|
|
|
|
|
|
|
|
|
# initialize the model
|
|
|
|
|
# Models only takes the config object as input.
|
|
|
|
|
model = GlowTTS(config)
|
|
|
|
|
|
|
|
|
|
# Initiate the Trainer.
|
|
|
|
|
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
|
2021-10-21 16:22:12 +00:00
|
|
|
# distributed training, etc.
|
2021-09-30 14:34:53 +00:00
|
|
|
trainer = Trainer(
|
|
|
|
|
TrainingArgs(),
|
|
|
|
|
config,
|
|
|
|
|
output_path,
|
|
|
|
|
model=model,
|
|
|
|
|
train_samples=train_samples,
|
|
|
|
|
eval_samples=eval_samples,
|
|
|
|
|
training_assets={"audio_processor": ap},
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# And kick it 🚀
|
|
|
|
|
trainer.fit()
|
|
|
|
|
```
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
2. Run the script.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
```bash
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 python train.py
|
|
|
|
|
```
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
- Continue a previous run.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
```bash
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 python train.py --continue_path path/to/previous/run/folder/
|
|
|
|
|
```
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
- Fine-tune a model.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
```bash
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 python train.py --restore_path path/to/model/checkpoint.pth.tar
|
|
|
|
|
```
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
- Run multi-gpu training.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
```bash
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0,1,2 python TTS/bin/distribute.py --script train.py
|
|
|
|
|
```
|
2021-06-27 20:55:20 +02:00
|
|
|
|
|
|
|
|
### CLI Way
|
|
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
We still support running training from CLI like in the old days. The same training run can also be started as follows.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
|
|
|
|
1. Define your `config.json`
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
{
|
2021-08-04 11:26:10 -07:00
|
|
|
"run_name": "my_run",
|
2021-06-27 20:55:20 +02:00
|
|
|
"model": "glow_tts",
|
|
|
|
|
"batch_size": 32,
|
|
|
|
|
"eval_batch_size": 16,
|
|
|
|
|
"num_loader_workers": 4,
|
|
|
|
|
"num_eval_loader_workers": 4,
|
|
|
|
|
"run_eval": true,
|
|
|
|
|
"test_delay_epochs": -1,
|
|
|
|
|
"epochs": 1000,
|
|
|
|
|
"text_cleaner": "english_cleaners",
|
|
|
|
|
"use_phonemes": false,
|
|
|
|
|
"phoneme_language": "en-us",
|
|
|
|
|
"phoneme_cache_path": "phoneme_cache",
|
|
|
|
|
"print_step": 25,
|
|
|
|
|
"print_eval": true,
|
|
|
|
|
"mixed_precision": false,
|
|
|
|
|
"output_path": "recipes/ljspeech/glow_tts/",
|
|
|
|
|
"datasets":[{"name": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}]
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
2. Start training.
|
|
|
|
|
```bash
|
|
|
|
|
$ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Training a `vocoder` Model
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import os
|
|
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
from TTS.trainer import Trainer, TrainingArgs
|
|
|
|
|
from TTS.utils.audio import AudioProcessor
|
2021-06-27 20:55:20 +02:00
|
|
|
from TTS.vocoder.configs import HifiganConfig
|
2021-09-30 14:34:53 +00:00
|
|
|
from TTS.vocoder.datasets.preprocess import load_wav_data
|
|
|
|
|
from TTS.vocoder.models.gan import GAN
|
2021-06-27 20:55:20 +02:00
|
|
|
|
|
|
|
|
output_path = os.path.dirname(os.path.abspath(__file__))
|
2021-09-30 14:34:53 +00:00
|
|
|
|
2021-06-27 20:55:20 +02:00
|
|
|
config = HifiganConfig(
|
|
|
|
|
batch_size=32,
|
|
|
|
|
eval_batch_size=16,
|
|
|
|
|
num_loader_workers=4,
|
|
|
|
|
num_eval_loader_workers=4,
|
|
|
|
|
run_eval=True,
|
2021-09-30 14:34:53 +00:00
|
|
|
test_delay_epochs=5,
|
2021-06-27 20:55:20 +02:00
|
|
|
epochs=1000,
|
|
|
|
|
seq_len=8192,
|
|
|
|
|
pad_short=2000,
|
|
|
|
|
use_noise_augment=True,
|
|
|
|
|
eval_split_size=10,
|
|
|
|
|
print_step=25,
|
2021-09-30 14:34:53 +00:00
|
|
|
print_eval=False,
|
2021-06-27 20:55:20 +02:00
|
|
|
mixed_precision=False,
|
|
|
|
|
lr_gen=1e-4,
|
|
|
|
|
lr_disc=1e-4,
|
|
|
|
|
data_path=os.path.join(output_path, "../LJSpeech-1.1/wavs/"),
|
|
|
|
|
output_path=output_path,
|
|
|
|
|
)
|
2021-09-30 14:34:53 +00:00
|
|
|
|
|
|
|
|
# init audio processor
|
|
|
|
|
ap = AudioProcessor(**config.audio.to_dict())
|
|
|
|
|
|
|
|
|
|
# load training samples
|
|
|
|
|
eval_samples, train_samples = load_wav_data(config.data_path, config.eval_split_size)
|
|
|
|
|
|
|
|
|
|
# init model
|
|
|
|
|
model = GAN(config)
|
|
|
|
|
|
|
|
|
|
# init the trainer and 🚀
|
|
|
|
|
trainer = Trainer(
|
|
|
|
|
TrainingArgs(),
|
|
|
|
|
config,
|
|
|
|
|
output_path,
|
|
|
|
|
model=model,
|
|
|
|
|
train_samples=train_samples,
|
|
|
|
|
eval_samples=eval_samples,
|
|
|
|
|
training_assets={"audio_processor": ap},
|
|
|
|
|
)
|
2021-06-27 20:55:20 +02:00
|
|
|
trainer.fit()
|
|
|
|
|
```
|
|
|
|
|
|
2021-09-30 14:34:53 +00:00
|
|
|
❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
|
2021-06-27 20:55:20 +02:00
|
|
|
|
|
|
|
|
## Synthesizing Speech
|
|
|
|
|
|
|
|
|
|
You can run `tts` and synthesize speech directly on the terminal.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ tts -h # see the help
|
|
|
|
|
$ tts --list_models # list the available models.
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
You can call `tts-server` to start a local demo server that you can open it on
|
|
|
|
|
your favorite web browser and 🗣️.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ tts-server -h # see the help
|
|
|
|
|
$ tts-server --list_models # list the available models.
|
|
|
|
|
```
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|