add demo results

This commit is contained in:
liuhaozhe6788
2023-07-22 20:39:59 +08:00
parent e1b236ecd5
commit ac8ca88900
30 changed files with 297 additions and 278 deletions

131
README.md
View File

@@ -129,4 +129,133 @@ The results are saved in dim_reduction_results/.
You can download the pretrained model from [this](https://drive.google.com/drive/folders/19fhjjAbWq60zv1Bl6Y51snGbG1r5kaN2) and extract as saved_models/20230609
## Demo results
coming soon
<div align = "center">
<table style="width:100%">
<thead>
<tr>
<th>Reference Audio</th>
<th>Input Text</th>
<th>Synthetic Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align = "center">
<audio controls autoplay src="samples/260-123286-0000.flac"></audio>
<a href="samples/260-123286-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text1/260-123286-0000_syn_1.0.wav"></audio>
<a href="demo_results/text1/260-123286-0000_syn_1.0.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text2/260-123286-0000_syn_1.0.wav"></audio>
<a href="demo_results/text2/260-123286-0000_syn_1.0.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text3/260-123286-0000_syn_0.97.wav"></audio>
<a href="demo_results/text3/260-123286-0000_syn_0.97.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls autoplay src="samples/1688-142285-0000.flac"></audio>
<a href="samples/1688-142285-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text1/1688-142285-0000_syn.wav"></audio>
<a href="demo_results/text1/1688-142285-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text2/1688-142285-0000_syn_0.77.wav"></audio>
<a href="demo_results/text2/1688-142285-0000_syn_0.77.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text3/1688-142285-0000_syn.wav"></audio>
<a href="demo_results/text3/1688-142285-0000_syn.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls autoplay src="samples/4294-9934-0000.flac"></audio>
<a href="samples/4294-9934-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text1/4294-9934-0000_syn_0.98.wav"></audio>
<a href="demo_results/text1/4294-9934-0000_syn_0.98.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text2/4294-9934-0000_syn_0.78.wav"></audio>
<a href="demo_results/text2/4294-9934-0000_syn_0.78.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text3/4294-9934-0000_syn_0.76.wav"></audio>
<a href="demo_results/text3/4294-9934-0000_syn_0.76.wav">
</a>
</td>
</tr>
<tr>
<td rowspan="3" align = "center">
<audio controls autoplay src="samples/7176-88083-0000.flac"></audio>
<a href="samples/7176-88083-0000.flac">
</a>
</td>
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text1/7176-88083-0000_syn_1.13.wav"></audio>
<a href="demo_results/text1/7176-88083-0000_syn_1.13.wav">
</a>
</td>
</tr>
<tr>
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text2/7176-88083-0000_syn_0.76.wav"></audio>
<a href="demo_results/text2/7176-88083-0000_syn_0.76.wav">
</a>
</td>
</tr>
<tr>
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
<td align = "center">
<audio controls autoplay src="demo_results/text3/7176-88083-0000_syn_0.8.wav"></audio>
<a href="demo_results/text3/7176-88083-0000_syn_0.8.wav">
</a>
</td>
</tr>
</tbody>
</table>
</div>

View File

@@ -1,7 +1,7 @@
import argparse
from ctypes import alignment
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from pathlib import Path
import spacy
import time
@@ -135,190 +135,191 @@ if __name__ == '__main__':
weight = arg_dict["weight"] # 声音美颜的用户语音权重
amp = 1
# try:
# Get the reference audio filepath
# enter the number of reference audios
message1 = "Please enter the number of reference audios:\n"
num_of_input_audio = int(input(message1))
# num_of_input_audio = 1
while True:
# try:
# Get the reference audio filepath
# enter the number of reference audios
message1 = "Please enter the number of reference audios:\n"
num_of_input_audio = int(input(message1))
# num_of_input_audio = 1
for i in range(num_of_input_audio):
# Computing the embedding
# First, we load the wav using the function that the speaker encoder provides. This is
# important: there is preprocessing that must be applied.
for i in range(num_of_input_audio):
# Computing the embedding
# First, we load the wav using the function that the speaker encoder provides. This is
# important: there is preprocessing that must be applied.
# The following two methods are equivalent:
# - Directly load from the filepath:
# preprocessed_wav = encoder.preprocess_wav(in_fpath)
# - If the wav is already loaded:
# The following two methods are equivalent:
# - Directly load from the filepath:
# preprocessed_wav = encoder.preprocess_wav(in_fpath)
# - If the wav is already loaded:
# get duration info from input audio
message2 = "Reference voice: enter an audio folder of a voice to be cloned (mp3, " \
f"wav, m4a, flac, ...):({i+1}/{num_of_input_audio})\n"
in_fpath = Path(input(message2).replace("\"", "").replace("\'", ""))
# get duration info from input audio
message2 = "Reference voice: enter an audio folder of a voice to be cloned (mp3, " \
f"wav, m4a, flac, ...):({i+1}/{num_of_input_audio})\n"
in_fpath = Path(input(message2).replace("\"", "").replace("\'", ""))
fpath_without_ext = os.path.splitext(str(in_fpath))[0]
speaker_name = os.path.normpath(fpath_without_ext).split(os.sep)[-1]
fpath_without_ext = os.path.splitext(str(in_fpath))[0]
speaker_name = os.path.normpath(fpath_without_ext).split(os.sep)[-1]
is_wav_file, single_wav, wav_path = TransFormat(in_fpath, 'wav')
is_wav_file, single_wav, wav_path = TransFormat(in_fpath, 'wav')
if not is_wav_file:
os.remove(wav_path) # remove intermediate wav files
# merge
if i == 0:
wav = single_wav
if not is_wav_file:
os.remove(wav_path) # remove intermediate wav files
# merge
if i == 0:
wav = single_wav
else:
wav = np.append(wav, single_wav)
# write to disk
path_ori, _ = os.path.split(wav_path)
file_ori = 'temp.wav'
fpath = os.path.join(path_ori, file_ori)
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
# adjust the speed
totDur_ori, nPause_ori, arDur_ori, nSyl_ori, arRate_ori = AudioAnalysis(path_ori, file_ori)
DelFile(path_ori, '.TextGrid')
os.remove(fpath)
preprocessed_wav = encoder.inference.preprocess_wav(wav)
print("Loaded input audio file succesfully")
# Then we derive the embedding. There are many functions and parameters that the
# speaker encoder interfaces. These are mostly for in-depth research. You will typically
# only use this function (with its default parameters):
input_embed = encoder.inference.embed_utterance(preprocessed_wav)
# Choose standard audio
fft_max_freq = vocoder.get_dominant_freq(preprocessed_wav)
print(f"\nthe dominant frequency of input audio is {fft_max_freq}Hz")
if fft_max_freq < encoder.params_data.split_freq:
vocoder.hp.sex = 1
standard_fpath = "standard_audios/male_1.wav"
else:
wav = np.append(wav, single_wav)
# write to disk
path_ori, _ = os.path.split(wav_path)
file_ori = 'temp.wav'
fpath = os.path.join(path_ori, file_ori)
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
vocoder.hp.sex = 0
standard_fpath = "standard_audios/female_1.wav"
# adjust the speed
totDur_ori, nPause_ori, arDur_ori, nSyl_ori, arRate_ori = AudioAnalysis(path_ori, file_ori)
DelFile(path_ori, '.TextGrid')
os.remove(fpath)
if os.path.exists(standard_fpath):
standard_wav = Synthesizer_infer.load_preprocess_wav(standard_fpath)
preprocessed_standard_wav = encoder.inference.preprocess_wav(standard_wav)
print("Loaded standard audio file successfully")
preprocessed_wav = encoder.inference.preprocess_wav(wav)
standard_embed = encoder.inference.embed_utterance(preprocessed_standard_wav)
print("Loaded input audio file succesfully")
embed1=np.copy(input_embed).dot(weight)
embed2=np.copy(standard_embed).dot(1 - weight)
embed=embed1+embed2
else:
embed = np.copy(input_embed)
# Then we derive the embedding. There are many functions and parameters that the
# speaker encoder interfaces. These are mostly for in-depth research. You will typically
# only use this function (with its default parameters):
input_embed = encoder.inference.embed_utterance(preprocessed_wav)
# Choose standard audio
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
embed = embed * amp
fft_max_freq = vocoder.get_dominant_freq(preprocessed_wav)
print(f"\nthe dominant frequency of input audio is {fft_max_freq}Hz")
if fft_max_freq < encoder.params_data.split_freq:
vocoder.hp.sex = 1
standard_fpath = "standard_audios/male_1.wav"
else:
vocoder.hp.sex = 0
standard_fpath = "standard_audios/female_1.wav"
start_syn = time.time()
# Generating the spectrogram
text = input("Write a sentence to be synthesized:\n")
if os.path.exists(standard_fpath):
# If seed is specified, reset torch seed and force synthesizer reload
if args.seed is not None:
torch.manual_seed(args.seed)
synthesizer = Synthesizer_infer(args.syn_model_fpath)
# The synthesizer works in batch, so you need to put your data in a list or numpy array
def preprocess_text(text):
text = add_breaks(text)
text = english_cleaners_predict(text)
texts = [i.text.strip() for i in nlp(text).sents] # split paragraph to sentences
return texts
texts = preprocess_text(text)
print(f"the list of inputs texts:\n{texts}")
embeds = [embed] * len(texts)
specs, alignments, stop_tokens = synthesizer.synthesize_spectrograms(texts, embeds, require_visualization=True)
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
standard_wav = Synthesizer_infer.load_preprocess_wav(standard_fpath)
preprocessed_standard_wav = encoder.inference.preprocess_wav(standard_wav)
print("Loaded standard audio file successfully")
standard_embed = encoder.inference.embed_utterance(preprocessed_standard_wav)
## Save synthesizer visualization results
if not os.path.exists("syn_results"):
os.mkdir("syn_results")
# save_attention_multiple(alignments, "syn_results/attention")
# save_stop_tokens(stop_tokens, "syn_results/stop_tokens")
# save_spectrogram(spec, "syn_results/mel")
print("Created the mel spectrogram")
embed1=np.copy(input_embed).dot(weight)
embed2=np.copy(standard_embed).dot(1 - weight)
embed=embed1+embed2
else:
embed = np.copy(input_embed)
end_syn = time.time()
print(f"Prediction time of synthesizer is {end_syn - start_syn}s")
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
embed = embed * amp
start_voc = time.time()
## Generating the waveform
print("Synthesizing the waveform:")
start_syn = time.time()
# Generating the spectrogram
text = input("Write a sentence to be synthesized:\n")
# If seed is specified, reset torch seed and reload vocoder
if args.seed is not None:
torch.manual_seed(args.seed)
vocoder.load_model(args.voc_model_fpath)
# If seed is specified, reset torch seed and force synthesizer reload
if args.seed is not None:
torch.manual_seed(args.seed)
synthesizer = Synthesizer_infer(args.syn_model_fpath)
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
# spectrogram, the more time-efficient the vocoder.
if not args.griffin_lim:
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade)
else:
wav = Synthesizer_infer.griffin_lim(spec)
# The synthesizer works in batch, so you need to put your data in a list or numpy array
def preprocess_text(text):
text = add_breaks(text)
text = english_cleaners_predict(text)
texts = [i.text.strip() for i in nlp(text).sents] # split paragraph to sentences
return texts
end_voc = time.time()
print(f"Prediction time of vocoder is {end_voc - start_voc}s")
print(f"Prediction time of TTS is {end_voc - start_syn}s")
texts = preprocess_text(text)
print(f"the list of inputs texts:\n{texts}")
# Add breaks
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
embeds = [embed] * len(texts)
specs, alignments, stop_tokens = synthesizer.synthesize_spectrograms(texts, embeds, require_visualization=True)
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
# generated_wav = encoder.inference.preprocess_wav(wav)
wav = wav / np.abs(wav).max() * 4
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
# Save it on the disk
# filename = "demo_output_%02d.wav" % num_generated
if not os.path.exists("out_audios"):
os.mkdir("out_audios")
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
filename = os.path.join(dir_path, f"out_audios/{speaker_name}_syn.wav")
# print(wav.dtype)
sf.write(filename, wav.astype(np.float32), synthesizer.sample_rate)
num_generated += 1
print("\nSaved output (havent't change speed) as %s\n\n" % filename)
## Save synthesizer visualization results
if not os.path.exists("syn_results"):
os.mkdir("syn_results")
# save_attention_multiple(alignments, "syn_results/attention")
# save_stop_tokens(stop_tokens, "syn_results/stop_tokens")
# save_spectrogram(spec, "syn_results/mel")
print("Created the mel spectrogram")
end_syn = time.time()
print(f"Prediction time of synthesizer is {end_syn - start_syn}s")
start_voc = time.time()
## Generating the waveform
print("Synthesizing the waveform:")
# If seed is specified, reset torch seed and reload vocoder
if args.seed is not None:
torch.manual_seed(args.seed)
vocoder.load_model(args.voc_model_fpath)
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
# spectrogram, the more time-efficient the vocoder.
if not args.griffin_lim:
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade)
else:
wav = Synthesizer_infer.griffin_lim(spec)
end_voc = time.time()
print(f"Prediction time of vocoder is {end_voc - start_voc}s")
print(f"Prediction time of TTS is {end_voc - start_syn}s")
# Add breaks
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
# generated_wav = encoder.inference.preprocess_wav(wav)
wav = wav / np.abs(wav).max() * 4
# Save it on the disk
# filename = "demo_output_%02d.wav" % num_generated
if not os.path.exists("out_audios"):
os.mkdir("out_audios")
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
filename = os.path.join(dir_path, f"out_audios/{speaker_name}_syn.wav")
# print(wav.dtype)
sf.write(filename, wav.astype(np.float32), synthesizer.sample_rate)
num_generated += 1
print("\nSaved output (havent't change speed) as %s\n\n" % filename)
# Fix Speed(generate new audio)
fix_file = work(totDur_ori,
nPause_ori,
arDur_ori,
nSyl_ori,
arRate_ori,
filename)
print(f"\nSaved output (fixed speed) as {fix_file}\n\n")
# Fix Speed(generate new audio)
fix_file = work(totDur_ori,
nPause_ori,
arDur_ori,
nSyl_ori,
arRate_ori,
filename)
print(f"\nSaved output (fixed speed) as {fix_file}\n\n")
# # Play the audio (non-blocking)
# if not args.no_sound:
# import sounddevice as sd
# try:
# sd.stop()
# sd.play(wav, synthesizer.sample_rate)
# except sd.PortAudioError as e:
# print("\nCaught exception: %s" % repr(e))
# print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
# except:
# raise
# # Play the audio (non-blocking)
# if not args.no_sound:
# import sounddevice as sd
# try:
# sd.stop()
# sd.play(wav, synthesizer.sample_rate)
# except sd.PortAudioError as e:
# print("\nCaught exception: %s" % repr(e))
# print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
# except:
# raise
# except Exception as e:
# print("Caught exception: %s" % repr(e))
# print("Restarting\n")
# except Exception as e:
# print("Caught exception: %s" % repr(e))
# print("Restarting\n")

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
Life was like a box of chocolates, you never know what you're gonna get.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1 @@
Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
samples/4294-9934-0000.flac Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

24
samples/README.md Executable file → Normal file
View File

@@ -1,22 +1,2 @@
The audio files in this folder are provided for toolbox testing and
benchmarking purposes. These are the same reference utterances
used by the SV2TTS authors to generate the audio samples located at:
https://google.github.io/tacotron/publications/speaker_adaptation/index.html
The `p240_00000.mp3` and `p260_00000.mp3` files are compressed
versions of audios from the VCTK corpus available at:
https://datashare.is.ed.ac.uk/handle/10283/3443
VCTK.txt contains the copyright notices and licensing information.
The `1320_00000.mp3`, `3575_00000.mp3`, `6829_00000.mp3`
and `8230_00000.mp3` files are compressed versions of audios
from the LibriSpeech dataset available at: https://openslr.org/12
For these files, the following notice applies:
```
LibriSpeech (c) 2014 by Vassil Panayotov
LibriSpeech ASR corpus is licensed under a
Creative Commons Attribution 4.0 International License.
See <http://creativecommons.org/licenses/by/4.0/>.
```
260-123286-0000.flac and 7176-88083-0000.flac are from LibriSpeech test-clean.
1688-142285-0000.flac and 4294-9934-0000.flac are from LibriSpeech test-other.

View File

@@ -1,94 +0,0 @@
---------------------------------------------------------------------
CSTR VCTK Corpus
English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
(Version 0.92)
RELEASE September 2019
The Centre for Speech Technology Research
University of Edinburgh
Copyright (c) 2019
Junichi Yamagishi
jyamagis@inf.ed.ac.uk
---------------------------------------------------------------------
Overview
This CSTR VCTK Corpus includes speech data uttered by 110 English
speakers with various accents. Each speaker reads out about 400
sentences, which were selected from a newspaper, the rainbow passage
and an elicitation paragraph used for the speech accent archive.
The newspaper texts were taken from Herald Glasgow, with permission
from Herald & Times Group. Each speaker has a different set of the
newspaper texts selected based a greedy algorithm that increases the
contextual and phonetic coverage. The details of the text selection
algorithms are described in the following paper:
C. Veaux, J. Yamagishi and S. King,
"The voice bank corpus: Design, collection and data analysis of
a large regional accent speech database,"
https://doi.org/10.1109/ICSDA.2013.6709856
The rainbow passage and elicitation paragraph are the same for all
speakers. The rainbow passage can be found at International Dialects
of English Archive:
(http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation
paragraph is identical to the one used for the speech accent archive
(http://accent.gmu.edu). The details of the the speech accent archive
can be found at
http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf
All speech data was recorded using an identical recording setup: an
omni-directional microphone (DPA 4035) and a small diaphragm condenser
microphone with very wide bandwidth (Sennheiser MKH 800), 96kHz
sampling frequency at 24 bits and in a hemi-anechoic chamber of
the University of Edinburgh. (However, two speakers, p280 and p315
had technical issues of the audio recordings using MKH 800).
All recordings were converted into 16 bits, were downsampled to
48 kHz, and were manually end-pointed.
This corpus was originally aimed for HMM-based text-to-speech synthesis
systems, especially for speaker-adaptive HMM-based speech synthesis
that uses average voice models trained on multiple speakers and speaker
adaptation technologies. This corpus is also suitable for DNN-based
multi-speaker text-to-speech synthesis systems and waveform modeling.
COPYING
This corpus is licensed under the Creative Commons License: Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/legalcode
VCTK VARIANTS
There are several variants of the VCTK corpus:
Speech enhancement
- Noisy speech database for training speech enhancement algorithms and TTS models where we added various types of noises to VCTK artificially: http://dx.doi.org/10.7488/ds/2117
- Reverberant speech database for training speech dereverberation algorithms and TTS models where we added various types of reverberantion to VCTK artificially http://dx.doi.org/10.7488/ds/1425
- Noisy reverberant speech database for training speech enhancement algorithms and TTS models http://dx.doi.org/10.7488/ds/2139
- Device Recorded VCTK where speech signals of the VCTK corpus were played back and re-recorded in office environments using relatively inexpensive consumer devices http://dx.doi.org/10.7488/ds/2316
- The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) https://github.com/microsoft/MS-SNSD
ASV and anti-spoofing
- Spoofing and Anti-Spoofing (SAS) corpus, which is a collection of synthetic speech signals produced by nine techniques, two of which are speech synthesis, and seven are voice conversion. All of them were built using the VCTK corpus. http://dx.doi.org/10.7488/ds/252
- Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database. This database consists of synthetic speech signals produced by ten techniques and this has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) http://dx.doi.org/10.7488/ds/298
- ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database. This database has been used in the 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2019) https://doi.org/10.7488/ds/2555
ACKNOWLEDGEMENTS
The CSTR VCTK Corpus was constructed by:
Christophe Veaux (University of Edinburgh)
Junichi Yamagishi (University of Edinburgh)
Kirsten MacDonald
The research leading to these results was partly funded from EPSRC
grants EP/I031022/1 (NST) and EP/J002526/1 (CAF), from the RSE-NSFC
grant (61111130120), and from the JST CREST (uDialogue).
Please cite this corpus as follows:
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
The Centre for Speech Technology Research (CSTR),
University of Edinburgh

Binary file not shown.

Binary file not shown.

View File

@@ -458,7 +458,7 @@ class Tacotron(nn.Module):
if t == 0:
first_stop_token = stop_tokens[0]
# Stop the loop when all stop tokens in batch exceed threshold compared with the 1st token and the sequence's length exceeds threshold
if (stop_tokens > first_stop_token * 2e3).all() and t > (20 * self.r): break
if (stop_tokens > first_stop_token * 8e3).all() and t > (20 * self.r): break
# if (stop_tokens > 0.5).all() and t > (20 * self.r): break
if torch.cuda.is_available():
torch.cuda.empty_cache()