mirror of
https://github.com/liuhaozhe6788/voice-cloning-collab.git
synced 2026-05-18 05:04:51 +02:00
add demo results
This commit is contained in:
131
README.md
131
README.md
@@ -129,4 +129,133 @@ The results are saved in dim_reduction_results/.
|
||||
You can download the pretrained model from [this](https://drive.google.com/drive/folders/19fhjjAbWq60zv1Bl6Y51snGbG1r5kaN2) and extract as saved_models/20230609
|
||||
|
||||
## Demo results
|
||||
coming soon
|
||||
<div align = "center">
|
||||
<table style="width:100%">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Reference Audio</th>
|
||||
<th>Input Text</th>
|
||||
<th>Synthetic Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan="3" align = "center">
|
||||
<audio controls autoplay src="samples/260-123286-0000.flac"></audio>
|
||||
<a href="samples/260-123286-0000.flac">
|
||||
</a>
|
||||
</td>
|
||||
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text1/260-123286-0000_syn_1.0.wav"></audio>
|
||||
<a href="demo_results/text1/260-123286-0000_syn_1.0.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text2/260-123286-0000_syn_1.0.wav"></audio>
|
||||
<a href="demo_results/text2/260-123286-0000_syn_1.0.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text3/260-123286-0000_syn_0.97.wav"></audio>
|
||||
<a href="demo_results/text3/260-123286-0000_syn_0.97.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="3" align = "center">
|
||||
<audio controls autoplay src="samples/1688-142285-0000.flac"></audio>
|
||||
<a href="samples/1688-142285-0000.flac">
|
||||
</a>
|
||||
</td>
|
||||
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text1/1688-142285-0000_syn.wav"></audio>
|
||||
<a href="demo_results/text1/1688-142285-0000_syn.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text2/1688-142285-0000_syn_0.77.wav"></audio>
|
||||
<a href="demo_results/text2/1688-142285-0000_syn_0.77.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text3/1688-142285-0000_syn.wav"></audio>
|
||||
<a href="demo_results/text3/1688-142285-0000_syn.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="3" align = "center">
|
||||
<audio controls autoplay src="samples/4294-9934-0000.flac"></audio>
|
||||
<a href="samples/4294-9934-0000.flac">
|
||||
</a>
|
||||
</td>
|
||||
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text1/4294-9934-0000_syn_0.98.wav"></audio>
|
||||
<a href="demo_results/text1/4294-9934-0000_syn_0.98.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text2/4294-9934-0000_syn_0.78.wav"></audio>
|
||||
<a href="demo_results/text2/4294-9934-0000_syn_0.78.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text3/4294-9934-0000_syn_0.76.wav"></audio>
|
||||
<a href="demo_results/text3/4294-9934-0000_syn_0.76.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="3" align = "center">
|
||||
<audio controls autoplay src="samples/7176-88083-0000.flac"></audio>
|
||||
<a href="samples/7176-88083-0000.flac">
|
||||
</a>
|
||||
</td>
|
||||
<td>Life was like a box of chocolates, you never know what you're gonna get.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text1/7176-88083-0000_syn_1.13.wav"></audio>
|
||||
<a href="demo_results/text1/7176-88083-0000_syn_1.13.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text2/7176-88083-0000_syn_0.76.wav"></audio>
|
||||
<a href="demo_results/text2/7176-88083-0000_syn_0.76.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.</td>
|
||||
<td align = "center">
|
||||
<audio controls autoplay src="demo_results/text3/7176-88083-0000_syn_0.8.wav"></audio>
|
||||
<a href="demo_results/text3/7176-88083-0000_syn_0.8.wav">
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
|
||||
321
demo_cli.py
321
demo_cli.py
@@ -1,7 +1,7 @@
|
||||
import argparse
|
||||
from ctypes import alignment
|
||||
import os
|
||||
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
import time
|
||||
@@ -135,190 +135,191 @@ if __name__ == '__main__':
|
||||
weight = arg_dict["weight"] # 声音美颜的用户语音权重
|
||||
amp = 1
|
||||
|
||||
# try:
|
||||
# Get the reference audio filepath
|
||||
# enter the number of reference audios
|
||||
message1 = "Please enter the number of reference audios:\n"
|
||||
num_of_input_audio = int(input(message1))
|
||||
# num_of_input_audio = 1
|
||||
while True:
|
||||
# try:
|
||||
# Get the reference audio filepath
|
||||
# enter the number of reference audios
|
||||
message1 = "Please enter the number of reference audios:\n"
|
||||
num_of_input_audio = int(input(message1))
|
||||
# num_of_input_audio = 1
|
||||
|
||||
for i in range(num_of_input_audio):
|
||||
# Computing the embedding
|
||||
# First, we load the wav using the function that the speaker encoder provides. This is
|
||||
# important: there is preprocessing that must be applied.
|
||||
for i in range(num_of_input_audio):
|
||||
# Computing the embedding
|
||||
# First, we load the wav using the function that the speaker encoder provides. This is
|
||||
# important: there is preprocessing that must be applied.
|
||||
|
||||
# The following two methods are equivalent:
|
||||
# - Directly load from the filepath:
|
||||
# preprocessed_wav = encoder.preprocess_wav(in_fpath)
|
||||
# - If the wav is already loaded:
|
||||
# The following two methods are equivalent:
|
||||
# - Directly load from the filepath:
|
||||
# preprocessed_wav = encoder.preprocess_wav(in_fpath)
|
||||
# - If the wav is already loaded:
|
||||
|
||||
# get duration info from input audio
|
||||
message2 = "Reference voice: enter an audio folder of a voice to be cloned (mp3, " \
|
||||
f"wav, m4a, flac, ...):({i+1}/{num_of_input_audio})\n"
|
||||
in_fpath = Path(input(message2).replace("\"", "").replace("\'", ""))
|
||||
# get duration info from input audio
|
||||
message2 = "Reference voice: enter an audio folder of a voice to be cloned (mp3, " \
|
||||
f"wav, m4a, flac, ...):({i+1}/{num_of_input_audio})\n"
|
||||
in_fpath = Path(input(message2).replace("\"", "").replace("\'", ""))
|
||||
|
||||
fpath_without_ext = os.path.splitext(str(in_fpath))[0]
|
||||
speaker_name = os.path.normpath(fpath_without_ext).split(os.sep)[-1]
|
||||
fpath_without_ext = os.path.splitext(str(in_fpath))[0]
|
||||
speaker_name = os.path.normpath(fpath_without_ext).split(os.sep)[-1]
|
||||
|
||||
is_wav_file, single_wav, wav_path = TransFormat(in_fpath, 'wav')
|
||||
is_wav_file, single_wav, wav_path = TransFormat(in_fpath, 'wav')
|
||||
|
||||
if not is_wav_file:
|
||||
os.remove(wav_path) # remove intermediate wav files
|
||||
# merge
|
||||
if i == 0:
|
||||
wav = single_wav
|
||||
if not is_wav_file:
|
||||
os.remove(wav_path) # remove intermediate wav files
|
||||
# merge
|
||||
if i == 0:
|
||||
wav = single_wav
|
||||
else:
|
||||
wav = np.append(wav, single_wav)
|
||||
# write to disk
|
||||
path_ori, _ = os.path.split(wav_path)
|
||||
file_ori = 'temp.wav'
|
||||
fpath = os.path.join(path_ori, file_ori)
|
||||
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
|
||||
|
||||
# adjust the speed
|
||||
totDur_ori, nPause_ori, arDur_ori, nSyl_ori, arRate_ori = AudioAnalysis(path_ori, file_ori)
|
||||
DelFile(path_ori, '.TextGrid')
|
||||
os.remove(fpath)
|
||||
|
||||
preprocessed_wav = encoder.inference.preprocess_wav(wav)
|
||||
|
||||
print("Loaded input audio file succesfully")
|
||||
|
||||
# Then we derive the embedding. There are many functions and parameters that the
|
||||
# speaker encoder interfaces. These are mostly for in-depth research. You will typically
|
||||
# only use this function (with its default parameters):
|
||||
input_embed = encoder.inference.embed_utterance(preprocessed_wav)
|
||||
# Choose standard audio
|
||||
|
||||
fft_max_freq = vocoder.get_dominant_freq(preprocessed_wav)
|
||||
print(f"\nthe dominant frequency of input audio is {fft_max_freq}Hz")
|
||||
if fft_max_freq < encoder.params_data.split_freq:
|
||||
vocoder.hp.sex = 1
|
||||
standard_fpath = "standard_audios/male_1.wav"
|
||||
else:
|
||||
wav = np.append(wav, single_wav)
|
||||
# write to disk
|
||||
path_ori, _ = os.path.split(wav_path)
|
||||
file_ori = 'temp.wav'
|
||||
fpath = os.path.join(path_ori, file_ori)
|
||||
sf.write(fpath, wav, samplerate=encoder.params_data.sampling_rate)
|
||||
vocoder.hp.sex = 0
|
||||
standard_fpath = "standard_audios/female_1.wav"
|
||||
|
||||
# adjust the speed
|
||||
totDur_ori, nPause_ori, arDur_ori, nSyl_ori, arRate_ori = AudioAnalysis(path_ori, file_ori)
|
||||
DelFile(path_ori, '.TextGrid')
|
||||
os.remove(fpath)
|
||||
if os.path.exists(standard_fpath):
|
||||
|
||||
standard_wav = Synthesizer_infer.load_preprocess_wav(standard_fpath)
|
||||
preprocessed_standard_wav = encoder.inference.preprocess_wav(standard_wav)
|
||||
print("Loaded standard audio file successfully")
|
||||
|
||||
preprocessed_wav = encoder.inference.preprocess_wav(wav)
|
||||
standard_embed = encoder.inference.embed_utterance(preprocessed_standard_wav)
|
||||
|
||||
print("Loaded input audio file succesfully")
|
||||
embed1=np.copy(input_embed).dot(weight)
|
||||
embed2=np.copy(standard_embed).dot(1 - weight)
|
||||
embed=embed1+embed2
|
||||
else:
|
||||
embed = np.copy(input_embed)
|
||||
|
||||
# Then we derive the embedding. There are many functions and parameters that the
|
||||
# speaker encoder interfaces. These are mostly for in-depth research. You will typically
|
||||
# only use this function (with its default parameters):
|
||||
input_embed = encoder.inference.embed_utterance(preprocessed_wav)
|
||||
# Choose standard audio
|
||||
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
|
||||
embed = embed * amp
|
||||
|
||||
fft_max_freq = vocoder.get_dominant_freq(preprocessed_wav)
|
||||
print(f"\nthe dominant frequency of input audio is {fft_max_freq}Hz")
|
||||
if fft_max_freq < encoder.params_data.split_freq:
|
||||
vocoder.hp.sex = 1
|
||||
standard_fpath = "standard_audios/male_1.wav"
|
||||
else:
|
||||
vocoder.hp.sex = 0
|
||||
standard_fpath = "standard_audios/female_1.wav"
|
||||
start_syn = time.time()
|
||||
# Generating the spectrogram
|
||||
text = input("Write a sentence to be synthesized:\n")
|
||||
|
||||
if os.path.exists(standard_fpath):
|
||||
# If seed is specified, reset torch seed and force synthesizer reload
|
||||
if args.seed is not None:
|
||||
torch.manual_seed(args.seed)
|
||||
synthesizer = Synthesizer_infer(args.syn_model_fpath)
|
||||
|
||||
# The synthesizer works in batch, so you need to put your data in a list or numpy array
|
||||
def preprocess_text(text):
|
||||
text = add_breaks(text)
|
||||
text = english_cleaners_predict(text)
|
||||
texts = [i.text.strip() for i in nlp(text).sents] # split paragraph to sentences
|
||||
return texts
|
||||
|
||||
texts = preprocess_text(text)
|
||||
print(f"the list of inputs texts:\n{texts}")
|
||||
|
||||
embeds = [embed] * len(texts)
|
||||
specs, alignments, stop_tokens = synthesizer.synthesize_spectrograms(texts, embeds, require_visualization=True)
|
||||
|
||||
breaks = [spec.shape[1] for spec in specs]
|
||||
spec = np.concatenate(specs, axis=1)
|
||||
|
||||
standard_wav = Synthesizer_infer.load_preprocess_wav(standard_fpath)
|
||||
preprocessed_standard_wav = encoder.inference.preprocess_wav(standard_wav)
|
||||
print("Loaded standard audio file successfully")
|
||||
|
||||
standard_embed = encoder.inference.embed_utterance(preprocessed_standard_wav)
|
||||
## Save synthesizer visualization results
|
||||
if not os.path.exists("syn_results"):
|
||||
os.mkdir("syn_results")
|
||||
# save_attention_multiple(alignments, "syn_results/attention")
|
||||
# save_stop_tokens(stop_tokens, "syn_results/stop_tokens")
|
||||
# save_spectrogram(spec, "syn_results/mel")
|
||||
print("Created the mel spectrogram")
|
||||
|
||||
embed1=np.copy(input_embed).dot(weight)
|
||||
embed2=np.copy(standard_embed).dot(1 - weight)
|
||||
embed=embed1+embed2
|
||||
else:
|
||||
embed = np.copy(input_embed)
|
||||
end_syn = time.time()
|
||||
print(f"Prediction time of synthesizer is {end_syn - start_syn}s")
|
||||
|
||||
embed[embed < encoder.params_data.set_zero_thres]=0 # 噪声值置零
|
||||
embed = embed * amp
|
||||
start_voc = time.time()
|
||||
## Generating the waveform
|
||||
print("Synthesizing the waveform:")
|
||||
|
||||
start_syn = time.time()
|
||||
# Generating the spectrogram
|
||||
text = input("Write a sentence to be synthesized:\n")
|
||||
# If seed is specified, reset torch seed and reload vocoder
|
||||
if args.seed is not None:
|
||||
torch.manual_seed(args.seed)
|
||||
vocoder.load_model(args.voc_model_fpath)
|
||||
|
||||
# If seed is specified, reset torch seed and force synthesizer reload
|
||||
if args.seed is not None:
|
||||
torch.manual_seed(args.seed)
|
||||
synthesizer = Synthesizer_infer(args.syn_model_fpath)
|
||||
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
|
||||
# spectrogram, the more time-efficient the vocoder.
|
||||
if not args.griffin_lim:
|
||||
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade)
|
||||
else:
|
||||
wav = Synthesizer_infer.griffin_lim(spec)
|
||||
|
||||
# The synthesizer works in batch, so you need to put your data in a list or numpy array
|
||||
def preprocess_text(text):
|
||||
text = add_breaks(text)
|
||||
text = english_cleaners_predict(text)
|
||||
texts = [i.text.strip() for i in nlp(text).sents] # split paragraph to sentences
|
||||
return texts
|
||||
end_voc = time.time()
|
||||
print(f"Prediction time of vocoder is {end_voc - start_voc}s")
|
||||
print(f"Prediction time of TTS is {end_voc - start_syn}s")
|
||||
|
||||
texts = preprocess_text(text)
|
||||
print(f"the list of inputs texts:\n{texts}")
|
||||
# Add breaks
|
||||
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
|
||||
b_starts = np.concatenate(([0], b_ends[:-1]))
|
||||
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
|
||||
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
|
||||
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
|
||||
|
||||
embeds = [embed] * len(texts)
|
||||
specs, alignments, stop_tokens = synthesizer.synthesize_spectrograms(texts, embeds, require_visualization=True)
|
||||
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
|
||||
# generated_wav = encoder.inference.preprocess_wav(wav)
|
||||
wav = wav / np.abs(wav).max() * 4
|
||||
|
||||
breaks = [spec.shape[1] for spec in specs]
|
||||
spec = np.concatenate(specs, axis=1)
|
||||
|
||||
# Save it on the disk
|
||||
# filename = "demo_output_%02d.wav" % num_generated
|
||||
if not os.path.exists("out_audios"):
|
||||
os.mkdir("out_audios")
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
|
||||
filename = os.path.join(dir_path, f"out_audios/{speaker_name}_syn.wav")
|
||||
# print(wav.dtype)
|
||||
sf.write(filename, wav.astype(np.float32), synthesizer.sample_rate)
|
||||
num_generated += 1
|
||||
print("\nSaved output (havent't change speed) as %s\n\n" % filename)
|
||||
|
||||
## Save synthesizer visualization results
|
||||
if not os.path.exists("syn_results"):
|
||||
os.mkdir("syn_results")
|
||||
# save_attention_multiple(alignments, "syn_results/attention")
|
||||
# save_stop_tokens(stop_tokens, "syn_results/stop_tokens")
|
||||
# save_spectrogram(spec, "syn_results/mel")
|
||||
print("Created the mel spectrogram")
|
||||
|
||||
end_syn = time.time()
|
||||
print(f"Prediction time of synthesizer is {end_syn - start_syn}s")
|
||||
|
||||
start_voc = time.time()
|
||||
## Generating the waveform
|
||||
print("Synthesizing the waveform:")
|
||||
|
||||
# If seed is specified, reset torch seed and reload vocoder
|
||||
if args.seed is not None:
|
||||
torch.manual_seed(args.seed)
|
||||
vocoder.load_model(args.voc_model_fpath)
|
||||
|
||||
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
|
||||
# spectrogram, the more time-efficient the vocoder.
|
||||
if not args.griffin_lim:
|
||||
wav = vocoder.infer_waveform(spec, target=vocoder.hp.voc_target, overlap=vocoder.hp.voc_overlap, crossfade=vocoder.hp.is_crossfade)
|
||||
else:
|
||||
wav = Synthesizer_infer.griffin_lim(spec)
|
||||
|
||||
end_voc = time.time()
|
||||
print(f"Prediction time of vocoder is {end_voc - start_voc}s")
|
||||
print(f"Prediction time of TTS is {end_voc - start_syn}s")
|
||||
|
||||
# Add breaks
|
||||
b_ends = np.cumsum(np.array(breaks) * Synthesizer_infer.hparams.hop_size)
|
||||
b_starts = np.concatenate(([0], b_ends[:-1]))
|
||||
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
|
||||
breaks = [np.zeros(int(0.15 * Synthesizer_infer.sample_rate))] * len(breaks)
|
||||
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
|
||||
|
||||
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
|
||||
# generated_wav = encoder.inference.preprocess_wav(wav)
|
||||
wav = wav / np.abs(wav).max() * 4
|
||||
|
||||
# Save it on the disk
|
||||
# filename = "demo_output_%02d.wav" % num_generated
|
||||
if not os.path.exists("out_audios"):
|
||||
os.mkdir("out_audios")
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__)) # current dir
|
||||
filename = os.path.join(dir_path, f"out_audios/{speaker_name}_syn.wav")
|
||||
# print(wav.dtype)
|
||||
sf.write(filename, wav.astype(np.float32), synthesizer.sample_rate)
|
||||
num_generated += 1
|
||||
print("\nSaved output (havent't change speed) as %s\n\n" % filename)
|
||||
|
||||
# Fix Speed(generate new audio)
|
||||
fix_file = work(totDur_ori,
|
||||
nPause_ori,
|
||||
arDur_ori,
|
||||
nSyl_ori,
|
||||
arRate_ori,
|
||||
filename)
|
||||
print(f"\nSaved output (fixed speed) as {fix_file}\n\n")
|
||||
# Fix Speed(generate new audio)
|
||||
fix_file = work(totDur_ori,
|
||||
nPause_ori,
|
||||
arDur_ori,
|
||||
nSyl_ori,
|
||||
arRate_ori,
|
||||
filename)
|
||||
print(f"\nSaved output (fixed speed) as {fix_file}\n\n")
|
||||
|
||||
|
||||
# # Play the audio (non-blocking)
|
||||
# if not args.no_sound:
|
||||
# import sounddevice as sd
|
||||
# try:
|
||||
# sd.stop()
|
||||
# sd.play(wav, synthesizer.sample_rate)
|
||||
# except sd.PortAudioError as e:
|
||||
# print("\nCaught exception: %s" % repr(e))
|
||||
# print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
|
||||
# except:
|
||||
# raise
|
||||
# # Play the audio (non-blocking)
|
||||
# if not args.no_sound:
|
||||
# import sounddevice as sd
|
||||
# try:
|
||||
# sd.stop()
|
||||
# sd.play(wav, synthesizer.sample_rate)
|
||||
# except sd.PortAudioError as e:
|
||||
# print("\nCaught exception: %s" % repr(e))
|
||||
# print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
|
||||
# except:
|
||||
# raise
|
||||
|
||||
|
||||
# except Exception as e:
|
||||
# print("Caught exception: %s" % repr(e))
|
||||
# print("Restarting\n")
|
||||
# except Exception as e:
|
||||
# print("Caught exception: %s" % repr(e))
|
||||
# print("Restarting\n")
|
||||
|
||||
BIN
demo_results/text1/1688-142285-0000_syn.wav
Normal file
BIN
demo_results/text1/1688-142285-0000_syn.wav
Normal file
Binary file not shown.
BIN
demo_results/text1/260-123286-0000_syn_1.0.wav
Normal file
BIN
demo_results/text1/260-123286-0000_syn_1.0.wav
Normal file
Binary file not shown.
BIN
demo_results/text1/4294-9934-0000_syn_0.98.wav
Normal file
BIN
demo_results/text1/4294-9934-0000_syn_0.98.wav
Normal file
Binary file not shown.
BIN
demo_results/text1/7176-88083-0000_syn_1.13.wav
Normal file
BIN
demo_results/text1/7176-88083-0000_syn_1.13.wav
Normal file
Binary file not shown.
1
demo_results/text1/README.md
Normal file
1
demo_results/text1/README.md
Normal file
@@ -0,0 +1 @@
|
||||
Life was like a box of chocolates, you never know what you're gonna get.
|
||||
BIN
demo_results/text2/1688-142285-0000_syn_0.77.wav
Normal file
BIN
demo_results/text2/1688-142285-0000_syn_0.77.wav
Normal file
Binary file not shown.
BIN
demo_results/text2/260-123286-0000_syn_1.0.wav
Normal file
BIN
demo_results/text2/260-123286-0000_syn_1.0.wav
Normal file
Binary file not shown.
BIN
demo_results/text2/4294-9934-0000_syn_0.78.wav
Normal file
BIN
demo_results/text2/4294-9934-0000_syn_0.78.wav
Normal file
Binary file not shown.
BIN
demo_results/text2/7176-88083-0000_syn_0.76.wav
Normal file
BIN
demo_results/text2/7176-88083-0000_syn_0.76.wav
Normal file
Binary file not shown.
1
demo_results/text2/README.md
Normal file
1
demo_results/text2/README.md
Normal file
@@ -0,0 +1 @@
|
||||
In 2014, P&G recorded $83.1 billion in sales. On August 1, 2014, P&G announced it was streamlining the company, dropping and selling off around 100 brands from its product portfolio in order to focus on the remaining 65 brands, which produced 95% of the company's profits.
|
||||
BIN
demo_results/text3/1688-142285-0000_syn.wav
Normal file
BIN
demo_results/text3/1688-142285-0000_syn.wav
Normal file
Binary file not shown.
BIN
demo_results/text3/260-123286-0000_syn_0.97.wav
Normal file
BIN
demo_results/text3/260-123286-0000_syn_0.97.wav
Normal file
Binary file not shown.
BIN
demo_results/text3/4294-9934-0000_syn_0.76.wav
Normal file
BIN
demo_results/text3/4294-9934-0000_syn_0.76.wav
Normal file
Binary file not shown.
BIN
demo_results/text3/7176-88083-0000_syn_0.8.wav
Normal file
BIN
demo_results/text3/7176-88083-0000_syn_0.8.wav
Normal file
Binary file not shown.
1
demo_results/text3/README.md
Normal file
1
demo_results/text3/README.md
Normal file
@@ -0,0 +1 @@
|
||||
Mechanics is a branch of physics that deals with the behavior of physical bodies under the influence of various forces. The study of mechanics is important in understanding the behavior of machines, the motion of objects, and the principles of engineering. Mechanics has been an essential part of physics since ancient times and has continued to evolve with advancements in science and technology. This paper will discuss the principles of mechanics, the laws of motion, and the applications of mechanics in engineering and technology.
|
||||
Binary file not shown.
BIN
samples/1688-142285-0000.flac
Normal file
BIN
samples/1688-142285-0000.flac
Normal file
Binary file not shown.
BIN
samples/260-123286-0000.flac
Normal file
BIN
samples/260-123286-0000.flac
Normal file
Binary file not shown.
Binary file not shown.
BIN
samples/4294-9934-0000.flac
Normal file
BIN
samples/4294-9934-0000.flac
Normal file
Binary file not shown.
Binary file not shown.
BIN
samples/7176-88083-0000.flac
Normal file
BIN
samples/7176-88083-0000.flac
Normal file
Binary file not shown.
Binary file not shown.
24
samples/README.md
Executable file → Normal file
24
samples/README.md
Executable file → Normal file
@@ -1,22 +1,2 @@
|
||||
The audio files in this folder are provided for toolbox testing and
|
||||
benchmarking purposes. These are the same reference utterances
|
||||
used by the SV2TTS authors to generate the audio samples located at:
|
||||
https://google.github.io/tacotron/publications/speaker_adaptation/index.html
|
||||
|
||||
The `p240_00000.mp3` and `p260_00000.mp3` files are compressed
|
||||
versions of audios from the VCTK corpus available at:
|
||||
https://datashare.is.ed.ac.uk/handle/10283/3443
|
||||
VCTK.txt contains the copyright notices and licensing information.
|
||||
|
||||
The `1320_00000.mp3`, `3575_00000.mp3`, `6829_00000.mp3`
|
||||
and `8230_00000.mp3` files are compressed versions of audios
|
||||
from the LibriSpeech dataset available at: https://openslr.org/12
|
||||
For these files, the following notice applies:
|
||||
```
|
||||
LibriSpeech (c) 2014 by Vassil Panayotov
|
||||
|
||||
LibriSpeech ASR corpus is licensed under a
|
||||
Creative Commons Attribution 4.0 International License.
|
||||
|
||||
See <http://creativecommons.org/licenses/by/4.0/>.
|
||||
```
|
||||
260-123286-0000.flac and 7176-88083-0000.flac are from LibriSpeech test-clean.
|
||||
1688-142285-0000.flac and 4294-9934-0000.flac are from LibriSpeech test-other.
|
||||
|
||||
@@ -1,94 +0,0 @@
|
||||
---------------------------------------------------------------------
|
||||
CSTR VCTK Corpus
|
||||
English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
|
||||
|
||||
(Version 0.92)
|
||||
RELEASE September 2019
|
||||
The Centre for Speech Technology Research
|
||||
University of Edinburgh
|
||||
Copyright (c) 2019
|
||||
|
||||
Junichi Yamagishi
|
||||
jyamagis@inf.ed.ac.uk
|
||||
---------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
|
||||
This CSTR VCTK Corpus includes speech data uttered by 110 English
|
||||
speakers with various accents. Each speaker reads out about 400
|
||||
sentences, which were selected from a newspaper, the rainbow passage
|
||||
and an elicitation paragraph used for the speech accent archive.
|
||||
|
||||
The newspaper texts were taken from Herald Glasgow, with permission
|
||||
from Herald & Times Group. Each speaker has a different set of the
|
||||
newspaper texts selected based a greedy algorithm that increases the
|
||||
contextual and phonetic coverage. The details of the text selection
|
||||
algorithms are described in the following paper:
|
||||
|
||||
C. Veaux, J. Yamagishi and S. King,
|
||||
"The voice bank corpus: Design, collection and data analysis of
|
||||
a large regional accent speech database,"
|
||||
https://doi.org/10.1109/ICSDA.2013.6709856
|
||||
|
||||
The rainbow passage and elicitation paragraph are the same for all
|
||||
speakers. The rainbow passage can be found at International Dialects
|
||||
of English Archive:
|
||||
(http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation
|
||||
paragraph is identical to the one used for the speech accent archive
|
||||
(http://accent.gmu.edu). The details of the the speech accent archive
|
||||
can be found at
|
||||
http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf
|
||||
|
||||
All speech data was recorded using an identical recording setup: an
|
||||
omni-directional microphone (DPA 4035) and a small diaphragm condenser
|
||||
microphone with very wide bandwidth (Sennheiser MKH 800), 96kHz
|
||||
sampling frequency at 24 bits and in a hemi-anechoic chamber of
|
||||
the University of Edinburgh. (However, two speakers, p280 and p315
|
||||
had technical issues of the audio recordings using MKH 800).
|
||||
All recordings were converted into 16 bits, were downsampled to
|
||||
48 kHz, and were manually end-pointed.
|
||||
|
||||
This corpus was originally aimed for HMM-based text-to-speech synthesis
|
||||
systems, especially for speaker-adaptive HMM-based speech synthesis
|
||||
that uses average voice models trained on multiple speakers and speaker
|
||||
adaptation technologies. This corpus is also suitable for DNN-based
|
||||
multi-speaker text-to-speech synthesis systems and waveform modeling.
|
||||
|
||||
COPYING
|
||||
|
||||
This corpus is licensed under the Creative Commons License: Attribution 4.0 International
|
||||
http://creativecommons.org/licenses/by/4.0/legalcode
|
||||
|
||||
VCTK VARIANTS
|
||||
There are several variants of the VCTK corpus:
|
||||
Speech enhancement
|
||||
- Noisy speech database for training speech enhancement algorithms and TTS models where we added various types of noises to VCTK artificially: http://dx.doi.org/10.7488/ds/2117
|
||||
- Reverberant speech database for training speech dereverberation algorithms and TTS models where we added various types of reverberantion to VCTK artificially http://dx.doi.org/10.7488/ds/1425
|
||||
- Noisy reverberant speech database for training speech enhancement algorithms and TTS models http://dx.doi.org/10.7488/ds/2139
|
||||
- Device Recorded VCTK where speech signals of the VCTK corpus were played back and re-recorded in office environments using relatively inexpensive consumer devices http://dx.doi.org/10.7488/ds/2316
|
||||
- The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) https://github.com/microsoft/MS-SNSD
|
||||
|
||||
ASV and anti-spoofing
|
||||
- Spoofing and Anti-Spoofing (SAS) corpus, which is a collection of synthetic speech signals produced by nine techniques, two of which are speech synthesis, and seven are voice conversion. All of them were built using the VCTK corpus. http://dx.doi.org/10.7488/ds/252
|
||||
- Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database. This database consists of synthetic speech signals produced by ten techniques and this has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) http://dx.doi.org/10.7488/ds/298
|
||||
- ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database. This database has been used in the 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2019) https://doi.org/10.7488/ds/2555
|
||||
|
||||
|
||||
ACKNOWLEDGEMENTS
|
||||
|
||||
The CSTR VCTK Corpus was constructed by:
|
||||
|
||||
Christophe Veaux (University of Edinburgh)
|
||||
Junichi Yamagishi (University of Edinburgh)
|
||||
Kirsten MacDonald
|
||||
|
||||
The research leading to these results was partly funded from EPSRC
|
||||
grants EP/I031022/1 (NST) and EP/J002526/1 (CAF), from the RSE-NSFC
|
||||
grant (61111130120), and from the JST CREST (uDialogue).
|
||||
|
||||
Please cite this corpus as follows:
|
||||
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
|
||||
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
|
||||
The Centre for Speech Technology Research (CSTR),
|
||||
University of Edinburgh
|
||||
|
||||
Binary file not shown.
Binary file not shown.
@@ -458,7 +458,7 @@ class Tacotron(nn.Module):
|
||||
if t == 0:
|
||||
first_stop_token = stop_tokens[0]
|
||||
# Stop the loop when all stop tokens in batch exceed threshold compared with the 1st token and the sequence's length exceeds threshold
|
||||
if (stop_tokens > first_stop_token * 2e3).all() and t > (20 * self.r): break
|
||||
if (stop_tokens > first_stop_token * 8e3).all() and t > (20 * self.r): break
|
||||
# if (stop_tokens > 0.5).all() and t > (20 * self.r): break
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
Reference in New Issue
Block a user