diff --git a/Changelog_CN.md b/Changelog_CN.md
index ad10739..351def3 100644
--- a/Changelog_CN.md
+++ b/Changelog_CN.md
@@ -1,4 +1,21 @@
-### 20230409
+### 20230513更新
+功能更新:
+- 清除一键包内部老版本runtime内残留的infer_pack和uvr5_pack
+- 修复训练集预处理伪多进程的bug
+- 增加harvest识别音高可选通过中值滤波削弱哑音现象,可调整中值滤波半径
+- 导出音频增加后处理重采样
+- 训练n_cpu进程数从"仅调整f0提取"改为"调整数据预处理和f0提取"
+- 自动检测logs文件夹下的index路径,提供下拉列表功能
+- tab页增加"常见问题解答"(也可参考github-rvc-wiki)
+
+待完成:
+- 推理音量预处理归一化
+- 推理后处理音量包络融合输入音频的音量包络
+- 增加选项:每次epoch保存的小模型均进行提取
+
+RVC很快会陆续更新v2版的新底模(先发40k采样率的,32k和48k的仍在训练中)!大幅改善呼吸、清辅音(唇齿音)的电音和撕裂伪影,请期待!
+
+### 20230409更新
- 修正训练参数,提升显卡平均利用率,A100最高从25%提升至90%左右,V100:50%->90%左右,2060S:60%->85%左右,P40:25%->95%左右,训练速度显著提升
- 修正参数:总batch_size改为每张卡的batch_size
- 修正total_epoch:最大限制100解锁至1000;默认10提升至默认20
@@ -21,15 +38,14 @@
- 修复部分音频格式下UVR5人声伴奏分离的bug
- 实时变声迷你gui增加对非40k与不懈怠音高模型的支持
-
### 后续计划:
功能:
- 增加选项:每次epoch保存的小模型均进行提取
-- 增加选项:推理额外导出mp3至填写的路径
+- 增加选项:推理额外导出mp3至填写的路径(批量推理)
- 支持多人训练选项卡(至多4人)
--
+
底模:
- 收集呼吸wav加入训练集修正呼吸变声电音的问题
- 我们正在训练增加了歌声训练集的底模,未来会公开
-- 升级鉴别器
+- 升级鉴别器(尝试MRD)
- 升级自监督特征结构
diff --git a/docs/faq.md b/docs/faq.md
new file mode 100644
index 0000000..59283ca
--- /dev/null
+++ b/docs/faq.md
@@ -0,0 +1,89 @@
+## Q1:ffmpeg error/utf8 error.
+
+大概率不是ffmpeg问题,而是音频路径问题;
+ffmpeg读取路径带空格、()等特殊符号,可能出现ffmpeg error;训练集音频带中文路径,在写入filelist.txt的时候可能出现utf8 error;
+
+## Q2:一键训练结束没有索引
+
+显示"Training is done. The program is closed."则模型训练成功,后续紧邻的报错是假的;
+
+一键训练结束完成没有added开头的索引文件,可能是因为训练集太大卡住了添加索引的步骤;已通过批处理add索引解决内存add索引对内存需求过大的问题。临时可尝试再次点击"训练索引"按钮。
+
+## Q3:训练结束推理没看到训练集的音色
+点刷新音色再看看,如果还没有看看训练有没有报错,控制台和webui的截图,logs/实验名下的log,都可以发给开发者看看。
+
+## Q4:如何分享模型
+ rvc_root/logs/实验名 下面存储的pth不是用来分享模型用来推理的,而是为了存储实验状态供复现,以及继续训练用的。用来分享的模型应该是weights文件夹下大小为60+MB的pth文件;
+ 后续将把weights/exp_name.pth和logs/exp_name/added_xxx.index合并打包成weights/exp_name.zip省去填写index的步骤,那么zip文件用来分享,不要分享pth文件,除非是想换机器继续训练;
+ 如果你把logs文件夹下的几百MB的pth文件复制/分享到weights文件夹下强行用于推理,可能会出现f0,tgt_sr等各种key不存在的报错。你需要用ckpt选项卡最下面,手工或自动(本地logs下如果能找到相关信息则会自动)选择是否携带音高、目标音频采样率的选项后进行ckpt小模型提取(输入路径填G开头的那个),提取完在weights文件夹下会出现60+MB的pth文件,刷新音色后可以选择使用。
+
+## Q5:Connection Error.
+也许你关闭了控制台(黑色窗口)。
+
+## Q6:WebUI弹出Expecting value: line 1 column 1 (char 0).
+请关闭系统局域网代理/全局代理。
+
+这个不仅是客户端的代理,也包括服务端的代理(例如你使用autodl设置了http_proxy和https_proxy学术加速,使用时也需要unset关掉)
+
+## Q7:不用WebUI如何通过命令训练推理
+训练脚本:
+可先跑通WebUI,消息窗内会显示数据集处理和训练用命令行;
+
+推理脚本:
+https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/myinfer.py
+
+例子:
+
+runtime\python.exe myinfer.py 0 "E:\codes\py39\RVC-beta\todo-songs\1111.wav" "E:\codes\py39\logs\mi-test\added_IVF677_Flat_nprobe_7.index" harvest "test.wav" "weights/mi-test.pth" 0.6 cuda:0 True
+
+f0up_key=sys.argv[1]
+input_path=sys.argv[2]
+index_path=sys.argv[3]
+f0method=sys.argv[4]#harvest or pm
+opt_path=sys.argv[5]
+model_path=sys.argv[6]
+index_rate=float(sys.argv[7])
+device=sys.argv[8]
+is_half=bool(sys.argv[9])
+
+## Q8:Cuda error/Cuda out of memory.
+小概率是cuda配置问题、设备不支持;大概率是显存不够(out of memory);
+
+训练的话缩小batch size(如果缩小到1还不够只能更换显卡训练),推理的话酌情缩小config.py结尾的x_pad,x_query,x_center,x_max。4G以下显存(例如1060(3G)和各种2G显卡)可以直接放弃,4G显存显卡还有救。
+
+## Q9:total_epoch调多少比较好
+
+如果训练集音质差底噪大,20~30足够了,调太高,底模音质无法带高你的低音质训练集
+如果训练集音质高底噪低时长多,可以调高,200是ok的(训练速度很快,既然你有条件准备高音质训练集,显卡想必条件也不错,肯定不在乎多一些训练时间)
+
+## Q10:需要多少训练集时长
+ 推荐10min至50min
+ 保证音质高底噪低的情况下,如果有个人特色的音色统一,则多多益善
+ 高水平的训练集(精简+音色有特色),5min至10min也是ok的,仓库作者本人就经常这么玩
+ 也有人拿1min至2min的数据来训练并且训练成功的,但是成功经验是其他人不可复现的,不太具备参考价值。这要求训练集音色特色非常明显(比如说高频气声较明显的萝莉少女音),且音质高;
+ 1min以下时长数据目前没见有人尝试(成功)过。不建议进行这种鬼畜行为。
+
+## Q11:index rate干嘛用的,怎么调(科普)
+ 如果底模和推理源的音质高于训练集的音质,他们可以带高推理结果的音质,但代价可能是音色往底模/推理源的音色靠,这种现象叫做"音色泄露";
+ index rate用来削减/解决音色泄露问题。调到1,则理论上不存在推理源的音色泄露问题,但音质更倾向于训练集。如果训练集音质比推理源低,则index rate调高可能降低音质。调到0,则不具备利用检索混合来保护训练集音色的效果;
+ 如果训练集优质时长多,可调高total_epoch,此时模型本身不太会引用推理源和底模的音色,很少存在"音色泄露"问题,此时index_rate不重要,你甚至可以不建立/分享index索引文件。
+
+## Q11:推理怎么选gpu
+config.py文件里device cuda:后面选择卡号;
+卡号和显卡的映射关系,在训练选项卡的显卡信息栏里能看到。
+
+## Q12:如何推理训练中间保存的pth
+通过ckpt选项卡最下面提取小模型。
+
+
+## Q13:如何中断和继续训练
+现阶段只能关闭WebUI控制台双击go-web.bat重启程序。网页参数也要刷新重新填写;
+继续训练:相同网页参数点训练模型,就会接着上次的checkpoint继续训练。
+
+## Q14:训练时出现文件页面/内存error
+进程开太多了,内存炸了。你可能可以通过如下方式解决
+1、"提取音高和处理数据使用的CPU进程数" 酌情拉低;
+2、训练集音频手工切一下,不要太长。
+
+
+
diff --git a/docs/faq_en.md b/docs/faq_en.md
new file mode 100644
index 0000000..ab7d6a3
--- /dev/null
+++ b/docs/faq_en.md
@@ -0,0 +1,95 @@
+## Q1:ffmpeg error/utf8 error.
+It is most likely not a FFmpeg issue, but rather an audio path issue;
+
+FFmpeg may encounter an error when reading paths containing special characters like spaces and (), which may cause an FFmpeg error; and when the training set's audio contains Chinese paths, writing it into filelist.txt may cause a utf8 error.
+
+## Q2:Cannot find index file after "One-click Training".
+If it displays "Training is done. The program is closed," then the model has been trained successfully, and the subsequent errors are fake;
+
+The lack of an 'added' index file after One-click training may be due to the training set being too large, causing the addition of the index to get stuck; this has been resolved by using batch processing to add the index, which solves the problem of memory overload when adding the index. As a temporary solution, try clicking the "Train Index" button again.
+
+## Q3:Cannot find the model in “Inferencing timbre” after training
+Click “Refresh timbre list” and check again; if still not visible, check if there are any errors during training and send screenshots of the console, web UI, and logs/experiment_name/*.log to the developers for further analysis.
+
+## Q4:How to share a model/How to use others' models?
+The pth files stored in rvc_root/logs/experiment_name are not meant for sharing or inference, but for storing the experiment checkpoits for reproducibility and further training. The model to be shared should be the 60+MB pth file in the weights folder;
+
+In the future, weights/exp_name.pth and logs/exp_name/added_xxx.index will be merged into a single weights/exp_name.zip file to eliminate the need for manual index input; so share the zip file, not the pth file, unless you want to continue training on a different machine;
+
+Copying/sharing the several hundred MB pth files from the logs folder to the weights folder for forced inference may result in errors such as missing f0, tgt_sr, or other keys. You need to use the ckpt tab at the bottom to manually or automatically (if the information is found in the logs/exp_name), select whether to include pitch infomation and target audio sampling rate options and then extract the smaller model. After extraction, there will be a 60+ MB pth file in the weights folder, and you can refresh the voices to use it.
+
+## Q5:Connection Error.
+You may have closed the console (black command line window).
+
+## Q6:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
+Please disable system LAN proxy/global proxy and then refresh.
+
+## Q7:How to train and infer without the WebUI?
+Training script:
+You can run training in WebUI first, and the command-line versions of dataset preprocessing and training will be displayed in the message window.
+
+Inference script:
+https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/myinfer.py
+
+
+e.g.
+
+runtime\python.exe myinfer.py 0 "E:\codes\py39\RVC-beta\todo-songs\1111.wav" "E:\codes\py39\logs\mi-test\added_IVF677_Flat_nprobe_7.index" harvest "test.wav" "weights/mi-test.pth" 0.6 cuda:0 True
+
+
+f0up_key=sys.argv[1]
+input_path=sys.argv[2]
+index_path=sys.argv[3]
+f0method=sys.argv[4]#harvest or pm
+opt_path=sys.argv[5]
+model_path=sys.argv[6]
+index_rate=float(sys.argv[7])
+device=sys.argv[8]
+is_half=bool(sys.argv[9])
+
+## Q8:Cuda error/Cuda out of memory.
+There is a small chance that there is a problem with the CUDA configuration or the device is not supported; more likely, there is not enough memory (out of memory).
+
+For training, reduce the batch size (if reducing to 1 is still not enough, you may need to change the graphics card); for inference, adjust the x_pad, x_query, x_center, and x_max settings in the config.py file as needed. 4G or lower memory cards (e.g. 1060(3G) and various 2G cards) can be abandoned, while 4G memory cards still have a chance.
+
+## Q9:How many total_epoch are optimal?
+If the training dataset's audio quality is poor and the noise floor is high, 20-30 epochs are sufficient. Setting it too high won't improve the audio quality of your low-quality training set.
+
+If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. 200 is acceptable (since training is fast, and if you're able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).
+
+## Q10:How much training set duration is needed?
+
+A dataset of around 10min to 50min is recommended.
+
+With guaranteed high sound quality and low bottom noise, more can be added if the dataset's timbre is uniform.
+
+For a high-level training set (lean + distinctive tone), 5min to 10min is fine.
+
+There are some people who have trained successfully with 1min to 2min data, but the success is not reproducible by others and is not very informative.
This requires that the training set has a very distinctive timbre (e.g. a high-frequency airy anime girl sound) and the quality of the audio is high;
+Data of less than 1min duration has not been successfully attempted so far. This is not recommended.
+
+
+## Q11:What is the index rate for and how to adjust it?
+If the tone quality of the pre-trained model and inference source is higher than that of the training set, they can bring up the tone quality of the inference result, but at the cost of a possible tone bias towards the tone of the underlying model/inference source rather than the tone of the training set, which is generally referred to as "tone leakage".
+
+The index rate is used to reduce/resolve the timbre leakage problem. If the index rate is set to 1, theoretically there is no timbre leakage from the inference source and the timbre quality is more biased towards the training set. If the training set has a lower sound quality than the inference source, then a higher index rate may reduce the sound quality. Turning it down to 0 does not have the effect of using retrieval blending to protect the training set tones.
+
+If the training set has good audio quality and long duration, turn up the total_epoch, when the model itself is less likely to refer to the inferred source and the pretrained underlying model, and there is little "tone leakage", the index_rate is not important and you can even not create/share the index file.
+
+## Q12:How to choose the gpu when inferring?
+In the config.py file, select the card number after "device cuda:".
+
+The mapping between card number and graphics card can be seen in the graphics card information section of the training tab.
+
+## Q13:How to use the model saved in the middle of training?
+Save via model extraction at the bottom of the ckpt processing tab.
+
+## Q14:File/memory error(when training)?
+Too many processes and your memory is not enough. You may fix it by:
+
+1、decrease "Number of CPU threads".
+
+2、pre-cut trainset to shorter audio files.
+
+
+
diff --git a/extract_feature_print.py b/extract_feature_print.py
index 7cc0601..f1c4f9a 100644
--- a/extract_feature_print.py
+++ b/extract_feature_print.py
@@ -18,6 +18,13 @@ from fairseq import checkpoint_utils
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+if torch.cuda.is_available():
+ device = "cuda"
+elif torch.backends.mps.is_available():
+ device = "mps"
+else:
+ device = "cpu"
+
f = open("%s/extract_f0_feature.log" % exp_dir, "a+")
@@ -60,7 +67,7 @@ models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
model = models[0]
model = model.to(device)
printt("move model to %s" % device)
-if device != "cpu":
+if device not in ["mps", "cpu"]:
model = model.half()
model.eval()
@@ -83,7 +90,7 @@ else:
padding_mask = torch.BoolTensor(feats.shape).fill_(False)
inputs = {
"source": feats.half().to(device)
- if device != "cpu"
+ if device not in ["mps", "cpu"]
else feats.to(device),
"padding_mask": padding_mask.to(device),
"output_layer": 9, # layer 9
diff --git a/gui.py b/gui.py
index 3e409d2..1aed430 100644
--- a/gui.py
+++ b/gui.py
@@ -133,9 +133,9 @@ class RVC:
score, ix = self.index.search(npy, k=8)
weight = np.square(1 / score)
weight /= weight.sum(axis=1, keepdims=True)
- npy = np.sum(self.big_npy[ix] * np.expand_dims(weight, axis=2), axis=1).astype(
- "float16"
- )
+ npy = np.sum(
+ self.big_npy[ix] * np.expand_dims(weight, axis=2), axis=1
+ ).astype("float16")
feats = (
torch.from_numpy(npy).unsqueeze(0).to(device) * self.index_rate
diff --git a/i18n/en_US.json b/i18n/en_US.json
index f2d3d18..e59a5ba 100644
--- a/i18n/en_US.json
+++ b/i18n/en_US.json
@@ -11,6 +11,7 @@
"模型推理": "Model inference",
"推理音色": "Inferencing timbre",
"刷新音色列表": "Refresh timbre list",
+ "刷新音色列表和索引路径": "Refresh timbre list and index path",
"卸载音色省显存": "Unload timbre to save GPU memory",
"请选择说话人id": "Please select a speaker id",
"男转女推荐+12key, 女转男推荐-12key, 如果音域爆炸导致音色失真也可以自己调整到合适音域. ": "It is recommended +12key for male to female conversion, and -12key for female to male conversion. If the sound range explodes and the timbre is distorted, you can also adjust it to the appropriate range by yourself. ",
@@ -20,8 +21,11 @@
"crepe_hop_length": "Crepe Hop Length (Only applies to crepe): Hop length refers to the time it takes for the speaker to jump to a dramatic pitch. Lower hop lengths take more time to infer but are more pitch accurate.",
"特征检索库文件路径": "Feature search database file path",
"特征文件路径": "Feature file path",
+ "特征检索库文件路径,为空则使用下拉的选择结果": "Feature file path. If null, use dropdown result.",
+ "自动检测index路径,下拉式选择(dropdown)": "Auto detect index path in logs directory. Dropdown.",
"检索特征占比": "Search feature ratio",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0 curve file, optional, one pitch per line, instead of the default F0 and ups and downs",
+ ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音": ">=3:using median filter for f0. The number is median filter radius.",
"转换": "Convert",
"输出信息": "Export message",
"输出音频(右下角三个点,点了可以下载)": "Export audio (three dots in the lower right corner, click to download)",
@@ -48,6 +52,7 @@
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "Enter the card numbers used separated by -, for example 0-1-2 use card 0 and card 1 and card 2",
"显卡信息": "GPU information",
"提取音高使用的CPU进程数": "Number of CPU threads to use for pitch extraction",
+ "提取音高和处理数据使用的CPU进程数": "Number of CPU threads to use for pitch extraction and dataset processing",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "Select pitch extraction algorithm: Use 'pm' for faster processing of singing voice, 'dio' for high-quality speech but slower processing, and 'harvest' for the best quality but slowest processing.",
"特征提取": "Feature extraction",
"step3: 填写训练设置, 开始训练模型和索引": "step3: Fill in the training settings, start training the model and index",
@@ -111,5 +116,7 @@
"性能设置": "performance settings",
"开始音频转换": "start audio conversion",
"停止音频转换": "stop audio conversion",
+ "常见问题解答": "FAQ (Frequently Asked Questions)",
+ "后处理重采样至最终采样率,0为不进行重采样": "Post resample the audio to the final sample rate. Default: don't use post resample.",
"推理时间(ms):": "Infer Time(ms):"
}
diff --git a/i18n/zh_CN.json b/i18n/zh_CN.json
index a06d842..1221e5c 100644
--- a/i18n/zh_CN.json
+++ b/i18n/zh_CN.json
@@ -19,6 +19,8 @@
"选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比": "选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比",
"crepe_hop_length": "Crepe Hop Length (Only applies to crepe): Hop length refers to the time it takes for the speaker to jump to a dramatic pitch. Lower hop lengths take more time to infer but are more pitch accurate.",
"特征检索库文件路径": "特征检索库文件路径",
+ "特征检索库文件路径,为空则使用下拉的选择结果": "特征检索库文件路径,为空则使用下拉的选择结果",
+ ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音": ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音",
"特征文件路径": "特征文件路径",
"检索特征占比": "检索特征占比",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调",
@@ -48,6 +50,9 @@
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2",
"显卡信息": "显卡信息",
"提取音高使用的CPU进程数": "提取音高使用的CPU进程数",
+ "提取音高和处理数据使用的CPU进程数": "提取音高和处理数据使用的CPU进程数",
+ "刷新音色列表和索引路径": "刷新音色列表和索引路径",
+ "常见问题解答": "常见问题解答",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢",
"特征提取": "特征提取",
"step3: 填写训练设置, 开始训练模型和索引": "step3: 填写训练设置, 开始训练模型和索引",
@@ -111,5 +116,6 @@
"性能设置": "性能设置",
"开始音频转换": "开始音频转换",
"停止音频转换": "停止音频转换",
+ "后处理重采样至最终采样率,0为不进行重采样": "后处理重采样至最终采样率,0为不进行重采样",
"推理时间(ms):": "推理时间(ms):"
}
diff --git a/i18n/zh_HK.json b/i18n/zh_HK.json
index c98c6c2..d2420ff 100644
--- a/i18n/zh_HK.json
+++ b/i18n/zh_HK.json
@@ -11,6 +11,7 @@
"模型推理": "模型推理",
"推理音色": "推理音色",
"刷新音色列表": "重新整理音色列表",
+ "刷新音色列表和索引路径":"刷新音色列表和索引路徑",
"卸载音色省显存": "卸載音色節省 VRAM",
"请选择说话人id": "請選擇說話人ID",
"男转女推荐+12key, 女转男推荐-12key, 如果音域爆炸导致音色失真也可以自己调整到合适音域. ": "男性轉女性推薦+12key,女性轉男性推薦-12key,如果音域爆炸導致音色失真也可以自己調整到合適音域。",
@@ -19,9 +20,12 @@
"选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比": "選擇音高提取演算法,輸入歌聲可用 pm 提速,harvest 低音好但巨慢無比",
"crepe_hop_length": "Crepe Hop Length (Only applies to crepe): Hop length refers to the time it takes for the speaker to jump to a dramatic pitch. Lower hop lengths take more time to infer but are more pitch accurate.",
"特征检索库文件路径": "特徵檢索庫檔案路徑",
+ "自动检测index路径,下拉式选择(dropdown)": "自動檢測index路徑,下拉式選擇(dropdown)",
+ "特征检索库文件路径,为空则使用下拉的选择结果": "特徵檢索庫檔路徑,為空則使用下拉的選擇結果",
"特征文件路径": "特徵檔案路徑",
"检索特征占比": "檢索特徵佔比",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲線檔案,可選,一行一個音高,代替預設的F0及升降調",
+ ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音":">=3則使用對harvest音高識別的結果使用中值濾波,數值為濾波半徑,使用可以削弱啞音",
"转换": "轉換",
"输出信息": "輸出訊息",
"输出音频(右下角三个点,点了可以下载)": "輸出音頻(右下角三個點,點了可以下載)",
@@ -48,6 +52,7 @@
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "以-分隔輸入使用的卡號, 例如 0-1-2 使用卡0和卡1和卡2",
"显卡信息": "顯示卡資訊",
"提取音高使用的CPU进程数": "提取音高使用的CPU進程數",
+ "提取音高和处理数据使用的CPU进程数": "提取音高和處理數據使用的CPU進程數",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "選擇音高提取算法:輸入歌聲可用pm提速,高品質語音但CPU差可用dio提速,harvest品質更好但較慢",
"特征提取": "特徵提取",
"step3: 填写训练设置, 开始训练模型和索引": "步驟3: 填寫訓練設定, 開始訓練模型和索引",
@@ -111,5 +116,7 @@
"性能设置": "效能設定",
"开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換",
+ "常见问题解答": "常見問題解答",
+ "后处理重采样至最终采样率,0为不进行重采样": "後處理重採樣至最終採樣率,0為不進行重採樣",
"推理时间(ms):": "推理時間(ms):"
}
diff --git a/i18n/zh_SG.json b/i18n/zh_SG.json
index c98c6c2..efa360f 100644
--- a/i18n/zh_SG.json
+++ b/i18n/zh_SG.json
@@ -11,6 +11,7 @@
"模型推理": "模型推理",
"推理音色": "推理音色",
"刷新音色列表": "重新整理音色列表",
+ "刷新音色列表和索引路径":"刷新音色列表和索引路徑",
"卸载音色省显存": "卸載音色節省 VRAM",
"请选择说话人id": "請選擇說話人ID",
"男转女推荐+12key, 女转男推荐-12key, 如果音域爆炸导致音色失真也可以自己调整到合适音域. ": "男性轉女性推薦+12key,女性轉男性推薦-12key,如果音域爆炸導致音色失真也可以自己調整到合適音域。",
@@ -20,8 +21,11 @@
"crepe_hop_length": "Crepe Hop Length (Only applies to crepe): Hop length refers to the time it takes for the speaker to jump to a dramatic pitch. Lower hop lengths take more time to infer but are more pitch accurate.",
"特征检索库文件路径": "特徵檢索庫檔案路徑",
"特征文件路径": "特徵檔案路徑",
+ "自动检测index路径,下拉式选择(dropdown)": "自動檢測index路徑,下拉式選擇(dropdown)",
+ "特征检索库文件路径,为空则使用下拉的选择结果": "特徵檢索庫檔路徑,為空則使用下拉的選擇結果",
"检索特征占比": "檢索特徵佔比",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲線檔案,可選,一行一個音高,代替預設的F0及升降調",
+ ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音":">=3則使用對harvest音高識別的結果使用中值濾波,數值為濾波半徑,使用可以削弱啞音",
"转换": "轉換",
"输出信息": "輸出訊息",
"输出音频(右下角三个点,点了可以下载)": "輸出音頻(右下角三個點,點了可以下載)",
@@ -48,6 +52,7 @@
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "以-分隔輸入使用的卡號, 例如 0-1-2 使用卡0和卡1和卡2",
"显卡信息": "顯示卡資訊",
"提取音高使用的CPU进程数": "提取音高使用的CPU進程數",
+ "提取音高和处理数据使用的CPU进程数": "提取音高和處理數據使用的CPU進程數",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "選擇音高提取算法:輸入歌聲可用pm提速,高品質語音但CPU差可用dio提速,harvest品質更好但較慢",
"特征提取": "特徵提取",
"step3: 填写训练设置, 开始训练模型和索引": "步驟3: 填寫訓練設定, 開始訓練模型和索引",
@@ -111,5 +116,7 @@
"性能设置": "效能設定",
"开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換",
+ "常见问题解答": "常見問題解答",
+ "后处理重采样至最终采样率,0为不进行重采样": "後處理重採樣至最終採樣率,0為不進行重採樣",
"推理时间(ms):": "推理時間(ms):"
}
diff --git a/i18n/zh_TW.json b/i18n/zh_TW.json
index c98c6c2..efa360f 100644
--- a/i18n/zh_TW.json
+++ b/i18n/zh_TW.json
@@ -11,6 +11,7 @@
"模型推理": "模型推理",
"推理音色": "推理音色",
"刷新音色列表": "重新整理音色列表",
+ "刷新音色列表和索引路径":"刷新音色列表和索引路徑",
"卸载音色省显存": "卸載音色節省 VRAM",
"请选择说话人id": "請選擇說話人ID",
"男转女推荐+12key, 女转男推荐-12key, 如果音域爆炸导致音色失真也可以自己调整到合适音域. ": "男性轉女性推薦+12key,女性轉男性推薦-12key,如果音域爆炸導致音色失真也可以自己調整到合適音域。",
@@ -20,8 +21,11 @@
"crepe_hop_length": "Crepe Hop Length (Only applies to crepe): Hop length refers to the time it takes for the speaker to jump to a dramatic pitch. Lower hop lengths take more time to infer but are more pitch accurate.",
"特征检索库文件路径": "特徵檢索庫檔案路徑",
"特征文件路径": "特徵檔案路徑",
+ "自动检测index路径,下拉式选择(dropdown)": "自動檢測index路徑,下拉式選擇(dropdown)",
+ "特征检索库文件路径,为空则使用下拉的选择结果": "特徵檢索庫檔路徑,為空則使用下拉的選擇結果",
"检索特征占比": "檢索特徵佔比",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲線檔案,可選,一行一個音高,代替預設的F0及升降調",
+ ">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音":">=3則使用對harvest音高識別的結果使用中值濾波,數值為濾波半徑,使用可以削弱啞音",
"转换": "轉換",
"输出信息": "輸出訊息",
"输出音频(右下角三个点,点了可以下载)": "輸出音頻(右下角三個點,點了可以下載)",
@@ -48,6 +52,7 @@
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "以-分隔輸入使用的卡號, 例如 0-1-2 使用卡0和卡1和卡2",
"显卡信息": "顯示卡資訊",
"提取音高使用的CPU进程数": "提取音高使用的CPU進程數",
+ "提取音高和处理数据使用的CPU进程数": "提取音高和處理數據使用的CPU進程數",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "選擇音高提取算法:輸入歌聲可用pm提速,高品質語音但CPU差可用dio提速,harvest品質更好但較慢",
"特征提取": "特徵提取",
"step3: 填写训练设置, 开始训练模型和索引": "步驟3: 填寫訓練設定, 開始訓練模型和索引",
@@ -111,5 +116,7 @@
"性能设置": "效能設定",
"开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換",
+ "常见问题解答": "常見問題解答",
+ "后处理重采样至最终采样率,0为不进行重采样": "後處理重採樣至最終採樣率,0為不進行重採樣",
"推理时间(ms):": "推理時間(ms):"
}
diff --git a/infer-web.py b/infer-web.py
index a1e607a..7c20242 100644
--- a/infer-web.py
+++ b/infer-web.py
@@ -11,6 +11,8 @@ now_dir = os.getcwd()
sys.path.append(now_dir)
tmp = os.path.join(now_dir, "TEMP")
shutil.rmtree(tmp, ignore_errors=True)
+shutil.rmtree("%s/runtime/Lib/site-packages/infer_pack" % (now_dir), ignore_errors=True)
+shutil.rmtree("%s/runtime/Lib/site-packages/uvr5_pack" % (now_dir), ignore_errors=True)
os.makedirs(tmp, exist_ok=True)
os.makedirs(os.path.join(now_dir, "logs"), exist_ok=True)
os.makedirs(os.path.join(now_dir, "weights"), exist_ok=True)
@@ -114,10 +116,16 @@ def load_hubert():
weight_root = "weights"
weight_uvr5_root = "uvr5_weights"
+index_root = "logs"
names = []
for name in os.listdir(weight_root):
if name.endswith(".pth"):
names.append(name)
+index_paths = []
+for root, dirs, files in os.walk(index_root, topdown=False):
+ for name in files:
+ if name.endswith(".index") and "trained" not in name:
+ index_paths.append("%s/%s" % (root, name))
uvr5_names = []
for name in os.listdir(weight_uvr5_root):
if name.endswith(".pth"):
@@ -126,32 +134,39 @@ for name in os.listdir(weight_uvr5_root):
def vc_single(
sid,
- input_audio,
+ input_audio_path,
f0_up_key,
f0_file,
f0_method,
file_index,
+ file_index2,
# file_big_npy,
index_rate,
+ filter_radius,
+ resample_sr,
crepe_hop_length,
): # spk_item, input_audio0, vc_transform0,f0_file,f0method0
global tgt_sr, net_g, vc, hubert_model
- if input_audio is None:
+ if input_audio_path is None:
return "You need to upload an audio", None
f0_up_key = int(f0_up_key)
try:
- audio = load_audio(input_audio, 16000)
+ audio = load_audio(input_audio_path, 16000)
times = [0, 0, 0]
if hubert_model == None:
load_hubert()
if_f0 = cpt.get("f0", 1)
file_index = (
- file_index.strip(" ")
- .strip('"')
- .strip("\n")
- .strip('"')
- .strip(" ")
- .replace("trained", "added")
+ (
+ file_index.strip(" ")
+ .strip('"')
+ .strip("\n")
+ .strip('"')
+ .strip(" ")
+ .replace("trained", "added")
+ )
+ if file_index != ""
+ else file_index2
) # 防止小白写错,自动帮他替换掉
# file_big_npy = (
# file_big_npy.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
@@ -161,6 +176,7 @@ def vc_single(
net_g,
sid,
audio,
+ input_audio_path,
times,
f0_up_key,
f0_method,
@@ -168,13 +184,25 @@ def vc_single(
# file_big_npy,
index_rate,
if_f0,
+ filter_radius,
+ tgt_sr,
+ resample_sr,
crepe_hop_length,
f0_file=f0_file,
)
- print(
- "npy: ", times[0], "s, f0: ", times[1], "s, infer: ", times[2], "s", sep=""
+ if resample_sr >= 16000 and tgt_sr != resample_sr:
+ tgt_sr = resample_sr
+ index_info = (
+ "Using index:%s." % file_index
+ if os.path.exists(file_index)
+ else "Index not used."
)
- return "Success", (tgt_sr, audio_opt)
+ return "Success.\n %s\nTime:\n npy:%ss, f0:%ss, infer:%ss" % (
+ index_info,
+ times[0],
+ times[1],
+ times[2],
+ ), (tgt_sr, audio_opt)
except:
info = traceback.format_exc()
print(info)
@@ -189,8 +217,11 @@ def vc_multi(
f0_up_key,
f0_method,
file_index,
+ file_index2,
# file_big_npy,
index_rate,
+ filter_radius,
+ resample_sr,
):
try:
dir_path = (
@@ -207,14 +238,6 @@ def vc_multi(
traceback.print_exc()
paths = [path.name for path in paths]
infos = []
- file_index = (
- file_index.strip(" ")
- .strip('"')
- .strip("\n")
- .strip('"')
- .strip(" ")
- .replace("trained", "added")
- ) # 防止小白写错,自动帮他替换掉
for path in paths:
info, opt = vc_single(
sid,
@@ -223,17 +246,20 @@ def vc_multi(
None,
f0_method,
file_index,
+ file_index2,
# file_big_npy,
index_rate,
+ filter_radius,
+ resample_sr,
)
- if info == "Success":
+ if "Success" in info:
try:
tgt_sr, audio_opt = opt
wavfile.write(
"%s/%s" % (opt_root, os.path.basename(path)), tgt_sr, audio_opt
)
except:
- info = traceback.format_exc()
+ info += traceback.format_exc()
infos.append("%s->%s" % (os.path.basename(path), info))
yield "\n".join(infos)
yield "\n".join(infos)
@@ -312,7 +338,7 @@ def uvr(model_name, inp_root, save_root_vocal, paths, save_root_ins, agg):
# 一个选项卡全局只能有一个音色
def get_vc(sid):
global n_spk, tgt_sr, net_g, vc, cpt
- if sid == "":
+ if sid == "" or sid == []:
global hubert_model
if hubert_model != None: # 考虑到轮询, 需要加个判断看是否 sid 是由有模型切换到无模型的
print("clean_empty_cache")
@@ -360,7 +386,15 @@ def change_choices():
for name in os.listdir(weight_root):
if name.endswith(".pth"):
names.append(name)
- return {"choices": sorted(names), "__type__": "update"}
+ index_paths = []
+ for root, dirs, files in os.walk(index_root, topdown=False):
+ for name in files:
+ if name.endswith(".index") and "trained" not in name:
+ index_paths.append("%s/%s" % (root, name))
+ return {"choices": sorted(names), "__type__": "update"}, {
+ "choices": sorted(index_paths),
+ "__type__": "update",
+ }
def clean():
@@ -414,7 +448,7 @@ def if_done_multi(done, ps):
done[0] = True
-def preprocess_dataset(trainset_dir, exp_dir, sr, n_p=ncpu):
+def preprocess_dataset(trainset_dir, exp_dir, sr, n_p):
sr = sr_dict[sr]
os.makedirs("%s/logs/%s" % (now_dir, exp_dir), exist_ok=True)
f = open("%s/logs/%s/preprocess.log" % (now_dir, exp_dir), "w")
@@ -450,9 +484,8 @@ def preprocess_dataset(trainset_dir, exp_dir, sr, n_p=ncpu):
# but2.click(extract_f0,[gpus6,np7,f0method8,if_f0_3,trainset_dir4],[info2])
def extract_f0_feature(gpus, n_p, f0method, if_f0, exp_dir, echl):
- print("Proceeding with f0 Feature Extraction")
+
gpus = gpus.split("-")
- print("GPU Card Slot Numbers: " + str(gpus))
os.makedirs("%s/logs/%s" % (now_dir, exp_dir), exist_ok=True)
f = open("%s/logs/%s/extract_f0_feature.log" % (now_dir, exp_dir), "w")
f.close()
@@ -689,7 +722,6 @@ def train_index(exp_dir1):
infos.append("training")
yield "\n".join(infos)
index_ivf = faiss.extract_index_ivf(index) #
- # index_ivf.nprobe = int(np.power(n_ivf,0.3))
index_ivf.nprobe = 1
index.train(big_npy)
faiss.write_index(
@@ -748,7 +780,7 @@ def train1key(
cmd = (
config.python_cmd
+ " trainset_preprocess_pipeline_print.py %s %s %s %s "
- % (trainset_dir4, sr_dict[sr2], ncpu, model_log_dir)
+ % (trainset_dir4, sr_dict[sr2], np7, model_log_dir)
+ str(config.noparallel)
)
yield get_info_str(i18n("step1:正在处理数据"))
@@ -913,7 +945,6 @@ def train1key(
index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
yield get_info_str("training index")
index_ivf = faiss.extract_index_ivf(index) #
- # index_ivf.nprobe = int(np.power(n_ivf,0.3))
index_ivf.nprobe = 1
index.train(big_npy)
faiss.write_index(
@@ -1050,8 +1081,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
with gr.TabItem(i18n("模型推理")):
with gr.Row():
sid0 = gr.Dropdown(label=i18n("推理音色"), choices=sorted(names))
- refresh_button = gr.Button(i18n("刷新音色列表"), variant="primary")
- refresh_button.click(fn=change_choices, inputs=[], outputs=[sid0])
+ refresh_button = gr.Button(i18n("刷新音色列表和索引路径"), variant="primary")
clean_button = gr.Button(i18n("卸载音色省显存"), variant="primary")
spk_item = gr.Slider(
minimum=0,
@@ -1079,7 +1109,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
)
input_audio0 = gr.Textbox(
label=i18n("输入待处理音频文件路径(默认是正确格式示例)"),
- value="E:\\codes\\py39\\vits_vc_gpu_train\\todo-songs\\冬之花clip1.wav",
+ value="E:\\codes\\py39\\test-20230416b\\todo-songs\\冬之花clip1.wav",
)
f0method0 = gr.Radio(
label=i18n("选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比"),
@@ -1095,12 +1125,28 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
value=128,
interactive=True
)
+ filter_radius0 = gr.Slider(
+ minimum=0,
+ maximum=7,
+ label=i18n(">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音"),
+ value=3,
+ step=1,
+ interactive=True,
+ )
with gr.Column():
file_index1 = gr.Textbox(
- label=i18n("特征检索库文件路径"),
- value="E:\\codes\\py39\\vits_vc_gpu_train\\logs\\mi-test-1key\\added_IVF677_Flat_nprobe_7.index",
+ label=i18n("特征检索库文件路径,为空则使用下拉的选择结果"),
+ value="",
interactive=True,
)
+ file_index2 = gr.Dropdown(
+ label=i18n("自动检测index路径,下拉式选择(dropdown)"),
+ choices=sorted(index_paths),
+ interactive=True,
+ )
+ refresh_button.click(
+ fn=change_choices, inputs=[], outputs=[sid0, file_index2]
+ )
# file_big_npy1 = gr.Textbox(
# label=i18n("特征文件路径"),
# value="E:\\codes\py39\\vits_vc_gpu_train\\logs\\mi-test-1key\\total_fea.npy",
@@ -1113,6 +1159,14 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
value=0.76,
interactive=True,
)
+ resample_sr0 = gr.Slider(
+ minimum=0,
+ maximum=48000,
+ label=i18n("后处理重采样至最终采样率,0为不进行重采样"),
+ value=0,
+ step=1,
+ interactive=True,
+ )
f0_file = gr.File(label=i18n("F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调"))
but0 = gr.Button(i18n("转换"), variant="primary")
with gr.Column():
@@ -1127,8 +1181,11 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
f0_file,
f0method0,
file_index1,
+ file_index2,
# file_big_npy1,
index_rate1,
+ filter_radius0,
+ resample_sr0,
crepe_hop_length
],
[vc_output1, vc_output2],
@@ -1149,10 +1206,23 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
value="pm",
interactive=True,
)
+ filter_radius1 = gr.Slider(
+ minimum=0,
+ maximum=7,
+ label=i18n(">=3则使用对harvest音高识别的结果使用中值滤波,数值为滤波半径,使用可以削弱哑音"),
+ value=3,
+ step=1,
+ interactive=True,
+ )
with gr.Column():
- file_index2 = gr.Textbox(
- label=i18n("特征检索库文件路径"),
- value="E:\\codes\\py39\\vits_vc_gpu_train\\logs\\mi-test-1key\\added_IVF677_Flat_nprobe_7.index",
+ file_index3 = gr.Textbox(
+ label=i18n("特征检索库文件路径,为空则使用下拉的选择结果"),
+ value="",
+ interactive=True,
+ )
+ file_index4 = gr.Dropdown(
+ label=i18n("自动检测index路径,下拉式选择(dropdown)"),
+ choices=sorted(index_paths),
interactive=True,
)
# file_big_npy2 = gr.Textbox(
@@ -1167,10 +1237,18 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
value=1,
interactive=True,
)
+ resample_sr1 = gr.Slider(
+ minimum=0,
+ maximum=48000,
+ label=i18n("后处理重采样至最终采样率,0为不进行重采样"),
+ value=0,
+ step=1,
+ interactive=True,
+ )
with gr.Column():
dir_input = gr.Textbox(
label=i18n("输入待处理音频文件夹路径(去文件管理器地址栏拷就行了)"),
- value="E:\codes\py39\\vits_vc_gpu_train\\todo-songs",
+ value="E:\codes\py39\\test-20230416b\\todo-songs",
)
inputs = gr.File(
file_count="multiple", label=i18n("也可批量输入音频文件, 二选一, 优先读文件夹")
@@ -1186,9 +1264,12 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
inputs,
vc_transform1,
f0method1,
- file_index2,
+ file_index3,
+ file_index4,
# file_big_npy2,
index_rate2,
+ filter_radius1,
+ resample_sr1,
],
[vc_output3],
)
@@ -1203,7 +1284,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
with gr.Column():
dir_wav_input = gr.Textbox(
label=i18n("输入待处理音频文件夹路径"),
- value="E:\\codes\\py39\\vits_vc_gpu_train\\todo-songs",
+ value="E:\\codes\\py39\\test-20230416b\\todo-songs\\todo-songs",
)
wav_inputs = gr.File(
file_count="multiple", label=i18n("也可批量输入音频文件, 二选一, 优先读文件夹")
@@ -1257,6 +1338,14 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
value=True,
interactive=True,
)
+ np7 = gr.Slider(
+ minimum=0,
+ maximum=ncpu,
+ step=1,
+ label=i18n("提取音高和处理数据使用的CPU进程数"),
+ value=ncpu,
+ interactive=True,
+ )
with gr.Group(): # 暂时单人的, 后面支持最多4人的#数据处理
gr.Markdown(
value=i18n(
@@ -1278,7 +1367,7 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
but1 = gr.Button(i18n("处理数据"), variant="primary")
info1 = gr.Textbox(label=i18n("输出信息"), value="")
but1.click(
- preprocess_dataset, [trainset_dir4, exp_dir1, sr2], [info1]
+ preprocess_dataset, [trainset_dir4, exp_dir1, sr2, np7], [info1]
)
with gr.Group():
gr.Markdown(value=i18n("step2b: 使用CPU提取音高(如果模型带音高), 使用GPU提取特征(选择卡号)"))
@@ -1291,14 +1380,6 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
)
gpu_info9 = gr.Textbox(label=i18n("显卡信息"), value=gpu_info)
with gr.Column():
- np7 = gr.Slider(
- minimum=0,
- maximum=ncpu,
- step=1,
- label=i18n("提取音高使用的CPU进程数"),
- value=ncpu,
- interactive=True,
- )
f0method8 = gr.Radio(
label=i18n(
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢"
@@ -1556,6 +1637,19 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
butOnnx = gr.Button(i18n("导出Onnx模型"), variant="primary")
butOnnx.click(export_onnx, [ckpt_dir, onnx_dir, moevs], infoOnnx)
+ tab_faq = i18n("常见问题解答")
+ with gr.TabItem(tab_faq):
+ try:
+ if tab_faq == "常见问题解答":
+ with open("docs/faq.md", "r", encoding="utf8") as f:
+ info = f.read()
+ else:
+ with open("docs/faq_en.md", "r") as f:
+ info = f.read()
+ gr.Markdown(value=info)
+ except:
+ gr.Markdown(traceback.format_exc())
+
# with gr.TabItem(i18n("招募音高曲线前端编辑器")):
# gr.Markdown(value=i18n("加开发群联系我xxxxx"))
# with gr.TabItem(i18n("点击查看交流、问题反馈群号")):
diff --git a/train_nsf_sim_cache_sid_load_pretrain.py b/train_nsf_sim_cache_sid_load_pretrain.py
index 4ba6b65..6078490 100644
--- a/train_nsf_sim_cache_sid_load_pretrain.py
+++ b/train_nsf_sim_cache_sid_load_pretrain.py
@@ -1,6 +1,7 @@
import sys, os
now_dir = os.getcwd()
+sys.path.append(os.path.join(now_dir))
sys.path.append(os.path.join(now_dir, "train"))
import utils
diff --git a/vc_infer_pipeline.py b/vc_infer_pipeline.py
index c6da083..06e2906 100644
--- a/vc_infer_pipeline.py
+++ b/vc_infer_pipeline.py
@@ -2,13 +2,30 @@ import numpy as np, parselmouth, torch, pdb
from time import time as ttime
import torch.nn.functional as F
import torchcrepe # Fork feature. Use the crepe f0 algorithm. New dependency (pip install torchcrepe)
+from torch import Tensor
import scipy.signal as signal
-import pyworld, os, traceback, faiss
+import pyworld, os, traceback, faiss, librosa
from scipy import signal
-from torch import Tensor # Fork Feature. Used for pitch prediction for the torchcrepe f0 inference computation
+from functools import lru_cache
bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
+input_audio_path2wav = {}
+
+@lru_cache
+def cache_harvest_f0(input_audio_path, fs, f0max, f0min, frame_period):
+ audio = input_audio_path2wav[input_audio_path]
+ f0, t = pyworld.harvest(
+ audio,
+ fs=fs,
+ f0_ceil=f0max,
+ f0_floor=f0min,
+ frame_period=frame_period,
+ )
+ f0 = pyworld.stonemask(audio, f0, t, fs)
+ return f0
+
+
class VC(object):
def __init__(self, tgt_sr, config):
self.x_pad, self.x_query, self.x_center, self.x_max, self.is_half = (
@@ -28,60 +45,6 @@ class VC(object):
self.t_max = self.sr * self.x_max # 免查询时长阈值
self.device = config.device
- #region f0 Overhaul Region
- # Fork Feature: Get the best torch device to use for f0 algorithms that require a torch device. Will return the type (torch.device)
- def get_optimal_torch_device(self, index: int = 0) -> torch.device:
- # Get cuda device
- if torch.cuda.is_available():
- return torch.device(f"cuda:{index % torch.cuda.device_count()}") # Very fast
- elif torch.backends.mps.is_available():
- return torch.device("mps")
- # Insert an else here to grab "xla" devices if available. TO DO later. Requires the torch_xla.core.xla_model library
- # Else wise return the "cpu" as a torch device,
- return torch.device("cpu")
-
- # Get the f0 via parselmouth computation
- def get_f0_pm_computation(self, x, time_step, f0_min, f0_max, p_len):
- f0 = (
- parselmouth.Sound(x, self.sr)
- .to_pitch_ac(
- time_step=time_step / 1000,
- voicing_threshold=0.6,
- pitch_floor=f0_min,
- pitch_ceiling=f0_max,
- )
- .selected_array["frequency"]
- )
- pad_size = (p_len - len(f0) + 1) // 2
- if pad_size > 0 or p_len - len(f0) - pad_size > 0:
- f0 = np.pad(
- f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
- )
- return f0
-
- # Get the f0 via the pyworld computation. Fork Feature +dio along with harvest
- def get_f0_pyworld_computation(self, x, f0_min, f0_max, f0_type):
- if f0_type == "harvest":
- f0, t = pyworld.harvest(
- x.astype(np.double),
- fs=self.sr,
- f0_ceil=f0_max,
- f0_floor=f0_min,
- frame_period=10,
- )
- elif f0_type == "dio":
- f0, t = pyworld.dio(
- x.astype(np.double),
- fs=self.sr,
- f0_ceil=f0_max,
- f0_floor=f0_min,
- frame_period=10,
- )
- f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
- f0 = signal.medfilt(f0, 3)
- return f0
-
- # Fork Feature: Get the f0 via the crepe algorithm from torchcrepe
def get_f0_crepe_computation(
self,
x,
@@ -122,27 +85,50 @@ class VC(object):
)
f0 = np.nan_to_num(target)
return f0 # Resized f0
-
- #endregion
- def get_f0(self, x, p_len, f0_up_key, f0_method, crepe_hop_length, inp_f0=None):
+ def get_f0(
+ self,
+ input_audio_path,
+ x,
+ p_len,
+ f0_up_key,
+ f0_method,
+ filter_radius,
+ crepe_hop_length,
+ inp_f0=None,
+ ):
+ global input_audio_path2wav
time_step = self.window / self.sr * 1000
f0_min = 50
f0_max = 1100
f0_mel_min = 1127 * np.log(1 + f0_min / 700)
f0_mel_max = 1127 * np.log(1 + f0_max / 700)
if f0_method == "pm":
- f0 = self.get_f0_pm_computation(x, time_step, f0_min, f0_max, p_len)
+ f0 = (
+ parselmouth.Sound(x, self.sr)
+ .to_pitch_ac(
+ time_step=time_step / 1000,
+ voicing_threshold=0.6,
+ pitch_floor=f0_min,
+ pitch_ceiling=f0_max,
+ )
+ .selected_array["frequency"]
+ )
+ pad_size = (p_len - len(f0) + 1) // 2
+ if pad_size > 0 or p_len - len(f0) - pad_size > 0:
+ f0 = np.pad(
+ f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
+ )
elif f0_method == "harvest":
- f0 = self.get_f0_pyworld_computation(x, f0_min, f0_max, "harvest")
- elif f0_method == "dio": # Fork Feature
- f0 = self.get_f0_pyworld_computation(x, f0_min, f0_max, "dio")
+ input_audio_path2wav[input_audio_path] = x.astype(np.double)
+ f0 = cache_harvest_f0(input_audio_path, self.sr, f0_max, f0_min, 10)
+ if filter_radius > 2:
+ f0 = signal.medfilt(f0, 3)
elif f0_method == "crepe": # Fork Feature: Adding a new f0 algorithm called crepe
f0 = self.get_f0_crepe_computation(x, f0_min, f0_max, p_len, crepe_hop_length)
elif f0_method == "crepe-tiny": # For Feature add crepe-tiny model
f0 = self.get_f0_crepe_computation(x, f0_min, f0_max, p_len, crepe_hop_length, "tiny")
- print("Using the following f0 method: " + f0_method)
f0 *= pow(2, f0_up_key / 12)
# with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
tf0 = self.sr // self.window # 每秒f0点数
@@ -243,7 +229,6 @@ class VC(object):
.data.cpu()
.float()
.numpy()
- .astype(np.int16)
)
else:
audio1 = (
@@ -251,7 +236,6 @@ class VC(object):
.data.cpu()
.float()
.numpy()
- .astype(np.int16)
)
del feats, p_len, padding_mask
if torch.cuda.is_available():
@@ -267,6 +251,7 @@ class VC(object):
net_g,
sid,
audio,
+ input_audio_path,
times,
f0_up_key,
f0_method,
@@ -274,6 +259,9 @@ class VC(object):
# file_big_npy,
index_rate,
if_f0,
+ filter_radius,
+ tgt_sr,
+ resample_sr,
crepe_hop_length,
f0_file=None,
):
@@ -329,7 +317,16 @@ class VC(object):
sid = torch.tensor(sid, device=self.device).unsqueeze(0).long()
pitch, pitchf = None, None
if if_f0 == 1:
- pitch, pitchf = self.get_f0(audio_pad, p_len, f0_up_key, f0_method, crepe_hop_length, inp_f0)
+ pitch, pitchf = self.get_f0(
+ input_audio_path,
+ audio_pad,
+ p_len,
+ f0_up_key,
+ f0_method,
+ filter_radius,
+ crepe_hop_length,
+ inp_f0,
+ )
pitch = pitch[:p_len]
pitchf = pitchf[:p_len]
if self.device == "mps":
@@ -402,6 +399,11 @@ class VC(object):
)[self.t_pad_tgt : -self.t_pad_tgt]
)
audio_opt = np.concatenate(audio_opt)
+ if resample_sr >= 16000 and tgt_sr != resample_sr:
+ audio_opt = librosa.resample(
+ audio_opt, orig_sr=tgt_sr, target_sr=resample_sr
+ )
+ audio_opt = audio_opt.astype(np.int16)
del pitch, pitchf, sid
if torch.cuda.is_available():
torch.cuda.empty_cache()