RMVPE training & GUI QOL changes

RMVPE training, remove one click, add formant shifting, stop training button all added.
2026-02-24 03:49:51 +01:00 · 2023-07-22 18:15:38 -05:00
parent 6b734c391d 25265bbe30
commit ab87e03559
13 changed files with 3932 additions and 2480 deletions
--- a/README.md
+++ b/README.md
@@ -1,3 +1,99 @@
+# Features:
+- Experimental Formant Shift using StftPitchShift(tried using praat with praatio but to no avail)
+- Added `Stop Training` button when training, no need to restart RVC every time you want to stop the training of a model!
+- Auto-detect Index path for models selected + Auto-detect paths, no more default values like this: `E:\codes\py39\vits_vc_gpu_train\logs\mi-test-1key\total_fea.npy`, We're getting Root Dir and subfolders using 
+```python 
+os.path.abspath(os.getcwd())
+``` 
+- Audio files dropdown by autodetecting files in `/audios/` folder 
+- More stable version of Gradio(3.34.0) with support of Themes
+- Removed `One Click Training` button in `Training` Tab, since it was glitchy and has confused a lot of users. 
+- Changed default training settings to be more optimal for newer users. 
+- Auto-open TensorBoard localhost URL when `tensor-launch.py` is executed 
+- RMVPE implemented in both inferencing and training (the one in `Training` tab doesn't work properly though, requires some additional work to do) 
+
+## Installation:
+
+1. Simply either extract directly or use git clone
+
+2. Run `installstft.bat`. It'll automatically: 
+    - Upgrade/Downgrade Gradio if its version isn't 3.34.0;
+    - Install `rmvpe.pt` if it hasn't been already installed;
+    - Install `StftPitchShift` if it hasn't been already installed;
+
+
+
+3. Done! You're good to go and use the RVC-WebUI Tweaked by me for you to use :)
+
+## Change Gradio Theme:
+
+- [OPTIONAL] Change Gradio's theme:
+    1. Open `infer-web.py` in any code/text editing software (e.g. `notepad++`, `notepad`, `vscode`, etc)
+  
+    2a. Press Ctrl+F and search for `with gr.Blocks(`, select the one that's not fully commented
+  
+    2b. Go to line `1842`, you'll see the `with gr.Blocks(theme='HaleyCH/HaleyCH_Theme') as app:`
+  
+    3. Go to [Gradio Theme Gallery](https://huggingface.co/spaces/gradio/theme-gallery):
+       
+    3.1 Select any theme you like (e.g. [this one](https://huggingface.co/spaces/freddyaboulton/dracula_revamped))
+  
+    3.2 Look at the top of the page
+  
+    ![image](https://github.com/alexlnkp/Mangio-RVC-Tweaks/assets/79400603/59e3e6a9-bdda-4ede-8161-00ee957c1715)
+
+    3.3 Copy theme variable(in this case, it's `theme='freddyaboulton/dracula_revamped'`)
+  
+    4. Replace `theme='HaleyCH/HaleyCH_Theme'` in `infer-web.py` with any value of a theme from [Gradio Theme Gallery](https://huggingface.co/spaces/gradio/theme-gallery)
+    
+### Current Todo-list:
+
+- [x] Fix `Unload voice to save GPU memory` button Traceback 
+- [ ] Add Accordions so people with Firefox browser get a much more compact GUI rather than [This](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/79400603/67e0cc08-82a2-4dc3-86cf-e23d1dcad9f8). 
+- [ ] Fix weird way Median Filtering value inputted in a slider is utilized
+- [ ] Replace regular refresh buttons with these tiny ones from [AUTOMATIC'S1111 Stable DIffusion](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
+![image](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/79400603/fe37051e-4c95-4d30-9254-87d44436bb9e)
+- [ ] Add a way to change the Gradio's theme from WebUI itself, like in [AUTOMATIC'S1111 Stable DIffusion](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
+![image](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/79400603/7b93e167-564a-41d3-9a76-ded20063cdc3)
+- [ ] Implement Praat in the GUI for f0 curve file manipulation and easier usage
+
+
+# Screenshots:
+
+- ## Inference Tab:
+
+![image](https://github.com/Mangio621/Mangio-RVC-Fork/assets/79400603/107aa15a-4e8d-4f77-a327-45f35a235fcf)
+
+- ## UVR Tab:
+
+![image](https://github.com/Mangio621/Mangio-RVC-Fork/assets/79400603/7e57242a-4950-40c8-bf2a-8f77e992af26)
+
+- ## Training Tab:
+
+![image](https://github.com/Mangio621/Mangio-RVC-Fork/assets/79400603/a19ce156-5532-4761-aa06-8a537f80c368)
+
+- ## Ckpt-Processing Tab:
+
+![image](https://github.com/Mangio621/Mangio-RVC-Fork/assets/79400603/0cdc285e-a184-48f3-92a7-65f6120caf2f)
+
+The rest of the tabs are left untouched code-wise.
+
+
+
+# Formant Shift:
+
+![image](https://github.com/Mangio621/Mangio-RVC-Fork/assets/79400603/300ebce2-36c7-4761-b1dd-b31403ad2cd1)
+
+- ### Click `Apply` button every time you change the values for inferencing.
+
+- ### As the name implies, you can only use `wav` files so far, also it is very slow, so be patient.
+
+- ### If you added a new `preset.txt` in the `\formantshiftcfg\` folder, click button with refresh emoji
+
+- ### If the preset you selected somehow got edited, by pressing refresh emoji button you'll update values, by grabbing them from the file
+
+
+
 <div align="center">
 <h1>Mangio-RVC-Fork with v2 Support! 💻 </h1>
 A fork of an easy-to-use SVC framework based on VITS with top1 retrieval 💯. In general, this fork provides a CLI interface in addition. And also gives you more f0 methods to use, as well as a personlized 'hybrid' f0 estimation method using nanmedian. <br><br>
--- a/extract_f0_print.py
+++ b/extract_f0_print.py
@@ -17,6 +17,12 @@ from multiprocessing import Process
 exp_dir = sys.argv[1]
 f = open("%s/extract_f0_feature.log" % exp_dir, "a+")

+DoFormant = False
+
+with open('formanting.txt', 'r') as fvf:
+    content = fvf.readlines()              
+    Quefrency, Timbre = content[1].split('\n')[0], content[2].split('\n')[0]
+

 def printt(strr):
    print(strr)
@@ -199,7 +205,7 @@ class FeatureInput(object):
        return f0_median_hybrid

    def compute_f0(self, path, f0_method, crepe_hop_length):
-        x = load_audio(path, self.fs)
+        x = load_audio(path, self.fs, DoFormant, Quefrency, Timbre)
        p_len = x.shape[0] // self.hop
        if f0_method == "pm":
            time_step = 160 / 16000 * 1000
@@ -227,6 +233,14 @@ class FeatureInput(object):
                frame_period=1000 * self.hop / self.fs,
            )
            f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
+        elif f0_method == "rmvpe":
+            if hasattr(self, "model_rmvpe") == False:
+                from rmvpe import RMVPE
+                print("loading rmvpe model")
+                self.model_rmvpe = RMVPE(
+                    "rmvpe.pt", is_half=False, device="cuda:0"
+                )
+            f0 = self.model_rmvpe.infer_from_audio(x, thred=0.03)
        elif f0_method == "dio":
            f0, t = pyworld.dio(
                x.astype(np.double),
--- a/formanting.txt
+++ b/formanting.txt
@@ -0,0 +1,3 @@
+False
+8.0
+1.2
--- a/formantshiftcfg/Put
+++ b/formantshiftcfg/Put
--- a/formantshiftcfg/f2m.txt
+++ b/formantshiftcfg/f2m.txt
@@ -0,0 +1,2 @@
+8.0
+-1.2
--- a/formantshiftcfg/m2f.txt
+++ b/formantshiftcfg/m2f.txt
@@ -0,0 +1,2 @@
+8.0
+1.2
--- a/formantshiftcfg/random.txt
+++ b/formantshiftcfg/random.txt
@@ -0,0 +1,2 @@
+16.0
+9.8
--- a/infer-web.py
+++ b/infer-web.py
--- a/my_utils.py
+++ b/my_utils.py
@@ -1,8 +1,13 @@
 import ffmpeg
 import numpy as np
+#import praatio
+#import praatio.praat_scripts
+import os
+#from os.path import join

+#praatEXE = join('.',os.path.abspath(os.getcwd()) + r"\Praat.exe")

-def load_audio(file, sr):
+def load_audio(file, sr, DoFormant, Quefrency, Timbre):
    try:
        # https://github.com/openai/whisper/blob/main/whisper/audio.py#L26
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
@@ -10,11 +15,44 @@ def load_audio(file, sr):
        file = (
            file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
        )  # 防止小白拷路径头尾带了空格和"和回车
-        out, _ = (
-            ffmpeg.input(file, threads=0)
-            .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
-            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
+        file_formanted = (
+            file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
        )
+        with open('formanting.txt', 'r') as fvf:
+            content = fvf.readlines()
+            if 'True' in content[0].split('\n')[0]:
+                #print("true")
+                DoFormant = True
+                Quefrency, Timbre = content[1].split('\n')[0], content[2].split('\n')[0]
+                
+            else:
+                #print("not true")
+                DoFormant = False
+            
+        if DoFormant:
+            #os.system(f"stftpitchshift -i {file} -q {Quefrency} -t {Timbre} -o {file_formanted}")
+            #print('stftpitchshift -i "%s" -p 1.0 --rms -w 128 -v 8 -q %s -t %s -o "%s"' % (file, Quefrency, Timbre, file_formanted))
+            print("formanting...")
+            
+            os.system('stftpitchshift -i "%s" -q %s -t %s -o "%sFORMANTED"' % (file, Quefrency, Timbre, file_formanted))
+            print("formanted!")
+            #filepraat = (os.path.abspath(os.getcwd()) + '\\' + file).replace('/','\\')
+            #file_formantedpraat = ('"' + os.path.abspath(os.getcwd()) + '/' + 'formanted'.join(file_formanted) + '"').replace('/','\\')
+
+            out, _ = (
+                ffmpeg.input('%sFORMANTED%s' % (file_formanted, '.wav'), threads=0)
+                .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
+                .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
+            )
+            
+            os.remove('%sFORMANTED%s' % (file_formanted, '.wav'))
+        else:
+            
+            out, _ = (
+                ffmpeg.input(file, threads=0)
+                .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
+                .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
+            )
    except Exception as e:
        raise RuntimeError(f"Failed to load audio: {e}")

--- a/requirements.txt
+++ b/requirements.txt
@@ -45,3 +45,4 @@ httpx==0.23.0
 #onnxruntime-gpu
 torchcrepe==0.0.20
 fastapi==0.88
+stftpitchshift==1.5.1
--- a/stop.txt
+++ b/stop.txt
--- a/train_nsf_sim_cache_sid_load_pretrain.py
+++ b/train_nsf_sim_cache_sid_load_pretrain.py
@@ -568,7 +568,28 @@ def train_and_evaluate(
                    ),
                )
            )
-
+    
+    with open("stop.txt", "r+") as tostop:
+        content = tostop.read()
+        if 'stop' in content:
+            logger.info("Stop Button was pressed. The program is closed.")
+            if hasattr(net_g, "module"):
+                ckpt = net_g.module.state_dict()
+            else:
+                ckpt = net_g.state_dict()
+            logger.info(
+                "saving final ckpt:%s"
+                % (
+                    savee(
+                        ckpt, hps.sample_rate, hps.if_f0, hps.name, epoch, hps.version, hps
+                    )
+                )
+            )
+            
+            tostop.truncate(0)
+            tostop.writelines("not")
+            os._exit(2333333)
+    
    if rank == 0:
        logger.info("====> Epoch: {} {}".format(epoch, epoch_recorder.record()))
    if epoch >= hps.total_epoch and rank == 0:
--- a/trainset_preprocess_pipeline_print.py
+++ b/trainset_preprocess_pipeline_print.py
@@ -17,9 +17,16 @@ import multiprocessing
 from my_utils import load_audio
 import tqdm

+DoFormant = False
+Quefrency = 0.0
+Timbre = 0.0
+
 mutex = multiprocessing.Lock()
 f = open("%s/preprocess.log" % exp_dir, "a+")

+with open('formanting.txt', 'r') as fvf:
+    content = fvf.readlines()              
+    Quefrency, Timbre = content[1].split('\n')[0], content[2].split('\n')[0]

 def println(strr):
    mutex.acquire()
@@ -77,7 +84,7 @@ class PreProcess:

    def pipeline(self, path, idx0):
        try:
-            audio = load_audio(path, self.sr)
+            audio = load_audio(path, self.sr, DoFormant, Quefrency, Timbre)
            # zero phased digital filter cause pre-ringing noise...
            # audio = signal.filtfilt(self.bh, self.ah, audio)
            audio = signal.lfilter(self.bh, self.ah, audio)