diff --git a/README.md b/README.md index a9b06d7..629461e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ # VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild -[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf) - +[![Paper](https://img.shields.io/badge/arXiv-2301.12503-brightgreen.svg?style=flat-square)](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://jasonppy.github.io/VoiceCraft_web/) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing) ### TL;DR VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts. @@ -8,20 +7,22 @@ VoiceCraft is a token infilling neural codec language model, that achieves state To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference. ## How to run inference -There are three ways: +There are three ways (besides running Gradio in Colab): -1. with Google Colab. see [quickstart colab](#quickstart-colab) +1. More flexible inference beyond Gradio UI in Google Colab. see [quickstart colab](#quickstart-colab) 2. with docker. see [quickstart docker](#quickstart-docker) -3. without docker. see [environment setup](#environment-setup) +3. without docker. see [environment setup](#environment-setup). You can also run gradio locally if you choose this option When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb). If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training). ## News -:star: 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)! +:star: 04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces [here](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio)! Major thanks to [@zuev-stepan](https://github.com/zuev-stepan), [@Sewlell](https://github.com/Sewlell), [@pgsoar](https://github.com/pgosar) [@Ph0rk0z](https://github.com/Ph0rk0z). -:star: 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight, the model outperforms giga830M on TTS. Weights are [here](https://huggingface.co/pyp1/VoiceCraft/tree/main). Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data) +:star: 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight. Weights are [here](https://huggingface.co/pyp1/VoiceCraft/tree/main). Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned! + +:star: 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)! ## TODO - [x] Codebase upload @@ -30,9 +31,12 @@ If you want to do model development such as training/finetuning, I recommend fol - [x] Training guidance - [x] RealEdit dataset and training manifest - [x] Model weights (giga330M.pth, giga830M.pth, and gigaHalfLibri330M_TTSEnhanced_max16s.pth) -- [x] Write colab notebooks for better hands-on experience -- [ ] HuggingFace Spaces demo -- [ ] Better guidance on training/finetuning +- [x] Better guidance on training/finetuning +- [x] Colab notebooks +- [x] HuggingFace Spaces demo +- [ ] Command line +- [ ] Improve efficiency + ## QuickStart Colab @@ -109,7 +113,7 @@ Checkout [`inference_speech_editing.ipynb`](./inference_speech_editing.ipynb) an ## Gradio ### Run in colab -[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zuev-stepan/VoiceCraft-gradio/blob/feature/colab-notebook/voicecraft-gradio-colab.ipynb) +[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing) ### Run locally After environment setup install additional dependencies: diff --git a/demo/pam.wav b/demo/pam.wav new file mode 100644 index 0000000..2e39c45 Binary files /dev/null and b/demo/pam.wav differ diff --git a/gradio_app.py b/gradio_app.py index 0f82600..113f0ab 100644 --- a/gradio_app.py +++ b/gradio_app.py @@ -13,7 +13,7 @@ import random import uuid -DEMO_PATH = os.getenv("DEMO_PATH", ".demo") +DEMO_PATH = os.getenv("DEMO_PATH", "./demo") TMP_PATH = os.getenv("TMP_PATH", "./demo/temp") MODELS_PATH = os.getenv("MODELS_PATH", "./pretrained_models") device = "cuda" if torch.cuda.is_available() else "cpu" @@ -371,20 +371,20 @@ demo_original_transcript = " But when I had approached so near to them, the comm demo_text = { "TTS": { - "smart": "I cannot believe that the same model can also do text to speech synthesis as well!", - "regular": "But when I had approached so near to them, the common I cannot believe that the same model can also do text to speech synthesis as well!" + "smart": "I cannot believe that the same model can also do text to speech synthesis too!", + "regular": "But when I had approached so near to them, the common I cannot believe that the same model can also do text to speech synthesis too!" }, "Edit": { "smart": "saw the mirage of the lake in the distance,", "regular": "But when I saw the mirage of the lake in the distance, which the sense deceives, Lost not by distance any of its marks," }, "Long TTS": { - "smart": "You can run TTS on a big text!\n" + "smart": "You can run the model on a big text!\n" "Just write it line-by-line. Or sentence-by-sentence.\n" - "If some sentences sound odd, just rerun TTS on them, no need to generate the whole text again!", - "regular": "But when I had approached so near to them, the common You can run TTS on a big text!\n" + "If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!", + "regular": "But when I had approached so near to them, the common You can run the model on a big text!\n" "But when I had approached so near to them, the common Just write it line-by-line. Or sentence-by-sentence.\n" - "But when I had approached so near to them, the common If some sentences sound odd, just rerun TTS on them, no need to generate the whole text again!" + "But when I had approached so near to them, the common If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!" } } @@ -602,9 +602,9 @@ if __name__ == "__main__": parser = argparse.ArgumentParser(description="VoiceCraft gradio app.") - parser.add_argument("--demo-path", default=".demo", help="Path to demo directory") - parser.add_argument("--tmp-path", default=".demo/temp", help="Path to tmp directory") - parser.add_argument("--models-path", default=".pretrained_models", help="Path to voicecraft models directory") + parser.add_argument("--demo-path", default="./demo", help="Path to demo directory") + parser.add_argument("--tmp-path", default="./demo/temp", help="Path to tmp directory") + parser.add_argument("--models-path", default="./pretrained_models", help="Path to voicecraft models directory") parser.add_argument("--port", default=7860, type=int, help="App port") parser.add_argument("--share", action="store_true", help="Launch with public url") diff --git a/voicecraft-gradio-colab.ipynb b/voicecraft-gradio-colab.ipynb index a13fb2e..4a5953c 100644 --- a/voicecraft-gradio-colab.ipynb +++ b/voicecraft-gradio-colab.ipynb @@ -18,7 +18,7 @@ }, "outputs": [], "source": [ - "!git clone https://github.com/zuev-stepan/VoiceCraft-gradio" + "!git clone https://github.com/jasonppy/VoiceCraft" ] }, {