Files

206 lines
9.6 KiB
Markdown
Raw Permalink Normal View History

2023-05-28 00:04:17 -06:00
# 🐶 BARK AI: but with the ability to use voice cloning on custom audio samples
2023-04-09 13:21:02 -04:00
2025-08-23 21:14:03 -07:00
---
## UPDATE: We launched a follow-up to BARK --
_a hyper realistic AI Voice Cloner Desktop App_
- Runs locally
- All data is yours - 100% data privacy
- No costs to run
👉 Check it out here: https://github.com/serpapps/ai-voice-cloner
---
2023-07-19 19:15:18 -06:00
For RVC `git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI` and train your model or point the code to you model (must clone RVC repo in bark-with-voice-clone directory)
2023-04-21 09:11:08 -06:00
If you want to clone a voice just follow the `clone_voice.ipynb` notebook. If you want to generate audio from text, follow the `generate.ipynb` notebook.
2023-07-19 19:15:18 -06:00
To create a voice clone sample, you need an audio sample of around 5-12 seconds
2023-04-21 09:11:08 -06:00
2023-05-25 14:38:35 -06:00
You will get the best results by making generations with your cloned voice until you find one that is really close to the source. Then use that as the new history prompt (comes from the model so should theoretically be more consistent)
2023-04-21 09:11:08 -06:00
2023-05-03 20:24:12 -07:00
- [BARK text to speech @ SERP AI](https://serp.ai/tools/bark-text-to-speech-ai-voice-clone-app/)
2023-04-21 09:11:08 -06:00
2023-07-02 21:28:49 -07:00
# Contributors
Huge shoutout & thank you to:
[gitmylo](https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/)
for the solution to the semantic token generation for better voice clones and finetunes (HuBERT, etc.)
2023-07-02 22:07:51 -07:00
***
2023-07-02 22:04:49 -07:00
<div style="display: flex; flex-wrap: wrap;">
2023-07-02 22:07:51 -07:00
<a href="https://github.com/francislabountyjr" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/73464335?v=4" alt="francislabountyjr" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/gkucsko" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/5068315?v=4" alt="gkucsko" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/kmfreyberg" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/32879321?v=4" alt="kmfreyberg" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/Vaibhavs10" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/18682411?v=4" alt="Vaibhavs10" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/devinschumacher" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/45643901?v=4" alt="devinschumacher" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/mcamac" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/461009?v=4" alt="mcamac" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/fiq" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/236293?v=4" alt="fiq" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/zygi" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/2059901?v=4" alt="zygi" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/jn-jairo" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/5104869?v=4" alt="jn-jairo" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/gitmylo" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/36931363?v=4" alt="gitmylo" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/alyxdow" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/84633629?v=4" alt="alyxdow" style="border-radius: 50%; width: 75px; height: 75px;"></a>
<a href="https://github.com/mikeyshulman" target="_blank" style="margin: 5px; display: inline-block;"><img src="https://avatars.githubusercontent.com/u/2565833?v=4" alt="mikeyshulman" style="border-radius: 50%; width: 75px; height: 75px;"></a>
2023-07-02 22:04:49 -07:00
</div>
2023-07-02 21:28:49 -07:00
2023-07-02 22:07:51 -07:00
2023-04-21 09:11:08 -06:00
-------------------------------------------------------------------
# Original README.md
2023-04-18 12:45:46 -04:00
## 🤖 Usage
2023-04-09 13:21:02 -04:00
```python
2023-04-22 17:09:20 -04:00
from bark import SAMPLE_RATE, generate_audio, preload_models
2023-04-09 13:21:02 -04:00
from IPython.display import Audio
2023-04-22 17:09:20 -04:00
# download and load all models
preload_models()
# generate audio from text
2023-04-09 13:21:02 -04:00
text_prompt = """
2023-05-25 19:58:15 -06:00
Hello, my name is Serpy. And, uh — and I like pizza. [laughs]
2023-04-09 13:21:02 -04:00
But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)
2023-04-22 17:09:20 -04:00
# play text in notebook
2023-04-09 13:21:02 -04:00
Audio(audio_array, rate=SAMPLE_RATE)
```
[pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)
2023-04-20 14:41:21 -07:00
To save `audio_array` as a WAV file:
```python
from scipy.io.wavfile import write as write_wav
write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
```
2023-04-09 13:21:02 -04:00
### 🌎 Foreign Language
2023-04-21 07:13:43 -07:00
Bark supports various languages out-of-the-box and automatically determines language from input text. When prompted with code-switched text, Bark will attempt to employ the native accent for the respective languages. English quality is best for the time being, and we expect other languages to further improve with scaling.
2023-04-09 13:21:02 -04:00
```python
text_prompt = """
Buenos días Miguel. Tu colega piensa que tu alemán es extremadamente malo.
But I suppose your english isn't terrible.
"""
audio_array = generate_audio(text_prompt)
```
[miguel.webm](https://user-images.githubusercontent.com/5068315/230684752-10baadfe-1e7c-46a2-8323-43282aef2c8c.webm)
### 🎶 Music
2023-04-11 18:17:12 -07:00
Bark can generate all types of audio, and, in principle, doesn't see a difference between speech and music. Sometimes Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.
2023-04-09 13:21:02 -04:00
```python
text_prompt = """
♪ In the jungle, the mighty jungle, the lion barks tonight ♪
"""
audio_array = generate_audio(text_prompt)
```
[lion.webm](https://user-images.githubusercontent.com/5068315/230684766-97f5ea23-ad99-473c-924b-66b6fab24289.webm)
2023-04-21 10:49:01 -07:00
### 🎤 Voice Presets and Voice/Audio Cloning
2023-04-09 13:21:02 -04:00
2023-04-21 15:13:16 -04:00
Bark has the capability to fully clone voices - including tone, pitch, emotion and prosody. The model also attempts to preserve music, ambient noise, etc. from input audio. However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language. Specify following the pattern: `{lang_code}_speaker_{0-9}`.
2023-04-09 13:21:02 -04:00
```python
text_prompt = """
2023-04-17 16:43:37 -04:00
I have a silky smooth voice, and today I will tell you about
the exercise regimen of the common sloth.
2023-04-09 13:21:02 -04:00
"""
2023-04-17 16:43:37 -04:00
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
2023-04-09 13:21:02 -04:00
```
2023-04-17 16:43:37 -04:00
[sloth.webm](https://user-images.githubusercontent.com/5068315/230684883-a344c619-a560-4ff5-8b99-b4463a34487b.webm)
2023-04-09 13:21:02 -04:00
2023-04-17 16:43:37 -04:00
*Note: since Bark recognizes languages automatically from input text, it is possible to use for example a german history prompt with english text. This usually leads to english audio with a german accent.*
2023-04-11 18:17:12 -07:00
2023-04-17 16:43:37 -04:00
### 👥 Speaker Prompts
You can provide certain speaker prompts such as NARRATOR, MAN, WOMAN, etc. Please note that these are not always respected, especially if a conflicting audio history prompt is given.
2023-04-09 13:21:02 -04:00
```python
text_prompt = """
2023-04-17 16:43:37 -04:00
WOMAN: I would like an oatmilk latte please.
MAN: Wow, that's expensive!
2023-04-09 13:21:02 -04:00
"""
2023-04-17 16:43:37 -04:00
audio_array = generate_audio(text_prompt)
2023-04-09 13:21:02 -04:00
```
2023-04-17 16:43:37 -04:00
[latte.webm](https://user-images.githubusercontent.com/5068315/230684864-12d101a1-a726-471d-9d56-d18b108efcb8.webm)
2023-04-09 13:21:02 -04:00
## 💻 Installation
```
pip install git+https://github.com/suno-ai/bark.git
```
or
```
git clone https://github.com/suno-ai/bark
cd bark && pip install .
```
## 🛠️ Hardware and Inference Speed
Bark has been tested and works on both CPU and GPU (`pytorch 2.0+`, CUDA 11.7 and CUDA 12.0).
Running Bark requires running >100M parameter transformer models.
On modern GPUs and PyTorch nightly, Bark can generate audio in roughly realtime. On older GPUs, default colab, or CPU, inference time might be 10-100x slower.
## ⚙️ Details
Similar to [Vall-E](https://arxiv.org/abs/2301.02111) and some other amazing work in the field, Bark uses GPT-style
models to generate audio from scratch. Different from Vall-E, the initial text prompt is embedded into high-level semantic tokens without the use of phonemes. It can therefore generalize to arbitrary instructions beyond speech that occur in the training data, such as music lyrics, sound effects or other non-speech sounds. A subsequent second model is used to convert the generated semantic tokens into audio codec tokens to generate the full waveform. To enable the community to use Bark via public code we used the fantastic
[EnCodec codec](https://github.com/facebookresearch/encodec) from Facebook to act as an audio representation.
2023-05-25 16:27:35 -06:00
Below is a list of some known non-speech sounds
2023-04-09 13:21:02 -04:00
- `[laughter]`
- `[laughs]`
- `[sighs]`
- `[music]`
- `[gasps]`
- `[clears throat]`
- `—` or `...` for hesitations
- `♪` for song lyrics
- capitalization for emphasis of a word
- `MAN/WOMAN:` for bias towards speaker
**Supported Languages**
| Language | Status |
| --- | --- |
2023-04-17 16:43:37 -04:00
| English (en) | ✅ |
| German (de) | ✅ |
| Spanish (es) | ✅ |
| French (fr) | ✅ |
| Hindi (hi) | ✅ |
| Italian (it) | ✅ |
| Japanese (ja) | ✅ |
| Korean (ko) | ✅ |
| Polish (pl) | ✅ |
| Portuguese (pt) | ✅ |
| Russian (ru) | ✅ |
| Turkish (tr) | ✅ |
| Chinese, simplified (zh) | ✅ |
2023-04-09 13:21:02 -04:00
| Arabic | Coming soon! |
| Bengali | Coming soon! |
| Telugu | Coming soon! |