Merge branch 'suno-ai:main' into main

2025-12-15 03:07:58 +01:00 · 2023-04-21 13:03:25 -06:00
parent d0ab679c2b d53b43e865
commit 21a1bc8d69
2 changed files with 26 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -81,7 +81,7 @@ audio_array = generate_audio(text_prompt)

 [lion.webm](https://user-images.githubusercontent.com/5068315/230684766-97f5ea23-ad99-473c-924b-66b6fab24289.webm)

-### 🎤 Voice/Audio Cloning
+### 🎤 Voice Presets and Voice/Audio Cloning

 Bark has the capability to fully clone voices - including tone, pitch, emotion and prosody. The model also attempts to preserve music, ambient noise, etc. from input audio. However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language. Specify following the pattern: `{lang_code}_speaker_{number}`.

@@ -202,4 +202,8 @@ If you are interested, you can sign up for early access [here](https://3os84zs17

 #### How do I specify where models are downloaded and cached?

-Use the `XDG_CACHE_HOME` env variable to override where models are downloaded and cached (otherwise defaults to a subdirectory of `~/.cache`). 
+Use the `XDG_CACHE_HOME` env variable to override where models are downloaded and cached (otherwise defaults to a subdirectory of `~/.cache`).
+
+#### Bark's generations sometimes differ from my prompts. What's happening?
+
+Bark is a GPT-style model. As such, it may take some creative liberties in its generations, resulting in higher-variance model outputs than traditional text-to-speech approaches.
--- a/bark/assets/prompts/readme.md
+++ b/bark/assets/prompts/readme.md
@@ -0,0 +1,20 @@
+# Example Prompts Data
+
+The provided data is in the .npz format, which is a file format used in Python for storing arrays and data. The data contains three arrays: semantic_prompt, coarse_prompt, and fine_prompt.
+
+```semantic_prompt```
+
+The semantic_prompt array contains a sequence of token IDs generated by the BERT tokenizer from Hugging Face. These tokens encode the text input and are used as an input to generate the audio output. The shape of this array is (n,), where n is the number of tokens in the input text.
+
+```coarse_prompt```
+
+The coarse_prompt array is an intermediate output of the text-to-speech pipeline, and contains token IDs generated by the first two codebooks of the EnCodec Codec from Facebook. This step converts the semantic tokens into a different representation that is better suited for the subsequent step. The shape of this array is (2, m), where m is the number of tokens after conversion by the EnCodec Codec.
+
+```fine_prompt```
+
+The fine_prompt array is a further processed output of the pipeline, and contains 8 codebooks from the EnCodec Codec. These codebooks represent the final stage of tokenization, and the resulting tokens are used to generate the audio output. The shape of this array is (8, p), where p is the number of tokens after further processing by the EnCodec Codec.
+
+Overall, these arrays represent different stages of a text-to-speech pipeline that converts text input into synthesized audio output. The semantic_prompt array represents the input text, while coarse_prompt and fine_prompt represent intermediate and final stages of tokenization, respectively.
+
+
+