Forewords
My last article on voice cloning is more than a year ago, and here we are again for adopting some latest advancement.
Refering to some Chinese source such as this blog and this video, I was attempting to adopt new tools for my audio book service, such as CosyVoice, F5-TTS, GPT-SoVITS, and fish-speech.
But before we start, I recommend to:
Install miniconda for dependency sanity
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && sudo chmod +x Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh
Setup PyTorch environment as needed and confirm with python -m torch.utils.collect_env
Install nvtop
if preferred by sudo apt install nvtop
GPT-SoVITS
This project is made by same group of people from so-vits-svc. The model’s quality improved greatly from v2 to v4. Although when doing long text tts, errors are unavoidable, it is good enough for my use case.
By the time of writing this artile, they released a new version 20250606v2pro
which may have some differences since I was using version 20250422v4
.
But, you can always using their “windows package” which packed with all models and works on Linux servers despite its name, so that is intended to provide a more user friendly “one-click” experience.
Install on Linux
git clone https://github.com/RVC-Boss/GPT-SoVITS.git && cd GPT-SoVITS
conda create -n GPTSoVits python=3.10
conda activate GPTSoVits
#auto install script
bash install.sh --source HF --download-uvr5
#(optional) manual install
pip install -r extra-req.txt --no-deps
pip install -r requirements.txt
Install FFmpeg and other deps
sudo apt install ffmpeg
sudo apt install libsox-dev
#(optional for troubleshooting)
conda install -c conda-forge 'ffmpeg<7'
pip install -U gradio
python -m nltk.downloader averaged_perceptron_tagger_eng
(optional) Download Pretrained ASR Models for Chinese
git lfs install
cd tools/asr/models/
git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git tools/asr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch
git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git tools/asr/models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch
git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git tools/asr/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
After all done, run GRADIO_SHARE=0 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py
to start server then access via http://ip:9874/
0-Fetch dataset
preparation-for-vocal-recording
I used Audacity instead of UVR5 because the recording is clean and clear.
Use the webui built-in audio slicer to slice the new recording.wav and put old recordings (if any) all together under output/slicer_opt
Use built-in batch ASR tool with faster whisper, since I’m doing multilingual model this time.
These issues are related exclusively with old GPU architecture. No worries for new GPU (30x0/40x0) users.
Troubleshooting 1 RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.
pip uninstall -y ctranslate2
pip install ctranslate2==3.24.0
Troubleshooting 2 'iwrk': array([], dfitpack_int), 'u': array([], float),
pip uninstall numpy scipy
pip install numba==0.60.0 numpy==1.26.4 scipy
After transcription finished, use built-in labeling tool(subfix) to remove bad samples. If the web page doesn’t pop-up, use ip:9871
manually. Choose Audio and Delete Audio, Save File when finished.
1-GPT-SOVITS-TTS
1A-Dataset formatting
Fill up the empty fields and click Set One-Click Formatting
:
#Text labelling file
/home/username/GPT-SoVITS/output/asr_opt/slicer_opt.list
#Audio dataset folder
output/slicer_opt
1B-Fine-tuned training
1Ba-SoVITS training
Use batch size
at 1
, total epoch
at 5
and save_every_epoch
at 1
.
1Bb-GPT training
For this part, my batch size
is 6
with DPO enabled
. Total traning epochs should be around 5-15, adjust Save frequency based on needs.
Troubleshooting for old GPUs
Error: cuFFT doesn't support signals of half type with compute capability less than SM_53, but the device containing input half tensor only has SM_52
.
Fix1: Eidt webui.py
, add a new line after from multiprocessing import cpu_count
with is_half = False
Fix2: Edit GPT_SoVITS/s2_train.py
, add hps.train.fp16_run = False
at beginning (among with torch.backends.cudnn.benchmark = False
)
1C-inference
click refreshing model paths
and select model in both lists
Check Enable Parallel Inference Version
then open TTS Inference WebUI
, this need a while to load, manually access ip:9872
if needed.
Troubleshooting for ValueError: Due to a serious vulnerability issue in torch.load
fix by pip install transformers==4.43
Inference Settings:
- e3.ckpt
- e15.pth
- Primary Reference Audio with Text and multiple reference audio
- Slice by every punct
- top_k 5
- top_p 1
- temperature 0.9
- Repetition Penalty 2
- speed_factor 1.3
Keep everything else default.
To find the best GPT weight, parameters and random seed, first inferencing on a large block of text and pick up a few problematic sentences for next inference. Then, adjust GPT weight and parameters to make the problem go away while on a fixed seed number. Once the best GPT weight and parameters are found, fix them then play with difference seed number to refine the final result. Take note on the parameters when inference gets perfect for future use.
Fish Speech
Fish-speech is contributed by the same people from Bert-vits2 which I used for a long time.
Following their official docs to install version 1.4 (unfortunately, v1.5 has problem of sound quality while finetuning)
git clone --branch v1.4.3 https://github.com/fishaudio/fish-speech.git && cd fish-speech
conda create -n fish-speech python=3.10
conda activate fish-speech
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1
apt install libsox-dev ffmpeg
apt install build-essential \
cmake \
libasound-dev \
portaudio19-dev \
libportaudio2 \
libportaudiocpp0
pip3 install -e .
Download required models
huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4
Prepare dataset
mkdir data
cp -r /home/username/GPT-SoVITS/output/slicer_opt data/
python tools/whisper_asr.py --audio-dir data/slicer_opt --save-dir data/slicer_opt --compute-type float32
python tools/vqgan/extract_vq.py data \
--num-workers 1 --batch-size 16 \
--config-name "firefly_gan_vq" \
--checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
python tools/llama/build_dataset.py \
--input "data" \
--output "data/protos" \
--text-extension .lab \
--num-workers 16
Edit parameters by nano fish_speech/configs/text2semantic_finetune.yaml
and start training
python fish_speech/train.py --config-name text2semantic_finetune \
project=$project \
[email protected]_config=r_8_alpha_16
Note:
- The default numbers are too high for my setup, so both
num_workers
andbatch_size
needs to be lowered according to CPU cores and VRAM. - For the first run, I set
max_steps: 10000
andval_check_interval: 1000
to have 5 models that have lower steps with some diversity . - Things like
lr
,weight_decay
andnum_warmup_steps
can be further adjusted accroding to this article. My setting islr: 1e-5
,weight_decay: 1e-6
,num_warmup_steps: 500
. - To check to training metrics such as loss curve, run
tensorboard --logdir fish-speech/results/tensorboard/version_xx/
and accesslocalhost:6006
via browser. Determine overfitting with the graph AND actuall hear to the inference result for each checkpoint. - At first I found out overfitting starts around 5000 step. Then a a second training for 5000 steps and find the best result is
step_000004000.ckpt
. - Training requires a newer GPU with bf16 and no workaround so far.
- When training a model for inferencing on an older GPU, use
precision: 32-true
infish_speech/configs/text2semantic_finetune.yaml
andresult += (self.lora_dropout(x).to(torch.float32) @ self.lora_A.to(torch.float32).transpose(0, 1) @ self.lora_B.to(torch.float32).transpose(0, 1)) * self.scaling.to(torch.float32)
in/home/username/miniconda3/envs/fish-speech/lib/python3.10/site-packages/loralib/layers.py
.
Training would take many hours on weak GPU. After finished, convert the LoRA weights
python tools/llama/merge_lora.py \
--lora-config r_8_alpha_16 \
--base-weight checkpoints/fish-speech-1.4 \
--lora-weight results/$project/checkpoints/step_000005000.ckpt \
--output checkpoints/fish-speech-1.4-yth-lora/
Generate prompt and semantic tokens
python tools/vqgan/inference.py \
-i "1.wav" \
--checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
Troubleshooting for old GPU 1 Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}
pip uninstall -y ctranslate2
pip install ctranslate2==3.24.0
Troubleshooting for old GPU 2 ImportError: cannot import name 'is_callable_allowed' from partially initialized module 'torch._dynamo.trace_rules'
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=11.8 -c pytorch -c nvidia
Make it accessible from LAN nano tools/run_webui.py
app.launch(server_name="0.0.0.0", server_port=7860, show_api=True)
Change the --llama-checkpoint-path
to the newly trained LoRA, and start WebUI (added --half
for my old GPU to avoid bf16 error)
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m tools.webui \
--llama-checkpoint-path "checkpoints/fish-speech-1.4-yth-lora" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq \
--half
Parameters for inferencing Enable Reference Audio check Text Normalization Interative Prompt Lenth 200 Top-P 0.8 Temperature 0.7 Repetition Penalty 1.5 Set Seed
Note:
- higher number to compensate overfitted model, lower number for underfitted model.
- certain punctuation or tab space may trigger noise generation. Text normalization suppose to address these issue but sometimes I still need to find & replace.
However, a bug Negative code found
occurs quite frequent while inferencing without solution by now. Give up.
CosyVoice
CosyVoice is one from the FunAudioLLM toolkits, which developed by the same team from Alibaba’s Qwen I use a lot.
Install
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice
git submodule update --init --recursive
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
conda install -y -c conda-forge pynini==2.1.5
sudo apt-get install sox libsox-dev -y
pip install -r requirements.txt
Download Pretrained Models
git lfs install
mkdir -p pretrained_models
git clone https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B pretrained_models/CosyVoice2-0.5B
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M pretrained_models/CosyVoice-300M
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT pretrained_models/CosyVoice-300M-SFT
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-Instruct pretrained_models/CosyVoice-300M-Instruct
Run with
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
Troubleshooting “GLIBCXX_3.4.29’ not found” with this
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep GLIBCXX
nano ~/.bashrc
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
find / -name "libstdc++.so*"
rm /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6
ln -s /home/username/text-generation-webui/installer_files/env/lib/libstdc++.so.6.0.29 /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6
It ends up working fine but not as good as GPT-SoVITS. Hope their 3.0 version can pump it up.
Voice Conversion
Both RVC and Seed-VC are intended to replace my good old so-vits-svc instance.
Retrieval-based-Voice-Conversion
Install
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI && cd Retrieval-based-Voice-Conversion-WebUI
conda create -n rvc -y python=3.8
conda activate rvc
pip install torch torchvision torchaudio
pip install pip==24.0
pip install -r requirements.txt
python tools/download_models.py
sudo apt install ffmpeg
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt
Run with python infer-web.py
, fill up following then click buttons step by step with default settings:
Enter the experiment name:/path/to/raw/
Troubleshooting “enabled=hps.train.fp16_run”
Seed-VC
Install
git clone https://github.com/Plachtaa/seed-vc && cd Retrieval-based-Voice-Conversion-WebUI
conda create -n seedvc -y python=3.10
conda activate seedvc
pip install -r requirements.txt
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app.py --enable-v1 --enable-v2
Settings
#V2
Diffusion Steps: 100
Length Adjust: 1
Intelligibility CFG Rate: 0
Similarity CFG Rate: 1
Top-p: 1
Temperature: 1
Repetition Penalty: 2
convert style/emotion/accent: check
#V1
Diffusion Steps: 100
Length Adjust: 1
Inference CFG Rate: 1
Use F0 conditioned model: check
Auto F0 adjust: check
Pitch shift: 0
Training
python train.py --config /home/username/seed-vc/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username --batch-size 6 --max-steps 10000 --max-epochs 10000 --save-every 1000 --num-workers 1
accelerate launch train_v2.py --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username-v2 --batch-size 6 --max-steps 2000 --max-epochs 2000 --save-every 200 --num-workers 0 --train-cfm
Using checkpoints
#Voice Conversion Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc.py --checkpoint ./runs/test01/ft_model.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False
#Singing Voice Conversion Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_svc.py --checkpoint ./runs/username/DiT_epoch_00029_step_08000.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False
#V2 model Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc_v2.py --cfm-checkpoint-path runs/Satine-V2/CFM_epoch_00000_step_00600.pth
It turned out V1 model with Singing Voice Conversion Web UI app_svc.py
performs the best.