Revisiting Voice Cloning with GPT-SoVITS and so on

Forewords

My last article on voice cloning is more than a year ago, and here we are again for adopting some latest advancement.

Refering to some Chinese source such as this blog and this video, I was attempting to adopt new tools for my audio book service, such as CosyVoice, F5-TTS, GPT-SoVITS, and fish-speech.

But before we start, I recommend to:

Install miniconda for dependency sanity

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && sudo chmod +x Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh

Setup PyTorch environment as needed and confirm with python -m torch.utils.collect_env

Install nvtop if preferred by sudo apt install nvtop

GPT-SoVITS

This project is made by same group of people from so-vits-svc. The model’s quality improved greatly from v2 to v4. Although when doing long text tts, errors are unavoidable, it is good enough for my use case.

By the time of writing this artile, they released a new version 20250606v2pro which may have some differences since I was using version 20250422v4.

But, you can always using their “windows package” which packed with all models and works on Linux servers despite its name, so that is intended to provide a more user friendly “one-click” experience.

Install on Linux

git clone https://github.com/RVC-Boss/GPT-SoVITS.git && cd GPT-SoVITS
conda create -n GPTSoVits python=3.10
conda activate GPTSoVits
#auto install script
bash install.sh --source HF --download-uvr5
#(optional) manual install
pip install -r extra-req.txt --no-deps
pip install -r requirements.txt

Install FFmpeg and other deps

sudo apt install ffmpeg
sudo apt install libsox-dev

#(optional for troubleshooting)
conda install -c conda-forge 'ffmpeg<7'
pip install -U gradio
python -m nltk.downloader averaged_perceptron_tagger_eng

(optional) Download Pretrained ASR Models for Chinese

git lfs install
cd tools/asr/models/
git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git tools/asr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch
git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git tools/asr/models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch
git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git tools/asr/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

After all done, run GRADIO_SHARE=0 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py to start server then access via http://ip:9874/

0-Fetch dataset

preparation-for-vocal-recording

I used Audacity instead of UVR5 because the recording is clean and clear.

Use the webui built-in audio slicer to slice the new recording.wav and put old recordings (if any) all together under output/slicer_opt

Use built-in batch ASR tool with faster whisper, since I’m doing multilingual model this time.

These issues are related exclusively with old GPU architecture. No worries for new GPU (30x0/40x0) users.

Troubleshooting 1 RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.

pip uninstall -y ctranslate2
pip install ctranslate2==3.24.0

Troubleshooting 2 'iwrk': array([], dfitpack_int), 'u': array([], float),

pip uninstall numpy scipy
pip install numba==0.60.0 numpy==1.26.4 scipy

After transcription finished, use built-in labeling tool(subfix) to remove bad samples. If the web page doesn’t pop-up, use ip:9871 manually. Choose Audio and Delete Audio, Save File when finished.

1-GPT-SOVITS-TTS

1A-Dataset formatting

Fill up the empty fields and click Set One-Click Formatting:

#Text labelling file
/home/username/GPT-SoVITS/output/asr_opt/slicer_opt.list

#Audio dataset folder
output/slicer_opt

1B-Fine-tuned training

1Ba-SoVITS training Use batch size at 1, total epoch at 5 and save_every_epoch at 1.

1Bb-GPT training For this part, my batch size is 6 with DPO enabled. Total traning epochs should be around 5-15, adjust Save frequency based on needs.

Troubleshooting for old GPUs Error: cuFFT doesn't support signals of half type with compute capability less than SM_53, but the device containing input half tensor only has SM_52.

Fix1: Eidt webui.py, add a new line after from multiprocessing import cpu_count with is_half = False Fix2: Edit GPT_SoVITS/s2_train.py, add hps.train.fp16_run = False at beginning (among with torch.backends.cudnn.benchmark = False)

1C-inference

click refreshing model paths and select model in both lists

Check Enable Parallel Inference Version then open TTS Inference WebUI, this need a while to load, manually access ip:9872 if needed.

Troubleshooting for ValueError: Due to a serious vulnerability issue in torch.load fix by pip install transformers==4.43

Inference Settings:

e3.ckpt
e15.pth
Primary Reference Audio with Text and multiple reference audio
Slice by every punct
top_k 5
top_p 1
temperature 0.9
Repetition Penalty 2
speed_factor 1.3

Keep everything else default.

To find the best GPT weight, parameters and random seed, first inferencing on a large block of text and pick up a few problematic sentences for next inference. Then, adjust GPT weight and parameters to make the problem go away while on a fixed seed number. Once the best GPT weight and parameters are found, fix them then play with difference seed number to refine the final result. Take note on the parameters when inference gets perfect for future use.

Fish Speech

Fish-speech is contributed by the same people from Bert-vits2 which I used for a long time.

Following their official docs to install version 1.4 (unfortunately, v1.5 has problem of sound quality while finetuning)

git clone --branch v1.4.3 https://github.com/fishaudio/fish-speech.git && cd fish-speech

conda create -n fish-speech python=3.10
conda activate fish-speech

pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1

apt install libsox-dev ffmpeg 

apt install build-essential \
    cmake \
    libasound-dev \
    portaudio19-dev \
    libportaudio2 \
    libportaudiocpp0

pip3 install -e .

Download required models

huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4

Prepare dataset

mkdir data
cp -r /home/username/GPT-SoVITS/output/slicer_opt data/
python tools/whisper_asr.py --audio-dir data/slicer_opt --save-dir data/slicer_opt --compute-type float32

python tools/vqgan/extract_vq.py data \
    --num-workers 1 --batch-size 16 \
    --config-name "firefly_gan_vq" \
    --checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
	
python tools/llama/build_dataset.py \
    --input "data" \
    --output "data/protos" \
    --text-extension .lab \
    --num-workers 16

Edit parameters by nano fish_speech/configs/text2semantic_finetune.yaml and start training

python fish_speech/train.py --config-name text2semantic_finetune \
    project=$project \
    [email protected]_config=r_8_alpha_16

Note:

The default numbers are too high for my setup, so both num_workers and batch_size needs to be lowered according to CPU cores and VRAM.
For the first run, I set max_steps: 10000 and val_check_interval: 1000 to have 5 models that have lower steps with some diversity .
Things like lr, weight_decay and num_warmup_steps can be further adjusted accroding to this article. My setting is lr: 1e-5, weight_decay: 1e-6, num_warmup_steps: 500.
To check to training metrics such as loss curve, run tensorboard --logdir fish-speech/results/tensorboard/version_xx/ and access localhost:6006 via browser. Determine overfitting with the graph AND actuall hear to the inference result for each checkpoint.
At first I found out overfitting starts around 5000 step. Then a a second training for 5000 steps and find the best result is step_000004000.ckpt.
Training requires a newer GPU with bf16 and no workaround so far.
When training a model for inferencing on an older GPU, use precision: 32-true in fish_speech/configs/text2semantic_finetune.yaml and result += (self.lora_dropout(x).to(torch.float32) @ self.lora_A.to(torch.float32).transpose(0, 1) @ self.lora_B.to(torch.float32).transpose(0, 1)) * self.scaling.to(torch.float32) in/home/username/miniconda3/envs/fish-speech/lib/python3.10/site-packages/loralib/layers.py.

Training would take many hours on weak GPU. After finished, convert the LoRA weights

python tools/llama/merge_lora.py \
    --lora-config r_8_alpha_16 \
    --base-weight checkpoints/fish-speech-1.4 \
    --lora-weight results/$project/checkpoints/step_000005000.ckpt \
    --output checkpoints/fish-speech-1.4-yth-lora/

Generate prompt and semantic tokens

python tools/vqgan/inference.py \
    -i "1.wav" \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

Troubleshooting for old GPU 1 Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}

pip uninstall -y ctranslate2
pip install ctranslate2==3.24.0

Troubleshooting for old GPU 2 ImportError: cannot import name 'is_callable_allowed' from partially initialized module 'torch._dynamo.trace_rules'

conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  pytorch-cuda=11.8 -c pytorch -c nvidia

Make it accessible from LAN nano tools/run_webui.py

app.launch(server_name="0.0.0.0", server_port=7860, show_api=True)

Change the --llama-checkpoint-path to the newly trained LoRA, and start WebUI (added --half for my old GPU to avoid bf16 error)

GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m tools.webui \
    --llama-checkpoint-path "checkpoints/fish-speech-1.4-yth-lora" \
    --decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
    --decoder-config-name firefly_gan_vq \
    --half

Parameters for inferencing Enable Reference Audio check Text Normalization Interative Prompt Lenth 200 Top-P 0.8 Temperature 0.7 Repetition Penalty 1.5 Set Seed

Note:

higher number to compensate overfitted model, lower number for underfitted model.
certain punctuation or tab space may trigger noise generation. Text normalization suppose to address these issue but sometimes I still need to find & replace.

However, a bug Negative code found occurs quite frequent while inferencing without solution by now. Give up.

CosyVoice

CosyVoice is one from the FunAudioLLM toolkits, which developed by the same team from Alibaba’s Qwen I use a lot.

Install

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice
git submodule update --init --recursive
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
conda install -y -c conda-forge pynini==2.1.5
sudo apt-get install sox libsox-dev -y
pip install -r requirements.txt

Download Pretrained Models

git lfs install
mkdir -p pretrained_models
git clone https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B pretrained_models/CosyVoice2-0.5B
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M pretrained_models/CosyVoice-300M
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT pretrained_models/CosyVoice-300M-SFT
git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-Instruct pretrained_models/CosyVoice-300M-Instruct

Run with

GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m  webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M

Troubleshooting “GLIBCXX_3.4.29’ not found” with this

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep GLIBCXX

nano ~/.bashrc
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

find / -name "libstdc++.so*"
rm /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6
ln -s /home/username/text-generation-webui/installer_files/env/lib/libstdc++.so.6.0.29 /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6

It ends up working fine but not as good as GPT-SoVITS. Hope their 3.0 version can pump it up.

Voice Conversion

Both RVC and Seed-VC are intended to replace my good old so-vits-svc instance.

Retrieval-based-Voice-Conversion

Install

git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI && cd Retrieval-based-Voice-Conversion-WebUI
conda create -n rvc -y python=3.8
conda activate rvc
pip install torch torchvision torchaudio
pip install pip==24.0
pip install -r requirements.txt
python tools/download_models.py
sudo apt install ffmpeg
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt

Run with python infer-web.py, fill up following then click buttons step by step with default settings:

Enter the experiment name:/path/to/raw/

Troubleshooting “enabled=hps.train.fp16_run”

Seed-VC

Install

git clone https://github.com/Plachtaa/seed-vc && cd Retrieval-based-Voice-Conversion-WebUI
conda create -n seedvc -y python=3.10
conda activate seedvc
pip install -r requirements.txt
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app.py --enable-v1 --enable-v2

Settings

#V2
Diffusion Steps: 100
Length Adjust: 1
Intelligibility CFG Rate: 0
Similarity CFG Rate: 1
Top-p: 1
Temperature: 1
Repetition Penalty: 2
convert style/emotion/accent: check

#V1
Diffusion Steps: 100
Length Adjust: 1
Inference CFG Rate: 1
Use F0 conditioned model: check
Auto F0 adjust: check
Pitch shift: 0

Training

python train.py --config /home/username/seed-vc/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username --batch-size 6 --max-steps 10000 --max-epochs 10000 --save-every 1000 --num-workers 1

accelerate launch train_v2.py --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username-v2 --batch-size 6 --max-steps 2000 --max-epochs 2000 --save-every 200 --num-workers 0 --train-cfm

Using checkpoints

#Voice Conversion Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc.py --checkpoint ./runs/test01/ft_model.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False

#Singing Voice Conversion Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_svc.py --checkpoint ./runs/username/DiT_epoch_00029_step_08000.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False

#V2 model Web UI
GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc_v2.py --cfm-checkpoint-path runs/Satine-V2/CFM_epoch_00000_step_00600.pth

It turned out V1 model with Singing Voice Conversion Web UI app_svc.py performs the best.

GPT-SoVITS#

0-Fetch dataset#

1-GPT-SOVITS-TTS#

1A-Dataset formatting#

1B-Fine-tuned training#

1C-inference#

Fish Speech#

CosyVoice#

Voice Conversion#

Retrieval-based-Voice-Conversion#

Seed-VC#

GPT-SoVITS

0-Fetch dataset

1-GPT-SOVITS-TTS

1A-Dataset formatting

1B-Fine-tuned training

1C-inference

Fish Speech

CosyVoice

Voice Conversion

Retrieval-based-Voice-Conversion

Seed-VC