A Deep Dive into Voice Cloning with SoftVC VITS and Bert-VITS2

In the previous post, I have tried a little bit of TTS Generation WebUI and found it’s interesting. So, I decide to train a usable model with my own voice.

This voice cloning project explores both SVC for Voice Changing and VITS for Text-to-Speech. There is no one tool does all jobs.

I have tested several tools for this project. Many of the good guides, like this, this and this, are in Chinese. So, I thought it’s useful to post my notes in English.

Although so-vits-svc has been archived for a few months, probably due to oppression, it is still the tool for the best result.

Other related tools such as so-vits-svc-fork, so-vits-svc-5.0, DDSP-SVC, and RVC provide either faster/liter optimization, more features or better interfaces.

But with enough time and resources, none of these alternatives can compete with the superior result generated by the original so-vits-svc.

For TTS, a new tool called Bert-VITS2 works fantastically and has already matured with its final release last month. It has some very different use case, for example, audio content creation.

Prepare Dataset

The audio files of the dataset should be WAV format, 44100 Hz, 16bit, mono, 1-2 hours ideally.

Extract from a Song

Ultimate Vocal Remover is the easiest tool for this job. There is a thread explains everything in details.

UVR Workflows

  • Remove and extract Instrumental
    • Model: VR - UVR(4_HP-Vocal-UVR)
    • Settings: 512 - 10 - GPU
    • Output Instrumental and unclean vocal
  • Remove and extract background vocal
    • Model: VR - UVR(5_HP-Karaoke-UVR)
    • Settings: 512 - 10 - GPU
    • Output background vocal and unclean main vocal
  • Remove reverb and noise
    • Model: VR - UVR-DeEcho-DeReverb & UVR-DeNoise
    • Settings: 512 - 10 - GPU - No Other Only
    • Output clean main vocal
  • (Optional) Using RipX (non-free) to perform a manual fine cleaning

Preparation for vocal recording

It’s better to record in a treated room with condenser microphone, otherwise use a directional or dynamic microphone to reduce noise.

Cheapskate’s Audio Equipment

The very first time I’ve got into music was during my high school, with the blue Sennheiser MX500 and Koss Porta Pro. I still remember the first time I was recording a song that was on a Sony VAIO with Cool Edit Pro.

Nowadays, I still resist to spend a lot of money on audio hardware as an amateur because it is literally a money-sucking blackhole.

Nonetheless, I really appreciate the reliability of those cheap production equipment.

The core part of my setup is a Behringer UCA202 and it’s perfect for my use cases. I bought it for $10 while a price drop.

It is so called “Audio Interface” but basically just a sound card with multiple ports. I used RCA to 3.5mm TRS cables for my headphones, a semi-open K240s for regular output and a closed-back HD669/MDR7506 for monitor output.

All three mentioned headphones are under $100 for normal price. And there are clones from Samson, Tascam, Knox Gear and more out there for less than $50.

For the input device, I’m using a dynamic microphone for the sake of my environmental noises. It is a SM58 copy (Pyle) + a Tascam DR-05 recorder (as amplifier). Other clones such as SL84c or wm58 would do it too.

I use a XLR to 3.5mm TRS cable to connect the microphone to the MIC/External-input of the recorder, and then use an AUX cable to connect between the line-out of the recorder and the input of the UCA202.

It’s not recommend to buy an “audio interface” and a dedicated amplifier to replicate my setup. A $10 c-media USB sound card should be good enough. The Syba model that I owned is capable to “pre-amp” dynamic microphones directly and even some lower-end phantom powered microphones.

The setup can go extremely cheap ($40~60) but with UCA202 and DR-05, the sound is much cleaner. And I really like the physical controls, versatility and portability of my old good digital recorder.

Audacity workflows

Although when I was getting paid as a designer, I was pretty happy with Audition. But for personal use on a fun project, Audacity is the way to avoid the chaotic evil of Adobe.


Use audio-slicer or audio-slicer (gui) to slice the audio file into small pieces for later use.

Default setting works great.

Cleaning dataset

Remove those very short ones and re-slice which are still over 10 seconds.

In case of large dataset, remove all that are less than 4 sec. In case of small dataset, remove only under 2 sec.

If necessary, perform manual inspection for every single file.

Match loudness

Use Audacity again with Loudness Normalization, 0db should do it.


Set up environment

Virtual environment is essential to run multiple python tools inside one system. I used to use VMs and Docker, but now I found anaconda is way quicker, handier than the others.

Create a new environment for so-vits-svc and activate it

conda create -n so-vits-svc python=3.8
conda activate so-vits-svc

Then install requirements

git clone https://github.com/svc-develop-team/so-vits-svc
cd so-vits-svc

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

#for linux
pip install -r requirements.txt

#for windows
pip install -r requirements_win.txt
pip install --upgrade fastapi==0.84.0
pip install --upgrade gradio==3.41.2
pip install --upgrade pydantic==1.10.12
pip install fastapi uvicorn


Download pretrained models

  • pretrain
    • wget https://huggingface.co/WitchHuntTV/checkpoint_best_legacy_500.pt/resolve/main/checkpoint_best_legacy_500.pt
    • wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt
  • logs/44k
    • wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_D_320000.pth
    • wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_G_320000.pth
  • logs/44k/diffusion
    • wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/resolve/main/fix_pitch_add_vctk_600k/model_0.pt
    • (Alternative) wget https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt
    • (Alternative) wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt
  • pretrain/nsf_hifigan
    • wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
    • unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip

Dataset Preparation

Put all Prepared audio.wav files into dataset_raw/character

cd so-vits-svc
python resample.py --skip_loudnorm
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
python preprocess_hubert_f0.py --use_diff

Edit Configs

The file is located at configs/config.json

log interval : the frequency of printing log eval interval : the frequency of saving checkpoints epochs : total steps keep ckpts : numbers of saved checkpoints, 0 for unlimited. half_type : fp32 in my case batch_size : the smaller the faster (rougher), the larger the slower (better). Recommended batch_size per VRAM: 4=6G;6=8G;10=12G;14=16G;20=24G

Keep default for configs/diffusion.yaml


python cluster/train_cluster.py --gpu
python train_index.py -c configs/config.json
python train.py -c configs/config.json -m 44k
python train_diff.py -c configs/diffusion.yaml

On training steps:

Use train.py to train the main model, usually 20k-30k would be usable, and 50k and up would be good enough. This can take a few days depending on the GPU speed. Feel free to stop it by ctrl+c and it will be continue training by re-run python train.py -c configs/config.json -m 44k anytime.

Use train_diff.py to train diffusion model, training steps is recommended at 1/3 of the main model.

Be aware of over training. Use tensorboard --logdir=./logs/44k to monitor the plots to see if it goes flat.

Change the learning rate from 0.0001 to 0.00005 if necessary.

When done, share/transport these files for inference.

  • config/
    • config.json
    • diffusion.yaml
  • logs/44k
    • feature_and_index.pkl
    • kmeans_10000.pt
    • model_0.pt
    • G_xxxxx.pt


It’s time to try out the trained model. I’d prefer webui for convenience of tweaking the parameters.

But before fire it up, edit following lines in webUI.py for LAN access:

os.system("start http://localhost:7860")
app.launch(server_name="", server_port=7860)

Run python webUI.py then access its ipaddress:7860 from web browser.

The webui has no English localization, but Immersive Translate would be helpful.

Most parameters would work well with default value. Refer to this and this to make changes.

Upload these 5 files:

main model.pt and its config.json

diffusion model.pt and its diffusion.yaml

Either cluster model kmeans_10000.pt for speaking or feature retrieval feature_and_index.pkl for singing.

F0 predictor is for speaking only, not for singing. Recommend RMVPE when using.

Pitch change is useful when singing a feminine song using a model with masculine voice, or vice versa.

Clustering model/feature retrieval mixing ratio is the way of controlling the tone. Use 0.1 to get clearest speech, and use 0.9 to get the closest tone to the model.

shallow diffusion steps should be set around 50, it enhances the result at 30-100 steps.

Audio Editing

This procedure is optional. Just for production of a better song.

I won’t go into details in this since the audio editing software, or so called DAW (digital audio workstation), that I’m using are non-free. I have no intention to advocate proprietary software even though the entire industry is paywalled and closed-source.

Audacity supports multitrack, effects and a lot more. It does load some advanced VST plugins as well.

It’s not hard to find tutorials on mastering songs with Audacity.

Typically, the mastering process should be mixing/balancing, EQ/compressing, reverb, imaging. The more advanced the tool is, the easier the process will be.

I’ll definitely spend more time on adopting Audacity for my mastering process in the future and I recommend everyone do so.


This is a so-vits-svc fork with realtime support and the models are compatible. Easier to use but does not support Diffusion model. For dedicated realtime voice changing, voice-changer is more recommended.


conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork

git clone https://github.com/voicepaw/so-vits-svc-fork
cd so-vits-svc-fork

python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork
pip install click
sudo apt-get install libportaudio2


Put dataset .wav files into so-vits-svc-fork/dataset_raw

svc pre-resample
svc pre-config

Edit batch_size in configs/44k/config.json. This fork takes larger size than the original.


svc pre-hubert
svc train -t
svc train-cluster


Use GUI with svcg. This requires local desktop environment.

Or use CLI with svc vc for realtime andsvc infer -m "logs/44k/xxxxx.pth" -c "configs/config.json" raw/xxx.wav for generating.


DDSP-SVC requires less hardware resources and runs faster than so-vits-svc. It supports both realtime and diffusion model (Diff-SVC).

conda create -n DDSP-SVC python=3.8
conda activate DDSP-SVC

git clone https://github.com/yxlllc/DDSP-SVC

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Refer to Initialization section for the two files:



python draw.py
python preprocess.py -c configs/combsub.yaml
python preprocess.py -c configs/diffusion-new.yaml

Edit configs/

batch_size: 32  (16 for diffusion)
cache_all_data: false
cache_device: 'cuda'
cache_fp16: false


conda activate DDSP-SVC
python train.py -c configs/combsub.yaml
python train_diff.py -c configs/diffusion-new.yaml

tensorboard --logdir=exp


It’s recommended to use main_diff.py since it includes both DDSP and diffusion model.

python main_diff.py -i "input.wav" -diff "model_xxxxxx.pt" -o "output.wav"

Realtime gui for voice cloning:

python gui_diff.py


This is a TTS tool which is completely different from everything above. By using it, I have already created several audio books with my voice for my parents, and they really enjoy it.

Instead of using the oringal, I used the fork by v3u for easier setup.


conda create -n bert-vits2 python=3.9
conda activate bert-vits2

git clone https://github.com/v3ucn/Bert-vits2-V2.3.git
cd Bert-vits2-V2.3

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Download pretrained models (includes Chinese, Japanese and English):

wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin

Create a character model folder mkdir -p Data/xxx/models/

Download base models:

!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/DUR_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/D_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/G_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/WD_0.pth

#More options

Edit train_ms.py by replacing all bfloat16 to float16

Edit webui.py for LAN access:

webbrowser.open(f"start http://localhost:7860")
app.launch(server_name="", server_port=7860)

Edit Data/xxx/config.json for batch_size and spk2id


Similar workflow as in previous section.

Remove noise and silence, normalization, then put the un-sliced WAV file into Data/xxx/raw.

Edit config.yml for dataset_path, num_workers and keep_ckpts.

Run python3 audio_slicer.py to slice the WAV file.

Clean the dataset (Data/xxx/raw) by removing small files that are under 2 sec.


Install whisper pip install git+https://github.com/openai/whisper.git

To turn off language auto-detection, set it to English only, and use large model, edit short_audio_transcribe.py as below:

    # set the spoken language to english
    print('language: en')
    lang = 'en'
    options = whisper.DecodingOptions(language='en')
    result = whisper.decode(model, mel, options)
    # set to use large model
    parser.add_argument("--whisper_size", default="large")

    #Solve error "Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead" while using large model
    mel = whisper.log_mel_spectrogram(audio,n_mels = 128).to(model.device)

Run python3 short_audio_transcribe.py to start transcription

Re-sample the sliced dataset: python3 resample.py --sr 44100 --in_dir ./Data/zizek/raw/ --out_dir ./Data/zizek/wavs/

Preprocess transcription: python3 preprocess_text.py --transcription-path ./Data/zizek/esd.list

Generate BERT feature config: python3 bert_gen.py --config-path ./Data/zizek/configs/config.json

Training and Inference

Run python3 train_ms.py to start training

Edit config.yml for model path:

model: "models/G_20900.pth"

Run python3 webui.py to start webui for inference


vits-simple-api is a web frontend for using trained models. I use this mainly for its long text support which the oringal project doesn’t have.

git clone https://github.com/Artrajz/vits-simple-api
git pull https://github.com/Artrajz/vits-simple-api
cd vits-simple-api

conda create -n vits-simple-api python=3.10 pip
conda activate vits-simple-api && 

pip install -r requirements.txt

(Optional) Copy pretrained model files from Bert-vits2-V2.3/ to vits-simple-api/bert_vits2/

Copy Bert-vits2-V2.3/Data/xxx/models/G_xxxxx.pth and Bert-vits2-V2.3/Data/xxx/config.json to vits-simple-api/Model/xxx/

Edit config.py for MODEL_LIST and Default parameter as preferred

Edit Model/xxx/config.json as below:

  "data": {
    "training_files": "Data/train.list",
    "validation_files": "Data/val.list",
  "version": "2.3"

Check/Edit model_list in config.yml as [xxx/G_xxxxx.pth, xxx/config.json]

Run python app.py


SDP Ratio for tone Noise for randomness Noise_W for pronounciation Length for speed emotion and style are self-explanatory

Share models

In its Hugging Face repo, there are a lot of VITS models shared by others. You can try it out first and then download desired models from Files.

Genshin model is widely used in some content creation community because its high quality. It contains hundreds of characters, although only Chinese and Japanese are supported.

In another repo, there are a lot of Bert-vits2 models that made from popular Chinese streamers and VTubers.

There are already projects making AI Vtuber like this and this. I’m looking forward how this technology can change the industry in the near future.