AI-Generated content can be fun or “slop” according to Simon Willison, but also can be malevolent due to its abuse in phishing attacks.

Some of my readers may have already known that recently I’m working on a side project , which based onchatgpt-html and uses LLMs to detect phishing emails. I think at some point, the tool should be able to detect phishing attempt from video content too. Because deepfake technology is so accessible nowadays and its generated content can be quite convincing.

Previously, I have played with image generating and audio generating. It’s time to play around with videos so let’s get started.


SadTalker is an image to video lip sync tool. Since I have updated my Stable Diffusion this year, and SadTalker SD extension does not work for SD v1.8. So I’m using the standalone version instead.

Install is pretty simple, according to the official repo:

git clone

cd SadTalker 

conda create -n sadtalker python=3.8

conda activate sadtalker

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url

conda install ffmpeg

pip install -r requirements.txt

pip install tts --no-cache

bash scripts/



Fix AttributeError: 'Row' object has no attribute 'style' Using

Fix FFmpeg cannot edit existing files in-place with pip install gradio==4.1.1

Fix OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' Or resize the image into 256x256/512x512.

To serve on LAN, edit with launch(server_name="", server_port=7860).


  • With low resolution input image, the video length is up to 5 mins for my 24GB vram (256px/512px, no still, no GFPGAN). With high resolution input image (2k), the video length can’t get longer than 1 minute due to OOM.

  • GFPGAN makes mouse very clear. full makes ghosting head movement, still reduce it.

  • resize doesn’t work, have to manually prepare square sized photos (512x512) instead.

  • Generating anime style image with smile tag (in SD) to increase detectability, the art style must have nose and mouth visible (many anime style checkpoints don’t).

  • To reduce OOM

    • edit --batch_size =2 to 1 in
    • run in linux cli export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
    • or edit in with
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
batch_size = 1


SadTalker-Video-Lip-Sync is a video to video lip sync tool. It works well for a little bit more motion in the results, but also consume more vram.

The installation takes more effort since lack of documentations.

git clone

cd SadTalker-Video-Lip-Sync

conda create -n SadTalker-Video-Lip-Sync python=3.8
conda activate SadTalker-Video-Lip-Sync

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url

conda install ffmpeg
pip install -r requirements.txt

python -m pip install paddlepaddle-gpu==2.3.2.post112 \

download pretrained model:

tar -zxvf checkpoints.tar.gz

replace `SadTalker-Video-Lip-Sync/checkpoints` directory


Using so-vits-svc or vits-simple-api to generate an audio sample.

The input video is preferred to be mouth closed with stable head position.

Using ffmpeg to match length between input video and audio

ffmpeg -t 30 -i tts_output_audio.wav audio.wav
ffmpeg -ss 00:00:00 -to 00:00:30 -i input_video.mp4 -c copy video.mp4

Note: For my 24GB VRAM, length of under 60s is safe from OOM error.


python --driven_audio <audio.wav> \
                    --source_video <video.mp4> \
                    --enhancer <none,lip,face> \  #(null for lip)
                    none:do not enhance; \
                    lip:only enhance lip region \
                    face: enhance (skin nose eye brow lip) 

I used it like this python --driven_audio "/home/user/SadTalker-Video-Lip-Sync/audio.wav" --source_video "/home/user/SadTalker-Video-Lip-Sync/video.mp4" --enhancer face

Before start inferencing, it will download and load more models.

If the input video is mouth closed, use --enhancer lip. If the input video is speaking, then use --enhancer face.

I couldn’t get DAIN to work at this point, but the result is already satisfying without it.


Fix for AttributeError: _2D edit src/face3d/ replace face_alignment.LandmarksType._2D to face_alignment.LandmarksType.TWO_D


Wav2Lip STUDIO is another video to video lip sync tool. I choose the SD extension rather than the standalone. It gives more controls than SadTalker-Video-Lip-Sync. More interestingly, it has the face swap function!

The install is really easy following the official guide.

In case of installing SD, my old article has been outdated, so the new way to do:

conda create --name stablediffusion python=3.10
conda activate stablediffusion
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
cd stable-diffusion-webui
git pull
pip install -r requirements.txt
python --listen --enable-insecure-extension-access --no-half-vae

Note: launching with --no-half-vae because the limitation of my old GPU.

My Workflow

  1. Use Edge TTS and/or so-vits-svc to generate a audio file with desired voice and context.

  2. Slice the audio file into short pieces, for example ffmpeg -i input.wav -segment_time 00:01:00 -f segment output_file%03d.wav for 1 minute long.

  3. Run each audio slices through either SadTalker, SadTalker-Video-Lip-Sync or Wav2Lip STUDIO to generate the video with desired character.

  4. Combine all video pieces in video editing tools or ffmpeg.

Disclaimer: This post is for educational purposes only. I am not responsible if you deepfaked your president and make your country into Fascism, commit genocide, create nuclear war, etc. You are using this at your own risk. However, it’s very likely this won’t happen… at least not because deepfake abuse : )​