Experimenting Lip Syncing Deepfake Tools

AI-Generated content can be fun or “slop” according to Simon Willison, but also can be malevolent due to its abuse in phishing attacks.

Some of my readers may have already known that recently I’m working on a side project , which based onchatgpt-html and uses LLMs to detect phishing emails. I think at some point, the tool should be able to detect phishing attempt from video content too. Because deepfake technology is so accessible nowadays and its generated content can be quite convincing.

Previously, I have played with image generating and audio generating. It’s time to play around with videos so let’s get started.

SadTalker

SadTalker is an image to video lip sync tool. Since I have updated my Stable Diffusion this year, and SadTalker SD extension does not work for SD v1.8. So I’m using the standalone version instead.

Install is pretty simple, according to the official repo:

git clone https://github.com/OpenTalker/SadTalker.git

cd SadTalker 

conda create -n sadtalker python=3.8

conda activate sadtalker

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

conda install ffmpeg

pip install -r requirements.txt

pip install tts --no-cache

bash scripts/download_models.sh

python app_sadtalker.py

Troubleshooting

Fix AttributeError: 'Row' object has no attribute 'style' Using app_sadtalker.zip

Fix FFmpeg cannot edit existing files in-place with pip install gradio==4.1.1

Fix OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' Or resize the image into 256x256/512x512.

To serve on LAN, edit app_sadtalker.py with launch(server_name="0.0.0.0", server_port=7860).

Notes

With low resolution input image, the video length is up to 5 mins for my 24GB vram (256px/512px, no still, no GFPGAN). With high resolution input image (2k), the video length can’t get longer than 1 minute due to OOM.
GFPGAN makes mouse very clear. full makes ghosting head movement, still reduce it.
resize doesn’t work, have to manually prepare square sized photos (512x512) instead.
Generating anime style image with smile tag (in SD) to increase detectability, the art style must have nose and mouth visible (many anime style checkpoints don’t).
To reduce OOM
- edit --batch_size =2 to 1 in inference.py.
- run in linux cli export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
- or edit in app_sadtalker.py with

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
batch_size = 1

SadTalker-Video-Lip-Sync

SadTalker-Video-Lip-Sync is a video to video lip sync tool. It works well for a little bit more motion in the results, but also consume more vram.

The installation takes more effort since lack of documentations.

git clone https://github.com/Zz-ww/SadTalker-Video-Lip-Sync.git

cd SadTalker-Video-Lip-Sync

conda create -n SadTalker-Video-Lip-Sync python=3.8
conda activate SadTalker-Video-Lip-Sync

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

conda install ffmpeg
pip install -r requirements.txt

python -m pip install paddlepaddle-gpu==2.3.2.post112 \
-f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

download pretrained model:
https://drive.google.com/file/d/1lW4mf5YNtS4MAD7ZkAauDDWp2N3_Qzs7/view?usp=sharing

tar -zxvf checkpoints.tar.gz

replace `SadTalker-Video-Lip-Sync/checkpoints` directory

Samples

Using so-vits-svc or vits-simple-api to generate an audio sample.

The input video is preferred to be mouth closed with stable head position.

Using ffmpeg to match length between input video and audio

ffmpeg -t 30 -i tts_output_audio.wav audio.wav
ffmpeg -ss 00:00:00 -to 00:00:30 -i input_video.mp4 -c copy video.mp4

Note: For my 24GB VRAM, length of under 60s is safe from OOM error.

Inference

python inference.py --driven_audio <audio.wav> \
                    --source_video <video.mp4> \
                    --enhancer <none,lip,face> \  #(null for lip)
                    none:do not enhance; \
                    lip:only enhance lip region \
                    face: enhance (skin nose eye brow lip)

I used it like this python inference.py --driven_audio "/home/user/SadTalker-Video-Lip-Sync/audio.wav" --source_video "/home/user/SadTalker-Video-Lip-Sync/video.mp4" --enhancer face

Before start inferencing, it will download and load more models.

If the input video is mouth closed, use --enhancer lip. If the input video is speaking, then use --enhancer face.

I couldn’t get DAIN to work at this point, but the result is already satisfying without it.

Troubleshooting

Fix for AttributeError: _2D edit src/face3d/extract_kp_videos.py replace face_alignment.LandmarksType._2D to face_alignment.LandmarksType.TWO_D

Wav2Lip STUDIO

Wav2Lip STUDIO is another video to video lip sync tool. I choose the SD extension rather than the standalone. It gives more controls than SadTalker-Video-Lip-Sync. More interestingly, it has the face swap function!

The install is really easy following the official guide.

In case of installing SD, my old article has been outdated, so the new way to do:

conda create --name stablediffusion python=3.10
conda activate stablediffusion
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
cd stable-diffusion-webui
git pull
pip install -r requirements.txt
python launch.py --listen --enable-insecure-extension-access --no-half-vae

Note: launching with --no-half-vae because the limitation of my old GPU.

My Workflow

Use Edge TTS and/or so-vits-svc to generate a audio file with desired voice and context.
Slice the audio file into short pieces, for example ffmpeg -i input.wav -segment_time 00:01:00 -f segment output_file%03d.wav for 1 minute long.
Run each audio slices through either SadTalker, SadTalker-Video-Lip-Sync or Wav2Lip STUDIO to generate the video with desired character.
Combine all video pieces in video editing tools or ffmpeg.

Disclaimer: This post is for educational purposes only. I am not responsible if you deepfaked your president and make your country into Fascism, commit genocide, create nuclear war, etc. You are using this at your own risk. However, it’s very likely this won’t happen… at least not because deepfake abuse : )

SadTalker#

Troubleshooting#

Notes#

SadTalker-Video-Lip-Sync#

Samples#

Inference#

Troubleshooting#

Wav2Lip STUDIO#

My Workflow#

SadTalker

Troubleshooting

Notes

SadTalker-Video-Lip-Sync

Samples

Inference

Troubleshooting

Wav2Lip STUDIO

My Workflow