AI-Generated content can be fun or “slop” according to Simon Willison, but also can be malevolent due to its abuse in phishing attacks.
Some of my readers may have already known that recently I’m working on a side project , which based onchatgpt-html and uses LLMs to detect phishing emails. I think at some point, the tool should be able to detect phishing attempt from video content too. Because deepfake technology is so accessible nowadays and its generated content can be quite convincing.
Previously, I have played with image generating and audio generating. It’s time to play around with videos so let’s get started.
SadTalker
SadTalker is an image to video lip sync tool. Since I have updated my Stable Diffusion this year, and SadTalker SD extension does not work for SD v1.8. So I’m using the standalone version instead.
Install is pretty simple, according to the official repo:
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker
conda create -n sadtalker python=3.8
conda activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install -r requirements.txt
pip install tts --no-cache
bash scripts/download_models.sh
python app_sadtalker.py
Troubleshooting
Fix AttributeError: 'Row' object has no attribute 'style'
Using app_sadtalker.zip
Fix FFmpeg cannot edit existing files in-place
with pip install gradio==4.1.1
Fix OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
Or resize the image into 256x256/512x512.
To serve on LAN, edit app_sadtalker.py
with launch(server_name="0.0.0.0", server_port=7860)
.
Notes
-
With low resolution input image, the video length is up to 5 mins for my 24GB vram (256px/512px, no still, no GFPGAN). With high resolution input image (2k), the video length can’t get longer than 1 minute due to OOM.
-
GFPGAN
makes mouse very clear.full
makes ghosting head movement,still
reduce it. -
resize
doesn’t work, have to manually prepare square sized photos (512x512) instead. -
Generating anime style image with smile tag (in SD) to increase detectability, the art style must have nose and mouth visible (many anime style checkpoints don’t).
-
To reduce OOM
- edit
--batch_size =2
to1
ininference.py
. - run in linux cli
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
- or edit in
app_sadtalker.py
with
- edit
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
batch_size = 1
SadTalker-Video-Lip-Sync
SadTalker-Video-Lip-Sync is a video to video lip sync tool. It works well for a little bit more motion in the results, but also consume more vram.
The installation takes more effort since lack of documentations.
git clone https://github.com/Zz-ww/SadTalker-Video-Lip-Sync.git
cd SadTalker-Video-Lip-Sync
conda create -n SadTalker-Video-Lip-Sync python=3.8
conda activate SadTalker-Video-Lip-Sync
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install -r requirements.txt
python -m pip install paddlepaddle-gpu==2.3.2.post112 \
-f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
download pretrained model:
https://drive.google.com/file/d/1lW4mf5YNtS4MAD7ZkAauDDWp2N3_Qzs7/view?usp=sharing
tar -zxvf checkpoints.tar.gz
replace `SadTalker-Video-Lip-Sync/checkpoints` directory
Samples
Using so-vits-svc or vits-simple-api to generate an audio sample.
The input video is preferred to be mouth closed with stable head position.
Using ffmpeg to match length between input video and audio
ffmpeg -t 30 -i tts_output_audio.wav audio.wav
ffmpeg -ss 00:00:00 -to 00:00:30 -i input_video.mp4 -c copy video.mp4
Note: For my 24GB VRAM, length of under 60s is safe from OOM error.
Inference
python inference.py --driven_audio <audio.wav> \
--source_video <video.mp4> \
--enhancer <none,lip,face> \ #(null for lip)
none:do not enhance; \
lip:only enhance lip region \
face: enhance (skin nose eye brow lip)
I used it like this python inference.py --driven_audio "/home/user/SadTalker-Video-Lip-Sync/audio.wav" --source_video "/home/user/SadTalker-Video-Lip-Sync/video.mp4" --enhancer face
Before start inferencing, it will download and load more models.
If the input video is mouth closed, use --enhancer lip
.
If the input video is speaking, then use --enhancer face
.
I couldn’t get DAIN to work at this point, but the result is already satisfying without it.
Troubleshooting
Fix for AttributeError: _2D
edit src/face3d/extract_kp_videos.py
replace face_alignment.LandmarksType._2D
to face_alignment.LandmarksType.TWO_D
Wav2Lip STUDIO
Wav2Lip STUDIO is another video to video lip sync tool. I choose the SD extension rather than the standalone. It gives more controls than SadTalker-Video-Lip-Sync. More interestingly, it has the face swap function!
The install is really easy following the official guide.
In case of installing SD, my old article has been outdated, so the new way to do:
conda create --name stablediffusion python=3.10
conda activate stablediffusion
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
cd stable-diffusion-webui
git pull
pip install -r requirements.txt
python launch.py --listen --enable-insecure-extension-access --no-half-vae
Note: launching with --no-half-vae
because the limitation of my old GPU.
My Workflow
-
Use Edge TTS and/or so-vits-svc to generate a audio file with desired voice and context.
-
Slice the audio file into short pieces, for example
ffmpeg -i input.wav -segment_time 00:01:00 -f segment output_file%03d.wav
for 1 minute long. -
Run each audio slices through either
SadTalker
,SadTalker-Video-Lip-Sync
orWav2Lip STUDIO
to generate the video with desired character. -
Combine all video pieces in video editing tools or ffmpeg.
Disclaimer: This post is for educational purposes only. I am not responsible if you deepfaked your president and make your country into Fascism, commit genocide, create nuclear war, etc. You are using this at your own risk. However, it’s very likely this won’t happen… at least not because deepfake abuse : )