Stability AI released a new Stable Diffusion model that generates video frames from an input image…for FREE. This new tool has the potential to make a huge impact in the fields of content generation and marketing, just to name a few.
In the field of marketing, short video clips provide several advantages over static images:
- Captures attention: Videos are generally more engaging than static images, making them ideal for grabbing attention and conveying information effectively.
- Enhances storytelling: You can bring still images to life with movement creating a more immersive and impactful story.
Currently, the process of taking a still image and creating a realistic motion effect requires expensive or complex video editing software and arguably – a lot of time. However, with the new Stable Diffusion Video models allow the conversion of an image to a video without the need for video editing software. Instead, videos can be generated by anyone with some python code and a decent Nvidia video card.
This post will guide you through the process of getting this model setup and running on your personal computer – no need to pay for expensive cloud processing platforms
Who is Stability AI?
Stability AI is leading generative AI research company. They develop bleeding-edge AI models with the goal of making them open-access and requiring minimal resources. The develop AI tools to work with images, language, code and audio.
What is Stable Video Diffusion (SVD)?
Stable Diffusion is a group of powerful deep learning, image generating models released in 2022 by Stability AI. They use diffusion techniques to transform noise into high-quality images based on text descriptions or existing visuals.
The Stable Video Diffusion (SVD) Image-to-Video is a latent diffusion model trained to generate short video clips from an image. Currently, there are two models that have been released:
- stable-video-diffusion-img2vid
- stable-video-diffusion-img2vid-xt
The first model, stable-video-diffusion-img2vid, generates up to 14frames from a given input image. The XT model, can generate up to 25frames. Both models, however, have input arguments that allow less frames to be generated.
Both models generate video at the 1024×576 resolution.
Where To Find
Currently, both models (stable-video-diffusion-img2vid and stable-video-diffusion-img2vid-xt) are available on HuggingFace.co. To read more about what HuggingFace.co is, see my other article: How To Get Started With Mistral-7B Tutorial
Here are links to the model’s page on HuggingFace.co:
- stable-video-diffusion-img2vid (14frames): https://huggingface.co/stabilityai/stable-video-diffusion-img2vid
- stable-video-diffusion-img2vid-xt (25frames): https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
See below for instructions on how to download the models.
Prerequisites
We will be using Python to run this model.
This files for this model total over 20GB in size. Be sure to have at least 25GB of free space before running the code below. If you do not include the *.safetensors flag, the model size will be around 10GB.
Before you being, you will need to make sure the following packages are installed in the python environment you will be using: diffusers, transformers, and accelerate
pip install -q -U diffusers transformers accelerate
The transformers package was created by HuggingFace.co. It includes many powerful functions that help reduce setup time when working with AI models. One of the best features is that it will handle the process of connecting to the HuggingFace repository, download the model, check for any updates and run the model – all from a single line of Python code!
Second, you will need to install PyTorch.
PyTorch is a library that provides tools for accelerating the models when running on either the CPU or GPU. The Transformers functions require PyTorch to be installed. However, it is important to make sure you install the correct PyTorch version for your system.
Click here for instructions to install PyTorch: https://pytorch.org/get-started/locally/
After these packages are installed, open your favorite Python editor and copy the code below into a blank file.
Python Code
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
In the code above, it instructs the model to be ran on your GPU. Depending on which GPU you have, you may see Cuda memory errors. I had to spend several hours tweaking the code to get it running on my Nvidia GeForce RTX 3080 10GB.
The modified code below is what ended up working for me. I had to reduce the number of frames generated to no more than 10. I also reduced the decode_chunk_size parameter from 8 to 3. You may have to play with this parameter a bit.
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
from GPUtil import showUtilization as gpu_usage
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
#pipe.enable_model_cpu_offload()
pipe.to(0) # Force to GPU
# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
# Perform GPU memory cleanup
gc.collect()
torch.cuda.empty_cache()
gpu_usage()
frames = pipe(image, decode_chunk_size=3, generator=generator, num_frames=10, motion_bucket_id=180, noise_aug_strength=0.3).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
This code will output a video with 10frames instead of 14 or 25 (depending on which model you are using).
Model Parameters
decode_chunk_size: This controls how many frames are decoded at once. It’s recommended to tweak this value based on your GPU memory. Setting decode_chunk_size=1
will decode one frame at a time and will use the least amount of memory but the video might have some flickering.
num_frames: The number of frames the model will generate.
fps: The frames per second of the generated video.
num_inference_steps: This has the same affect as the image generation models. Usually a larger number will generate images with more detail. The tradeoff is that the model will take longer to complete.
motion_bucket_id: This parameter determines how much motion the video will demonstrate. Lower values seem to work well (between 10 and 180). Higher values will start to distort the images.
noise_aug_strength: The amount of noise added to the conditioning image. The higher the values the less the video will resemble the conditioning image. Increasing this value will also increase the motion of the generated video.
For more information about the model, see here: https://huggingface.co/docs/diffusers/using-diffusers/svd
Examples
Below are 3 examples of images that were converted into short videos using Stable Diffusion Video:
Conclusion
This is the very beginning of these types of models. Ultimately, I believe, AI-powered video generation will be the preferred method for generating action shots from pre-existing images. In the near future, expect to see similar AI tools integrated into the leading video editing software packages.
Let me know in the comments if you have any success or experience issues running the code. I will do my best to help!