Kandinsky 5.0 I2V Lite - Diffusers
This repository provides the π€ Diffusers integration for Kandinsky 5.0 Lite - a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class.
Project Updates
- π₯ 2025/09/29: We have open-sourced
Kandinsky 5.0 T2V Litea lite (2B parameters) version ofKandinsky 5.0 Videotext-to-video generation model. - π Diffusers Integration: Now available with easy-to-use π€ Diffusers pipeline!
Kandinsky 5.0 Lite
Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger Wan models (5B and 14B) and offers the best understanding of Russian concepts in the open-source ecosystem.
We provide 9 model variants, each optimized for different use cases:
- SFT model β delivers the highest generation quality
- CFG-distilled β runs 2Γ faster
- Diffusion-distilled β enables low-latency generation with minimal quality loss (6Γ faster)
- Pretrain model β designed for fine-tuning by researchers and enthusiasts
Basic Usage
import torch
from diffusers import Kandinsky5I2VPipeline
from diffusers.utils import export_to_video
# Load the pipeline
pipe = Kandinsky5I2VPipeline.from_pretrained(
"kandinskylab/Kandinsky-5.0-I2V-Lite-5s-Diffusers",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
image = load_image(
"https://frontofficesports.com/wp-content/uploads/2023/10/USATSI_19520555_168393969_lowres-scaled-e1697215176168.jpg?quality=100"
)
height = 480
width = 640
image = image.resize((width, height))
prompt = "A football player kicking a ball"
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
output = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
height=512,
width=768,
num_frames=121,
num_inference_steps=50,
guidance_scale=5.0,
).frames[0]
## Save the video
export_to_video(output, "output.mp4", fps=24, quality=9)
Architecture
Latent diffusion pipeline with Flow Matching.
Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings.
Qwen2.5-VL and CLIP provides text embeddings
HunyuanVideo 3D VAE encodes/decodes video into a latent space
DiT is the main generative module using cross-attention to condition on text
Examples
Kandinsky 5.0 T2V Lite SFT
| |
| |
|
| |
@misc{kandinsky2025,
author = {Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov,
Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim,
Anastasiia Kargapoltseva, Nikita Kiselev, Vladimir Arkhipkin, Vladimir Korviakov,
Nikolai Gerasimenko, Denis Parkhomenko, Anna Dmitrienko, Anastasia Maltseva,
Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov,
Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina,
Tatiana Nikulina, Polina Gavrilova, Denis Dimitrov},
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
howpublished = {\url{https://github.com/kandinskylab/Kandinsky-5}},
year = 2025
}
@misc{mikhailov2025nablanablaneighborhoodadaptiveblocklevel,
title={$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention},
author={Dmitrii Mikhailov and Aleksey Letunovskiy and Maria Kovaleva and Vladimir Arkhipkin
and Vladimir Korviakov and Vladimir Polovnikov and Viacheslav Vasilev
and Evelina Sidorova and Denis Dimitrov},
year={2025},
eprint={2507.13546},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13546},
}
- Downloads last month
- 23