Abstract
Helios is a 14 billion parameter autoregressive diffusion model for video generation that achieves real-time performance and high-quality long-video synthesis without conventional optimization techniques.
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
Community
Code: https://github.com/PKU-YuanGroup/Helios
Page: https://pku-yuangroup.github.io/Helios-Page/
Inference Speed:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation (2026)
- S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation (2026)
- Pathwise Test-Time Correction for Autoregressive Long Video Generation (2026)
- LoL: Longer than Longer, Scaling Video Generation to Hour (2026)
- VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction (2026)
- Context Forcing: Consistent Autoregressive Video Generation with Long Context (2026)
- PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/helios-real-real-time-long-video-generation-model-2072-b2b90818
- Executive Summary
- Detailed Breakdown
- Practical Applications
19.5 FPS on a single H100 for a 14B video generation model is pretty wild. most video models are painfully slow so this real-time performance is a big deal. the fact they did this without typical acceleration tricks like KV cache or quantization makes it more impressive. handling long video drift without anti-drift heuristics seems tricky but important. good analysis here https://arxivexplained.com/helios-real-real-time-long-video-generation-model
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper