Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation

This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.

Model Overview

  • Architecture: 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
  • Dataset: Trained on How2Sign (ASL) video-text pairs.
  • Resolution: 64x64 RGB, 16 frames per clip.
  • Checkpoint: Provided at epoch 70.

Files

  • checkpoint_epoch_70.pt โ€” Pretrained model weights
  • config.py โ€” Model and generation configuration
  • inference.py โ€” Example script for generating sign language videos from text

Usage

  1. Install dependencies:
    pip install torch torchvision pillow matplotlib
    
  2. Run the inference script:
    python inference.py --prompt "Hello world"
    
    This will generate a video for the given prompt and save a filmstrip image.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support