rsdiff-sr-cascade-ep650

A T5-conditioned cascaded diffusion model for text-to-satellite-image generation at 256Γ—256, trained on RSICD.

  • FID 65.70 on the full RSICD test split (N=1093, Inception feature=2048, cascade-256, cond_scale=5.0).
  • CLIP-score 0.278 (OpenAI ViT-B/32), with a +0.046 lift over a shuffled-caption null baseline.

Code & full tech report: https://github.com/asebaq/rsdiff (REPORT.md).

Architecture

Two-stage Imagen-style cascade conditioned on a frozen T5-base text encoder.

Stage Params Resolution Conditioning
LR base UNet 27.18 M 128Γ—128 T5-base, p_uncond=0.1
SR UNet 92.66 M 128 β†’ 256 T5-base + LR image, p_uncond=0.1

Total β‰ˆ 120 M params. Sampler: DDPM, T=1000 denoising steps.

Files

File Size What it is
ckpt_sr_ep650_step89050.pt ~1.9 GB Merged cascade weights (LR base + SR)
samples/ ~2 MB 16 demo PNGs at 256Β² + 4Γ—4 grid
captions.txt 72 KB 1093 RSICD-test captions matching the demo and FID PNGs
fid_result.json β€” Headline FID (Inception feature=2048)
fid_result_f768.json β€” Cross-comparison FID (feature=768)
clip_result.json β€” OpenAI CLIP ViT-B/32 score + shuffled-baseline null

Usage

git clone https://github.com/asebaq/rsdiff
cd rsdiff
uv venv && source .venv/bin/activate
uv pip install -e ".[dev,eval]"

# pull the checkpoint
hf download asebaq/rsdiff-sr-cascade-ep650 ckpt_sr_ep650_step89050.pt -o legacy/DDPM/ckpts/

# sample 16 captions from the RSICD test split
python legacy/DDPM/sample_grid.py \
  --log_dir legacy/DDPM/logs/full_sr_gdm \
  --data_root data/RSICD_optimal \
  --ckpt legacy/DDPM/ckpts/ckpt_sr_ep650_step89050.pt \
  --n 16 --cols 4 --batch 2 --cond_scale 5.0 \
  --img_sz 128 --sr_sz 256 --ts 1000 \
  --sr --split test --seed 17

A diffusers-native sampling path is on the project roadmap; for now the bundled cascade runner (legacy/) loads this checkpoint directly.

Training data

RSICD β€” 10 921 paired satellite images and natural-language captions, official 8/1/1 train/val/test split (1093 test). At training time the first caption per image (sent1) is used as the conditioning text.

Intended use & limitations

Intended use. Research artefact for studying small-scale text-to-RS generation. Useful as a baseline for new remote-sensing diffusion work and as a starting point for downstream tasks (augmentation, change- detection priors).

Out of scope.

  • Operational or commercial remote-sensing imagery synthesis β€” visual fidelity is well below modern web-scale models.
  • Generating imagery intended to be mistaken for real satellite data.
  • Anything safety-critical (disaster response, surveillance, defence).

Known limitations.

  • Overfit drift past SR ep650. Validation FID climbs slightly after the bowl (see REPORT.md Β§4). No augmentation or weight decay; the train set is small (10 921 images).
  • Single-caption conditioning. RSICD provides 5 captions per image; this run uses only the first.
  • Pixel-space cascade. Slower at inference than a latent-diffusion port; a latent-space rewrite is on the project roadmap.
  • No memorisation probe. Partial training-set memorisation is not ruled out β€” pHash audit is on the roadmap.

License

Apache 2.0 β€” see LICENSE.

Citation

@article{sebaq2024rsdiff,
  title   = {RSDiff: remote sensing image generation from text using diffusion model},
  author  = {Sebaq, Ahmad and ElHelw, Mohamed},
  journal = {Neural Computing and Applications},
  volume  = {36},
  number  = {36},
  pages   = {23103--23111},
  year    = {2024},
  doi     = {10.1007/s00521-024-10363-3}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train asebaq/rsdiff-sr-cascade-ep650