| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - vision-language |
| | - vlm |
| | - grpo |
| | - earthmind |
| | - geospatial |
| | - remote-sensing |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | # EarthMind-R1 |
| |
|
| | EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model:** EarthMind-4B |
| | - **Training Method:** GRPO (Group Relative Policy Optimization) |
| | - **Training Data:** Geospatial instruction dataset |
| | - **Fine-tuning:** LoRA adapters merged into base weights |
| |
|
| | ## Usage |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model_id = "aadex/Earthmind-R1" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | trust_remote_code=True, |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto", |
| | ) |
| | |
| | # Load an image |
| | image = Image.open("your_image.jpg").convert("RGB") |
| | |
| | # Ask a question |
| | question = "Describe what you see in this satellite image." |
| | |
| | # Use model's chat interface |
| | response = model.chat( |
| | tokenizer=tokenizer, |
| | question=question, |
| | images=[image], |
| | generation_config={ |
| | "max_new_tokens": 512, |
| | "temperature": 0.7, |
| | "do_sample": True, |
| | }, |
| | ) |
| | |
| | print(response) |
| | ``` |
| |
|
| | ### Expected Output Format |
| |
|
| | The model is trained to provide structured responses: |
| |
|
| | ``` |
| | <think> |
| | [Reasoning about the image content] |
| | </think> |
| | <answer> |
| | [Final answer to the question] |
| | </answer> |
| | ``` |
| |
|
| | ## Requirements |
| |
|
| | ``` |
| | torch>=2.0 |
| | transformers>=4.40 |
| | accelerate |
| | pillow |
| | ``` |
| |
|
| | ## Hardware Requirements |
| |
|
| | - **Minimum:** 16GB VRAM (with bfloat16) |
| | - **Recommended:** 24GB VRAM for comfortable inference |
| |
|
| | ## Training Details |
| |
|
| | - **Framework:** VLM-R1 + TRL |
| | - **Optimizer:** AdamW |
| | - **Learning Rate:** 1e-6 |
| | - **LoRA Configuration:** |
| | - r: 32 |
| | - alpha: 64 |
| | - dropout: 0.05 |
| | - **GRPO Settings:** |
| | - num_generations: 4 |
| | - num_iterations: 2 |
| | - beta: 0.01 |
| |
|
| | ## Limitations |
| |
|
| | - Optimized for geospatial/remote sensing imagery |
| | - May not perform as well on general domain images |
| | - Response quality depends on image resolution and clarity |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{earthmind-r1, |
| | title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding}, |
| | author={Your Name}, |
| | year={2024}, |
| | publisher={HuggingFace} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|