README.md · aadex/Earthmind-R1-test at main

Earthmind-R1-test / README.md

aadex

Upload EarthMind-4B GRPO fine-tuned model

6028ebb verified 3 months ago

preview code

raw

history blame contribute delete

2.5 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- vision-language
	- vlm
	- grpo
	- earthmind
	- geospatial
	- remote-sensing
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	# EarthMind-R1

	EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks.

	## Model Description

	- Base Model: EarthMind-4B
	- Training Method: GRPO (Group Relative Policy Optimization)
	- Training Data: Geospatial instruction dataset
	- Fine-tuning: LoRA adapters merged into base weights

	## Usage

	### Quick Start

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model_id = "aadex/Earthmind-R1"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	# Load an image
	image = Image.open("your_image.jpg").convert("RGB")

	# Ask a question
	question = "Describe what you see in this satellite image."

	# Use model's chat interface
	response = model.chat(
	tokenizer=tokenizer,
	question=question,
	images=[image],
	generation_config={
	"max_new_tokens": 512,
	"temperature": 0.7,
	"do_sample": True,
	},
	)

	print(response)
	```

	### Expected Output Format

	The model is trained to provide structured responses:

	```
	<think>
	[Reasoning about the image content]
	</think>
	<answer>
	[Final answer to the question]
	</answer>
	```

	## Requirements

	```
	torch>=2.0
	transformers>=4.40
	accelerate
	pillow
	```

	## Hardware Requirements

	- Minimum: 16GB VRAM (with bfloat16)
	- Recommended: 24GB VRAM for comfortable inference

	## Training Details

	- Framework: VLM-R1 + TRL
	- Optimizer: AdamW
	- Learning Rate: 1e-6
	- LoRA Configuration:
	- r: 32
	- alpha: 64
	- dropout: 0.05
	- GRPO Settings:
	- num_generations: 4
	- num_iterations: 2
	- beta: 0.01

	## Limitations

	- Optimized for geospatial/remote sensing imagery
	- May not perform as well on general domain images
	- Response quality depends on image resolution and clarity

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{earthmind-r1,
	title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding},
	author={Your Name},
	year={2024},
	publisher={HuggingFace}
	}
	```

	## License

	Apache 2.0