# WARP.md This file provides guidance to WARP (warp.dev) when working with code in this repository. ## Project Overview SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos. Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access. ## Development Commands ```bash # Install dependencies (uses uv package manager) uv sync # Run the Gradio app (port 49200 by default) uv run python app.py # Run MCP server (stdio transport) uv run python mcp_server.py # Lint with ruff uv run ruff check . uv run ruff format . ``` ## Codebase Map ``` ml-sharp/ ├── app.py # Gradio UI (tabs: Run, Examples, About, Settings) │ ├── build_demo() # Main UI builder │ ├── run_sharp() # Inference entrypoint called by UI │ └── discover_examples() # Load precompiled examples ├── model_utils.py # Core inference + rendering │ ├── ModelWrapper # Checkpoint loading, predictor caching │ │ ├── predict_to_ply() # Image → Gaussians → PLY │ │ └── render_video() # Gaussians → MP4 trajectory │ ├── PredictionOutputs # Dataclass for inference results │ ├── configure_gpu_mode() # Switch between local/Spaces GPU │ └── predict_and_maybe_render_gpu # Module-level entrypoint ├── hardware_config.py # GPU hardware selection & persistence │ ├── HardwareConfig # Dataclass with mode, hardware, duration │ ├── get_hardware_choices() # Dropdown options │ └── SPACES_HARDWARE_SPECS # HF Spaces GPU specs & pricing ├── mcp_server.py # MCP server for programmatic access │ ├── sharp_predict # Tool: image → PLY + video │ ├── list_outputs # Tool: list generated files │ └── sharp://info # Resource: GPU status, config ├── assets/examples/ # Precompiled example outputs ├── outputs/ # Runtime outputs (PLY, MP4) ├── .hardware_config.json # Persisted hardware settings ├── pyproject.toml # Dependencies (uv) └── WARP.md # This file ``` ### Data Flow ``` Image → load_rgb() → predict_image() → Gaussians3D → save_ply() → PLY ↓ render_video() → MP4 ``` ## Architecture ### Core Files - `app.py` — Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions. - `model_utils.py` — SHARP model wrapper with checkpoint loading (HF Hub → CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`. - `hardware_config.py` — GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`. - `mcp_server.py` — MCP server exposing `sharp_predict` tool and `sharp://info` resource. ### Key Patterns **Local CUDA mode**: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs. **Spaces GPU mode**: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab. **Checkpoint resolution order**: 1. `SHARP_CHECKPOINT_PATH` env var 2. HF Hub cache 3. HF Hub download 4. Upstream CDN via `torch.hub` **Video rendering**: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `SHARP_PORT` | `49200` | Gradio server port | | `SHARP_MCP_PORT` | `49201` | MCP server port | | `SHARP_CHECKPOINT_PATH` | — | Override local checkpoint path | | `SHARP_HF_REPO_ID` | `apple/Sharp` | HuggingFace repo | | `SHARP_HF_FILENAME` | `sharp_2572gikvuh.pt` | Checkpoint filename | | `SHARP_KEEP_MODEL_ON_DEVICE` | `1` | Keep model on GPU (set `0` to free VRAM) | | `CUDA_VISIBLE_DEVICES` | — | GPU selection (e.g., `0` or `0,1`) | ## Gradio API API is enabled by default. Access at `http://localhost:49200/?view=api`. ### Endpoint: `/api/run_sharp` ```python import requests response = requests.post( "http://localhost:49200/api/run_sharp", json={ "data": [ "/path/to/image.jpg", # image_path "rotate_forward", # trajectory_type 0, # output_long_side (0 = match input) 60, # num_frames 30, # fps True, # render_video ] } ) result = response.json()["data"] video_path, ply_path, status = result ``` ## MCP Server Run the MCP server for integration with AI agents: ```bash uv run python mcp_server.py ``` ### MCP Config (for clients like Warp) ```json { "mcpServers": { "sharp": { "command": "uv", "args": ["run", "python", "mcp_server.py"], "cwd": "/home/robin/CascadeProjects/ml-sharp" } } } ``` ### Tools - `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` — Run inference - `list_outputs()` — List generated PLY/MP4 files ### Resources - `sharp://info` — GPU status, configuration - `sharp://help` — Usage documentation ## Multi-GPU Configuration Select GPU via environment variable: ```bash # Use GPU 0 (e.g., 4090) CUDA_VISIBLE_DEVICES=0 uv run python app.py # Use GPU 1 (e.g., 3090) CUDA_VISIBLE_DEVICES=1 uv run python app.py ``` ## HuggingFace Spaces GPU The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the **Settings** tab. ### Available Hardware | Hardware | VRAM | Price/hr | Best For | |----------|------|----------|----------| | ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation | | T4 small | 16GB | $0.40 | Light workloads | | T4 medium | 16GB | $0.60 | Standard workloads | | L4x1 | 24GB | $0.80 | Standard inference | | L4x4 | 96GB | $3.80 | Multi-GPU | | L40Sx1 | 48GB | $1.80 | Large models | | L40Sx4 | 192GB | $8.30 | Very large models | | A10G small | 24GB | $1.00 | Balanced | | A10G large | 24GB | $1.50 | More CPU/RAM | | A100 large | 80GB | $2.50 | Maximum VRAM | ### Deploying to Spaces 1. Push to HuggingFace Space 2. Set hardware in Space settings (or use `suggested_hardware` in README.md) 3. The app auto-detects Spaces environment via `SPACE_ID` env var ### README.md Metadata for Spaces ```yaml --- title: SHARP - 3D Gaussian Scene Prediction emoji: 🔪 colorFrom: purple colorTo: indigo sdk: gradio sdk_version: 6.2.0 python_version: 3.13.11 app_file: app.py suggested_hardware: l4x1 # or zero-gpu, a100-large, etc. startup_duration_timeout: 1h preload_from_hub: - apple/Sharp sharp_2572gikvuh.pt --- ``` ## Examples System Place precompiled outputs in `assets/examples/`: - `.{jpg,png,webp}` + `.mp4` + `.ply` - Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries ## Multi-Image Stacking Roadmap SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene: ### Required Components 1. **Pose Estimation** (`multi_view.py`) - Estimate relative camera poses between images - Options: COLMAP, hloc, or PnP-based - Transform each prediction to common world frame 2. **Gaussian Merging** (`gaussian_merge.py`) - Concatenate Gaussian parameters (means, covariances, colors, opacities) - Deduplicate overlapping regions via density-based filtering - Optional: fine-tune merged scene with photometric loss 3. **UI Changes** - Multi-upload widget - Alignment preview/validation - Progress indicator for multi-image processing ### Data Structures ```python @dataclass class AlignedGaussians: gaussians: Gaussians3D world_transform: torch.Tensor # 4x4 SE(3) source_image: Path def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D: # 1. Transform each Gaussian's means by world_transform # 2. Concatenate all parameters # 3. Density-based pruning in overlapping regions ... ``` ### Dependencies to Add - `pycolmap` or `hloc` for pose estimation - `open3d` for point cloud operations (optional) ### Implementation Phases #### Phase 1: Basic Multi-Image Pipeline - [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching - [ ] Add `gaussian_merge.py` with naive concatenation (no dedup) - [ ] UI: Multi-file upload in new "Stack" tab - [ ] Export merged PLY #### Phase 2: Pose Estimation Options - [ ] Integrate COLMAP sparse reconstruction for >2 images - [ ] Add hloc (Hierarchical Localization) as lightweight alternative - [ ] Fallback: manual pose input for known camera rigs #### Phase 3: Gaussian Deduplication - [ ] Implement KD-tree based nearest-neighbor pruning - [ ] Merge overlapping Gaussians by averaging parameters - [ ] Add confidence weighting based on view angle #### Phase 4: Refinement (Optional) - [ ] Photometric loss optimization on merged scene - [ ] Iterative alignment refinement - [ ] Support for depth priors from stereo/MVS ### API Design ```python # multi_view.py def estimate_poses( images: list[Path], method: Literal["colmap", "hloc", "pnp"] = "hloc", ) -> list[np.ndarray]: # List of 4x4 world-to-camera transforms ... # gaussian_merge.py def merge_scenes( predictions: list[PredictionOutputs], poses: list[np.ndarray], deduplicate: bool = True, dedup_radius: float = 0.01, # meters ) -> Gaussians3D: ... # app.py (Stack tab) def run_stack( images: list[str], # Gradio multi-file upload pose_method: str, deduplicate: bool, ) -> tuple[str | None, str | None, str]: # video, ply, status ... ``` ### MCP Extension ```python # mcp_server.py additions @mcp.tool() def sharp_stack( image_paths: list[str], pose_method: str = "hloc", deduplicate: bool = True, render_video: bool = True, ) -> dict: """Stack multiple images into unified 3D Gaussian scene.""" ... ``` ### Technical Considerations **Coordinate Systems**: - SHARP outputs Gaussians in camera-centric coordinates - Need to transform to world frame using estimated poses - Convention: Y-up, -Z forward (OpenGL style) **Memory Management**: - Each SHARP prediction ~50-200MB GPU memory - Batch processing with model unload between predictions - Consider streaming merge for >10 images **Quality Metrics**: - Reprojection error for pose validation - Gaussian density histogram for coverage analysis - Visual comparison with ground truth (if available)