Spaces:
Sleeping
Sleeping
Update README with more information about the project
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🚀
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: blue
|
|
@@ -7,7 +7,83 @@ sdk: gradio
|
|
| 7 |
sdk_version: 6.0.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
short_description:
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Unit 8 Final Project - End-to-End AI Solution Implementation
|
| 3 |
emoji: 🚀
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: blue
|
|
|
|
| 7 |
sdk_version: 6.0.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
short_description: Multimodal image captioning & vibe evaluation.
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Assignment 8 – Multimodal Image Captioning & Vibe Evaluation
|
| 14 |
+
|
| 15 |
+
This Space implements a **multimodal AI web app** for my AI Solutions class.
|
| 16 |
+
The app compares two **image captioning models** on the same image, analyzes the emotional *“vibe”* of each caption, and evaluates model performance using **NLP metrics**.
|
| 17 |
+
|
| 18 |
+
The goal is to explore how **Vision-Language Models (VLMs)** and **text-based models (LLM-style components)** can work together in a single pipeline, and to provide a clear interface for testing and analysis.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 🧠 What This App Does
|
| 23 |
+
|
| 24 |
+
Given an image and a user-provided *ground truth* caption, the app:
|
| 25 |
+
|
| 26 |
+
1. **Generates captions** with two image captioning models:
|
| 27 |
+
- **Model 1:** BLIP image captioning
|
| 28 |
+
- **Model 2:** ViT-GPT2 image captioning
|
| 29 |
+
|
| 30 |
+
2. **Detects the emotional “vibe”** of each caption using a **zero-shot text classifier** with labels such as:
|
| 31 |
+
- Peaceful / Calm
|
| 32 |
+
- Happy / Joy
|
| 33 |
+
- Sad / Sorrow
|
| 34 |
+
- Angry / Upset
|
| 35 |
+
- Fear / Scared
|
| 36 |
+
- Action / Violence
|
| 37 |
+
|
| 38 |
+
3. **Evaluates the captions** against the ground truth using NLP techniques:
|
| 39 |
+
- **Semantic similarity** via `sentence-transformers` (cosine similarity)
|
| 40 |
+
- **ROUGE-L** via the `evaluate` library (word-overlap accuracy)
|
| 41 |
+
|
| 42 |
+
4. **Displays all results** in a Gradio interface:
|
| 43 |
+
- Captions for each model
|
| 44 |
+
- Vibe labels + confidence scores
|
| 45 |
+
- A summary block with similarity and ROUGE-L scores
|
| 46 |
+
|
| 47 |
+
This makes it easy to see not just *what* the models say, but also *how close* they are to a human caption and *how the wording affects the emotional tone*.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 🔍 Models & Libraries Used
|
| 52 |
+
|
| 53 |
+
- **Vision-Language Models (VLMs) for captioning**
|
| 54 |
+
- BLIP image captioning model
|
| 55 |
+
- ViT-GPT2 image captioning model
|
| 56 |
+
|
| 57 |
+
- **Text / NLP Components**
|
| 58 |
+
- Zero-shot text classifier for vibe detection
|
| 59 |
+
- `sentence-transformers/all-MiniLM-L6-v2` for semantic similarity
|
| 60 |
+
- `evaluate` library for ROUGE-L
|
| 61 |
+
|
| 62 |
+
- **Framework / UI**
|
| 63 |
+
- [Gradio](https://gradio.app/) for the web interface
|
| 64 |
+
- Deployed as a **Hugging Face Space** (this repo)
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 🖼️ How to Use the App
|
| 69 |
+
|
| 70 |
+
1. **Upload an image**
|
| 71 |
+
- Use one of the provided example images or upload your own.
|
| 72 |
+
|
| 73 |
+
2. **Enter a ground truth caption**
|
| 74 |
+
- Type a short sentence that, in your own words, best describes the image.
|
| 75 |
+
|
| 76 |
+
3. **Click “Submit”**
|
| 77 |
+
- The app will:
|
| 78 |
+
- Run both captioning models
|
| 79 |
+
- Classify the vibe of each caption
|
| 80 |
+
- Compute similarity and ROUGE-L vs. your ground truth
|
| 81 |
+
|
| 82 |
+
4. **Review the outputs**
|
| 83 |
+
- Compare how each model describes the scene
|
| 84 |
+
- Check if the vibe matches what you expect
|
| 85 |
+
- Look at the metrics to see which caption is closer to your description
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
|