Spaces:
Sleeping
Sleeping
File size: 2,963 Bytes
7f2d7b9 0fa10ed 7f2d7b9 0fa10ed 7f2d7b9 0fa10ed |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
title: Unit 8 Final Project - End-to-End AI Solution Implementation
emoji: 🚀
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
short_description: Multimodal image captioning & vibe evaluation.
---
# Assignment 8 – Multimodal Image Captioning & Vibe Evaluation
This Space implements a **multimodal AI web app** for my AI Solutions class.
The app compares two **image captioning models** on the same image, analyzes the emotional *“vibe”* of each caption, and evaluates model performance using **NLP metrics**.
The goal is to explore how **Vision-Language Models (VLMs)** and **text-based models (LLM-style components)** can work together in a single pipeline, and to provide a clear interface for testing and analysis.
---
## 🧠 What This App Does
Given an image and a user-provided *ground truth* caption, the app:
1. **Generates captions** with two image captioning models:
- **Model 1:** BLIP image captioning
- **Model 2:** ViT-GPT2 image captioning
2. **Detects the emotional “vibe”** of each caption using a **zero-shot text classifier** with labels such as:
- Peaceful / Calm
- Happy / Joy
- Sad / Sorrow
- Angry / Upset
- Fear / Scared
- Action / Violence
3. **Evaluates the captions** against the ground truth using NLP techniques:
- **Semantic similarity** via `sentence-transformers` (cosine similarity)
- **ROUGE-L** via the `evaluate` library (word-overlap accuracy)
4. **Displays all results** in a Gradio interface:
- Captions for each model
- Vibe labels + confidence scores
- A summary block with similarity and ROUGE-L scores
This makes it easy to see not just *what* the models say, but also *how close* they are to a human caption and *how the wording affects the emotional tone*.
---
## 🔍 Models & Libraries Used
- **Vision-Language Models (VLMs) for captioning**
- BLIP image captioning model
- ViT-GPT2 image captioning model
- **Text / NLP Components**
- Zero-shot text classifier for vibe detection
- `sentence-transformers/all-MiniLM-L6-v2` for semantic similarity
- `evaluate` library for ROUGE-L
- **Framework / UI**
- [Gradio](https://gradio.app/) for the web interface
- Deployed as a **Hugging Face Space** (this repo)
---
## 🖼️ How to Use the App
1. **Upload an image**
- Use one of the provided example images or upload your own.
2. **Enter a ground truth caption**
- Type a short sentence that, in your own words, best describes the image.
3. **Click “Submit”**
- The app will:
- Run both captioning models
- Classify the vibe of each caption
- Compute similarity and ROUGE-L vs. your ground truth
4. **Review the outputs**
- Compare how each model describes the scene
- Check if the vibe matches what you expect
- Look at the metrics to see which caption is closer to your description
---
|