File size: 2,963 Bytes
7f2d7b9
0fa10ed
7f2d7b9
 
 
 
 
 
 
0fa10ed
7f2d7b9
 
0fa10ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: Unit 8 Final Project - End-to-End AI Solution Implementation
emoji: 🚀
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
short_description: Multimodal image captioning & vibe evaluation.
---

# Assignment 8 – Multimodal Image Captioning & Vibe Evaluation

This Space implements a **multimodal AI web app** for my AI Solutions class.  
The app compares two **image captioning models** on the same image, analyzes the emotional *“vibe”* of each caption, and evaluates model performance using **NLP metrics**.

The goal is to explore how **Vision-Language Models (VLMs)** and **text-based models (LLM-style components)** can work together in a single pipeline, and to provide a clear interface for testing and analysis.

---

## 🧠 What This App Does

Given an image and a user-provided *ground truth* caption, the app:

1. **Generates captions** with two image captioning models:
   - **Model 1:** BLIP image captioning  
   - **Model 2:** ViT-GPT2 image captioning  

2. **Detects the emotional “vibe”** of each caption using a **zero-shot text classifier** with labels such as:
   - Peaceful / Calm  
   - Happy / Joy  
   - Sad / Sorrow  
   - Angry / Upset  
   - Fear / Scared  
   - Action / Violence  

3. **Evaluates the captions** against the ground truth using NLP techniques:
   - **Semantic similarity** via `sentence-transformers` (cosine similarity)
   - **ROUGE-L** via the `evaluate` library (word-overlap accuracy)

4. **Displays all results** in a Gradio interface:
   - Captions for each model  
   - Vibe labels + confidence scores  
   - A summary block with similarity and ROUGE-L scores  

This makes it easy to see not just *what* the models say, but also *how close* they are to a human caption and *how the wording affects the emotional tone*.

---

## 🔍 Models & Libraries Used

- **Vision-Language Models (VLMs) for captioning**
  - BLIP image captioning model  
  - ViT-GPT2 image captioning model  

- **Text / NLP Components**
  - Zero-shot text classifier for vibe detection  
  - `sentence-transformers/all-MiniLM-L6-v2` for semantic similarity  
  - `evaluate` library for ROUGE-L  

- **Framework / UI**
  - [Gradio](https://gradio.app/) for the web interface  
  - Deployed as a **Hugging Face Space** (this repo)

---

## 🖼️ How to Use the App

1. **Upload an image**  
   - Use one of the provided example images or upload your own.

2. **Enter a ground truth caption**  
   - Type a short sentence that, in your own words, best describes the image.

3. **Click “Submit”**  
   - The app will:
     - Run both captioning models  
     - Classify the vibe of each caption  
     - Compute similarity and ROUGE-L vs. your ground truth  

4. **Review the outputs**
   - Compare how each model describes the scene  
   - Check if the vibe matches what you expect  
   - Look at the metrics to see which caption is closer to your description  

---