jguimond commited on
Commit
0fa10ed
·
verified ·
1 Parent(s): 4dfff9e

Update README with more information about the project

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Assignment 8 V3
3
  emoji: 🚀
4
  colorFrom: yellow
5
  colorTo: blue
@@ -7,7 +7,83 @@ sdk: gradio
7
  sdk_version: 6.0.1
8
  app_file: app.py
9
  pinned: false
10
- short_description: assignment_8 using both rouge and Zero-Shot Classifier
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Unit 8 Final Project - End-to-End AI Solution Implementation
3
  emoji: 🚀
4
  colorFrom: yellow
5
  colorTo: blue
 
7
  sdk_version: 6.0.1
8
  app_file: app.py
9
  pinned: false
10
+ short_description: Multimodal image captioning & vibe evaluation.
11
  ---
12
 
13
+ # Assignment 8 Multimodal Image Captioning & Vibe Evaluation
14
+
15
+ This Space implements a **multimodal AI web app** for my AI Solutions class.
16
+ The app compares two **image captioning models** on the same image, analyzes the emotional *“vibe”* of each caption, and evaluates model performance using **NLP metrics**.
17
+
18
+ The goal is to explore how **Vision-Language Models (VLMs)** and **text-based models (LLM-style components)** can work together in a single pipeline, and to provide a clear interface for testing and analysis.
19
+
20
+ ---
21
+
22
+ ## 🧠 What This App Does
23
+
24
+ Given an image and a user-provided *ground truth* caption, the app:
25
+
26
+ 1. **Generates captions** with two image captioning models:
27
+ - **Model 1:** BLIP image captioning
28
+ - **Model 2:** ViT-GPT2 image captioning
29
+
30
+ 2. **Detects the emotional “vibe”** of each caption using a **zero-shot text classifier** with labels such as:
31
+ - Peaceful / Calm
32
+ - Happy / Joy
33
+ - Sad / Sorrow
34
+ - Angry / Upset
35
+ - Fear / Scared
36
+ - Action / Violence
37
+
38
+ 3. **Evaluates the captions** against the ground truth using NLP techniques:
39
+ - **Semantic similarity** via `sentence-transformers` (cosine similarity)
40
+ - **ROUGE-L** via the `evaluate` library (word-overlap accuracy)
41
+
42
+ 4. **Displays all results** in a Gradio interface:
43
+ - Captions for each model
44
+ - Vibe labels + confidence scores
45
+ - A summary block with similarity and ROUGE-L scores
46
+
47
+ This makes it easy to see not just *what* the models say, but also *how close* they are to a human caption and *how the wording affects the emotional tone*.
48
+
49
+ ---
50
+
51
+ ## 🔍 Models & Libraries Used
52
+
53
+ - **Vision-Language Models (VLMs) for captioning**
54
+ - BLIP image captioning model
55
+ - ViT-GPT2 image captioning model
56
+
57
+ - **Text / NLP Components**
58
+ - Zero-shot text classifier for vibe detection
59
+ - `sentence-transformers/all-MiniLM-L6-v2` for semantic similarity
60
+ - `evaluate` library for ROUGE-L
61
+
62
+ - **Framework / UI**
63
+ - [Gradio](https://gradio.app/) for the web interface
64
+ - Deployed as a **Hugging Face Space** (this repo)
65
+
66
+ ---
67
+
68
+ ## 🖼️ How to Use the App
69
+
70
+ 1. **Upload an image**
71
+ - Use one of the provided example images or upload your own.
72
+
73
+ 2. **Enter a ground truth caption**
74
+ - Type a short sentence that, in your own words, best describes the image.
75
+
76
+ 3. **Click “Submit”**
77
+ - The app will:
78
+ - Run both captioning models
79
+ - Classify the vibe of each caption
80
+ - Compute similarity and ROUGE-L vs. your ground truth
81
+
82
+ 4. **Review the outputs**
83
+ - Compare how each model describes the scene
84
+ - Check if the vibe matches what you expect
85
+ - Look at the metrics to see which caption is closer to your description
86
+
87
+ ---
88
+
89
+