Spaces:
Paused
Paused
Commit
·
a422c4e
1
Parent(s):
24a293d
commit the app
Browse files- README.md +83 -14
- amazon_movies_2023/gcl_embeddings.npz +3 -0
- amazon_movies_2023/title_embeddings.npz +3 -0
- amazon_movies_2023/title_embeddings_mapping.csv +3 -0
- app.py +531 -0
- ranking_agent.py +128 -0
- requirements.txt +14 -0
README.md
CHANGED
@@ -1,14 +1,83 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Movie Recommender System
|
2 |
+
**Tag:** `agent-demo-track`
|
3 |
+
A hybrid movie recommender system that combines collaborative filtering, language model embeddings, and graph convolutional networks to provide personalized movie recommendations.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Dual Embedding Types:**
|
8 |
+
- Pure Language Model (LLM) embeddings from Mistral AI
|
9 |
+
- Graph-enhanced embeddings (LLM + GCL) that combine language understanding with user interaction patterns
|
10 |
+
- **Hybrid Input:**
|
11 |
+
- Select up to 5 movies you've enjoyed
|
12 |
+
- Describe what kind of movie you're looking for in natural language
|
13 |
+
- Adjust the weight (α) between your movie selections and text description
|
14 |
+
- **Rich Results:**
|
15 |
+
- Get up to 20 personalized recommendations
|
16 |
+
- View similarity scores for each recommendation
|
17 |
+
- Search through a database of over 100,000 movies
|
18 |
+
|
19 |
+
## Requirements
|
20 |
+
|
21 |
+
1. Python 3.8+
|
22 |
+
2. Virtual environment (recommended)
|
23 |
+
3. Mistral AI API key (get one at https://console.mistral.ai/)
|
24 |
+
|
25 |
+
Install the required packages:
|
26 |
+
|
27 |
+
```bash
|
28 |
+
pip install -r requirements.txt
|
29 |
+
```
|
30 |
+
|
31 |
+
## Environment Setup
|
32 |
+
|
33 |
+
1. Create a `.env` file in the project root:
|
34 |
+
```bash
|
35 |
+
MISTRAL_API_KEY=your_api_key_here
|
36 |
+
```
|
37 |
+
|
38 |
+
2. Ensure you have the necessary data files in the `amazon_movies_2023` directory:
|
39 |
+
- `title_embeddings.npz`: Movie title embeddings from Mistral AI
|
40 |
+
- `gcl_embeddings.npz`: Graph-enhanced embeddings
|
41 |
+
- `title_embeddings_mapping.csv`: Movie metadata mapping
|
42 |
+
|
43 |
+
## Usage
|
44 |
+
|
45 |
+
1. Activate your virtual environment:
|
46 |
+
```bash
|
47 |
+
source venv/bin/activate # On Unix/macOS
|
48 |
+
```
|
49 |
+
|
50 |
+
2. Run the recommender app:
|
51 |
+
```bash
|
52 |
+
python movie_recommender_app.py
|
53 |
+
```
|
54 |
+
|
55 |
+
3. Open your browser to the local URL shown in the terminal (typically http://127.0.0.1:7860)
|
56 |
+
|
57 |
+
## How It Works
|
58 |
+
|
59 |
+
1. **Movie Selection:**
|
60 |
+
- Search and select up to 5 movies you've enjoyed
|
61 |
+
- The system uses these as a baseline for your taste
|
62 |
+
|
63 |
+
2. **Text Preferences:**
|
64 |
+
- Describe what you're looking for (e.g., "A thrilling sci-fi movie with deep philosophical themes")
|
65 |
+
- Your description is converted to embeddings using Mistral AI
|
66 |
+
|
67 |
+
3. **Preference Weighting:**
|
68 |
+
- Use the α slider to balance between your selected movies and text description
|
69 |
+
- α = 0: Only use movie history
|
70 |
+
- α = 1: Only use text description
|
71 |
+
- Values in between combine both signals
|
72 |
+
|
73 |
+
4. **Embedding Types:**
|
74 |
+
- LLM: Pure language model embeddings for semantic understanding
|
75 |
+
- LLM + GCL: Graph-enhanced embeddings that also consider user interaction patterns
|
76 |
+
|
77 |
+
## Data Processing
|
78 |
+
|
79 |
+
For information about the dataset processing pipeline, see [DATA_PROCESSING.md](DATA_PROCESSING.md)
|
80 |
+
|
81 |
+
## Contributing
|
82 |
+
|
83 |
+
Feel free to open issues or submit pull requests with improvements!
|
amazon_movies_2023/gcl_embeddings.npz
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:957c1b970c9d8371da883523871c956593e81205a499c6962696df545806f6d6
|
3 |
+
size 580096202
|
amazon_movies_2023/title_embeddings.npz
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1d134b5950985ee7009a30f370d7b3b281351893d4d440ec5131bc759cf219ab
|
3 |
+
size 173284697
|
amazon_movies_2023/title_embeddings_mapping.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:20e2e163e9591dcd7eaf13e72b7c0666e41c0734f303599113c161bb7c9f0bdc
|
3 |
+
size 3386200
|
app.py
ADDED
@@ -0,0 +1,531 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
import numpy as np
|
3 |
+
from sklearn.preprocessing import StandardScaler
|
4 |
+
import pandas as pd
|
5 |
+
import os
|
6 |
+
import zlib
|
7 |
+
from typing import Dict, List, Tuple, Optional, Literal
|
8 |
+
from langchain_mistralai import MistralAIEmbeddings
|
9 |
+
from langchain_core.embeddings import Embeddings
|
10 |
+
import os
|
11 |
+
from dotenv import load_dotenv
|
12 |
+
from ranking_agent import rank_with_ai
|
13 |
+
from scipy.sparse import load_npz
|
14 |
+
from rapidfuzz import process, fuzz
|
15 |
+
import re
|
16 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
17 |
+
|
18 |
+
load_dotenv()
|
19 |
+
|
20 |
+
class MovieRecommender:
|
21 |
+
def __init__(self, data_dir: str = "amazon_movies_2023"):
|
22 |
+
self.data_dir = data_dir
|
23 |
+
self.embeddings = MistralAIEmbeddings(
|
24 |
+
model="mistral-embed",
|
25 |
+
mistral_api_key=os.getenv("MISTRAL_API_KEY")
|
26 |
+
)
|
27 |
+
# Load both types of embeddings
|
28 |
+
self.load_embeddings()
|
29 |
+
|
30 |
+
def load_embeddings(self) -> None:
|
31 |
+
# Load LLM embeddings
|
32 |
+
llm_embeddings_path = os.path.join(self.data_dir, "title_embeddings.npz")
|
33 |
+
try:
|
34 |
+
llm_data = np.load(llm_embeddings_path)
|
35 |
+
self.llm_embeddings = llm_data['embeddings']
|
36 |
+
self.llm_item_ids = llm_data['item_ids'].astype(str) # Ensure string type
|
37 |
+
print(f"Loaded LLM embeddings with shape: {self.llm_embeddings.shape}")
|
38 |
+
print(f"Number of LLM item IDs: {len(self.llm_item_ids)}")
|
39 |
+
except (IOError, zlib.error) as e:
|
40 |
+
raise RuntimeError(
|
41 |
+
f"Error loading LLM embeddings file: {str(e)}\n"
|
42 |
+
"The embeddings file appears to be corrupted or invalid."
|
43 |
+
)
|
44 |
+
|
45 |
+
# Load GCL embeddings
|
46 |
+
gcl_embeddings_path = os.path.join(self.data_dir, "gcl_embeddings.npz")
|
47 |
+
try:
|
48 |
+
gcl_data = np.load(gcl_embeddings_path)
|
49 |
+
self.gcl_embeddings = gcl_data['embeddings']
|
50 |
+
self.gcl_item_ids = gcl_data['item_ids'].astype(str) # Ensure string type
|
51 |
+
print(f"Loaded GCL embeddings with shape: {self.gcl_embeddings.shape}")
|
52 |
+
print(f"Number of GCL item IDs: {len(self.gcl_item_ids)}")
|
53 |
+
except (IOError, zlib.error) as e:
|
54 |
+
raise RuntimeError(
|
55 |
+
f"Error loading GCL embeddings file: {str(e)}\n"
|
56 |
+
"Please run gcl_embeddings.py first to generate GCL embeddings."
|
57 |
+
)
|
58 |
+
|
59 |
+
# Load movie mapping
|
60 |
+
mapping_path = os.path.join(self.data_dir, "title_embeddings_mapping.csv")
|
61 |
+
self.movies_df = pd.read_csv(mapping_path)
|
62 |
+
self.movies_df['item_id'] = self.movies_df['item_id'].astype(str) # Ensure string type
|
63 |
+
|
64 |
+
# Create standardized embeddings for both types
|
65 |
+
scaler = StandardScaler()
|
66 |
+
self.llm_embeddings = scaler.fit_transform(self.llm_embeddings)
|
67 |
+
self.gcl_embeddings = scaler.fit_transform(self.gcl_embeddings)
|
68 |
+
|
69 |
+
# Create item_id to index mappings for both types
|
70 |
+
self.llm_id_to_idx = {str(item_id): idx for idx, item_id in enumerate(self.llm_item_ids)}
|
71 |
+
self.gcl_id_to_idx = {str(item_id): idx for idx, item_id in enumerate(self.gcl_item_ids)}
|
72 |
+
|
73 |
+
# Create title to id mapping for search
|
74 |
+
self.title_to_id = dict(zip(self.movies_df['title'], self.movies_df['item_id']))
|
75 |
+
|
76 |
+
# Store all titles for search
|
77 |
+
self.all_titles = self.movies_df['title'].tolist()
|
78 |
+
|
79 |
+
print(f"Number of movies in mapping: {len(self.movies_df)}")
|
80 |
+
print(f"Number of titles with LLM embeddings: {len(set(self.llm_id_to_idx.keys()) & set(self.title_to_id.values()))}")
|
81 |
+
print(f"Number of titles with GCL embeddings: {len(set(self.gcl_id_to_idx.keys()) & set(self.title_to_id.values()))}")
|
82 |
+
|
83 |
+
# Pre-process titles for fuzzy matching
|
84 |
+
self.clean_titles = {self.clean_title_for_comparison(title): title for title in self.title_to_id.keys()}
|
85 |
+
|
86 |
+
def clean_title_for_comparison(self, title):
|
87 |
+
"""Clean title for comparison purposes"""
|
88 |
+
# Remove special characters and extra spaces
|
89 |
+
title = re.sub(r'[^\w\s]', '', str(title))
|
90 |
+
# Convert to lowercase and strip
|
91 |
+
return ' '.join(title.lower().split())
|
92 |
+
|
93 |
+
def search_movies(self, query: str) -> List[str]:
|
94 |
+
if not query:
|
95 |
+
return [] # Return empty if no query to avoid overwhelming UI
|
96 |
+
|
97 |
+
clean_query = self.clean_title_for_comparison(query)
|
98 |
+
# Use rapidfuzz to find matches across entire dataset
|
99 |
+
matches = process.extract(
|
100 |
+
clean_query,
|
101 |
+
self.clean_titles.keys(),
|
102 |
+
scorer=fuzz.WRatio, # WRatio works well for movie titles
|
103 |
+
limit=None, # No limit - show all matches
|
104 |
+
score_cutoff=60 # Only return matches with score > 60
|
105 |
+
)
|
106 |
+
|
107 |
+
# Convert matches back to original titles
|
108 |
+
return [self.clean_titles[match[0]] for match in matches]
|
109 |
+
|
110 |
+
def get_text_embedding(self, text: str) -> np.ndarray:
|
111 |
+
"""Get embedding for text using LangChain Mistral embeddings"""
|
112 |
+
try:
|
113 |
+
embedding = self.embeddings.embed_query(text)
|
114 |
+
# Convert embedding to numpy array
|
115 |
+
embedding = np.array(embedding, dtype=np.float32)
|
116 |
+
# Normalize the embedding
|
117 |
+
if np.any(embedding): # Only normalize if not all zeros
|
118 |
+
embedding = embedding / np.linalg.norm(embedding)
|
119 |
+
return embedding
|
120 |
+
except Exception as e:
|
121 |
+
print(f"Error getting embedding from Mistral API: {str(e)}")
|
122 |
+
return None
|
123 |
+
|
124 |
+
def get_recommendations(self, selected_movies: List[str], embedding_type: str = "LLM + GCL", user_preferences: str = "", alpha: float = 0.5) -> str:
|
125 |
+
"""
|
126 |
+
Get recommendations using proper embedding aggregation:
|
127 |
+
- e_h: embedding from user history (selected movies)
|
128 |
+
- e_u: embedding from user preferences (text)
|
129 |
+
- Combined: alpha * e_u + (1-alpha) * e_h
|
130 |
+
"""
|
131 |
+
if not selected_movies and not user_preferences:
|
132 |
+
return "Please select some movies or provide preferences."
|
133 |
+
|
134 |
+
# Choose embeddings based on type
|
135 |
+
if embedding_type == "LLM + GCL":
|
136 |
+
embeddings = self.gcl_embeddings
|
137 |
+
id_to_idx = self.gcl_id_to_idx
|
138 |
+
else:
|
139 |
+
embeddings = self.llm_embeddings
|
140 |
+
id_to_idx = self.llm_id_to_idx
|
141 |
+
|
142 |
+
user_profile = None
|
143 |
+
|
144 |
+
# Get embedding from user history (e_h)
|
145 |
+
e_h = None
|
146 |
+
if selected_movies:
|
147 |
+
movie_ids = [self.title_to_id[title] for title in selected_movies if title in self.title_to_id]
|
148 |
+
if movie_ids:
|
149 |
+
selected_embeddings = []
|
150 |
+
for movie_id in movie_ids:
|
151 |
+
if movie_id in id_to_idx:
|
152 |
+
idx = id_to_idx[movie_id]
|
153 |
+
selected_embeddings.append(embeddings[idx])
|
154 |
+
|
155 |
+
if selected_embeddings:
|
156 |
+
e_h = np.mean(selected_embeddings, axis=0)
|
157 |
+
|
158 |
+
# Get embedding from user preferences (e_u)
|
159 |
+
e_u = None
|
160 |
+
if user_preferences.strip():
|
161 |
+
e_u = self.get_text_embedding(user_preferences)
|
162 |
+
|
163 |
+
# Apply aggregation algorithm
|
164 |
+
if e_h is not None and e_u is not None:
|
165 |
+
# Both available: alpha * e_u + (1-alpha) * e_h
|
166 |
+
user_profile = alpha * e_u + (1 - alpha) * e_h
|
167 |
+
print(f"Using combined embedding: α={alpha} (preferences weight)")
|
168 |
+
elif e_u is not None:
|
169 |
+
# Only preferences available
|
170 |
+
user_profile = e_u
|
171 |
+
print("Using preferences-only embedding")
|
172 |
+
elif e_h is not None:
|
173 |
+
# Only history available
|
174 |
+
user_profile = e_h
|
175 |
+
print("Using history-only embedding")
|
176 |
+
else:
|
177 |
+
return "Could not create user profile from provided input."
|
178 |
+
|
179 |
+
# Calculate similarity with all movies
|
180 |
+
# Normalize user profile and embeddings for proper cosine similarity
|
181 |
+
user_profile_norm = user_profile / np.linalg.norm(user_profile)
|
182 |
+
embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
|
183 |
+
|
184 |
+
# Calculate cosine similarity (normalized dot product)
|
185 |
+
similarities = np.dot(embeddings_norm, user_profile_norm)
|
186 |
+
|
187 |
+
print(f"Similarity range: {similarities.min():.3f} to {similarities.max():.3f}")
|
188 |
+
|
189 |
+
# Get top 100 most similar movies
|
190 |
+
top_indices = np.argsort(similarities)[-100:][::-1]
|
191 |
+
|
192 |
+
# Filter out selected movies and create recommendations
|
193 |
+
seen_titles = set(selected_movies) if selected_movies else set()
|
194 |
+
seen_clean_titles = set(self.clean_title_for_comparison(title) for title in seen_titles)
|
195 |
+
final_recommendations = []
|
196 |
+
|
197 |
+
# Get reverse mapping for the chosen embedding type
|
198 |
+
if embedding_type == "LLM + GCL":
|
199 |
+
idx_to_id = {idx: item_id for item_id, idx in self.gcl_id_to_idx.items()}
|
200 |
+
else:
|
201 |
+
idx_to_id = {idx: item_id for item_id, idx in self.llm_id_to_idx.items()}
|
202 |
+
|
203 |
+
for idx in top_indices:
|
204 |
+
if idx not in idx_to_id:
|
205 |
+
continue
|
206 |
+
|
207 |
+
item_id = idx_to_id[idx]
|
208 |
+
|
209 |
+
# Find the title for this item_id
|
210 |
+
title = None
|
211 |
+
for t, id_ in self.title_to_id.items():
|
212 |
+
if id_ == item_id:
|
213 |
+
title = t
|
214 |
+
break
|
215 |
+
|
216 |
+
if not title:
|
217 |
+
continue
|
218 |
+
|
219 |
+
clean_title = self.clean_title_for_comparison(title)
|
220 |
+
|
221 |
+
# Skip if exact title is in seen titles
|
222 |
+
if title in seen_titles:
|
223 |
+
continue
|
224 |
+
|
225 |
+
# Skip if clean version of title is in seen titles
|
226 |
+
if clean_title in seen_clean_titles:
|
227 |
+
continue
|
228 |
+
|
229 |
+
# Skip collections/trilogies if user has seen any part
|
230 |
+
is_collection = False
|
231 |
+
for seen_title in seen_titles:
|
232 |
+
seen_clean = self.clean_title_for_comparison(seen_title)
|
233 |
+
if seen_clean in clean_title or clean_title in seen_clean:
|
234 |
+
if any(marker in title.lower() for marker in ['collection', 'trilogy', 'series', 'complete']):
|
235 |
+
is_collection = True
|
236 |
+
break
|
237 |
+
if is_collection:
|
238 |
+
continue
|
239 |
+
|
240 |
+
# Check if this is a duplicate of already recommended movie
|
241 |
+
is_duplicate = any(
|
242 |
+
fuzz.ratio(clean_title, self.clean_title_for_comparison(rec[0])) > 90
|
243 |
+
for rec in final_recommendations
|
244 |
+
)
|
245 |
+
if is_duplicate:
|
246 |
+
continue
|
247 |
+
|
248 |
+
# Add with similarity score
|
249 |
+
final_recommendations.append((title, similarities[idx]))
|
250 |
+
if len(final_recommendations) >= 100:
|
251 |
+
break
|
252 |
+
|
253 |
+
if not final_recommendations:
|
254 |
+
return "No recommendations found based on your input."
|
255 |
+
|
256 |
+
return final_recommendations[:100] # Return top 100 for ranking agent
|
257 |
+
|
258 |
+
def create_interface():
|
259 |
+
try:
|
260 |
+
recommender = MovieRecommender()
|
261 |
+
except Exception as e:
|
262 |
+
print(f"Error initializing recommender: {str(e)}")
|
263 |
+
return None
|
264 |
+
|
265 |
+
with gr.Blocks() as iface:
|
266 |
+
gr.Markdown(
|
267 |
+
"""
|
268 |
+
# Movie Recommender
|
269 |
+
Get personalized movie recommendations based on your taste and preferences!
|
270 |
+
|
271 |
+
**How to use:**
|
272 |
+
1. Search and select movies you've enjoyed (no limit!)
|
273 |
+
2. Describe what kind of movie you're looking for (optional)
|
274 |
+
3. Adjust the preference weight (α) to balance between your description and movie history
|
275 |
+
4. Get personalized recommendations
|
276 |
+
"""
|
277 |
+
)
|
278 |
+
|
279 |
+
selected_movies = gr.State([])
|
280 |
+
retrieval_results = gr.State([]) # Store retrieval results for ranking
|
281 |
+
|
282 |
+
with gr.Row():
|
283 |
+
with gr.Column():
|
284 |
+
# Movie search and selection
|
285 |
+
movie_search_input = gr.Textbox(
|
286 |
+
label="Search movies",
|
287 |
+
placeholder="Type to search...",
|
288 |
+
interactive=True,
|
289 |
+
every=True
|
290 |
+
)
|
291 |
+
|
292 |
+
# Show search results as a list of clickable buttons
|
293 |
+
search_results = gr.Radio(
|
294 |
+
choices=[],
|
295 |
+
label="Search Results",
|
296 |
+
interactive=True,
|
297 |
+
visible=True
|
298 |
+
)
|
299 |
+
|
300 |
+
# Display selected movies with functional red cross buttons
|
301 |
+
with gr.Column(elem_id="selected_movies_container") as selected_movies_container:
|
302 |
+
selected_display = gr.HTML(
|
303 |
+
label="Your Selected Movies",
|
304 |
+
value="<p><i>No movies selected yet</i></p>"
|
305 |
+
)
|
306 |
+
|
307 |
+
# Individual delete buttons (simpler approach)
|
308 |
+
delete_buttons = []
|
309 |
+
for i in range(20): # Support up to 20 movies
|
310 |
+
btn = gr.Button(f"× Remove Movie {i+1}", visible=False, size="sm", variant="secondary")
|
311 |
+
delete_buttons.append(btn)
|
312 |
+
|
313 |
+
# Clear all button
|
314 |
+
clear_btn = gr.Button("Clear All", size="sm", variant="secondary")
|
315 |
+
|
316 |
+
# User preferences text field
|
317 |
+
user_preferences = gr.Textbox(
|
318 |
+
label="Describe what kind of movie you're looking for",
|
319 |
+
placeholder="E.g., 'A thrilling sci-fi movie with deep philosophical themes'",
|
320 |
+
lines=3
|
321 |
+
)
|
322 |
+
|
323 |
+
# Alpha slider
|
324 |
+
alpha = gr.Slider(
|
325 |
+
minimum=0,
|
326 |
+
maximum=1,
|
327 |
+
value=0.5,
|
328 |
+
step=0.1,
|
329 |
+
label="Preference Weight (α)",
|
330 |
+
info="0: Use only movie history, 1: Use only your description"
|
331 |
+
)
|
332 |
+
|
333 |
+
# Embedding type selection (defaulting to GCL)
|
334 |
+
embedding_type = gr.Radio(
|
335 |
+
choices=["LLM + GCL", "LLM"],
|
336 |
+
value="LLM + GCL",
|
337 |
+
label="Embedding Type",
|
338 |
+
info="Choose between pure language model embeddings (LLM) or graph-enhanced embeddings (LLM + GCL)"
|
339 |
+
)
|
340 |
+
|
341 |
+
# Get recommendations button
|
342 |
+
recommend_btn = gr.Button("Get Recommendations", variant="primary")
|
343 |
+
|
344 |
+
with gr.Column():
|
345 |
+
# Display recommendations with streaming
|
346 |
+
recommendations = gr.Markdown(
|
347 |
+
label="Your Personalized Recommendations",
|
348 |
+
value="Recommendations will appear here"
|
349 |
+
)
|
350 |
+
|
351 |
+
def update_search_results(query):
|
352 |
+
"""Update search results based on input"""
|
353 |
+
if not query or len(query.strip()) < 2:
|
354 |
+
return gr.Radio(choices=[], visible=False)
|
355 |
+
|
356 |
+
matches = recommender.search_movies(query)
|
357 |
+
# Limit display to first 20 for UI performance
|
358 |
+
display_matches = matches[:20] if len(matches) > 20 else matches
|
359 |
+
|
360 |
+
if display_matches:
|
361 |
+
return gr.Radio(choices=display_matches, visible=True)
|
362 |
+
else:
|
363 |
+
return gr.Radio(choices=[], visible=False)
|
364 |
+
|
365 |
+
def format_selected_movies_display(movies):
|
366 |
+
"""Format selected movies with remove buttons on same line"""
|
367 |
+
if not movies:
|
368 |
+
return "<p><i>No movies selected yet</i></p>"
|
369 |
+
|
370 |
+
html_items = []
|
371 |
+
for i, movie in enumerate(movies):
|
372 |
+
html_items.append(f"""
|
373 |
+
<div style="display: flex; align-items: center; justify-content: space-between;
|
374 |
+
padding: 8px 12px; margin: 4px 0; background-color: #f8f9fa;
|
375 |
+
border-radius: 6px; border-left: 3px solid #007bff;">
|
376 |
+
<span style="flex-grow: 1; font-size: 14px; margin-right: 10px;">{i+1}. {movie}</span>
|
377 |
+
</div>
|
378 |
+
""")
|
379 |
+
|
380 |
+
return f"<div>{''.join(html_items)}</div>"
|
381 |
+
|
382 |
+
def update_delete_buttons_visibility(movies):
|
383 |
+
"""Update visibility and labels of delete buttons"""
|
384 |
+
button_updates = []
|
385 |
+
for i in range(20): # Support up to 20 movies
|
386 |
+
if i < len(movies):
|
387 |
+
movie_name = movies[i][:40] + ("..." if len(movies[i]) > 40 else "")
|
388 |
+
button_updates.append(gr.Button(f"🗑️ {movie_name}", visible=True, size="sm", variant="secondary"))
|
389 |
+
else:
|
390 |
+
button_updates.append(gr.Button(f"× Remove Movie {i+1}", visible=False, size="sm", variant="secondary"))
|
391 |
+
|
392 |
+
return button_updates
|
393 |
+
|
394 |
+
def delete_movie_by_index(index, current_movies):
|
395 |
+
"""Delete movie at specific index"""
|
396 |
+
if not current_movies or index >= len(current_movies):
|
397 |
+
return current_movies, format_selected_movies_display(current_movies)
|
398 |
+
|
399 |
+
current_movies.pop(index)
|
400 |
+
return current_movies, format_selected_movies_display(current_movies)
|
401 |
+
|
402 |
+
def handle_movie_selection(selected_movie, current_movies):
|
403 |
+
"""Handle movie selection from radio buttons"""
|
404 |
+
if not selected_movie:
|
405 |
+
return [current_movies, format_selected_movies_display(current_movies)] + update_delete_buttons_visibility(current_movies)
|
406 |
+
|
407 |
+
# Check if it's a movie title (exists in our database)
|
408 |
+
if selected_movie in recommender.title_to_id:
|
409 |
+
# It's a movie selection - add it to the list
|
410 |
+
current_movies = current_movies or []
|
411 |
+
# Remove the 5-movie limit - users can now select as many as they want
|
412 |
+
|
413 |
+
if selected_movie not in current_movies:
|
414 |
+
current_movies.append(selected_movie)
|
415 |
+
|
416 |
+
return [current_movies, format_selected_movies_display(current_movies)] + update_delete_buttons_visibility(current_movies)
|
417 |
+
else:
|
418 |
+
# Not a movie from database
|
419 |
+
return [current_movies, format_selected_movies_display(current_movies)] + update_delete_buttons_visibility(current_movies)
|
420 |
+
|
421 |
+
def clear_all_movies():
|
422 |
+
"""Clear all selected movies"""
|
423 |
+
empty_movies = []
|
424 |
+
return [empty_movies, "<p><i>No movies selected yet</i></p>"] + update_delete_buttons_visibility(empty_movies)
|
425 |
+
|
426 |
+
def get_recommendations(movies, emb_type, preferences, pref_weight):
|
427 |
+
"""Get recommendations: retrieval phase only, then delegate to ranking_agent with streaming"""
|
428 |
+
if not movies and not preferences:
|
429 |
+
yield "Please select some movies or provide preferences"
|
430 |
+
return
|
431 |
+
|
432 |
+
try:
|
433 |
+
# RETRIEVAL PHASE: Get top 100 candidates using proper embedding aggregation
|
434 |
+
print(f"\n=== RETRIEVAL PHASE ===")
|
435 |
+
print(f"Selected movies: {movies}")
|
436 |
+
print(f"User preferences: '{preferences}'")
|
437 |
+
print(f"Alpha weight: {pref_weight}")
|
438 |
+
print(f"Embedding type: {emb_type}")
|
439 |
+
|
440 |
+
yield "🔍 Searching for similar movies..."
|
441 |
+
|
442 |
+
recommendations = recommender.get_recommendations(
|
443 |
+
selected_movies=movies,
|
444 |
+
embedding_type=emb_type,
|
445 |
+
user_preferences=preferences,
|
446 |
+
alpha=pref_weight
|
447 |
+
)
|
448 |
+
|
449 |
+
# Handle error cases
|
450 |
+
if isinstance(recommendations, str):
|
451 |
+
yield recommendations
|
452 |
+
return
|
453 |
+
|
454 |
+
# Print retrieval results
|
455 |
+
print(f"\nRETRIEVAL RESULTS: Found {len(recommendations)} candidates")
|
456 |
+
print("Top 100 from retrieval phase:")
|
457 |
+
for i, (title, score) in enumerate(recommendations[:100], 1):
|
458 |
+
print(f" {i:2d}. {title} (score: {score:.3f})")
|
459 |
+
|
460 |
+
# RERANKING + EXPLANATION PHASE: Delegate to ranking_agent with streaming
|
461 |
+
print(f"\n=== RERANKING PHASE ===")
|
462 |
+
print(f"Calling rank_with_ai with:")
|
463 |
+
print(f" - {len(recommendations)} recommendations")
|
464 |
+
print(f" - preferences: '{preferences}'")
|
465 |
+
print(f" - alpha: {pref_weight}")
|
466 |
+
print(f" - user_movies: {movies}")
|
467 |
+
|
468 |
+
yield "🤖 AI is ranking and explaining your recommendations..."
|
469 |
+
|
470 |
+
# Stream the responses from ranking agent
|
471 |
+
for partial_result in rank_with_ai(
|
472 |
+
recommendations=recommendations,
|
473 |
+
user_preferences=preferences,
|
474 |
+
alpha=pref_weight,
|
475 |
+
user_movies=movies
|
476 |
+
):
|
477 |
+
yield partial_result
|
478 |
+
|
479 |
+
except Exception as e:
|
480 |
+
print(f"ERROR in get_recommendations: {str(e)}")
|
481 |
+
import traceback
|
482 |
+
traceback.print_exc()
|
483 |
+
yield f"Error getting recommendations: {str(e)}"
|
484 |
+
|
485 |
+
# Event handlers
|
486 |
+
movie_search_input.change(
|
487 |
+
fn=update_search_results,
|
488 |
+
inputs=movie_search_input,
|
489 |
+
outputs=search_results
|
490 |
+
)
|
491 |
+
|
492 |
+
search_results.change(
|
493 |
+
fn=handle_movie_selection,
|
494 |
+
inputs=[search_results, selected_movies],
|
495 |
+
outputs=[selected_movies, selected_display] + delete_buttons
|
496 |
+
)
|
497 |
+
|
498 |
+
# Add individual delete button handlers
|
499 |
+
for i, btn in enumerate(delete_buttons):
|
500 |
+
def make_delete_handler(btn_idx):
|
501 |
+
def delete_handler(current_movies):
|
502 |
+
updated_movies, updated_display = delete_movie_by_index(btn_idx, current_movies)
|
503 |
+
return [updated_movies, updated_display] + update_delete_buttons_visibility(updated_movies)
|
504 |
+
return delete_handler
|
505 |
+
|
506 |
+
btn.click(
|
507 |
+
fn=make_delete_handler(i),
|
508 |
+
inputs=[selected_movies],
|
509 |
+
outputs=[selected_movies, selected_display] + delete_buttons
|
510 |
+
)
|
511 |
+
|
512 |
+
clear_btn.click(
|
513 |
+
fn=clear_all_movies,
|
514 |
+
inputs=[],
|
515 |
+
outputs=[selected_movies, selected_display] + delete_buttons
|
516 |
+
)
|
517 |
+
|
518 |
+
recommend_btn.click(
|
519 |
+
fn=get_recommendations,
|
520 |
+
inputs=[selected_movies, embedding_type, user_preferences, alpha],
|
521 |
+
outputs=recommendations
|
522 |
+
)
|
523 |
+
|
524 |
+
return iface
|
525 |
+
|
526 |
+
if __name__ == "__main__":
|
527 |
+
iface = create_interface()
|
528 |
+
if iface is not None:
|
529 |
+
iface.launch()
|
530 |
+
else:
|
531 |
+
print("\nPlease fix the issues above and try again.")
|
ranking_agent.py
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from typing import List, Tuple, Dict
|
2 |
+
from langchain_core.prompts import ChatPromptTemplate
|
3 |
+
from langchain_mistralai.chat_models import ChatMistralAI
|
4 |
+
import os
|
5 |
+
from dotenv import load_dotenv
|
6 |
+
|
7 |
+
load_dotenv()
|
8 |
+
|
9 |
+
def create_ranking_chain():
|
10 |
+
"""Create a ranking chain using new RunnableSequence format"""
|
11 |
+
prompt = ChatPromptTemplate.from_messages([
|
12 |
+
("system", """You are a movie recommendation expert. Your task is to select the top 10 most relevant movies from a list of recommended movies and provide the final formatted output with brief explanations.
|
13 |
+
|
14 |
+
Rules:
|
15 |
+
1. Always return exactly 10 movies
|
16 |
+
2. Consider both relevance scores and how well each movie matches user preferences
|
17 |
+
3. Pay attention to the alpha weighting parameter - it tells you how much to prioritize text preferences vs viewing history
|
18 |
+
4. Return only movies from the provided list
|
19 |
+
5. NEVER recommend movies that are already in the user's viewing history - these should be completely excluded
|
20 |
+
6. Format each movie exactly as: **1. Movie Title**\n[Exactly 2 sentences explaining why this movie matches their taste]\n\n
|
21 |
+
7. Number from 1 to 10, no additional text before or after"""),
|
22 |
+
("user", """Given these movie recommendations with their relevance scores:
|
23 |
+
{movie_scores}
|
24 |
+
|
25 |
+
User preferences: {preferences}
|
26 |
+
|
27 |
+
User's viewing history (DO NOT RECOMMEND ANY OF THESE): {user_movies}
|
28 |
+
|
29 |
+
Alpha weighting: {alpha}
|
30 |
+
(α=0.0 means recommendations were based entirely on viewing history, α=1.0 means entirely on text preferences, α=0.5 means equal balance)
|
31 |
+
|
32 |
+
Select the 10 most relevant movies and provide the final formatted output with explanations. Format each as:
|
33 |
+
**1. Movie Title**
|
34 |
+
[Exactly 2 sentences explaining why this movie matches their taste based on the weighted combination of their preferences and history]
|
35 |
+
|
36 |
+
**2. Movie Title**
|
37 |
+
[Exactly 2 sentences explaining why this movie matches their taste based on the weighted combination of their preferences and history]
|
38 |
+
|
39 |
+
...continue for all 10 movies.
|
40 |
+
|
41 |
+
Remember: NEVER include any movie from the user's viewing history in your recommendations.""")
|
42 |
+
])
|
43 |
+
|
44 |
+
model = ChatMistralAI(
|
45 |
+
mistral_api_key=os.environ["MISTRAL_API_KEY"],
|
46 |
+
model="mistral-large-latest",
|
47 |
+
temperature=0.5,
|
48 |
+
max_tokens=1200,
|
49 |
+
streaming=True
|
50 |
+
)
|
51 |
+
|
52 |
+
return prompt | model
|
53 |
+
|
54 |
+
|
55 |
+
|
56 |
+
def rank_with_ai(recommendations: List[Tuple[str, float]], user_preferences: str = "", alpha: float = 0.5, user_movies: List[str] = None):
|
57 |
+
"""
|
58 |
+
Complete reranking and explanation pipeline with streaming:
|
59 |
+
1. Takes top 100 candidates from retrieval phase
|
60 |
+
2. Reranks to top 10 using AI
|
61 |
+
3. Generates explanations with streaming
|
62 |
+
4. Yields partial formatted responses
|
63 |
+
|
64 |
+
Args:
|
65 |
+
recommendations: List of (movie_title, relevance_score) tuples from retrieval phase
|
66 |
+
user_preferences: User's textual preferences/description
|
67 |
+
alpha: Weighting parameter (0.0 = only history matters, 1.0 = only preferences matter)
|
68 |
+
user_movies: List of user's selected movies for context
|
69 |
+
"""
|
70 |
+
print(f"\n=== RANKING_AGENT DEBUG ===")
|
71 |
+
print(f"Received {len(recommendations) if recommendations else 0} recommendations")
|
72 |
+
print(f"User preferences: '{user_preferences}' (length: {len(user_preferences) if user_preferences else 0})")
|
73 |
+
print(f"Alpha: {alpha}")
|
74 |
+
print(f"User movies: {user_movies}")
|
75 |
+
|
76 |
+
if not recommendations:
|
77 |
+
yield "No recommendations available."
|
78 |
+
return
|
79 |
+
|
80 |
+
# Take only top 100 recommendations if more are provided
|
81 |
+
recommendations = recommendations[:100]
|
82 |
+
|
83 |
+
try:
|
84 |
+
# Format movie scores for ranking
|
85 |
+
movie_scores = "\n".join(
|
86 |
+
f"{title} (relevance: {score:.3f})"
|
87 |
+
for title, score in recommendations
|
88 |
+
)
|
89 |
+
|
90 |
+
# Start with header
|
91 |
+
result_header = "## 🎬 Your Personalized Movie Recommendations\n\n"
|
92 |
+
|
93 |
+
if user_movies and user_preferences:
|
94 |
+
result_header += f"*Based on α={alpha} weighting: {int((1-alpha)*100)}% your viewing history + {int(alpha*100)}% your preferences*\n\n"
|
95 |
+
elif user_preferences:
|
96 |
+
result_header += f"*Based entirely on your preferences: \"{user_preferences}\"*\n\n"
|
97 |
+
elif user_movies:
|
98 |
+
result_header += f"*Based entirely on your viewing history*\n\n"
|
99 |
+
|
100 |
+
result_header += "---\n\n"
|
101 |
+
yield result_header
|
102 |
+
|
103 |
+
# Single chain that does both ranking and explanation
|
104 |
+
ranking_chain = create_ranking_chain()
|
105 |
+
print("Calling unified ranking + explanation chain...")
|
106 |
+
|
107 |
+
# Stream the response directly
|
108 |
+
accumulated_text = result_header
|
109 |
+
for chunk in ranking_chain.stream({
|
110 |
+
"movie_scores": movie_scores,
|
111 |
+
"preferences": user_preferences if user_preferences else "No specific preferences provided",
|
112 |
+
"user_movies": ", ".join(user_movies) if user_movies else "None",
|
113 |
+
"alpha": alpha
|
114 |
+
}):
|
115 |
+
if chunk.content:
|
116 |
+
accumulated_text += chunk.content
|
117 |
+
yield accumulated_text
|
118 |
+
|
119 |
+
except Exception as e:
|
120 |
+
print(f"ERROR in rank_with_ai: {str(e)}")
|
121 |
+
import traceback
|
122 |
+
traceback.print_exc()
|
123 |
+
# Fallback to simple format
|
124 |
+
result = "## 🎬 Your Recommendations\n\n"
|
125 |
+
for i, (title, score) in enumerate(recommendations[:10], 1):
|
126 |
+
result += f"**{i}. {title}**\n"
|
127 |
+
result += f"*Similarity: {score:.3f}*\n\n"
|
128 |
+
yield result
|
requirements.txt
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
flask==2.0.1
|
2 |
+
numpy>=1.21.0
|
3 |
+
pandas>=1.3.0
|
4 |
+
scipy>=1.7.1
|
5 |
+
rapidfuzz>=3.0.0
|
6 |
+
requests>=2.31.0
|
7 |
+
tqdm>=4.66.1
|
8 |
+
scikit-learn>=1.0.0
|
9 |
+
datasets>=2.17.0
|
10 |
+
python-dotenv>=1.0.1
|
11 |
+
langchain>=0.1.9
|
12 |
+
langchain-core>=0.1.27
|
13 |
+
langchain-mistralai>=0.0.5
|
14 |
+
gradio>=4.19.2
|