Spaces:
Sleeping
Sleeping
yonnel
commited on
Commit
Β·
b1c879a
1
Parent(s):
66fef64
Add automatic data generation on startup for Hugging Face deployment
Browse files- Add start.py script that builds index if data files don't exist
- Update Dockerfile to use startup script with longer health check timeout
- Update README_HF.md with deployment instructions
- First startup will take 3-5 minutes to build movie index automatically
- Dockerfile +8 -4
- README_HF.md +58 -0
- app/start.py +81 -0
Dockerfile
CHANGED
@@ -4,6 +4,7 @@ FROM python:3.9-slim
|
|
4 |
RUN apt-get update && apt-get install -y \
|
5 |
gcc \
|
6 |
g++ \
|
|
|
7 |
&& rm -rf /var/lib/apt/lists/*
|
8 |
|
9 |
# Set working directory
|
@@ -21,12 +22,15 @@ COPY app/ ./app/
|
|
21 |
# Create data directory
|
22 |
RUN mkdir -p app/data
|
23 |
|
|
|
|
|
|
|
24 |
# Expose port
|
25 |
EXPOSE 7860
|
26 |
|
27 |
-
# Health check
|
28 |
-
HEALTHCHECK --interval=30s --timeout=
|
29 |
CMD curl -f http://localhost:7860/health || exit 1
|
30 |
|
31 |
-
# Run the
|
32 |
-
CMD ["
|
|
|
4 |
RUN apt-get update && apt-get install -y \
|
5 |
gcc \
|
6 |
g++ \
|
7 |
+
curl \
|
8 |
&& rm -rf /var/lib/apt/lists/*
|
9 |
|
10 |
# Set working directory
|
|
|
22 |
# Create data directory
|
23 |
RUN mkdir -p app/data
|
24 |
|
25 |
+
# Make start script executable
|
26 |
+
RUN chmod +x app/start.py
|
27 |
+
|
28 |
# Expose port
|
29 |
EXPOSE 7860
|
30 |
|
31 |
+
# Health check with longer timeout for initial build
|
32 |
+
HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=3 \
|
33 |
CMD curl -f http://localhost:7860/health || exit 1
|
34 |
|
35 |
+
# Run the startup script that builds index if needed, then starts the API
|
36 |
+
CMD ["python", "app/start.py"]
|
README_HF.md
CHANGED
@@ -33,6 +33,64 @@ curl -X POST "https://yonnel-karl-movie-vector-backend.hf.space/explore" \
|
|
33 |
}'
|
34 |
```
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
## Environment Variables
|
37 |
|
38 |
Set these in your Space settings:
|
|
|
33 |
}'
|
34 |
```
|
35 |
|
36 |
+
# Karl Movie Vector Backend - Hugging Face Deployment
|
37 |
+
|
38 |
+
This FastAPI application provides movie recommendations using vector similarity search.
|
39 |
+
|
40 |
+
## π Automatic Setup
|
41 |
+
|
42 |
+
This application will automatically build its movie index on first startup. The process includes:
|
43 |
+
|
44 |
+
1. **Data Collection**: Fetches movie data from TMDB API
|
45 |
+
2. **Embedding Generation**: Creates vector embeddings using OpenAI API
|
46 |
+
3. **Index Building**: Builds FAISS index for fast similarity search
|
47 |
+
4. **API Startup**: Launches the FastAPI service
|
48 |
+
|
49 |
+
β±οΈ **First startup may take 3-5 minutes** to build the index.
|
50 |
+
|
51 |
+
## π§ Required Environment Variables
|
52 |
+
|
53 |
+
Configure these in your Hugging Face Space settings:
|
54 |
+
|
55 |
+
### Essential APIs
|
56 |
+
- `OPENAI_API_KEY`: Your OpenAI API key for generating embeddings
|
57 |
+
- `TMDB_API_KEY`: Your TMDB API key for fetching movie data
|
58 |
+
|
59 |
+
### Optional Configuration
|
60 |
+
- `API_TOKEN`: Token for API authentication (optional)
|
61 |
+
- `LOG_LEVEL`: Logging level (default: INFO)
|
62 |
+
|
63 |
+
## π‘ API Endpoints
|
64 |
+
|
65 |
+
- `GET /health` - Health check
|
66 |
+
- `POST /search` - Search for similar movies
|
67 |
+
- `GET /movie/{movie_id}` - Get movie details
|
68 |
+
|
69 |
+
## ποΈ Technical Details
|
70 |
+
|
71 |
+
- **Framework**: FastAPI
|
72 |
+
- **Vector Search**: FAISS
|
73 |
+
- **Embeddings**: OpenAI text-embedding-3-small
|
74 |
+
- **Movie Data**: TMDB (The Movie Database)
|
75 |
+
- **Container**: Docker
|
76 |
+
|
77 |
+
## π Rebuilding Index
|
78 |
+
|
79 |
+
To rebuild the movie index (e.g., to get newer movies):
|
80 |
+
1. Delete the Space's persistent storage
|
81 |
+
2. Restart the Space
|
82 |
+
3. The index will rebuild automatically on startup
|
83 |
+
|
84 |
+
## π¦ Data Files Generated
|
85 |
+
|
86 |
+
The application creates these files on startup:
|
87 |
+
- `app/data/faiss.index` - FAISS vector search index
|
88 |
+
- `app/data/movies.npy` - Movie embeddings matrix
|
89 |
+
- `app/data/id_map.json` - TMDB ID to matrix mapping
|
90 |
+
- `app/data/movie_metadata.json` - Movie metadata
|
91 |
+
|
92 |
+
These files are automatically generated and don't need to be included in the repository.
|
93 |
+
|
94 |
## Environment Variables
|
95 |
|
96 |
Set these in your Space settings:
|
app/start.py
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Startup script that builds the index if data files don't exist,
|
4 |
+
then starts the FastAPI application.
|
5 |
+
"""
|
6 |
+
import os
|
7 |
+
import subprocess
|
8 |
+
import sys
|
9 |
+
import logging
|
10 |
+
|
11 |
+
# Configure logging
|
12 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
13 |
+
logger = logging.getLogger(__name__)
|
14 |
+
|
15 |
+
def check_data_files():
|
16 |
+
"""Check if all required data files exist"""
|
17 |
+
required_files = [
|
18 |
+
"app/data/faiss.index",
|
19 |
+
"app/data/movies.npy",
|
20 |
+
"app/data/id_map.json",
|
21 |
+
"app/data/movie_metadata.json"
|
22 |
+
]
|
23 |
+
|
24 |
+
missing_files = []
|
25 |
+
for file_path in required_files:
|
26 |
+
if not os.path.exists(file_path):
|
27 |
+
missing_files.append(file_path)
|
28 |
+
|
29 |
+
return missing_files
|
30 |
+
|
31 |
+
def build_index():
|
32 |
+
"""Run the build_index script"""
|
33 |
+
logger.info("π§ Building movie index and data files...")
|
34 |
+
try:
|
35 |
+
# Run build_index with reduced dataset for faster startup on HF
|
36 |
+
result = subprocess.run([
|
37 |
+
sys.executable, "-m", "app.build_index",
|
38 |
+
"--max-pages", "5" # Reduced for faster startup
|
39 |
+
], check=True, capture_output=True, text=True)
|
40 |
+
|
41 |
+
logger.info("β
Index built successfully!")
|
42 |
+
logger.info(result.stdout)
|
43 |
+
|
44 |
+
except subprocess.CalledProcessError as e:
|
45 |
+
logger.error("β Failed to build index:")
|
46 |
+
logger.error(e.stderr)
|
47 |
+
raise
|
48 |
+
|
49 |
+
def start_api():
|
50 |
+
"""Start the FastAPI application"""
|
51 |
+
logger.info("π Starting FastAPI application...")
|
52 |
+
os.execv(sys.executable, [
|
53 |
+
sys.executable, "-m", "uvicorn",
|
54 |
+
"app.main:app",
|
55 |
+
"--host", "0.0.0.0",
|
56 |
+
"--port", "7860"
|
57 |
+
])
|
58 |
+
|
59 |
+
if __name__ == "__main__":
|
60 |
+
logger.info("π¬ Karl Movie Vector Backend - Starting up...")
|
61 |
+
|
62 |
+
# Check if data files exist
|
63 |
+
missing_files = check_data_files()
|
64 |
+
|
65 |
+
if missing_files:
|
66 |
+
logger.info(f"π Missing data files: {missing_files}")
|
67 |
+
logger.info("π This is the first startup - building index...")
|
68 |
+
|
69 |
+
# Build the index
|
70 |
+
build_index()
|
71 |
+
|
72 |
+
# Verify files were created
|
73 |
+
missing_after_build = check_data_files()
|
74 |
+
if missing_after_build:
|
75 |
+
logger.error(f"β Still missing files after build: {missing_after_build}")
|
76 |
+
sys.exit(1)
|
77 |
+
else:
|
78 |
+
logger.info("β
All data files present, skipping index build")
|
79 |
+
|
80 |
+
# Start the API
|
81 |
+
start_api()
|