yonnel commited on
Commit
b1c879a
Β·
1 Parent(s): 66fef64

Add automatic data generation on startup for Hugging Face deployment

Browse files

- Add start.py script that builds index if data files don't exist
- Update Dockerfile to use startup script with longer health check timeout
- Update README_HF.md with deployment instructions
- First startup will take 3-5 minutes to build movie index automatically

Files changed (3) hide show
  1. Dockerfile +8 -4
  2. README_HF.md +58 -0
  3. app/start.py +81 -0
Dockerfile CHANGED
@@ -4,6 +4,7 @@ FROM python:3.9-slim
4
  RUN apt-get update && apt-get install -y \
5
  gcc \
6
  g++ \
 
7
  && rm -rf /var/lib/apt/lists/*
8
 
9
  # Set working directory
@@ -21,12 +22,15 @@ COPY app/ ./app/
21
  # Create data directory
22
  RUN mkdir -p app/data
23
 
 
 
 
24
  # Expose port
25
  EXPOSE 7860
26
 
27
- # Health check
28
- HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
29
  CMD curl -f http://localhost:7860/health || exit 1
30
 
31
- # Run the application
32
- CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]
 
4
  RUN apt-get update && apt-get install -y \
5
  gcc \
6
  g++ \
7
+ curl \
8
  && rm -rf /var/lib/apt/lists/*
9
 
10
  # Set working directory
 
22
  # Create data directory
23
  RUN mkdir -p app/data
24
 
25
+ # Make start script executable
26
+ RUN chmod +x app/start.py
27
+
28
  # Expose port
29
  EXPOSE 7860
30
 
31
+ # Health check with longer timeout for initial build
32
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=3 \
33
  CMD curl -f http://localhost:7860/health || exit 1
34
 
35
+ # Run the startup script that builds index if needed, then starts the API
36
+ CMD ["python", "app/start.py"]
README_HF.md CHANGED
@@ -33,6 +33,64 @@ curl -X POST "https://yonnel-karl-movie-vector-backend.hf.space/explore" \
33
  }'
34
  ```
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Environment Variables
37
 
38
  Set these in your Space settings:
 
33
  }'
34
  ```
35
 
36
+ # Karl Movie Vector Backend - Hugging Face Deployment
37
+
38
+ This FastAPI application provides movie recommendations using vector similarity search.
39
+
40
+ ## πŸš€ Automatic Setup
41
+
42
+ This application will automatically build its movie index on first startup. The process includes:
43
+
44
+ 1. **Data Collection**: Fetches movie data from TMDB API
45
+ 2. **Embedding Generation**: Creates vector embeddings using OpenAI API
46
+ 3. **Index Building**: Builds FAISS index for fast similarity search
47
+ 4. **API Startup**: Launches the FastAPI service
48
+
49
+ ⏱️ **First startup may take 3-5 minutes** to build the index.
50
+
51
+ ## πŸ”§ Required Environment Variables
52
+
53
+ Configure these in your Hugging Face Space settings:
54
+
55
+ ### Essential APIs
56
+ - `OPENAI_API_KEY`: Your OpenAI API key for generating embeddings
57
+ - `TMDB_API_KEY`: Your TMDB API key for fetching movie data
58
+
59
+ ### Optional Configuration
60
+ - `API_TOKEN`: Token for API authentication (optional)
61
+ - `LOG_LEVEL`: Logging level (default: INFO)
62
+
63
+ ## πŸ“‘ API Endpoints
64
+
65
+ - `GET /health` - Health check
66
+ - `POST /search` - Search for similar movies
67
+ - `GET /movie/{movie_id}` - Get movie details
68
+
69
+ ## πŸ—οΈ Technical Details
70
+
71
+ - **Framework**: FastAPI
72
+ - **Vector Search**: FAISS
73
+ - **Embeddings**: OpenAI text-embedding-3-small
74
+ - **Movie Data**: TMDB (The Movie Database)
75
+ - **Container**: Docker
76
+
77
+ ## πŸ”„ Rebuilding Index
78
+
79
+ To rebuild the movie index (e.g., to get newer movies):
80
+ 1. Delete the Space's persistent storage
81
+ 2. Restart the Space
82
+ 3. The index will rebuild automatically on startup
83
+
84
+ ## πŸ“¦ Data Files Generated
85
+
86
+ The application creates these files on startup:
87
+ - `app/data/faiss.index` - FAISS vector search index
88
+ - `app/data/movies.npy` - Movie embeddings matrix
89
+ - `app/data/id_map.json` - TMDB ID to matrix mapping
90
+ - `app/data/movie_metadata.json` - Movie metadata
91
+
92
+ These files are automatically generated and don't need to be included in the repository.
93
+
94
  ## Environment Variables
95
 
96
  Set these in your Space settings:
app/start.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Startup script that builds the index if data files don't exist,
4
+ then starts the FastAPI application.
5
+ """
6
+ import os
7
+ import subprocess
8
+ import sys
9
+ import logging
10
+
11
+ # Configure logging
12
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
13
+ logger = logging.getLogger(__name__)
14
+
15
+ def check_data_files():
16
+ """Check if all required data files exist"""
17
+ required_files = [
18
+ "app/data/faiss.index",
19
+ "app/data/movies.npy",
20
+ "app/data/id_map.json",
21
+ "app/data/movie_metadata.json"
22
+ ]
23
+
24
+ missing_files = []
25
+ for file_path in required_files:
26
+ if not os.path.exists(file_path):
27
+ missing_files.append(file_path)
28
+
29
+ return missing_files
30
+
31
+ def build_index():
32
+ """Run the build_index script"""
33
+ logger.info("πŸ”§ Building movie index and data files...")
34
+ try:
35
+ # Run build_index with reduced dataset for faster startup on HF
36
+ result = subprocess.run([
37
+ sys.executable, "-m", "app.build_index",
38
+ "--max-pages", "5" # Reduced for faster startup
39
+ ], check=True, capture_output=True, text=True)
40
+
41
+ logger.info("βœ… Index built successfully!")
42
+ logger.info(result.stdout)
43
+
44
+ except subprocess.CalledProcessError as e:
45
+ logger.error("❌ Failed to build index:")
46
+ logger.error(e.stderr)
47
+ raise
48
+
49
+ def start_api():
50
+ """Start the FastAPI application"""
51
+ logger.info("πŸš€ Starting FastAPI application...")
52
+ os.execv(sys.executable, [
53
+ sys.executable, "-m", "uvicorn",
54
+ "app.main:app",
55
+ "--host", "0.0.0.0",
56
+ "--port", "7860"
57
+ ])
58
+
59
+ if __name__ == "__main__":
60
+ logger.info("🎬 Karl Movie Vector Backend - Starting up...")
61
+
62
+ # Check if data files exist
63
+ missing_files = check_data_files()
64
+
65
+ if missing_files:
66
+ logger.info(f"πŸ“ Missing data files: {missing_files}")
67
+ logger.info("πŸ”„ This is the first startup - building index...")
68
+
69
+ # Build the index
70
+ build_index()
71
+
72
+ # Verify files were created
73
+ missing_after_build = check_data_files()
74
+ if missing_after_build:
75
+ logger.error(f"❌ Still missing files after build: {missing_after_build}")
76
+ sys.exit(1)
77
+ else:
78
+ logger.info("βœ… All data files present, skipping index build")
79
+
80
+ # Start the API
81
+ start_api()