Spaces:

yonnel
/

karl-movie-vector-backend

Sleeping

yonnel commited on Jun 11

Commit

b1c879a

1 Parent(s): 66fef64

Add automatic data generation on startup for Hugging Face deployment

- Add start.py script that builds index if data files don't exist
- Update Dockerfile to use startup script with longer health check timeout
- Update README_HF.md with deployment instructions
- First startup will take 3-5 minutes to build movie index automatically

Files changed (3) hide show

Dockerfile +8 -4
README_HF.md +58 -0
app/start.py +81 -0

Dockerfile CHANGED Viewed

@@ -4,6 +4,7 @@ FROM python:3.9-slim
 RUN apt-get update && apt-get install -y \
     gcc \
     g++ \
     && rm -rf /var/lib/apt/lists/*
 # Set working directory
@@ -21,12 +22,15 @@ COPY app/ ./app/
 # Create data directory
 RUN mkdir -p app/data
 # Expose port
 EXPOSE 7860
-# Health check
-HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
     CMD curl -f http://localhost:7860/health || exit 1
-# Run the application
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

 RUN apt-get update && apt-get install -y \
     gcc \
     g++ \
+    curl \
     && rm -rf /var/lib/apt/lists/*
 # Set working directory
 # Create data directory
 RUN mkdir -p app/data
+# Make start script executable
+RUN chmod +x app/start.py
 # Expose port
 EXPOSE 7860
+# Health check with longer timeout for initial build
+HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=3 \
     CMD curl -f http://localhost:7860/health || exit 1
+# Run the startup script that builds index if needed, then starts the API
+CMD ["python", "app/start.py"]

README_HF.md CHANGED Viewed

@@ -33,6 +33,64 @@ curl -X POST "https://yonnel-karl-movie-vector-backend.hf.space/explore" \
   }'
 ```
 ## Environment Variables
 Set these in your Space settings:

   }'
 ```
+# Karl Movie Vector Backend - Hugging Face Deployment
+This FastAPI application provides movie recommendations using vector similarity search.
+## 🚀 Automatic Setup
+This application will automatically build its movie index on first startup. The process includes:
+1. **Data Collection**: Fetches movie data from TMDB API
+2. **Embedding Generation**: Creates vector embeddings using OpenAI API
+3. **Index Building**: Builds FAISS index for fast similarity search
+4. **API Startup**: Launches the FastAPI service
+⏱️ **First startup may take 3-5 minutes** to build the index.
+## 🔧 Required Environment Variables
+Configure these in your Hugging Face Space settings:
+### Essential APIs
+- `OPENAI_API_KEY`: Your OpenAI API key for generating embeddings
+- `TMDB_API_KEY`: Your TMDB API key for fetching movie data
+### Optional Configuration
+- `API_TOKEN`: Token for API authentication (optional)
+- `LOG_LEVEL`: Logging level (default: INFO)
+## 📡 API Endpoints
+- `GET /health` - Health check
+- `POST /search` - Search for similar movies
+- `GET /movie/{movie_id}` - Get movie details
+## 🏗️ Technical Details
+- **Framework**: FastAPI
+- **Vector Search**: FAISS
+- **Embeddings**: OpenAI text-embedding-3-small
+- **Movie Data**: TMDB (The Movie Database)
+- **Container**: Docker
+## 🔄 Rebuilding Index
+To rebuild the movie index (e.g., to get newer movies):
+1. Delete the Space's persistent storage
+2. Restart the Space
+3. The index will rebuild automatically on startup
+## 📦 Data Files Generated
+The application creates these files on startup:
+- `app/data/faiss.index` - FAISS vector search index
+- `app/data/movies.npy` - Movie embeddings matrix
+- `app/data/id_map.json` - TMDB ID to matrix mapping
+- `app/data/movie_metadata.json` - Movie metadata
+These files are automatically generated and don't need to be included in the repository.
 ## Environment Variables
 Set these in your Space settings:

app/start.py ADDED Viewed

	@@ -0,0 +1,81 @@

+#!/usr/bin/env python3
+"""
+Startup script that builds the index if data files don't exist,
+then starts the FastAPI application.
+"""
+import os
+import subprocess
+import sys
+import logging
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def check_data_files():
+    """Check if all required data files exist"""
+    required_files = [
+        "app/data/faiss.index",
+        "app/data/movies.npy",
+        "app/data/id_map.json",
+        "app/data/movie_metadata.json"
+    ]
+    missing_files = []
+    for file_path in required_files:
+        if not os.path.exists(file_path):
+            missing_files.append(file_path)
+    return missing_files
+def build_index():
+    """Run the build_index script"""
+    logger.info("🔧 Building movie index and data files...")
+    try:
+        # Run build_index with reduced dataset for faster startup on HF
+        result = subprocess.run([
+            sys.executable, "-m", "app.build_index",
+            "--max-pages", "5"  # Reduced for faster startup
+        ], check=True, capture_output=True, text=True)
+        logger.info("✅ Index built successfully!")
+        logger.info(result.stdout)
+    except subprocess.CalledProcessError as e:
+        logger.error("❌ Failed to build index:")
+        logger.error(e.stderr)
+        raise
+def start_api():
+    """Start the FastAPI application"""
+    logger.info("🚀 Starting FastAPI application...")
+    os.execv(sys.executable, [
+        sys.executable, "-m", "uvicorn",
+        "app.main:app",
+        "--host", "0.0.0.0",
+        "--port", "7860"
+    ])
+if __name__ == "__main__":
+    logger.info("🎬 Karl Movie Vector Backend - Starting up...")
+    # Check if data files exist
+    missing_files = check_data_files()
+    if missing_files:
+        logger.info(f"📁 Missing data files: {missing_files}")
+        logger.info("🔄 This is the first startup - building index...")
+        # Build the index
+        build_index()
+        # Verify files were created
+        missing_after_build = check_data_files()
+        if missing_after_build:
+            logger.error(f"❌ Still missing files after build: {missing_after_build}")
+            sys.exit(1)
+    else:
+        logger.info("✅ All data files present, skipping index build")
+    # Start the API
+    start_api()