Jordi Catafal commited on
Commit
dda5c3b
·
1 Parent(s): 47bf3b3

trying to solve api

Browse files
CLAUDE.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ This is a FastAPI-based multilingual embedding API that provides access to 5 specialized models for generating embeddings from Spanish, Catalan, English, and multilingual text. The API is deployed on Hugging Face Spaces and serves embeddings for different use cases including legal documents and general-purpose text.
8
+
9
+ ## Available Models
10
+
11
+ The API serves 5 models with different specializations:
12
+ - **jina**: Bilingual Spanish-English (768D, 8192 tokens)
13
+ - **robertalex**: Spanish legal domain (768D, 512 tokens)
14
+ - **jina-v3**: Multilingual latest generation (1024D, 8192 tokens)
15
+ - **legal-bert**: English legal domain (768D, 512 tokens)
16
+ - **roberta-ca**: Catalan general purpose (1024D, 512 tokens)
17
+
18
+ ## Architecture
19
+
20
+ ### Core Components
21
+
22
+ - **app.py**: Main FastAPI application with endpoints for embedding generation, model listing, and health checks
23
+ - **models/schemas.py**: Pydantic models for request/response validation and API documentation
24
+ - **utils/helpers.py**: Model loading, embedding generation, and memory management utilities
25
+
26
+ ### Key Design Patterns
27
+
28
+ - **Global model cache**: All 5 models loaded at startup and cached in memory for fast inference
29
+ - **Batch processing**: Memory-efficient batching with different batch sizes based on model complexity
30
+ - **Memory optimization**: Automatic cleanup after large batches, torch dtype optimization
31
+ - **Device handling**: Automatic CPU/GPU detection with appropriate tensor placement
32
+
33
+ ## Development Commands
34
+
35
+ ### Running the API
36
+ ```bash
37
+ python app.py
38
+ ```
39
+ The API will start on `http://0.0.0.0:7860` by default.
40
+
41
+ ### Testing the API
42
+ ```bash
43
+ # Using the test script
44
+ python test_api.py
45
+
46
+ # Manual testing with curl
47
+ # Health check
48
+ curl http://localhost:7860/health
49
+
50
+ # Generate embeddings
51
+ curl -X POST "http://localhost:7860/embed" \
52
+ -H "Content-Type: application/json" \
53
+ -d '{"texts": ["Texto de prueba"], "model": "jina"}'
54
+
55
+ # List models
56
+ curl http://localhost:7860/models
57
+ ```
58
+
59
+ ### Docker Development
60
+ ```bash
61
+ # Build image
62
+ docker build -t spanish-embeddings-api .
63
+
64
+ # Run container
65
+ docker run -p 7860:7860 spanish-embeddings-api
66
+ ```
67
+
68
+ ## Important Implementation Details
69
+
70
+ ### Model Loading Strategy
71
+ - All models loaded at startup in `load_models()` function
72
+ - Uses different tokenizer classes based on model architecture (AutoTokenizer, RobertaTokenizer, BertTokenizer)
73
+ - Implements memory optimization with torch.float16 on GPU, torch.float32 on CPU
74
+ - Each model cached with its tokenizer, device, and pooling strategy
75
+
76
+ ### Embedding Generation
77
+ - Supports two pooling strategies: mean pooling (Jina models) and CLS token (BERT-based models)
78
+ - Implements dynamic batching based on model complexity
79
+ - Automatic memory cleanup for large batches (>20 texts)
80
+ - Text validation and cleaning in preprocessing
81
+
82
+ ### API Rate Limiting
83
+ - Maximum 50 texts per request
84
+ - Model-specific max_length validation
85
+ - Memory-aware batch sizing
86
+
87
+ ### Error Handling
88
+ - Comprehensive validation in Pydantic schemas
89
+ - HTTP status code mapping for different error types
90
+ - Model availability checks
91
+
92
+ ## Environment Variables
93
+ - `TRANSFORMERS_CACHE`: Model cache directory
94
+ - `HF_HOME`: Hugging Face cache directory
95
+ - `PYTORCH_CUDA_ALLOC_CONF`: CUDA memory management
96
+ - `TOKENIZERS_PARALLELISM`: Set to false to avoid warnings
97
+
98
+ ## Memory Management
99
+ The application implements several memory optimization strategies:
100
+ - Automatic garbage collection after model loading
101
+ - Batch size reduction for large models (jina-v3, roberta-ca)
102
+ - CUDA cache clearing for GPU deployments
103
+ - Memory cleanup after processing large batches
__pycache__/app.cpython-311.pyc ADDED
Binary file (6.2 kB). View file
 
app.py CHANGED
@@ -1,28 +1,48 @@
1
  from fastapi import FastAPI, HTTPException
 
 
2
  from typing import List
3
  import torch
4
  import uvicorn
5
- import gc
6
- import os
7
 
8
  from models.schemas import EmbeddingRequest, EmbeddingResponse, ModelInfo
9
  from utils.helpers import load_models, get_embeddings, cleanup_memory
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  app = FastAPI(
12
  title="Multilingual & Legal Embedding API",
13
  description="Multi-model embedding API for Spanish, Catalan, English and Legal texts",
14
- version="3.0.0"
 
15
  )
16
 
17
- # Global model cache
18
- models_cache = {}
19
-
20
- @app.on_event("startup")
21
- async def startup_event():
22
- """Load models on startup"""
23
- global models_cache
24
- models_cache = load_models()
25
- print("All models loaded successfully!")
26
 
27
  @app.get("/")
28
  async def root():
@@ -122,10 +142,13 @@ async def list_models():
122
  @app.get("/health")
123
  async def health_check():
124
  """Health check endpoint"""
 
125
  return {
126
- "status": "healthy",
127
- "models_loaded": len(models_cache) == 5,
128
- "available_models": list(models_cache.keys())
 
 
129
  }
130
 
131
  if __name__ == "__main__":
 
1
  from fastapi import FastAPI, HTTPException
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+ from contextlib import asynccontextmanager
4
  from typing import List
5
  import torch
6
  import uvicorn
 
 
7
 
8
  from models.schemas import EmbeddingRequest, EmbeddingResponse, ModelInfo
9
  from utils.helpers import load_models, get_embeddings, cleanup_memory
10
 
11
+ # Global model cache
12
+ models_cache = {}
13
+
14
+ @asynccontextmanager
15
+ async def lifespan(app: FastAPI):
16
+ """Application lifespan handler for startup and shutdown"""
17
+ # Startup
18
+ try:
19
+ global models_cache
20
+ print("Loading models...")
21
+ models_cache = load_models()
22
+ print("All models loaded successfully!")
23
+ yield
24
+ except Exception as e:
25
+ print(f"Failed to load models: {str(e)}")
26
+ raise
27
+ finally:
28
+ # Shutdown - cleanup resources
29
+ cleanup_memory()
30
+
31
  app = FastAPI(
32
  title="Multilingual & Legal Embedding API",
33
  description="Multi-model embedding API for Spanish, Catalan, English and Legal texts",
34
+ version="3.0.0",
35
+ lifespan=lifespan
36
  )
37
 
38
+ # Add CORS middleware to allow cross-origin requests
39
+ app.add_middleware(
40
+ CORSMiddleware,
41
+ allow_origins=["*"], # In production, specify actual domains
42
+ allow_credentials=True,
43
+ allow_methods=["*"],
44
+ allow_headers=["*"],
45
+ )
 
46
 
47
  @app.get("/")
48
  async def root():
 
142
  @app.get("/health")
143
  async def health_check():
144
  """Health check endpoint"""
145
+ models_loaded = len(models_cache) == 5
146
  return {
147
+ "status": "healthy" if models_loaded else "degraded",
148
+ "models_loaded": models_loaded,
149
+ "available_models": list(models_cache.keys()),
150
+ "expected_models": ["jina", "robertalex", "jina-v3", "legal-bert", "roberta-ca"],
151
+ "models_count": len(models_cache)
152
  }
153
 
154
  if __name__ == "__main__":
models/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (409 Bytes). View file
 
models/__pycache__/schemas.cpython-311.pyc ADDED
Binary file (5.72 kB). View file
 
test_api.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple test script for the embedding API
4
+ """
5
+
6
+ import requests
7
+ import json
8
+ import time
9
+
10
+ def test_api(base_url="http://localhost:7860"):
11
+ """Test the API endpoints"""
12
+
13
+ print(f"Testing API at {base_url}")
14
+
15
+ # Test root endpoint
16
+ try:
17
+ response = requests.get(f"{base_url}/")
18
+ print(f"✓ Root endpoint: {response.status_code}")
19
+ print(f" Response: {response.json()}")
20
+ except Exception as e:
21
+ print(f"✗ Root endpoint failed: {e}")
22
+ return False
23
+
24
+ # Test health endpoint
25
+ try:
26
+ response = requests.get(f"{base_url}/health")
27
+ print(f"✓ Health endpoint: {response.status_code}")
28
+ health_data = response.json()
29
+ print(f" Models loaded: {health_data.get('models_loaded', False)}")
30
+ print(f" Available models: {health_data.get('available_models', [])}")
31
+ except Exception as e:
32
+ print(f"✗ Health endpoint failed: {e}")
33
+
34
+ # Test models endpoint
35
+ try:
36
+ response = requests.get(f"{base_url}/models")
37
+ print(f"✓ Models endpoint: {response.status_code}")
38
+ models = response.json()
39
+ print(f" Found {len(models)} model definitions")
40
+ except Exception as e:
41
+ print(f"✗ Models endpoint failed: {e}")
42
+
43
+ # Test embedding endpoint
44
+ try:
45
+ payload = {
46
+ "texts": ["Hello world", "Test text"],
47
+ "model": "jina",
48
+ "normalize": True
49
+ }
50
+ response = requests.post(f"{base_url}/embed", json=payload)
51
+ print(f"✓ Embed endpoint: {response.status_code}")
52
+ if response.status_code == 200:
53
+ data = response.json()
54
+ print(f" Generated {data.get('num_texts', 0)} embeddings")
55
+ print(f" Dimensions: {data.get('dimensions', 0)}")
56
+ else:
57
+ print(f" Error: {response.text}")
58
+ except Exception as e:
59
+ print(f"✗ Embed endpoint failed: {e}")
60
+
61
+ return True
62
+
63
+ if __name__ == "__main__":
64
+ test_api()
utils/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (445 Bytes). View file
 
utils/__pycache__/helpers.cpython-311.pyc ADDED
Binary file (10.4 kB). View file