CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a FastAPI-based multilingual embedding API that provides access to 5 specialized models for generating embeddings from Spanish, Catalan, English, and multilingual text. The API is deployed on Hugging Face Spaces and serves embeddings for different use cases including legal documents and general-purpose text.

API Architecture - Endpoint Per Model

The API uses dedicated endpoints for each model with different loading strategies:

Startup Model (loads at app initialization):

jina-v3: /embed/jina-v3 - Multilingual latest generation (1024D, 8192 tokens)

On-Demand Models (load when first requested):

roberta-ca: /embed/roberta-ca - Catalan general purpose (1024D, 512 tokens)
jina: /embed/jina - Bilingual Spanish-English (768D, 8192 tokens)
robertalex: /embed/robertalex - Spanish legal domain (768D, 512 tokens)
legal-bert: /embed/legal-bert - English legal domain (768D, 512 tokens)

Architecture

Core Components

app.py: Main FastAPI application with endpoints for embedding generation, model listing, and health checks
models/schemas.py: Pydantic models for request/response validation and API documentation
utils/helpers.py: Model loading, embedding generation, and memory management utilities

Key Design Patterns

Global model cache: All 5 models loaded at startup and cached in memory for fast inference
Batch processing: Memory-efficient batching with different batch sizes based on model complexity
Memory optimization: Automatic cleanup after large batches, torch dtype optimization
Device handling: Automatic CPU/GPU detection with appropriate tensor placement

Development Commands

Running the API

python app.py

The API will start on http://0.0.0.0:7860 by default.

Testing the API

# Using the endpoint test script
python test_endpoints.py

# Manual testing with curl
# Health check
curl http://localhost:7860/health

# Test jina-v3 endpoint (startup model)
curl -X POST "http://localhost:7860/embed/jina-v3" \
     -H "Content-Type: application/json" \
     -d '{"texts": ["Hello world", "Hola mundo"], "normalize": true}'

# Test Catalan RoBERTa endpoint
curl -X POST "http://localhost:7860/embed/roberta-ca" \
     -H "Content-Type: application/json" \
     -d '{"texts": ["Bon dia", "Com estàs?"], "normalize": true}'

# Test Spanish legal endpoint
curl -X POST "http://localhost:7860/embed/robertalex" \
     -H "Content-Type: application/json" \
     -d '{"texts": ["Artículo primero"], "normalize": true}'

# List models
curl http://localhost:7860/models

Docker Development

# Build image
docker build -t spanish-embeddings-api .

# Run container
docker run -p 7860:7860 spanish-embeddings-api

Important Implementation Details

Model Loading Strategy

All models loaded at startup in load_models() function
Uses different tokenizer classes based on model architecture (AutoTokenizer, RobertaTokenizer, BertTokenizer)
Implements memory optimization with torch.float16 on GPU, torch.float32 on CPU
Each model cached with its tokenizer, device, and pooling strategy

Embedding Generation

Supports two pooling strategies: mean pooling (Jina models) and CLS token (BERT-based models)
Implements dynamic batching based on model complexity
Automatic memory cleanup for large batches (>20 texts)
Text validation and cleaning in preprocessing

API Rate Limiting

Maximum 50 texts per request
Model-specific max_length validation
Memory-aware batch sizing

Error Handling

Comprehensive validation in Pydantic schemas
HTTP status code mapping for different error types
Model availability checks

Environment Variables

TRANSFORMERS_CACHE: Model cache directory
HF_HOME: Hugging Face cache directory
PYTORCH_CUDA_ALLOC_CONF: CUDA memory management
TOKENIZERS_PARALLELISM: Set to false to avoid warnings

Memory Management

The application implements several memory optimization strategies:

Automatic garbage collection after model loading
Batch size reduction for large models (jina-v3, roberta-ca)
CUDA cache clearing for GPU deployments
Memory cleanup after processing large batches