Spaces:
Paused
Paused
sequenceDiagram | |
participant PDF as arXiv PDF Document | |
participant DL as Document Loader (PyMuPDF) | |
participant TS as Text Splitter (RecursiveCharacter) | |
participant EM as Embedding Model (OpenAI) | |
participant VDB as Vector Database (Qdrant) | |
participant DS as Dataset (Hugging Face) | |
PDF->>DL: Load document | |
Note over DL: extract_images=True | |
DL->>TS: Pass extracted text | |
TS->>EM: Send text chunks | |
EM->>VDB: Store embeddings | |
DL->>DS: Store metadata | |
DL->>DS: Store extracted text |