File size: 539 Bytes
8c3a73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sequenceDiagram
    participant PDF as arXiv PDF Document
    participant DL as Document Loader (PyMuPDF)
    participant TS as Text Splitter (RecursiveCharacter)
    participant EM as Embedding Model (OpenAI)
    participant VDB as Vector Database (Qdrant)
    participant DS as Dataset (Hugging Face)

    PDF->>DL: Load document
    Note over DL: extract_images=True
    DL->>TS: Pass extracted text
    TS->>EM: Send text chunks
    EM->>VDB: Store embeddings
    DL->>DS: Store metadata
    DL->>DS: Store extracted text