esra2001 commited on
Commit
93449af
·
verified ·
1 Parent(s): 9864977

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +4017 -38
app.py CHANGED
@@ -1,51 +1,4033 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from typing import List
 
 
 
2
  from langchain import hub
3
  from langchain.text_splitter import RecursiveCharacterTextSplitter
4
  from langchain_community.vectorstores import Chroma
5
  from langchain.vectorstores import Chroma
6
- import chromadb
7
  from langchain_core.output_parsers import StrOutputParser
8
  from langchain_core.runnables import RunnablePassthrough
 
9
  import bs4
10
  from sentence_transformers import SentenceTransformer
11
  from langchain_openai import OpenAIEmbeddings, ChatOpenAI
12
  from langchain_huggingface import HuggingFaceEmbeddings
13
  import ollama
14
  from langchain.embeddings import OllamaEmbeddings, HuggingFaceEmbeddings
 
15
  import numpy as np
 
 
 
16
  import uuid
17
  import os
18
- from dotenv import load_dotenv
19
- from langchain.chains import RetrievalQA
20
- from langchain_core.output_parsers import StrOutputParser
21
- from langchain_core.runnables import RunnablePassthrough
22
- from langchain_huggingface import HuggingFaceEmbeddings
23
- from langchain.memory import ConversationBufferMemory
24
- from langchain.chains import ConversationalRetrievalChain
25
- from langchain_core.prompts import PromptTemplate
26
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
27
- from transformers import pipeline
28
- import json
29
- import smtplib
30
- from email.mime.text import MIMEText
31
- from email.mime.multipart import MIMEMultipart
32
- from email.message import EmailMessage
33
- import ssl
34
- from datetime import datetime
35
- from langchain.prompts import PromptTemplate
36
- from langchain.schema.runnable import RunnablePassthrough
37
- from sentence_transformers import CrossEncoder
38
- from langchain_openai import ChatOpenAI
39
- import zipfile
40
- import gradio as gr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- load_dotenv()
43
 
44
- os.environ['LANGCHAIN_TRACING_V2'] = 'true'
45
- os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
46
- os.environ['LANGCHAIN_API_KEY']
47
- os.environ["OPENAI_API_KEY"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  embeddings_model = HuggingFaceEmbeddings(model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5")
50
 
51
  model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")
@@ -57,18 +4039,14 @@ def detect_intent(text):
57
  result = classifier(text, candidate_labels=["question", "greeting", "small talk", "feedback", "thanks"])
58
  label = result["labels"][0]
59
  return label.lower()
60
-
61
- if not os.path.exists("./chroma_db_Copy"):
62
- with zipfile.ZipFile("chroma_db_Copy.zip", "r") as zip_ref:
63
- zip_ref.extractall("./")
64
-
65
- chroma_db_path = "./chroma_db_Copy"
66
  chroma_client = chromadb.PersistentClient(path=chroma_db_path)
67
 
68
  data = chroma_client.get_collection(name="my_dataaaa")
69
  vectorstore = Chroma(
70
  collection_name="my_dataaaa",
71
- persist_directory="./chroma_db_Copy",
72
  embedding_function=embeddings_model
73
  )
74
 
@@ -109,7 +4087,8 @@ llm = ChatOpenAI(model="gpt-3.5-turbo")
109
  def format_docs(docs):
110
  return "\n\n".join(doc.page_content for doc in docs)
111
 
112
-
 
113
 
114
  rag_chain = (
115
  {
@@ -247,5 +4226,5 @@ with gr.Blocks() as chat:
247
  )
248
  gr.Markdown("© 2025 Esra Belhassen. All rights reserved")
249
 
250
- chat.launch()
251
 
 
1
+ import os
2
+ import uuid
3
+ import gradio as gr
4
+ from dotenv import load_dotenv
5
+ from langchain_core.output_parsers import StrOutputParser
6
+ from langchain_core.runnables import RunnableLambda, RunnablePassthrough
7
+ from langchain_core.prompts import PromptTemplate
8
+ from langchain_community.vectorstores import Chroma
9
+ from langchain_community.embeddings import HuggingFaceEmbeddings
10
+ from langchain_openai import ChatOpenAI
11
+ from langchain.chains import RetrievalQA
12
+ from langchain_community.document_loaders import UnstructuredURLLoader
13
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
14
+ from langchain_community.vectorstores.utils import filter_complex_metadata
15
+ import smtplib
16
+ from email.mime.text import MIMEText
17
+ from email.mime.multipart import MIMEMultipart
18
+ import logging
19
+ from langchain_community.document_loaders import PyPDFLoader
20
  from typing import List
21
+ from langchain_core.documents import Document
22
+ from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
23
+ from langchain_unstructured import UnstructuredLoader
24
  from langchain import hub
25
  from langchain.text_splitter import RecursiveCharacterTextSplitter
26
  from langchain_community.vectorstores import Chroma
27
  from langchain.vectorstores import Chroma
 
28
  from langchain_core.output_parsers import StrOutputParser
29
  from langchain_core.runnables import RunnablePassthrough
30
+ import os
31
  import bs4
32
  from sentence_transformers import SentenceTransformer
33
  from langchain_openai import OpenAIEmbeddings, ChatOpenAI
34
  from langchain_huggingface import HuggingFaceEmbeddings
35
  import ollama
36
  from langchain.embeddings import OllamaEmbeddings, HuggingFaceEmbeddings
37
+ from langchain_ollama import OllamaEmbeddings
38
  import numpy as np
39
+ from sklearn.decomposition import PCA
40
+ import matplotlib.pyplot as plt
41
+ import chromadb
42
  import uuid
43
  import os
44
+ from langchain.embeddings import HuggingFaceEmbeddings
45
+ load_dotenv()
46
+
47
+ os.environ['LANGCHAIN_TRACING_V2'] = 'true'
48
+ os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
49
+ os.environ['LANGCHAIN_API_KEY']
50
+ os.environ["OPENAI_API_KEY"]
51
+
52
+ ef clean_text(text):
53
+ '''this functionn clean the output of the webmloader '''
54
+ text = text.replace('\xa0', ' ')
55
+ text = re.sub(r'[\n\r\t]+', ' ', text)
56
+ text = re.sub(r'\s+', ' ', text)
57
+
58
+ return text.strip()
59
+
60
+ chroma_db_path = "./chroma_db"
61
+ chroma_client = chromadb.PersistentClient(path=chroma_db_path)
62
+
63
+ data = chroma_client.get_collection(name="my_dataaaa")
64
+
65
+ file_path = (
66
+ "Charte.pdf"
67
+ )
68
+ loader = PyPDFLoader(file_path)
69
+ pages = []
70
+ async for page in loader.alazy_load():
71
+ pages.append(page)
72
+
73
+
74
+ document0=pages[0].page_content
75
+
76
+
77
+ document0
78
+
79
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
80
+ splits1 = text_splitter.split_text(document0)
81
+
82
+
83
+ splits1
84
+
85
+ embeddings1 = embeddings_model.embed_documents(
86
+ splits1
87
+ # normalize_embeddings=True,
88
+ # batch_size=256,
89
+ # show_progress_bar=True
90
+ )
91
+
92
+
93
+ ids1 = [str(uuid.uuid4()) for _ in range(len(splits1))]
94
+
95
+
96
+
97
+
98
+ data.add(
99
+ documents=splits1,
100
+ embeddings=embeddings1,
101
+ ids=ids1
102
+ )
103
+
104
+
105
+ file_path = "circulaire 35-2010.pdf"
106
+ loader = PyPDFLoader(file_path)
107
+ pages = []
108
+ async for page in loader.alazy_load():
109
+ pages.append(page)
110
+
111
+
112
+
113
+ document1=[page.page_content for doc in pages]
114
+
115
+
116
+
117
+ document1
118
+
119
+
120
+
121
+ document1 = "\n".join(document1)
122
+
123
+
124
+
125
+
126
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
127
+ splits2 = text_splitter.split_text(document1)
128
+
129
+
130
+
131
+ splits2
132
+
133
+
134
+
135
+
136
+ embeddings2 = embeddings_model.embed_documents(
137
+ splits2,
138
+ # normalize_embeddings=True,
139
+ # batch_size=256,
140
+ # show_progress_bar=True
141
+ )
142
+
143
+
144
+
145
+
146
+ ids2 = [str(uuid.uuid4()) for _ in range(len(splits2))]
147
+
148
+
149
+
150
+
151
+
152
+ data.add(
153
+ documents=splits2,
154
+ embeddings=embeddings2,
155
+ ids=ids2
156
+ )
157
+
158
+
159
+
160
+ file_path = "Demande de prolongation de stage MP2 Physique.pdf"
161
+ loader = PyPDFLoader(file_path)
162
+ pages = []
163
+ async for page in loader.alazy_load():
164
+ pages.append(page)
165
+
166
+
167
+ document2 = [page.page_content for doc in pages]
168
+
169
+
170
+
171
+
172
+ document2
173
+
174
+
175
+
176
+ document2 = "\n".join(document2)
177
+
178
+
179
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
180
+ splits3 = text_splitter.split_text(document2)
181
+
182
+
183
+
184
+
185
+
186
+ splits3
187
+
188
+
189
+
190
+
191
+
192
+ embeddings3 = embeddings_model.embed_documents(
193
+ splits3,
194
+ # normalize_embeddings=True,
195
+ # batch_size=256,
196
+ # show_progress_bar=True
197
+ )
198
+
199
+
200
+
201
+
202
+
203
+ ids3 = [str(uuid.uuid4()) for _ in range(len(splits3))]
204
+
205
+
206
+
207
+
208
+
209
+ data.add(
210
+ documents=splits3,
211
+ embeddings=embeddings3,
212
+ ids=ids3
213
+ )
214
+
215
+
216
+
217
+
218
+ file_path = "dérogation pdf.pdf"
219
+ loader = PyPDFLoader(file_path)
220
+ pages = []
221
+ async for page in loader.alazy_load():
222
+ pages.append(page)
223
+
224
+
225
+
226
+
227
+
228
+ document3=[page.page_content for doc in pages]
229
+
230
+
231
+
232
+ document3
233
+
234
+
235
+ document3 = "\n".join(document3)
236
+
237
+
238
+
239
+
240
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
241
+ splits4 = text_splitter.split_text(document3)
242
+
243
+
244
+
245
+
246
+ splits4
247
+
248
+
249
+
250
+ embeddings4 = embeddings_model.embed_documents(
251
+ splits4,
252
+ # normalize_embeddings=True,
253
+ # batch_size=256,
254
+ # show_progress_bar=True
255
+ )
256
+
257
+
258
+
259
+
260
+
261
+ ids4 = [str(uuid.uuid4()) for _ in range(len(splits4))]
262
+
263
+
264
+
265
+
266
+ data.add(
267
+ documents=splits4,
268
+ embeddings=embeddings4,
269
+ ids=ids4
270
+ )
271
+
272
+
273
+
274
+ file_path = "Fiche d'évaluation de stage.pdf"
275
+ loader = PyPDFLoader(file_path)
276
+ pages = []
277
+ async for page in loader.alazy_load():
278
+ pages.append(page)
279
+
280
+ document4=[page.page_content for doc in pages]
281
+
282
+
283
+
284
+ document4
285
+
286
+
287
+
288
+
289
+ document4 = "\n".join(document4)
290
+
291
+
292
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
293
+ splits5 = text_splitter.split_text(document4)
294
+
295
+
296
+
297
+
298
+
299
+ splits5
300
+
301
+
302
+
303
+
304
+
305
+ embeddings5 = embeddings_model.embed_documents(
306
+ splits5,
307
+ # normalize_embeddings=True,
308
+ # batch_size=256,
309
+ # show_progress_bar=True
310
+ )
311
+
312
+
313
+ ids5 = [str(uuid.uuid4()) for _ in range(len(splits5))]
314
+
315
+
316
+ data.add(
317
+ documents=splits5,
318
+ embeddings=embeddings5,
319
+ ids=ids5
320
+ )
321
+
322
+
323
+
324
+ file_path = "النظام الداخلي لكلية العلوم بالمنستير.pdf"
325
+ loader = PyPDFLoader(file_path)
326
+ pages = []
327
+ async for page in loader.alazy_load():
328
+ pages.append(page)
329
+
330
+
331
+
332
+
333
+
334
+ document5=[page.page_content for doc in pages]
335
+
336
+
337
+
338
+
339
+ document5
340
+
341
+
342
+
343
+
344
+ document5 = "\n".join(document5)
345
+
346
+
347
+
348
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
349
+ splits6 = text_splitter.split_text(document5)
350
+
351
+
352
+
353
+ splits6
354
+
355
+
356
+
357
+
358
+ embeddings6 = embeddings_model.embed_documents(
359
+ splits6,
360
+ # normalize_embeddings=True,
361
+ # batch_size=256,
362
+ # show_progress_bar=True
363
+ )
364
+
365
+
366
+
367
+
368
+ ids6 = [str(uuid.uuid4()) for _ in range(len(splits6))]
369
+
370
+
371
+
372
+ data.add(
373
+ documents=splits6,
374
+ embeddings=embeddings6,
375
+ ids=ids6
376
+ )
377
+
378
+
379
+ file_path = "sante_mentale.pdf"
380
+ loader = PyPDFLoader(file_path)
381
+ pages = []
382
+ async for page in loader.alazy_load():
383
+ pages.append(page)
384
+
385
+
386
+
387
+ document6=[page.page_content for doc in pages]
388
+
389
+
390
+ document6
391
+
392
+
393
+
394
+ document6 = "\n".join(document6)
395
+
396
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
397
+ splits7 = text_splitter.split_text(document6)
398
+
399
+
400
+ splits7
401
+
402
+ embeddings7 = embeddings_model.embed_documents(
403
+ splits7,
404
+ # normalize_embeddings=True,
405
+ # batch_size=256,
406
+ # show_progress_bar=True
407
+ )
408
+
409
+
410
+
411
+ ids7 = [str(uuid.uuid4()) for _ in range(len(splits7))]
412
+
413
+
414
+ data.add(
415
+ documents=splits7,
416
+ embeddings=embeddings7,
417
+ ids=ids7
418
+ )
419
+
420
+
421
+
422
+ file_path = "sante_mentale2.pdf"
423
+ loader = PyPDFLoader(file_path)
424
+ pages = []
425
+ async for page in loader.alazy_load():
426
+ pages.append(page)
427
+
428
+
429
+
430
+
431
+
432
+ document7=[page.page_content for doc in pages]
433
+
434
+
435
+
436
+
437
+
438
+ document7
439
+
440
+
441
+
442
+
443
+ document7 = "\n".join(document7)
444
+
445
+
446
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
447
+ splits8 = text_splitter.split_text(document7)
448
+
449
+
450
+
451
+ splits8
452
+
453
+
454
+ embeddings8 = embeddings_model.embed_documents(
455
+ splits8,
456
+ # normalize_embeddings=True,
457
+ # batch_size=256,
458
+ # show_progress_bar=True
459
+ )
460
+
461
+
462
+ ids8 = [str(uuid.uuid4()) for _ in range(len(splits8))]
463
+
464
+
465
+
466
+
467
+ data.add(
468
+ documents=splits8,
469
+ embeddings=embeddings8,
470
+ ids=ids8
471
+ )
472
+
473
+
474
+
475
+ file_path = "score_pour_mastere.pdf"
476
+ loader = PyPDFLoader(file_path)
477
+ pages = []
478
+ async for page in loader.alazy_load():
479
+ pages.append(page)
480
+
481
+
482
+ # In[99]:
483
+
484
+
485
+ document8=[page.page_content for doc in pages]
486
+
487
+
488
+ # In[100]:
489
+
490
+
491
+ document8
492
+
493
+
494
+ # # splitting DOC8 into chunks
495
+
496
+ # In[102]:
497
+
498
+
499
+ document8 = "\n".join(document8)
500
+
501
+
502
+ # In[103]:
503
+
504
+
505
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
506
+ splits9 = text_splitter.split_text(document8)
507
+
508
+
509
+ # In[104]:
510
+
511
+
512
+ splits9
513
+
514
+
515
+ # In[105]:
516
+
517
+
518
+ embeddings9 = embeddings_model.embed_documents(
519
+ splits9,
520
+ # normalize_embeddings=True,
521
+ # batch_size=256,
522
+ # show_progress_bar=True
523
+ )
524
+
525
+
526
+ # In[106]:
527
+
528
+
529
+ ids9 = [str(uuid.uuid4()) for _ in range(len(splits9))]
530
+
531
+
532
+ # In[107]:
533
+
534
+
535
+ data.add(
536
+ documents=splits9,
537
+ embeddings=embeddings9,
538
+ ids=ids9
539
+ )
540
+
541
+
542
+ # # Master RECHERCHE
543
+
544
+ # # Document 9 Recherche chimie
545
+
546
+ # In[110]:
547
+
548
+
549
+ file_path = "recherche_chimie.pdf"
550
+ loader = PyPDFLoader(file_path)
551
+ pages = []
552
+ async for page in loader.alazy_load():
553
+ pages.append(page)
554
+
555
+
556
+ # In[111]:
557
+
558
+
559
+ document9=[page.page_content for doc in pages]
560
+
561
+
562
+ # In[112]:
563
+
564
+
565
+ document9
566
+
567
+
568
+ # # splitting DOC9 into chunks
569
+
570
+ # In[114]:
571
+
572
+
573
+ document9= "\n".join(document9)
574
+
575
+
576
+ # In[115]:
577
+
578
+
579
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
580
+ splits10 = text_splitter.split_text(document9)
581
+
582
+
583
+ # In[116]:
584
+
585
+
586
+ splits10
587
+
588
+
589
+ # In[117]:
590
+
591
+
592
+ embeddings10 = embeddings_model.embed_documents(
593
+ splits10,
594
+ # normalize_embeddings=True,
595
+ # batch_size=256,
596
+ # show_progress_bar=True
597
+ )
598
+
599
+
600
+ # In[118]:
601
+
602
+
603
+ ids10 = [str(uuid.uuid4()) for _ in range(len(splits10))]
604
+
605
+
606
+ # In[119]:
607
+
608
+
609
+ data.add(
610
+ documents=splits10,
611
+ embeddings=embeddings10,
612
+ ids=ids10
613
+ )
614
+
615
+
616
+ # # Document 10 Recherche info
617
+
618
+ # In[121]:
619
+
620
+
621
+ file_path = "recherche_info.pdf"
622
+ loader = PyPDFLoader(file_path)
623
+ pages = []
624
+ async for page in loader.alazy_load():
625
+ pages.append(page)
626
+
627
+
628
+ # In[122]:
629
+
630
+
631
+ document10=[page.page_content for doc in pages]
632
+
633
+
634
+ # In[123]:
635
+
636
+
637
+ document10
638
+
639
+
640
+ # # splitting DOC10 into chunks
641
+
642
+ # In[125]:
643
+
644
+
645
+ document10= "\n".join(document10)
646
+
647
+
648
+ # In[126]:
649
+
650
+
651
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
652
+ splits11 = text_splitter.split_text(document10)
653
+
654
+
655
+ # In[127]:
656
+
657
+
658
+ splits11
659
+
660
+
661
+ # In[128]:
662
+
663
+
664
+ embeddings11 = embeddings_model.embed_documents(
665
+ splits11,
666
+ # normalize_embeddings=True,
667
+ # batch_size=256,
668
+ # show_progress_bar=True
669
+ )
670
+
671
+
672
+ # In[129]:
673
+
674
+
675
+ ids11 = [str(uuid.uuid4()) for _ in range(len(splits11))]
676
+
677
+
678
+ # In[130]:
679
+
680
+
681
+ data.add(
682
+ documents=splits11,
683
+ embeddings=embeddings11,
684
+ ids=ids11
685
+ )
686
+
687
+
688
+ # # Document 11 Recherche physique
689
+
690
+ # In[132]:
691
+
692
+
693
+ file_path = "recherche_phy.pdf"
694
+ loader = PyPDFLoader(file_path)
695
+ pages = []
696
+ async for page in loader.alazy_load():
697
+ pages.append(page)
698
+
699
+
700
+ # In[133]:
701
+
702
+
703
+ document11=[page.page_content for doc in pages]
704
+
705
+
706
+ # In[134]:
707
+
708
+
709
+ document11
710
+
711
+
712
+ # # splitting DOC11 into chunks
713
+
714
+ # In[136]:
715
+
716
+
717
+ document11= "\n".join(document11)
718
+
719
+
720
+ # In[137]:
721
+
722
+
723
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
724
+ splits12 = text_splitter.split_text(document11)
725
+
726
+
727
+ # In[138]:
728
+
729
+
730
+ splits12
731
+
732
+
733
+ # In[139]:
734
+
735
+
736
+ embeddings12 = embeddings_model.embed_documents(
737
+ splits12,
738
+ # normalize_embeddings=True,
739
+ # batch_size=256,
740
+ # show_progress_bar=True
741
+ )
742
+
743
+
744
+ # In[140]:
745
+
746
+
747
+ ids12 = [str(uuid.uuid4()) for _ in range(len(splits12))]
748
+
749
+
750
+ # In[141]:
751
+
752
+
753
+ data.add(
754
+ documents=splits12,
755
+ embeddings=embeddings12,
756
+ ids=ids12
757
+ )
758
+
759
+
760
+ # # Mastere Pro
761
+
762
+ # # Document 12 PRO chimie
763
+
764
+ # In[144]:
765
+
766
+
767
+ file_path = "pro_chimie.pdf"
768
+ loader = PyPDFLoader(file_path)
769
+ pages = []
770
+ async for page in loader.alazy_load():
771
+ pages.append(page)
772
+
773
+
774
+ # In[145]:
775
+
776
+
777
+ document12=[page.page_content for doc in pages]
778
+
779
+
780
+ # In[146]:
781
+
782
+
783
+ document12
784
+
785
+
786
+ # # splitting DOC 12 into chunks
787
+
788
+ # In[148]:
789
+
790
+
791
+ document12= "\n".join(document12)
792
+
793
+
794
+ # In[149]:
795
+
796
+
797
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
798
+ splits13= text_splitter.split_text(document12)
799
+
800
+
801
+ # In[150]:
802
+
803
+
804
+ splits13
805
+
806
+
807
+ # In[151]:
808
+
809
+
810
+ embeddings13 = embeddings_model.embed_documents(
811
+ splits13,
812
+ # normalize_embeddings=True,
813
+ # batch_size=256,
814
+ # show_progress_bar=True
815
+ )
816
+
817
+
818
+ # In[152]:
819
+
820
+
821
+ ids13 = [str(uuid.uuid4()) for _ in range(len(splits13))]
822
+
823
+
824
+ # In[153]:
825
+
826
+
827
+ data.add(
828
+ documents=splits13,
829
+ embeddings=embeddings13,
830
+ ids=ids13
831
+ )
832
+
833
+
834
+ # # Document 13 PRO info
835
+
836
+ # In[155]:
837
+
838
+
839
+ file_path = "pro_info.pdf"
840
+ loader = PyPDFLoader(file_path)
841
+ pages = []
842
+ async for page in loader.alazy_load():
843
+ pages.append(page)
844
+
845
+
846
+ # In[156]:
847
+
848
+
849
+ document13=[page.page_content for doc in pages]
850
+
851
+
852
+ # In[157]:
853
+
854
+
855
+ document13
856
+
857
+
858
+ # # splitting DOC 13 into chunks
859
+
860
+ # In[159]:
861
+
862
+
863
+ document13= "\n".join(document13)
864
+
865
+
866
+ # In[160]:
867
+
868
+
869
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
870
+ splits14= text_splitter.split_text(document13)
871
+
872
+
873
+ # In[161]:
874
+
875
+
876
+ splits14
877
+
878
+
879
+ # In[162]:
880
+
881
+
882
+ embeddings14 = embeddings_model.embed_documents(
883
+ splits14,
884
+ # normalize_embeddings=True,
885
+ # batch_size=256,
886
+ # show_progress_bar=True
887
+ )
888
+
889
+
890
+ # In[163]:
891
+
892
+
893
+ ids14 = [str(uuid.uuid4()) for _ in range(len(splits14))]
894
+
895
+
896
+ # In[164]:
897
+
898
+
899
+ data.add(
900
+ documents=splits14,
901
+ embeddings=embeddings14,
902
+ ids=ids14
903
+ )
904
+
905
+
906
+ # # Document 14 on peut effectuer deux stages en meme temps
907
+
908
+ # In[166]:
909
+
910
+
911
+ file_path = "deux_stage_.pdf"
912
+ loader = PyPDFLoader(file_path)
913
+ pages = []
914
+ async for page in loader.alazy_load():
915
+ pages.append(page)
916
+
917
+
918
+ # In[167]:
919
+
920
+
921
+ document14=[page.page_content for doc in pages]
922
+
923
+
924
+ # In[168]:
925
+
926
+
927
+ document14
928
+
929
+
930
+ # # splitting DOC14 INTO chunks
931
+
932
+ # In[170]:
933
+
934
+
935
+ document14= "\n".join(document14)
936
+
937
+
938
+ # In[171]:
939
+
940
+
941
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
942
+ splits15= text_splitter.split_text(document14)
943
+
944
+
945
+ # In[172]:
946
+
947
+
948
+ splits15
949
+
950
+
951
+ # In[173]:
952
+
953
+
954
+ embeddings15= embeddings_model.embed_documents(
955
+ splits15,
956
+ # normalize_embeddings=True,
957
+ # batch_size=256,
958
+ # show_progress_bar=True
959
+ )
960
+
961
+
962
+ # In[174]:
963
+
964
+
965
+ ids15 = [str(uuid.uuid4()) for _ in range(len(splits15))]
966
+
967
+
968
+ # In[175]:
969
+
970
+
971
+ data.add(
972
+ documents=splits15,
973
+ embeddings=embeddings15,
974
+ ids=ids15
975
+ )
976
+
977
+
978
+ # # Document 15 des question avec reponse
979
+
980
+ # In[177]:
981
+
982
+
983
+ file_path = "Les avantages de la carte étudiante.pdf"
984
+ loader = PyPDFLoader(file_path)
985
+ pages = []
986
+ async for page in loader.alazy_load():
987
+ pages.append(page)
988
+
989
+
990
+ # In[178]:
991
+
992
+
993
+ document15=[page.page_content for doc in pages]
994
+
995
+
996
+ # In[179]:
997
+
998
+
999
+ document15
1000
+
1001
+
1002
+ # # Splitting DOC15 into chunks
1003
+
1004
+ # In[181]:
1005
+
1006
+
1007
+ document15= "\n".join(document15)
1008
+
1009
+
1010
+ # In[182]:
1011
+
1012
+
1013
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50, separators=["\n\n", "\n", ".", " ", "\n•"])
1014
+ splits16= text_splitter.split_text(document15)
1015
+
1016
+
1017
+ # In[183]:
1018
+
1019
+
1020
+ splits16
1021
+
1022
+
1023
+ # In[184]:
1024
+
1025
+
1026
+ embeddings16 = embeddings_model.embed_documents(
1027
+ splits16,
1028
+ # normalize_embeddings=True,
1029
+ # batch_size=256,
1030
+ # show_progress_bar=True
1031
+ )
1032
+
1033
+
1034
+ # In[185]:
1035
+
1036
+
1037
+ ids16 = [str(uuid.uuid4()) for _ in range(len(splits16))]
1038
+
1039
+
1040
+ # In[186]:
1041
+
1042
+
1043
+ data.add(
1044
+ documents=splits16,
1045
+ embeddings=embeddings16,
1046
+ ids=ids16
1047
+ )
1048
+
1049
+
1050
+ # # Checking does the data is added or not ✅
1051
+
1052
+ # In[188]:
1053
+
1054
+
1055
+ data = data.get(include=['embeddings'])
1056
+ print(data)
1057
+
1058
+ # embeddings_model = SentenceTransformer("HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5")
1059
+ embeddings_model = HuggingFaceEmbeddings(model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5")
1060
+
1061
+
1062
+ # # Configure `ChromaDB` for our work
1063
+
1064
+ # In[29]:
1065
+
1066
+
1067
+ # chroma_client.delete_collection(name="my_dataaaa") # Deletes "my_dataaaa"
1068
+
1069
+
1070
+ # In[30]:
1071
+
1072
+
1073
+ chroma_db_path = "./chroma_db"
1074
+ chroma_client = chromadb.PersistentClient(path=chroma_db_path)
1075
+
1076
+
1077
+ # In[31]:
1078
+
1079
+
1080
+ data = chroma_client.get_or_create_collection(name="my_dataaaa")
1081
+
1082
+
1083
+ # # <p style="color: orange;">Document 0 Masteres-Procedure-de-Depot</p>
1084
+
1085
+ # In[33]:
1086
+
1087
+
1088
+ loader = WebBaseLoader(
1089
+ web_paths=("https://fsm.rnu.tn/fra/pages/152/Masteres-Procedure-de-Depot",),
1090
+ bs_kwargs=dict(
1091
+ parse_only=bs4.SoupStrainer(
1092
+ class_=("content")
1093
+ )
1094
+ ),
1095
+ )
1096
+ Masteres_Procedure_de_Depot = loader.load()
1097
+
1098
+
1099
+ # In[34]:
1100
+
1101
+
1102
+ Masteres_Procedure_de_Depot = [
1103
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1104
+ for doc in Masteres_Procedure_de_Depot]
1105
+ Masteres_Procedure_de_Depot
1106
+
1107
+
1108
+ # ## spliiting into chunks the doc0
1109
+
1110
+ # In[36]:
1111
+
1112
+
1113
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100)
1114
+ splits1 = text_splitter.split_documents( Masteres_Procedure_de_Depot)
1115
+
1116
+
1117
+ # In[37]:
1118
+
1119
+
1120
+ splits1
1121
+
1122
+
1123
+ # ## Saving to chromadb in data
1124
+
1125
+ # In[39]:
1126
+
1127
+
1128
+ contents1 = [doc.page_content for doc in splits1]
1129
+ metadata1 = [doc.metadata for doc in splits1]
1130
+
1131
+
1132
+ # In[40]:
1133
+
1134
+
1135
+ embeddings1 = embeddings_model.embed_documents(
1136
+ [doc.page_content for doc in splits1],
1137
+ #normalize_embeddings=True,
1138
+ #batch_size=256,
1139
+ #show_progress_bar=True
1140
+ )
1141
+ print(embeddings1)
1142
+
1143
+
1144
+ # In[41]:
1145
+
1146
+
1147
+ ids = [str(uuid.uuid4()) for _ in range(len(contents1))]
1148
+
1149
+
1150
+ # In[42]:
1151
+
1152
+
1153
+ data.add(
1154
+ documents=contents1,
1155
+ embeddings=embeddings1,
1156
+ metadatas=metadata1,
1157
+ ids=ids
1158
+ )
1159
+
1160
+
1161
+ # In[43]:
1162
+
1163
+
1164
+ # visulizing in a dataframe
1165
+ data_dict = {
1166
+ "ID": ids,
1167
+ "Document": contents1,
1168
+ "Metadata": metadata1,
1169
+ "Embedding Shape": [np.array(embed).shape for embed in embeddings1],
1170
+ }
1171
+
1172
+ df = pd.DataFrame(data_dict)
1173
+ df.tail()
1174
+
1175
+
1176
+ # In[44]:
1177
+
1178
+
1179
+ def append_data(contents, metadata, embeddings):
1180
+ '''this function will append the embeddings and metadata and
1181
+ the document into the data_dict so we can visulize how it looks in chrom '''
1182
+ global df
1183
+ new_ids = list(range(len(df) + 1, len(df) + 1 + len(contents)))
1184
+
1185
+ data_dict["ID"].extend(new_ids)
1186
+ data_dict["Document"].extend(contents)
1187
+ data_dict["Metadata"].extend(metadata)
1188
+ data_dict["Embedding Shape"].extend([np.array(embed).shape for embed in embeddings])
1189
+
1190
+ df = pd.DataFrame(data_dict)
1191
+
1192
+
1193
+ # # <p style="color: orange;">Document 1 Theses-Inscriptions-etProcedure-de-Depot</p>
1194
+
1195
+ # In[46]:
1196
+
1197
+
1198
+ loader = WebBaseLoader(
1199
+ web_paths=("https://fsm.rnu.tn/fra/pages/147/Theses-Inscriptions-etProcedure-de-Depot",),
1200
+ bs_kwargs=dict(
1201
+ parse_only=bs4.SoupStrainer(
1202
+ class_=("content")
1203
+ )
1204
+ ),
1205
+ )
1206
+ Theses_Inscriptions_etProcedure_de_Depot = loader.load()
1207
+
1208
+
1209
+ # In[47]:
1210
+
1211
+
1212
+ Theses_Inscriptions_etProcedure_de_Depot = [
1213
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1214
+ for doc in Theses_Inscriptions_etProcedure_de_Depot]
1215
+ Theses_Inscriptions_etProcedure_de_Depot
1216
+
1217
+
1218
+ # ## splitting into chunks the doc1
1219
+
1220
+ # In[49]:
1221
+
1222
+
1223
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1224
+ splits2 = text_splitter.split_documents( Theses_Inscriptions_etProcedure_de_Depot)
1225
+
1226
+
1227
+ # In[50]:
1228
+
1229
+
1230
+ splits2
1231
+
1232
+
1233
+ # In[51]:
1234
+
1235
+
1236
+ contents2= [doc.page_content for doc in splits2]
1237
+ metadata2 = [doc.metadata for doc in splits2]
1238
+
1239
+
1240
+ # In[52]:
1241
+
1242
+
1243
+ embeddings2 = embeddings_model.embed_documents(
1244
+ [doc.page_content for doc in splits2],
1245
+ # normalize_embeddings=True,
1246
+ # batch_size=256,
1247
+ # show_progress_bar=True
1248
+ )
1249
+ print(embeddings2)
1250
+
1251
+
1252
+ # In[53]:
1253
+
1254
+
1255
+ ids2= [str(uuid.uuid4()) for _ in range(len(contents2))]
1256
+
1257
+
1258
+ # In[54]:
1259
+
1260
+
1261
+ data.add(
1262
+ documents=contents2,
1263
+ embeddings=embeddings2,
1264
+ metadatas=metadata2,
1265
+ ids=ids2
1266
+ )
1267
+
1268
+
1269
+ # In[55]:
1270
+
1271
+
1272
+ append_data(contents2, metadata2, embeddings2)
1273
+
1274
+
1275
+ # In[56]:
1276
+
1277
+
1278
+ df
1279
+
1280
+
1281
+ # # <p style="color: orange;"> Document 2 رشة_بعنوان_أهمية_الصحة_النفسية</p>
1282
+
1283
+ # In[58]:
1284
+
1285
+
1286
+ loader = WebBaseLoader(
1287
+ web_paths=("https://fsm.rnu.tn/fra/articles/4798/%D9%88%D8%B1%D8%B4%D8%A9-%D8%A8%D8%B9%D9%86%D9%88%D8%A7%D9%86-%D8%A3%D9%87%D9%85%D9%8A%D8%A9-%D8%A7%D9%84%D8%B5%D8%AD%D8%A9-%D8%A7%D9%84%D9%86%D9%81%D8%B3%D9%8A%D8%A9",),
1288
+ bs_kwargs=dict(
1289
+ parse_only=bs4.SoupStrainer(
1290
+ class_=("content")
1291
+ )
1292
+ ),
1293
+ )
1294
+ warcha_mental_health = loader.load()
1295
+
1296
+
1297
+ # In[59]:
1298
+
1299
+
1300
+ warcha_mental_health = [
1301
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1302
+ for doc in warcha_mental_health]
1303
+ warcha_mental_health
1304
+
1305
+
1306
+ # ## spitting doc 2 into chunks
1307
+
1308
+ # In[61]:
1309
+
1310
+
1311
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1312
+ splits3 = text_splitter.split_documents( warcha_mental_health)
1313
+
1314
+
1315
+ # In[62]:
1316
+
1317
+
1318
+ splits3
1319
+
1320
+
1321
+ # In[63]:
1322
+
1323
+
1324
+ contents3= [doc.page_content for doc in splits3]
1325
+ metadata3 = [doc.metadata for doc in splits3]
1326
+
1327
+
1328
+ # In[64]:
1329
+
1330
+
1331
+ embeddings3 = embeddings_model.embed_documents(
1332
+ [doc.page_content for doc in splits3],
1333
+ # normalize_embeddings=True,
1334
+ # batch_size=256,
1335
+ # show_progress_bar=True
1336
+ )
1337
+ print(embeddings3)
1338
+
1339
+
1340
+ # In[65]:
1341
+
1342
+
1343
+ ids3 = [str(uuid.uuid4()) for _ in range(len(contents3))]
1344
+
1345
+
1346
+ # In[66]:
1347
+
1348
+
1349
+ data.add(
1350
+ documents=contents3,
1351
+ embeddings=embeddings3,
1352
+ metadatas=metadata3,
1353
+ ids=ids3
1354
+ )
1355
+
1356
+
1357
+ # In[67]:
1358
+
1359
+
1360
+ append_data(contents3, metadata3, embeddings3)
1361
+
1362
+
1363
+ # In[68]:
1364
+
1365
+
1366
+ df.tail()
1367
+
1368
+
1369
+ # # <p style="color: orange;"> Document 3 festival-de-la-creativite-estudiantine</p>
1370
+
1371
+ # In[70]:
1372
+
1373
+
1374
+ loader = WebBaseLoader(
1375
+ web_paths=("https://fsm.rnu.tn/fra/articles/4795/festival-de-la-creativite-estudiantine",),
1376
+ bs_kwargs=dict(
1377
+ parse_only=bs4.SoupStrainer(
1378
+ class_=("content")
1379
+ )
1380
+ ),
1381
+ )
1382
+ festival_de_la_creativite_estudiantinet = loader.load()
1383
+
1384
+
1385
+ # In[71]:
1386
+
1387
+
1388
+ festival_de_la_creativite_estudiantinet = [
1389
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1390
+ for doc in festival_de_la_creativite_estudiantinet]
1391
+ festival_de_la_creativite_estudiantinet
1392
+
1393
+
1394
+ # ## splitting the Doc3 into chunks
1395
+
1396
+ # In[73]:
1397
+
1398
+
1399
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1400
+ splits4 = text_splitter.split_documents( festival_de_la_creativite_estudiantinet)
1401
+
1402
+
1403
+ # In[74]:
1404
+
1405
+
1406
+ print(splits4[0].page_content) # First chunk's content
1407
+ print(splits4[0].metadata)
1408
+
1409
+
1410
+ # In[75]:
1411
+
1412
+
1413
+ contents4= [doc.page_content for doc in splits4]
1414
+ metadata4 = [doc.metadata for doc in splits4]
1415
+
1416
+
1417
+ # In[76]:
1418
+
1419
+
1420
+ embeddings4 = embeddings_model.embed_documents(
1421
+ [doc.page_content for doc in splits4],
1422
+ # normalize_embeddings=True,
1423
+ # batch_size=256,
1424
+ # show_progress_bar=True
1425
+ )
1426
+ print(embeddings4)
1427
+
1428
+
1429
+ # In[77]:
1430
+
1431
+
1432
+ ids4 = [str(uuid.uuid4()) for _ in range(len(contents4))]
1433
+
1434
+
1435
+ # In[78]:
1436
+
1437
+
1438
+ data.add(
1439
+ documents=contents4,
1440
+ embeddings=embeddings4,
1441
+ metadatas=metadata4,
1442
+ ids=ids4
1443
+ )
1444
+
1445
+
1446
+ # In[79]:
1447
+
1448
+
1449
+ append_data(contents4, metadata4, embeddings4)
1450
+
1451
+
1452
+ # In[80]:
1453
+
1454
+
1455
+ df
1456
+
1457
+
1458
+ # # <p style="color: orange;"> Document 4 bourses-d-alternance-2025</p>
1459
+
1460
+ # In[82]:
1461
+
1462
+
1463
+ loader = WebBaseLoader(
1464
+ web_paths=("https://fsm.rnu.tn/fra/articles/4813/bourses-d-alternance-2025",),
1465
+ bs_kwargs=dict(
1466
+ parse_only=bs4.SoupStrainer(
1467
+ class_=("content")
1468
+ )
1469
+ ),
1470
+ )
1471
+ Bourse_alternance = loader.load()
1472
+
1473
+
1474
+ # In[83]:
1475
+
1476
+
1477
+ Bourse_alternance = [
1478
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1479
+ for doc in Bourse_alternance]
1480
+ Bourse_alternance
1481
+
1482
+
1483
+ # ## splitting doc 4 into chunks
1484
+
1485
+ # In[85]:
1486
+
1487
+
1488
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1489
+ splits5 = text_splitter.split_documents( Bourse_alternance)
1490
+
1491
+
1492
+ # In[86]:
1493
+
1494
+
1495
+ print(splits5[2].page_content)
1496
+ print(splits5[2].metadata)
1497
+
1498
+
1499
+ # In[87]:
1500
+
1501
+
1502
+ contents5= [doc.page_content for doc in splits5]
1503
+ metadata5 = [doc.metadata for doc in splits5]
1504
+
1505
+
1506
+ # In[88]:
1507
+
1508
+
1509
+ embeddings5 = embeddings_model.embed_documents(
1510
+ [doc.page_content for doc in splits5],
1511
+ # normalize_embeddings=True,
1512
+ # batch_size=256,
1513
+ # show_progress_bar=True
1514
+ )
1515
+ print(embeddings5)
1516
+
1517
+
1518
+ # In[89]:
1519
+
1520
+
1521
+ ids5 = [str(uuid.uuid4()) for _ in range(len(contents5))]
1522
+
1523
+
1524
+ # In[90]:
1525
+
1526
+
1527
+ data.add(
1528
+ documents=contents5,
1529
+ embeddings=embeddings5,
1530
+ metadatas=metadata5,
1531
+ ids=ids5
1532
+ )
1533
+
1534
+
1535
+ # In[91]:
1536
+
1537
+
1538
+ append_data(contents5, metadata5, embeddings5)
1539
+
1540
+
1541
+ # In[92]:
1542
+
1543
+
1544
+ df
1545
+
1546
+
1547
+ # # <p style="color: orange;"> Document 5 the-indian-council-for-cultural-relations--iccr</p>
1548
+
1549
+ # In[94]:
1550
+
1551
+
1552
+ loader = WebBaseLoader(
1553
+ web_paths=("https://fsm.rnu.tn/fra/articles/4807/the-indian-council-for-cultural-relations--iccr-",),
1554
+ bs_kwargs=dict(
1555
+ parse_only=bs4.SoupStrainer(
1556
+ class_=("content")
1557
+ )
1558
+ ),
1559
+ )
1560
+ the_indian_council_for_cultural_relations = loader.load()
1561
+
1562
+
1563
+ # In[95]:
1564
+
1565
+
1566
+ the_indian_council_for_cultural_relations = [
1567
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1568
+ for doc in the_indian_council_for_cultural_relations]
1569
+ the_indian_council_for_cultural_relations
1570
+
1571
+
1572
+ # ## splitting doc 5 into chunks
1573
+
1574
+ # In[97]:
1575
+
1576
+
1577
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1578
+ splits6 = text_splitter.split_documents( the_indian_council_for_cultural_relations)
1579
+
1580
+
1581
+ # In[98]:
1582
+
1583
+
1584
+ splits6
1585
+
1586
+
1587
+ # In[99]:
1588
+
1589
+
1590
+ contents6= [doc.page_content for doc in splits6]
1591
+ metadata6 = [doc.metadata for doc in splits6]
1592
+
1593
+
1594
+ # In[100]:
1595
+
1596
+
1597
+ embeddings6 = embeddings_model.embed_documents(
1598
+ [doc.page_content for doc in splits6],
1599
+ # normalize_embeddings=True,
1600
+ # batch_size=256,
1601
+ # show_progress_bar=True
1602
+ )
1603
+ print(embeddings6)
1604
+
1605
+
1606
+ # In[101]:
1607
+
1608
+
1609
+ ids6 = [str(uuid.uuid4()) for _ in range(len(contents6))]
1610
+
1611
+
1612
+ # In[102]:
1613
+
1614
+
1615
+ data.add(
1616
+ documents=contents6,
1617
+ embeddings=embeddings6,
1618
+ metadatas=metadata6,
1619
+ ids=ids6
1620
+ )
1621
+
1622
+
1623
+ # In[103]:
1624
+
1625
+
1626
+ append_data(contents6, metadata6, embeddings6)
1627
+
1628
+
1629
+ # In[104]:
1630
+
1631
+
1632
+ df
1633
+
1634
+
1635
+ # In[105]:
1636
+
1637
+
1638
+ # page_url = "https://fsm.rnu.tn/useruploads/files/au2425/NV%20ICCR.pdf"
1639
+ # loader = PyPDFLoader(page_url)
1640
+
1641
+ # applications_guidelines_indian = []
1642
+ # async for doc in loader.alazy_load():
1643
+ # applications_guidelines_indian.append(doc)
1644
+
1645
+
1646
+ # In[106]:
1647
+
1648
+
1649
+ # applications_guidelines_indian
1650
+
1651
+
1652
+ # In[107]:
1653
+
1654
+
1655
+ # documents6
1656
+
1657
+
1658
+ # In[108]:
1659
+
1660
+
1661
+ # pip install "unstructured[pdf]"
1662
+
1663
+
1664
+ # # <p style="color: orange;"> Document 6 Règlement intérieur des examens</p>
1665
+
1666
+ # In[110]:
1667
+
1668
+
1669
+ loader = WebBaseLoader(
1670
+ web_paths=("https://fsm.rnu.tn/fra/pages/346/R%C3%A8glement-int%C3%A9rieur-des-examens",),
1671
+ bs_kwargs=dict(
1672
+ parse_only=bs4.SoupStrainer(
1673
+ class_=("content")
1674
+ )
1675
+ ),
1676
+ )
1677
+ Règlement_intérieur_des_examens = loader.load()
1678
+
1679
+
1680
+ # In[111]:
1681
+
1682
+
1683
+ Règlement_intérieur_des_examens = [
1684
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1685
+ for doc in Règlement_intérieur_des_examens]
1686
+ Règlement_intérieur_des_examens
1687
+
1688
+
1689
+ # ## splitting doc 6 into chunks
1690
+
1691
+ # In[113]:
1692
+
1693
+
1694
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1695
+ splits7 = text_splitter.split_documents( Règlement_intérieur_des_examens)
1696
+
1697
+
1698
+ # In[114]:
1699
+
1700
+
1701
+ splits7
1702
+
1703
+
1704
+ # In[115]:
1705
+
1706
+
1707
+ contents7= [doc.page_content for doc in splits7]
1708
+ metadata7 = [doc.metadata for doc in splits7]
1709
+
1710
+
1711
+ # In[116]:
1712
+
1713
+
1714
+ embeddings7 = embeddings_model.embed_documents(
1715
+ [doc.page_content for doc in splits7],
1716
+ # normalize_embeddings=True,
1717
+ # batch_size=256,
1718
+ # show_progress_bar=True
1719
+ )
1720
+ print(embeddings7)
1721
+
1722
+
1723
+ # In[117]:
1724
+
1725
+
1726
+ ids7 = [str(uuid.uuid4()) for _ in range(len(contents7))]
1727
+
1728
+
1729
+ # In[118]:
1730
+
1731
+
1732
+ data.add(
1733
+ documents=contents7,
1734
+ embeddings=embeddings7,
1735
+ metadatas=metadata7,
1736
+ ids=ids7
1737
+ )
1738
+
1739
+
1740
+ # In[119]:
1741
+
1742
+
1743
+ append_data(contents7, metadata7, embeddings7)
1744
+
1745
+
1746
+ # In[120]:
1747
+
1748
+
1749
+ df
1750
+
1751
+
1752
+ # # <p style="color: orange;">Document 7 Gestion des Stages & PFE (CPE-BR-01-00)</p>
1753
+
1754
+ # In[122]:
1755
+
1756
+
1757
+ loader = WebBaseLoader(
1758
+ web_paths=("https://fsm.rnu.tn/fra/pages/73/Stages-&-PFE",),
1759
+ bs_kwargs=dict(
1760
+ parse_only=bs4.SoupStrainer(
1761
+ class_=("content")
1762
+ )
1763
+ ),
1764
+ )
1765
+ Stages_PFE = loader.load()
1766
+
1767
+
1768
+ # In[123]:
1769
+
1770
+
1771
+ Stages_PFE = [
1772
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1773
+ for doc in Stages_PFE]
1774
+ Stages_PFE
1775
+
1776
+
1777
+ # ## splitting doc 7 into chunks
1778
+
1779
+ # In[125]:
1780
+
1781
+
1782
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1783
+ splits8 = text_splitter.split_documents( Stages_PFE)
1784
+
1785
+
1786
+ # In[126]:
1787
+
1788
+
1789
+ splits8
1790
+
1791
+
1792
+ # In[127]:
1793
+
1794
+
1795
+ contents8= [doc.page_content for doc in splits8]
1796
+ metadata8 = [doc.metadata for doc in splits8]
1797
+
1798
+
1799
+ # In[128]:
1800
+
1801
+
1802
+ embeddings8= embeddings_model.embed_documents(
1803
+ [doc.page_content for doc in splits8],
1804
+ # normalize_embeddings=True,
1805
+ # batch_size=256,
1806
+ # show_progress_bar=True
1807
+ )
1808
+ print(embeddings8)
1809
+
1810
+
1811
+ # In[129]:
1812
+
1813
+
1814
+ ids8 = [str(uuid.uuid4()) for _ in range(len(contents8))]
1815
+
1816
+
1817
+ # In[130]:
1818
+
1819
+
1820
+ data.add(
1821
+ documents=contents8,
1822
+ embeddings=embeddings8,
1823
+ metadatas=metadata8,
1824
+ ids=ids8
1825
+ )
1826
+
1827
+
1828
+ # In[131]:
1829
+
1830
+
1831
+ append_data(contents8, metadata8, embeddings8)
1832
+
1833
+
1834
+ # In[132]:
1835
+
1836
+
1837
+ df
1838
+
1839
+
1840
+ # # <p style="color: orange;">Document 8 Procédure de déroulement des stages facultatifs (CPE-IN-01-00)</p>
1841
+
1842
+ # In[134]:
1843
+
1844
+
1845
+ loader = WebBaseLoader(
1846
+ web_paths=("https://fsm.rnu.tn/fra/pages/437/Proc%C3%A9dure-de-d%C3%A9roulement-des-stages-facultatif",),
1847
+ bs_kwargs=dict(
1848
+ parse_only=bs4.SoupStrainer(
1849
+ class_=("content")
1850
+ )
1851
+ ),
1852
+ )
1853
+ Procédure_de_déroulement_des_stages_facultatifs = loader.load()
1854
+
1855
+
1856
+ # In[135]:
1857
+
1858
+
1859
+ Procédure_de_déroulement_des_stages_facultatifs = [
1860
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1861
+ for doc in Procédure_de_déroulement_des_stages_facultatifs]
1862
+ Procédure_de_déroulement_des_stages_facultatifs
1863
+
1864
+
1865
+ # ## splitting doc 8 into chunks
1866
+
1867
+ # In[137]:
1868
+
1869
+
1870
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1871
+ splits9 = text_splitter.split_documents( Procédure_de_déroulement_des_stages_facultatifs)
1872
+
1873
+
1874
+ # In[138]:
1875
+
1876
+
1877
+ splits9
1878
+
1879
+
1880
+ # In[139]:
1881
+
1882
+
1883
+ contents9= [doc.page_content for doc in splits9]
1884
+ metadata9 = [doc.metadata for doc in splits9]
1885
+
1886
+
1887
+ # In[140]:
1888
+
1889
+
1890
+ embeddings9 = embeddings_model.embed_documents(
1891
+ [doc.page_content for doc in splits9],
1892
+ # normalize_embeddings=True,
1893
+ # batch_size=256,
1894
+ # show_progress_bar=True
1895
+ )
1896
+ print(embeddings9)
1897
+
1898
+
1899
+ # In[141]:
1900
+
1901
+
1902
+ ids9 = [str(uuid.uuid4()) for _ in range(len(contents9))]
1903
+
1904
+
1905
+ # In[142]:
1906
+
1907
+
1908
+ data.add(
1909
+ documents=contents9,
1910
+ embeddings=embeddings9,
1911
+ metadatas=metadata9,
1912
+ ids=ids9
1913
+ )
1914
+
1915
+
1916
+ # In[143]:
1917
+
1918
+
1919
+ append_data(contents9, metadata9, embeddings9)
1920
+
1921
+
1922
+ # In[144]:
1923
+
1924
+
1925
+ df
1926
+
1927
+
1928
+ # # <p style="color: orange;"> Document 9 Procédure de déroulement des stages obligatoires (CPE-IN-02-00)</p>
1929
+
1930
+ # In[146]:
1931
+
1932
+
1933
+ loader = WebBaseLoader(
1934
+ web_paths=("https://fsm.rnu.tn/fra/pages/75/Proc%C3%A9dure-de-d%C3%A9roulement-des-stages",),
1935
+ bs_kwargs=dict(
1936
+ parse_only=bs4.SoupStrainer(
1937
+ class_=("content")
1938
+ )
1939
+ ),
1940
+ )
1941
+ Procédure_de_déroulement_des_stages_obligatoires = loader.load()
1942
+
1943
+
1944
+ # In[147]:
1945
+
1946
+
1947
+ Procédure_de_déroulement_des_stages_obligatoires = [
1948
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
1949
+ for doc in Procédure_de_déroulement_des_stages_obligatoires]
1950
+ Procédure_de_déroulement_des_stages_obligatoires
1951
+
1952
+
1953
+ # ## splitting doc 9 into chunks
1954
+
1955
+ # In[149]:
1956
+
1957
+
1958
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
1959
+ splits10= text_splitter.split_documents(Procédure_de_déroulement_des_stages_obligatoires)
1960
+
1961
+
1962
+ # In[150]:
1963
+
1964
+
1965
+ splits10
1966
+
1967
+
1968
+ # In[151]:
1969
+
1970
+
1971
+ contents10= [doc.page_content for doc in splits10]
1972
+ metadata10 = [doc.metadata for doc in splits10]
1973
+
1974
+
1975
+ # In[152]:
1976
+
1977
+
1978
+ embeddings10 = embeddings_model.embed_documents(
1979
+ [doc.page_content for doc in splits10],
1980
+ # normalize_embeddings=True,
1981
+ # batch_size=256,
1982
+ # show_progress_bar=True
1983
+ )
1984
+ print(embeddings10)
1985
+
1986
+
1987
+ # In[153]:
1988
+
1989
+
1990
+ ids10 = [str(uuid.uuid4()) for _ in range(len(contents10))]
1991
+
1992
+
1993
+ # In[154]:
1994
+
1995
+
1996
+ data.add(
1997
+ documents=contents10,
1998
+ embeddings=embeddings10,
1999
+ metadatas=metadata10,
2000
+ ids=ids10
2001
+ )
2002
+
2003
+
2004
+ # In[155]:
2005
+
2006
+
2007
+ append_data(contents10, metadata10, embeddings10)
2008
+
2009
+
2010
+ # In[156]:
2011
+
2012
+
2013
+ df
2014
+
2015
+
2016
+ # # <p style="color: orange;"> Document 10 Partenariat international</p>
2017
+
2018
+ # In[158]:
2019
+
2020
+
2021
+ loader = WebBaseLoader(
2022
+ web_paths=("https://fsm.rnu.tn/fra/pages/9/Partenariat-international",),
2023
+ bs_kwargs=dict(
2024
+ parse_only=bs4.SoupStrainer(
2025
+ class_=("content")
2026
+ )
2027
+ ),
2028
+ )
2029
+ Partenariat_international = loader.load()
2030
+
2031
+
2032
+ # In[159]:
2033
+
2034
+
2035
+ Partenariat_international = [
2036
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2037
+ for doc in Partenariat_international]
2038
+ Partenariat_international
2039
+
2040
+
2041
+ # ## splitting doc 10 into chunks
2042
+
2043
+ # In[161]:
2044
+
2045
+
2046
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2047
+ splits11 = text_splitter.split_documents(Partenariat_international)
2048
+
2049
+
2050
+ # In[162]:
2051
+
2052
+
2053
+ splits11
2054
+
2055
+
2056
+ # In[163]:
2057
+
2058
+
2059
+ contents11= [doc.page_content for doc in splits11]
2060
+ metadata11 = [doc.metadata for doc in splits11]
2061
+
2062
+
2063
+ # In[164]:
2064
+
2065
+
2066
+ embeddings11 = embeddings_model.embed_documents(
2067
+ [doc.page_content for doc in splits11],
2068
+ # normalize_embeddings=True,
2069
+ # batch_size=256,
2070
+ # show_progress_bar=True
2071
+ )
2072
+ print(embeddings11)
2073
+
2074
+
2075
+ # In[165]:
2076
+
2077
+
2078
+ ids11 = [str(uuid.uuid4()) for _ in range(len(contents11))]
2079
+
2080
+
2081
+ # In[166]:
2082
+
2083
+
2084
+ data.add(
2085
+ documents=contents11,
2086
+ embeddings=embeddings11,
2087
+ metadatas=metadata11,
2088
+ ids=ids11
2089
+ )
2090
+
2091
+
2092
+ # In[167]:
2093
+
2094
+
2095
+ append_data(contents11, metadata11, embeddings11)
2096
+
2097
+
2098
+ # In[168]:
2099
+
2100
+
2101
+ df
2102
+
2103
+
2104
+ # # <p style="color: orange;"> Document 11 Communication</p>
2105
+
2106
+ # In[170]:
2107
+
2108
+
2109
+ loader = WebBaseLoader(
2110
+ web_paths=("https://fsm.rnu.tn/fra/pages/140/Communication",),
2111
+ bs_kwargs=dict(
2112
+ parse_only=bs4.SoupStrainer(
2113
+ class_=("content")
2114
+ )
2115
+ ),
2116
+ )
2117
+ Communication = loader.load()
2118
+
2119
+
2120
+ # In[171]:
2121
+
2122
+
2123
+ Communication = [
2124
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2125
+ for doc in Communication]
2126
+ Communication
2127
+
2128
+
2129
+ # ## splitting doc 11 into chunks
2130
+
2131
+ # In[173]:
2132
+
2133
+
2134
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2135
+ splits12 = text_splitter.split_documents(Communication)
2136
+
2137
+
2138
+ # In[174]:
2139
+
2140
+
2141
+ splits12
2142
+
2143
+
2144
+ # In[175]:
2145
+
2146
+
2147
+ contents12= [doc.page_content for doc in splits12]
2148
+ metadata12 = [doc.metadata for doc in splits12]
2149
+
2150
+
2151
+ # In[176]:
2152
+
2153
+
2154
+ embeddings12 = embeddings_model.embed_documents(
2155
+ [doc.page_content for doc in splits12],
2156
+ # normalize_embeddings=True,
2157
+ # batch_size=256,
2158
+ # show_progress_bar=True
2159
+ )
2160
+ print(embeddings12)
2161
+
2162
+
2163
+ # In[177]:
2164
+
2165
+
2166
+ ids12 = [str(uuid.uuid4()) for _ in range(len(contents12))]
2167
+
2168
+
2169
+ # In[178]:
2170
+
2171
+
2172
+ data.add(
2173
+ documents=contents12,
2174
+ embeddings=embeddings12,
2175
+ metadatas=metadata12,
2176
+ ids=ids12
2177
+ )
2178
+
2179
+
2180
+ # In[179]:
2181
+
2182
+
2183
+ append_data(contents12, metadata12, embeddings12)
2184
+
2185
+
2186
+ # In[180]:
2187
+
2188
+
2189
+ df
2190
+
2191
+
2192
+ # # <p style="color: orange;"> Document 12 Liens utiles</p>
2193
+
2194
+ # In[182]:
2195
+
2196
+
2197
+ loader = WebBaseLoader(
2198
+ web_paths=("https://fsm.rnu.tn/fra/links",),
2199
+ bs_kwargs=dict(
2200
+ parse_only=bs4.SoupStrainer(
2201
+ class_=("links_container","link_item","link_tags")
2202
+ )
2203
+ ),
2204
+ )
2205
+ Liens_utiles = loader.load()
2206
+
2207
+
2208
+ # In[183]:
2209
+
2210
+
2211
+ Liens_utiles = [
2212
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2213
+ for doc in Liens_utiles]
2214
+ Liens_utiles
2215
+
2216
+
2217
+ # ## splitting doc 12 into chunks
2218
+
2219
+ # In[185]:
2220
+
2221
+
2222
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2223
+ splits13 = text_splitter.split_documents(Liens_utiles)
2224
+
2225
+
2226
+ # In[186]:
2227
+
2228
+
2229
+ splits13
2230
+
2231
+
2232
+ # In[187]:
2233
+
2234
+
2235
+ contents13= [doc.page_content for doc in splits13]
2236
+ metadata13 = [doc.metadata for doc in splits13]
2237
+
2238
+
2239
+ # In[188]:
2240
+
2241
+
2242
+ embeddings13 = embeddings_model.embed_documents(
2243
+ [doc.page_content for doc in splits13],
2244
+ # normalize_embeddings=True,
2245
+ # batch_size=256,
2246
+ # show_progress_bar=True
2247
+ )
2248
+ print(embeddings13)
2249
+
2250
+
2251
+ # In[189]:
2252
+
2253
+
2254
+ ids13 = [str(uuid.uuid4()) for _ in range(len(contents13))]
2255
+
2256
+
2257
+ # In[190]:
2258
+
2259
+
2260
+ data.add(
2261
+ documents=contents13,
2262
+ embeddings=embeddings13,
2263
+ metadatas=metadata13,
2264
+ ids=ids13
2265
+ )
2266
+
2267
+
2268
+ # In[191]:
2269
+
2270
+
2271
+ append_data(contents13, metadata13, embeddings13)
2272
+
2273
+
2274
+ # In[192]:
2275
+
2276
+
2277
+ df
2278
+
2279
+
2280
+ # # <p style="color: orange;"> Document 13 Departement Chimie </p>
2281
+
2282
+ # In[194]:
2283
+
2284
+
2285
+ loader = WebBaseLoader(
2286
+ web_paths=("https://fsm.rnu.tn/fra/departements/CH/4/chimie",),
2287
+ bs_kwargs=dict(
2288
+ parse_only=bs4.SoupStrainer(
2289
+ class_=("content")
2290
+ )
2291
+ ),
2292
+ )
2293
+ Chimie = loader.load()
2294
+
2295
+
2296
+ # In[195]:
2297
+
2298
+
2299
+ Chimie = [
2300
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2301
+ for doc in Chimie]
2302
+ Chimie
2303
+
2304
+
2305
+ # ## splitting doc 13 into chunks
2306
+
2307
+ # In[197]:
2308
+
2309
+
2310
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2311
+ splits14 = text_splitter.split_documents(Chimie)
2312
+
2313
+
2314
+ # In[198]:
2315
+
2316
+
2317
+ splits14
2318
+
2319
+
2320
+ # In[199]:
2321
+
2322
+
2323
+ contents14= [doc.page_content for doc in splits14]
2324
+ metadata14 = [doc.metadata for doc in splits14]
2325
+
2326
+
2327
+ # In[200]:
2328
+
2329
+
2330
+ embeddings14 = embeddings_model.embed_documents(
2331
+ [doc.page_content for doc in splits14],
2332
+ # normalize_embeddings=True,
2333
+ # batch_size=256,
2334
+ # show_progress_bar=True
2335
+ )
2336
+ print(embeddings14)
2337
+
2338
+
2339
+ # In[201]:
2340
+
2341
+
2342
+ ids14 = [str(uuid.uuid4()) for _ in range(len(contents14))]
2343
+
2344
+
2345
+ # In[202]:
2346
+
2347
+
2348
+ data.add(
2349
+ documents=contents14,
2350
+ embeddings=embeddings14,
2351
+ metadatas=metadata14,
2352
+ ids=ids14
2353
+ )
2354
+
2355
+
2356
+ # In[203]:
2357
+
2358
+
2359
+ append_data(contents14, metadata14, embeddings14)
2360
+
2361
+
2362
+ # In[204]:
2363
+
2364
+
2365
+ df
2366
+
2367
+
2368
+ # # <p style="color: orange;"> Document 14 Departement Mathematique </p>
2369
+
2370
+ # In[206]:
2371
+
2372
+
2373
+ loader = WebBaseLoader(
2374
+ web_paths=("https://fsm.rnu.tn/fra/departements/M/1/mathematiques",),
2375
+ bs_kwargs=dict(
2376
+ parse_only=bs4.SoupStrainer(
2377
+ class_=("selectEnsFilter")
2378
+ )
2379
+ ),
2380
+ )
2381
+ math = loader.load()
2382
+
2383
+
2384
+ # In[207]:
2385
+
2386
+
2387
+ math = [
2388
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2389
+ for doc in math]
2390
+ math
2391
+
2392
+
2393
+ # ## splitting doc 14 into chunks
2394
+
2395
+ # In[209]:
2396
+
2397
+
2398
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2399
+ splits15 = text_splitter.split_documents(math)
2400
+
2401
+
2402
+ # In[210]:
2403
+
2404
+
2405
+ splits15
2406
+
2407
+
2408
+ # In[211]:
2409
+
2410
+
2411
+ contents15= [doc.page_content for doc in splits15]
2412
+ metadata15 = [doc.metadata for doc in splits15]
2413
+
2414
+
2415
+ # In[212]:
2416
+
2417
+
2418
+ embeddings15 = embeddings_model.embed_documents(
2419
+ [doc.page_content for doc in splits15],
2420
+ # normalize_embeddings=True,
2421
+ # batch_size=256,
2422
+ # show_progress_bar=True
2423
+ )
2424
+ print(embeddings15)
2425
+
2426
+
2427
+ # In[213]:
2428
+
2429
+
2430
+ ids15 = [str(uuid.uuid4()) for _ in range(len(contents15))]
2431
+
2432
+
2433
+ # In[214]:
2434
+
2435
+
2436
+ data.add(
2437
+ documents=contents15,
2438
+ embeddings=embeddings15,
2439
+ metadatas=metadata15,
2440
+ ids=ids15
2441
+ )
2442
+
2443
+
2444
+ # In[215]:
2445
+
2446
+
2447
+ append_data(contents15, metadata15, embeddings15)
2448
+
2449
+
2450
+ # In[216]:
2451
+
2452
+
2453
+ df
2454
+
2455
+
2456
+ # # <p style="color: orange;"> Document 15 Departement informatique </p>
2457
+
2458
+ # In[218]:
2459
+
2460
+
2461
+ loader = WebBaseLoader(
2462
+ web_paths=("https://fsm.rnu.tn/fra/departements/Info/2/informatique",),
2463
+ bs_kwargs=dict(
2464
+ parse_only=bs4.SoupStrainer(
2465
+ class_=("selectEnsFilter")
2466
+ )
2467
+ ),
2468
+ )
2469
+ info = loader.load()
2470
+
2471
+
2472
+ # In[219]:
2473
+
2474
+
2475
+ info = [
2476
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2477
+ for doc in info]
2478
+ info
2479
+
2480
+
2481
+ # ## splitting doc 15 into chunks
2482
+
2483
+ # In[221]:
2484
+
2485
+
2486
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2487
+ splits16=text_splitter.split_documents(info)
2488
+
2489
+
2490
+ # In[222]:
2491
+
2492
+
2493
+ splits16
2494
+
2495
+
2496
+ # In[223]:
2497
+
2498
+
2499
+ contents16= [doc.page_content for doc in splits16]
2500
+ metadata16 = [doc.metadata for doc in splits16]
2501
+
2502
+
2503
+ # In[224]:
2504
+
2505
+
2506
+ embeddings16 = embeddings_model.embed_documents(
2507
+ [doc.page_content for doc in splits16],
2508
+ # normalize_embeddings=True,
2509
+ # batch_size=256,
2510
+ # show_progress_bar=True
2511
+ )
2512
+ print(embeddings16)
2513
+
2514
+
2515
+ # In[225]:
2516
+
2517
+
2518
+ ids16 = [str(uuid.uuid4()) for _ in range(len(contents16))]
2519
+
2520
+
2521
+ # In[226]:
2522
+
2523
+
2524
+ data.add(
2525
+ documents=contents16,
2526
+ embeddings=embeddings16,
2527
+ metadatas=metadata16,
2528
+ ids=ids16
2529
+ )
2530
+
2531
+
2532
+ # In[227]:
2533
+
2534
+
2535
+ append_data(contents16, metadata16, embeddings16)
2536
+
2537
+
2538
+ # In[228]:
2539
+
2540
+
2541
+ df
2542
+
2543
+
2544
+ # # <p style="color: orange;">Document 16 departement Physique </p>
2545
+
2546
+ # # Document 16 Departement 16
2547
+
2548
+ # In[231]:
2549
+
2550
+
2551
+ loader = WebBaseLoader(
2552
+ web_paths=("https://fsm.rnu.tn/fra/departements/PH/3/physique",),
2553
+ bs_kwargs=dict(
2554
+ parse_only=bs4.SoupStrainer(
2555
+ class_=("selectEnsFilter")
2556
+ )
2557
+ ),
2558
+ )
2559
+ physique = loader.load()
2560
+
2561
+
2562
+ # In[232]:
2563
+
2564
+
2565
+ physique = [
2566
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2567
+ for doc in physique]
2568
+ physique
2569
+
2570
+
2571
+ # ## splitting doc 16 into chunks
2572
+
2573
+ # In[234]:
2574
+
2575
+
2576
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2577
+ splits17 = text_splitter.split_documents(physique)
2578
+
2579
+
2580
+ # In[235]:
2581
+
2582
+
2583
+ splits17
2584
+
2585
+
2586
+ # In[236]:
2587
+
2588
+
2589
+ contents17= [doc.page_content for doc in splits17]
2590
+ metadata17 = [doc.metadata for doc in splits17]
2591
+
2592
+
2593
+ # In[237]:
2594
+
2595
+
2596
+ embeddings17 = embeddings_model.embed_documents(
2597
+ [doc.page_content for doc in splits17],
2598
+ # normalize_embeddings=True,
2599
+ # batch_size=256,
2600
+ # show_progress_bar=True
2601
+ )
2602
+ print(embeddings17)
2603
+
2604
+
2605
+ # In[238]:
2606
+
2607
+
2608
+ ids17 = [str(uuid.uuid4()) for _ in range(len(contents17))]
2609
+
2610
+
2611
+ # In[239]:
2612
+
2613
+
2614
+ data.add(
2615
+ documents=contents17,
2616
+ embeddings=embeddings17,
2617
+ metadatas=metadata17,
2618
+ ids=ids17
2619
+ )
2620
+
2621
+
2622
+ # In[240]:
2623
+
2624
+
2625
+ append_data(contents17, metadata17, embeddings17)
2626
+
2627
+
2628
+ # In[241]:
2629
+
2630
+
2631
+ df
2632
+
2633
+
2634
+ # # <p style="color: orange;">Document 17 Enseignement Tronc Commun </p>
2635
+
2636
+ # In[243]:
2637
+
2638
+
2639
+ loader = WebBaseLoader(
2640
+ web_paths=("https://fsm.rnu.tn/fra/departements/ET/5/enseignement-tronc-commun",),
2641
+ bs_kwargs=dict(
2642
+ parse_only=bs4.SoupStrainer(
2643
+ class_=("content")
2644
+ )
2645
+ ),
2646
+ )
2647
+ Enseignement_Tronc_Commun = loader.load()
2648
+
2649
+
2650
+ # In[244]:
2651
+
2652
+
2653
+ Enseignement_Tronc_Commun = [
2654
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2655
+ for doc in Enseignement_Tronc_Commun]
2656
+ Enseignement_Tronc_Commun
2657
+
2658
+
2659
+ # ## splitting doc 17 into chunks
2660
+
2661
+ # In[246]:
2662
+
2663
+
2664
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2665
+ splits18 = text_splitter.split_documents(Enseignement_Tronc_Commun)
2666
+
2667
+
2668
+ # In[247]:
2669
+
2670
+
2671
+ splits18
2672
+
2673
+
2674
+ # In[248]:
2675
+
2676
+
2677
+ contents18= [doc.page_content for doc in splits18]
2678
+ metadata18 = [doc.metadata for doc in splits18]
2679
+
2680
+
2681
+ # In[249]:
2682
+
2683
+
2684
+ embeddings18 = embeddings_model.embed_documents(
2685
+ [doc.page_content for doc in splits18],
2686
+ # normalize_embeddings=True,
2687
+ # batch_size=256,
2688
+ # show_progress_bar=True
2689
+ )
2690
+ print(embeddings18)
2691
+
2692
+
2693
+ # In[250]:
2694
+
2695
+
2696
+ ids18 = [str(uuid.uuid4()) for _ in range(len(contents18))]
2697
+
2698
+
2699
+ # In[251]:
2700
+
2701
+
2702
+ data.add(
2703
+ documents=contents18,
2704
+ embeddings=embeddings18,
2705
+ metadatas=metadata18,
2706
+ ids=ids18
2707
+ )
2708
+
2709
+
2710
+ # In[252]:
2711
+
2712
+
2713
+ append_data(contents18, metadata18, embeddings18)
2714
+
2715
+
2716
+ # In[253]:
2717
+
2718
+
2719
+ df
2720
+
2721
+
2722
+ # # <p style="color: orange;">Document 18 اخر بلاغ للتسجيل بالنسبة للسنة الجامعية </p>
2723
+ #
2724
+
2725
+ # In[255]:
2726
+
2727
+
2728
+ loader = WebBaseLoader(
2729
+ web_paths=("https://fsm.rnu.tn/fra/articles/4712/%D8%A7%D8%AE%D8%B1-%D8%A8%D9%84%D8%A7%D8%BA-%D9%84%D9%84%D8%AA%D8%B3%D8%AC%D9%8A%D9%84-%D8%A8%D8%A7%D9%84%D9%86%D8%B3%D8%A8%D8%A9-%D9%84%D9%84%D8%B3%D9%86%D8%A9-%D8%A7%D9%84%D8%AC%D8%A7%D9%85%D8%B9%D9%8A%D8%A9-2024-2025",),
2730
+ bs_kwargs=dict(
2731
+ parse_only=bs4.SoupStrainer(
2732
+ class_=("content")
2733
+ )
2734
+ ),
2735
+ )
2736
+ ekher_balegh = loader.load()
2737
+
2738
+
2739
+ # In[256]:
2740
+
2741
+
2742
+ ekher_balegh = [
2743
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2744
+ for doc in ekher_balegh]
2745
+ ekher_balegh
2746
+
2747
+
2748
+ # ## splitting doc 18 into chunks
2749
+
2750
+ # In[258]:
2751
+
2752
+
2753
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2754
+ splits19 = text_splitter.split_documents(ekher_balegh)
2755
+
2756
+
2757
+ # In[259]:
2758
+
2759
+
2760
+ splits19
2761
+
2762
+
2763
+ # In[260]:
2764
+
2765
+
2766
+ contents19= [doc.page_content for doc in splits19]
2767
+ metadata19 = [doc.metadata for doc in splits19]
2768
+
2769
+
2770
+ # In[261]:
2771
+
2772
+
2773
+ embeddings19 = embeddings_model.embed_documents(
2774
+ [doc.page_content for doc in splits19],
2775
+ # normalize_embeddings=True,
2776
+ # batch_size=256,
2777
+ # show_progress_bar=True
2778
+ )
2779
+ print(embeddings19)
2780
+
2781
+
2782
+ # In[262]:
2783
+
2784
+
2785
+ ids19 = [str(uuid.uuid4()) for _ in range(len(contents19))]
2786
+
2787
+
2788
+ # In[263]:
2789
+
2790
+
2791
+ data.add(
2792
+ documents=contents19,
2793
+ embeddings=embeddings19,
2794
+ metadatas=metadata19,
2795
+ ids=ids19
2796
+ )
2797
+
2798
+
2799
+ # In[264]:
2800
+
2801
+
2802
+ append_data(contents19, metadata19, embeddings19)
2803
+
2804
+
2805
+ # In[265]:
2806
+
2807
+
2808
+ df
2809
+
2810
+
2811
+ # # <p style="color: orange;">Documents 19 Comptes extranet des étudiants 2024-2025 </p>
2812
+ #
2813
+
2814
+ # In[267]:
2815
+
2816
+
2817
+ loader = WebBaseLoader(
2818
+ web_paths=("https://fsm.rnu.tn/fra/articles/4673/comptes-extranet-des-etudiants-2024-2025",),
2819
+ bs_kwargs=dict(
2820
+ parse_only=bs4.SoupStrainer(
2821
+ class_=("content")
2822
+ )
2823
+ ),
2824
+ )
2825
+ comptes_extranet_des_etudiants = loader.load()
2826
+
2827
+
2828
+ # In[268]:
2829
+
2830
+
2831
+ comptes_extranet_des_etudiants = [
2832
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2833
+ for doc in comptes_extranet_des_etudiants]
2834
+ comptes_extranet_des_etudiants
2835
+
2836
+
2837
+
2838
+ # ## splitting doc 19 into chunks
2839
+
2840
+ # In[270]:
2841
+
2842
+
2843
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2844
+ splits20 = text_splitter.split_documents(comptes_extranet_des_etudiants)
2845
+
2846
+
2847
+ # In[271]:
2848
+
2849
+
2850
+ splits20
2851
+
2852
+
2853
+ # In[272]:
2854
+
2855
+
2856
+ contents20= [doc.page_content for doc in splits20]
2857
+ metadata20 = [doc.metadata for doc in splits20]
2858
+
2859
+
2860
+ # In[273]:
2861
+
2862
+
2863
+ embeddings20 = embeddings_model.embed_documents(
2864
+ [doc.page_content for doc in splits20],
2865
+ # normalize_embeddings=True,
2866
+ # batch_size=256,
2867
+ # show_progress_bar=True
2868
+ )
2869
+ print(embeddings20)
2870
+
2871
+
2872
+ # In[274]:
2873
+
2874
+
2875
+ ids20 = [str(uuid.uuid4()) for _ in range(len(contents20))]
2876
+
2877
+
2878
+ # In[275]:
2879
+
2880
+
2881
+ data.add(
2882
+ documents=contents20,
2883
+ embeddings=embeddings20,
2884
+ metadatas=metadata20,
2885
+ ids=ids20
2886
+ )
2887
+
2888
+
2889
+ # In[276]:
2890
+
2891
+
2892
+ append_data(contents20, metadata20, embeddings20)
2893
+
2894
+
2895
+ # In[277]:
2896
+
2897
+
2898
+ df
2899
+
2900
+
2901
+ # # <p style="color: orange;"> Document 20 بلاغ الترسيم للسنة الجامعية </p>
2902
+ #
2903
+
2904
+ # In[279]:
2905
+
2906
+
2907
+ loader = WebBaseLoader(
2908
+ web_paths=("https://fsm.rnu.tn/fra/articles/4395/%D8%A8%D9%84%D8%A7%D8%BA-%D8%A7%D9%84%D8%AA%D8%B1%D8%B3%D9%8A%D9%85-%D9%84%D9%84%D8%B3%D9%86%D8%A9-%D8%A7%D9%84%D8%AC%D8%A7%D9%85%D8%B9%D9%8A%D8%A9-2024-2025",),
2909
+ bs_kwargs=dict(
2910
+ parse_only=bs4.SoupStrainer(
2911
+ class_=("content")
2912
+ )
2913
+ ),
2914
+ )
2915
+ balegh_tarsim = loader.load()
2916
+
2917
+
2918
+ # In[280]:
2919
+
2920
+
2921
+ comptes_extranet_des_etudiants = [
2922
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
2923
+ for doc in balegh_tarsim]
2924
+ balegh_tarsim
2925
+
2926
+
2927
+
2928
+ # ## splitting doc 20 into chunks
2929
+
2930
+ # In[282]:
2931
+
2932
+
2933
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
2934
+ splits21 = text_splitter.split_documents(balegh_tarsim)
2935
+
2936
+
2937
+ # In[283]:
2938
+
2939
+
2940
+ splits21
2941
+
2942
+
2943
+ # In[284]:
2944
+
2945
+
2946
+ contents21= [doc.page_content for doc in splits21]
2947
+ metadata21= [doc.metadata for doc in splits21]
2948
+
2949
+
2950
+ # In[285]:
2951
+
2952
+
2953
+ embeddings21= embeddings_model.embed_documents(
2954
+ [doc.page_content for doc in splits21],
2955
+ # normalize_embeddings=True,
2956
+ # batch_size=256,
2957
+ # show_progress_bar=True
2958
+ )
2959
+ print(embeddings21)
2960
+
2961
+
2962
+ # In[286]:
2963
+
2964
+
2965
+ ids21 = [str(uuid.uuid4()) for _ in range(len(contents21))]
2966
+
2967
+
2968
+ # In[287]:
2969
+
2970
+
2971
+ data.add(
2972
+ documents=contents21,
2973
+ embeddings=embeddings21,
2974
+ metadatas=metadata21,
2975
+ ids=ids21
2976
+ )
2977
+
2978
+
2979
+ # In[288]:
2980
+
2981
+
2982
+ append_data(contents21, metadata21, embeddings21)
2983
+
2984
+
2985
+ # In[289]:
2986
+
2987
+
2988
+ df
2989
+
2990
+
2991
+ # # <p style="color: orange;">Document 21 Fiche de renseignements des diplômés </p>
2992
+ #
2993
+
2994
+ # In[291]:
2995
+
2996
+
2997
+ loader = WebBaseLoader(
2998
+ web_paths=("https://fsm.rnu.tn/fra/pages/138/Fiche-de-renseignements-des-dipl%C3%B4m%C3%A9s",),
2999
+ bs_kwargs=dict(
3000
+ parse_only=bs4.SoupStrainer(
3001
+ class_=("content")
3002
+ )
3003
+ ),
3004
+ )
3005
+ Fiche_de_renseignements_des_diplome = loader.load()
3006
+
3007
+
3008
+ # In[292]:
3009
+
3010
+
3011
+ Fiche_de_renseignements_des_diplome = [
3012
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3013
+ for doc in Fiche_de_renseignements_des_diplome]
3014
+ Fiche_de_renseignements_des_diplome
3015
+
3016
+
3017
+ # ## splitting doc 21 into chunks
3018
+
3019
+ # In[294]:
3020
+
3021
+
3022
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3023
+ splits22 = text_splitter.split_documents(Fiche_de_renseignements_des_diplome)
3024
+
3025
+
3026
+ # In[295]:
3027
+
3028
+
3029
+ splits22
3030
+
3031
+
3032
+ # In[296]:
3033
+
3034
+
3035
+ contents22= [doc.page_content for doc in splits22]
3036
+ metadata22 = [doc.metadata for doc in splits22]
3037
+
3038
+
3039
+ # In[297]:
3040
+
3041
+
3042
+ embeddings22 = embeddings_model.embed_documents(
3043
+ [doc.page_content for doc in splits22],
3044
+ # normalize_embeddings=True,
3045
+ # batch_size=256,
3046
+ # show_progress_bar=True
3047
+ )
3048
+ print(embeddings22)
3049
+
3050
+
3051
+ # In[298]:
3052
+
3053
+
3054
+ ids22 = [str(uuid.uuid4()) for _ in range(len(contents22))]
3055
+
3056
+
3057
+ # In[299]:
3058
+
3059
+
3060
+ data.add(
3061
+ documents=contents22,
3062
+ embeddings=embeddings22,
3063
+ metadatas=metadata22,
3064
+ ids=ids22
3065
+ )
3066
+
3067
+
3068
+ # In[300]:
3069
+
3070
+
3071
+ append_data(contents22, metadata22, embeddings22)
3072
+
3073
+
3074
+ # In[301]:
3075
+
3076
+
3077
+ df
3078
+
3079
+
3080
+ # # <p style="color: orange;">Document 22 Loi de creation FSM </p>
3081
+ #
3082
+
3083
+ # In[303]:
3084
+
3085
+
3086
+ loader = WebBaseLoader(
3087
+ web_paths=("https://fsm.rnu.tn/fra/pages/1/Loi-de-cr%C3%A9ation",),
3088
+ bs_kwargs=dict(
3089
+ parse_only=bs4.SoupStrainer(
3090
+ class_=("content")
3091
+ )
3092
+ ),
3093
+ )
3094
+ loi_de_creation = loader.load()
3095
+
3096
+
3097
+ # In[304]:
3098
+
3099
+
3100
+ loi_de_creation = [
3101
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3102
+ for doc in loi_de_creation]
3103
+ loi_de_creation
3104
+
3105
+
3106
+ # ## splitting doc 22 into chunks
3107
+
3108
+ # In[306]:
3109
+
3110
+
3111
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3112
+ splits23 = text_splitter.split_documents(loi_de_creation)
3113
+
3114
+
3115
+ # In[307]:
3116
+
3117
+
3118
+ splits23
3119
+
3120
+
3121
+ # In[308]:
3122
+
3123
+
3124
+ contents23= [doc.page_content for doc in splits23]
3125
+ metadata23 = [doc.metadata for doc in splits23]
3126
+
3127
+
3128
+ # In[309]:
3129
+
3130
+
3131
+ embeddings23 = embeddings_model.embed_documents(
3132
+ [doc.page_content for doc in splits23],
3133
+ # normalize_embeddings=True,
3134
+ # batch_size=256,
3135
+ # show_progress_bar=True
3136
+ )
3137
+ print(embeddings23)
3138
+
3139
+
3140
+ # In[310]:
3141
+
3142
+
3143
+ ids23 = [str(uuid.uuid4()) for _ in range(len(contents23))]
3144
+
3145
+
3146
+ # In[311]:
3147
+
3148
+
3149
+ data.add(
3150
+ documents=contents23,
3151
+ embeddings=embeddings23,
3152
+ metadatas=metadata23,
3153
+ ids=ids23
3154
+ )
3155
+
3156
+
3157
+ # In[312]:
3158
+
3159
+
3160
+ append_data(contents23, metadata23, embeddings23)
3161
+
3162
+
3163
+ # In[313]:
3164
+
3165
+
3166
+ df
3167
+
3168
+
3169
+ # # <p style="color: orange;">Document 23 loi en chiffre </p>
3170
+ #
3171
+
3172
+ # In[315]:
3173
+
3174
+
3175
+ loader = WebBaseLoader(
3176
+ web_paths=("https://fsm.rnu.tn/fra/pages/3/En-chiffres",),
3177
+ bs_kwargs=dict(
3178
+ parse_only=bs4.SoupStrainer(
3179
+ class_=("content")
3180
+ )
3181
+ ),
3182
+ )
3183
+ loi_en_chiffre = loader.load()
3184
+
3185
+
3186
+ # In[316]:
3187
+
3188
+
3189
+ loi_en_chiffre = [
3190
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3191
+ for doc in loi_en_chiffre]
3192
+ loi_en_chiffre
3193
+
3194
+
3195
+ # ## splitting doc 23 into chunks
3196
+
3197
+ # In[318]:
3198
+
3199
+
3200
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3201
+ splits24 = text_splitter.split_documents(loi_en_chiffre)
3202
+
3203
+
3204
+ # In[319]:
3205
+
3206
+
3207
+ splits24
3208
+
3209
+
3210
+ # In[320]:
3211
+
3212
+
3213
+ contents24= [doc.page_content for doc in splits24]
3214
+ metadata24 = [doc.metadata for doc in splits24]
3215
+
3216
+
3217
+ # In[321]:
3218
+
3219
+
3220
+ embeddings24 = embeddings_model.embed_documents(
3221
+ [doc.page_content for doc in splits24],
3222
+ # normalize_embeddings=True,
3223
+ # batch_size=256,
3224
+ # show_progress_bar=True
3225
+ )
3226
+ print(embeddings24)
3227
+
3228
+
3229
+ # In[322]:
3230
+
3231
+
3232
+ ids24 = [str(uuid.uuid4()) for _ in range(len(contents24))]
3233
+
3234
+
3235
+ # In[323]:
3236
+
3237
+
3238
+ data.add(
3239
+ documents=contents24,
3240
+ embeddings=embeddings24,
3241
+ metadatas=metadata24,
3242
+ ids=ids24
3243
+ )
3244
+
3245
+
3246
+ # In[324]:
3247
+
3248
+
3249
+ append_data(contents24, metadata24, embeddings24)
3250
+
3251
+
3252
+ # In[325]:
3253
+
3254
+
3255
+ df
3256
+
3257
+
3258
+ # # LICENCE
3259
+
3260
+ # # <p style="color: orange;">Document 24 PARCOURS LMD Mathématiques Appliquées</p>
3261
+ #
3262
+
3263
+ # In[328]:
3264
+
3265
+
3266
+ loader = WebBaseLoader(
3267
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=ABhRHFxzAmNUZVIoBj4ENQYgX2sBPA==&etab=VjJQYQk7",),
3268
+ bs_kwargs=dict(
3269
+ parse_only=bs4.SoupStrainer(
3270
+ class_=("center")
3271
+ )
3272
+ ),
3273
+ )
3274
+ parcours_math_appli = loader.load()
3275
+
3276
+
3277
+ # In[329]:
3278
+
3279
+
3280
+ parcours_math_appli = [
3281
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3282
+ for doc in parcours_math_appli]
3283
+ parcours_math_appli
3284
+
3285
+
3286
+ # ## splitting doc 24 into chunks
3287
+
3288
+ # In[331]:
3289
+
3290
+
3291
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3292
+ splits25 = text_splitter.split_documents(parcours_math_appli)
3293
+
3294
+
3295
+ # In[332]:
3296
+
3297
+
3298
+ splits25
3299
+
3300
+
3301
+ # In[333]:
3302
+
3303
+
3304
+ contents25= [doc.page_content for doc in splits25]
3305
+ metadata25 = [doc.metadata for doc in splits25]
3306
+
3307
+
3308
+ # In[334]:
3309
+
3310
+
3311
+ embeddings25 = embeddings_model.embed_documents(
3312
+ [doc.page_content for doc in splits25],
3313
+ # normalize_embeddings=True,
3314
+ # batch_size=256,
3315
+ # show_progress_bar=True
3316
+ )
3317
+ print(embeddings25)
3318
+
3319
+
3320
+ # In[335]:
3321
+
3322
+
3323
+ ids25 = [str(uuid.uuid4()) for _ in range(len(contents25))]
3324
+
3325
+
3326
+ # In[336]:
3327
+
3328
+
3329
+ data.add(
3330
+ documents=contents25,
3331
+ embeddings=embeddings25,
3332
+ metadatas=metadata25,
3333
+ ids=ids25
3334
+ )
3335
+
3336
+
3337
+ # In[337]:
3338
+
3339
+
3340
+ append_data(contents25, metadata25, embeddings25)
3341
+
3342
+
3343
+ # In[338]:
3344
+
3345
+
3346
+ df
3347
+
3348
+
3349
+ # # <p style="color: orange;"> Document 25 parcours lmd Computer Science</p>
3350
+ #
3351
+
3352
+ # In[340]:
3353
+
3354
+
3355
+ loader = WebBaseLoader(
3356
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=UkpTHlxzUzJXZlctDjJTYFZwDDI=&etab=VjJZaAg6",),
3357
+ bs_kwargs=dict(
3358
+ parse_only=bs4.SoupStrainer(
3359
+ class_=("center")
3360
+ )
3361
+ ),
3362
+ )
3363
+ parcours_computer_science = loader.load()
3364
+
3365
+
3366
+ # In[341]:
3367
+
3368
+
3369
+ parcours_computer_science = [
3370
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3371
+ for doc in parcours_computer_science]
3372
+ parcours_computer_science
3373
+
3374
+
3375
+ # ## splitting doc 25 into chunks
3376
+
3377
+ # In[343]:
3378
+
3379
+
3380
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3381
+ splits26 = text_splitter.split_documents(parcours_computer_science)
3382
+
3383
+
3384
+ # In[344]:
3385
+
3386
+
3387
+ splits26
3388
+
3389
+
3390
+ # In[345]:
3391
+
3392
+
3393
+ contents26= [doc.page_content for doc in splits26]
3394
+ metadata26= [doc.metadata for doc in splits26]
3395
+
3396
+
3397
+ # In[346]:
3398
+
3399
+
3400
+ embeddings26 = embeddings_model.embed_documents(
3401
+ [doc.page_content for doc in splits26],
3402
+ # normalize_embeddings=True,
3403
+ # batch_size=256,
3404
+ # show_progress_bar=True
3405
+ )
3406
+ print(embeddings26)
3407
+
3408
+
3409
+ # In[347]:
3410
+
3411
+
3412
+ ids26 = [str(uuid.uuid4()) for _ in range(len(contents26))]
3413
+
3414
+
3415
+ # In[348]:
3416
+
3417
+
3418
+ data.add(
3419
+ documents=contents26,
3420
+ embeddings=embeddings26,
3421
+ metadatas=metadata26,
3422
+ ids=ids26
3423
+ )
3424
+
3425
+
3426
+ # In[349]:
3427
+
3428
+
3429
+ append_data(contents26, metadata26, embeddings26)
3430
+
3431
+
3432
+ # In[350]:
3433
+
3434
+
3435
+ df
3436
+
3437
+
3438
+ # # <p style="color: orange;"> Document 26 Parcours LMD Mesures et Instrumentation</p>
3439
+ #
3440
+
3441
+ # In[352]:
3442
+
3443
+
3444
+ loader = WebBaseLoader(
3445
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=W0NXGlp1UjNWZwN5BzkHMVN1DzsBPA==&etab=BGBYaQw+",),
3446
+ bs_kwargs=dict(
3447
+ parse_only=bs4.SoupStrainer(
3448
+ class_=("center")
3449
+ )
3450
+ ),
3451
+ )
3452
+ parcours_Mesures = loader.load()
3453
+
3454
+
3455
+ # In[353]:
3456
+
3457
+
3458
+ parcours_Mesures = [
3459
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3460
+ for doc in parcours_Mesures]
3461
+ parcours_Mesures
3462
+
3463
+
3464
+ # ## spitting doc 26 inti chunks
3465
+
3466
+ # In[355]:
3467
+
3468
+
3469
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3470
+ splits27 = text_splitter.split_documents(parcours_Mesures)
3471
+
3472
+
3473
+ # In[356]:
3474
+
3475
+
3476
+ splits27
3477
+
3478
+
3479
+ # In[357]:
3480
+
3481
+
3482
+ contents27= [doc.page_content for doc in splits27]
3483
+ metadata27= [doc.metadata for doc in splits27]
3484
+
3485
+
3486
+ # In[358]:
3487
+
3488
+
3489
+ embeddings27 = embeddings_model.embed_documents(
3490
+ [doc.page_content for doc in splits27],
3491
+ # normalize_embeddings=True,
3492
+ # batch_size=256,
3493
+ # show_progress_bar=True
3494
+ )
3495
+ print(embeddings27)
3496
+
3497
+
3498
+ # In[359]:
3499
+
3500
+
3501
+ ids27 = [str(uuid.uuid4()) for _ in range(len(contents27))]
3502
+
3503
+
3504
+ # In[360]:
3505
+
3506
+
3507
+ data.add(
3508
+ documents=contents27,
3509
+ embeddings=embeddings27,
3510
+ metadatas=metadata27,
3511
+ ids=ids27
3512
+ )
3513
+
3514
+
3515
+ # In[361]:
3516
+
3517
+
3518
+ append_data(contents27, metadata27, embeddings27)
3519
+
3520
+
3521
+ # In[362]:
3522
+
3523
+
3524
+ df
3525
+
3526
+
3527
+ # # <p style="color: orange;">Document 27 Parcours LMD Physique </p>
3528
+ #
3529
+
3530
+ # In[364]:
3531
 
 
3532
 
3533
+ loader = WebBaseLoader(
3534
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=W0NZFFp1UjNcbVshDjAENlJ0X2tTbg==&etab=AWUDMl9t",),
3535
+ bs_kwargs=dict(
3536
+ parse_only=bs4.SoupStrainer(
3537
+ class_=("center")
3538
+ )
3539
+ ),
3540
+ )
3541
+ parcours_physique = loader.load()
3542
+
3543
+
3544
+ # In[365]:
3545
+
3546
+
3547
+ parcours_physique = [
3548
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3549
+ for doc in parcours_physique]
3550
+ parcours_physique
3551
+
3552
+
3553
+ # ## splitting doc 27 into chunks
3554
+
3555
+ # In[367]:
3556
+
3557
+
3558
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3559
+ splits28 = text_splitter.split_documents(parcours_physique)
3560
+
3561
+
3562
+ # In[368]:
3563
+
3564
+
3565
+ splits28
3566
+
3567
+
3568
+ # In[369]:
3569
+
3570
+
3571
+ contents28= [doc.page_content for doc in splits28]
3572
+ metadata28= [doc.metadata for doc in splits28]
3573
+
3574
+
3575
+ # In[370]:
3576
+
3577
+
3578
+ embeddings28 = embeddings_model.embed_documents(
3579
+ [doc.page_content for doc in splits28],
3580
+ # normalize_embeddings=True,
3581
+ # batch_size=256,
3582
+ # show_progress_bar=True
3583
+ )
3584
+ print(embeddings28)
3585
+
3586
+
3587
+ # In[371]:
3588
+
3589
+
3590
+ ids28 = [str(uuid.uuid4()) for _ in range(len(contents28))]
3591
+
3592
+
3593
+ # In[372]:
3594
+
3595
+
3596
+ data.add(
3597
+ documents=contents28,
3598
+ embeddings=embeddings28,
3599
+ metadatas=metadata28,
3600
+ ids=ids28
3601
+ )
3602
+
3603
+
3604
+ # In[373]:
3605
+
3606
+
3607
+ append_data(contents28, metadata28, embeddings28)
3608
+
3609
+
3610
+ # In[374]:
3611
+
3612
+
3613
+ df
3614
+
3615
+
3616
+ # # <p style="color: orange;">Document 28 Parcours LMD chimie </p>
3617
+ #
3618
+
3619
+ # In[376]:
3620
+
3621
+
3622
+ loader = WebBaseLoader(
3623
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=W0NYFV9wVDVcbQF7BzkKPQQiCz8HOg==&etab=B2NUZQAy",),
3624
+ bs_kwargs=dict(
3625
+ parse_only=bs4.SoupStrainer(
3626
+ class_=("center")
3627
+ )
3628
+ ),
3629
+ )
3630
+ parcours_chimie = loader.load()
3631
+
3632
+
3633
+ # In[377]:
3634
+
3635
+
3636
+ parcours_chimie = [
3637
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3638
+ for doc in parcours_chimie]
3639
+ parcours_chimie
3640
+
3641
+
3642
+ # ## splitting doc 28 into chunks
3643
+
3644
+ # In[379]:
3645
+
3646
+
3647
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3648
+ splits29= text_splitter.split_documents(parcours_chimie)
3649
+
3650
+
3651
+ # In[380]:
3652
+
3653
+
3654
+ splits29
3655
+
3656
+
3657
+ # In[381]:
3658
+
3659
+
3660
+ contents29= [doc.page_content for doc in splits29]
3661
+ metadata29= [doc.metadata for doc in splits29]
3662
+
3663
+
3664
+ # In[382]:
3665
+
3666
+
3667
+ embeddings29 = embeddings_model.embed_documents(
3668
+ [doc.page_content for doc in splits29],
3669
+ # normalize_embeddings=True,
3670
+ # batch_size=256,
3671
+ # show_progress_bar=True
3672
+ )
3673
+ print(embeddings29)
3674
+
3675
+
3676
+ # In[383]:
3677
+
3678
+
3679
+ ids29 = [str(uuid.uuid4()) for _ in range(len(contents29))]
3680
+
3681
+
3682
+ # In[384]:
3683
+
3684
+
3685
+ data.add(
3686
+ documents=contents29,
3687
+ embeddings=embeddings29,
3688
+ metadatas=metadata29,
3689
+ ids=ids29
3690
+ )
3691
+
3692
+
3693
+ # In[385]:
3694
+
3695
+
3696
+ append_data(contents29, metadata29, embeddings29)
3697
+
3698
+
3699
+ # In[386]:
3700
+
3701
+
3702
+ df
3703
+
3704
+
3705
+ # # <p style="color: orange;"> Document 29 Parcours LMD Physique-Chimie</p>
3706
+ #
3707
+
3708
+ # In[388]:
3709
+
3710
+
3711
+ loader = WebBaseLoader(
3712
+ web_paths=("http://www.parcours-lmd.salima.tn/listeueetab.php?parc=Bh4HSlh3VTQGN1ctVWsAMVJ0DjA=&etab=VjJZaA0/",),
3713
+ bs_kwargs=dict(
3714
+ parse_only=bs4.SoupStrainer(
3715
+ class_=("center")
3716
+ )
3717
+ ),
3718
+ )
3719
+ parcours_physique_chimie = loader.load()
3720
+
3721
+
3722
+ # In[389]:
3723
+
3724
+
3725
+ parcours_physique_chimie = [
3726
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3727
+ for doc in parcours_physique_chimie]
3728
+ parcours_physique_chimie
3729
+
3730
+
3731
+ # ## splitting doc 29 into chunks
3732
+
3733
+ # In[391]:
3734
+
3735
+
3736
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3737
+ splits30= text_splitter.split_documents(parcours_physique_chimie)
3738
+
3739
+
3740
+ # In[392]:
3741
+
3742
+
3743
+ splits30
3744
+
3745
+
3746
+ # In[393]:
3747
+
3748
+
3749
+ contents30= [doc.page_content for doc in splits30]
3750
+ metadata30= [doc.metadata for doc in splits30]
3751
+
3752
+
3753
+ # In[394]:
3754
+
3755
+
3756
+ embeddings30 = embeddings_model.embed_documents(
3757
+ [doc.page_content for doc in splits30],
3758
+ # normalize_embeddings=True,
3759
+ # batch_size=256,
3760
+ # show_progress_bar=True
3761
+ )
3762
+ print(embeddings30)
3763
+
3764
+
3765
+ # In[395]:
3766
+
3767
+
3768
+ ids30 = [str(uuid.uuid4()) for _ in range(len(contents30))]
3769
+
3770
+
3771
+ # In[396]:
3772
+
3773
+
3774
+ data.add(
3775
+ documents=contents30,
3776
+ embeddings=embeddings30,
3777
+ metadatas=metadata30,
3778
+ ids=ids30
3779
+ )
3780
+
3781
 
3782
+ # In[397]:
3783
+
3784
+
3785
+ append_data(contents30, metadata30, embeddings30)
3786
+
3787
+
3788
+
3789
+
3790
+
3791
+ df
3792
+
3793
+
3794
+
3795
+
3796
+ loader = WebBaseLoader(
3797
+ web_paths=("https://fsm.rnu.tn/fra/articles/1249/demande-de-diplomes",),
3798
+ bs_kwargs=dict(
3799
+ parse_only=bs4.SoupStrainer(
3800
+ class_=("content")
3801
+ )
3802
+ ),
3803
+ )
3804
+ doc_demande_de_diplome = loader.load()
3805
+
3806
+
3807
+ # In[401]:
3808
+
3809
+
3810
+ doc_demande_de_diplome = [
3811
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3812
+ for doc in doc_demande_de_diplome]
3813
+ doc_demande_de_diplome
3814
+
3815
+
3816
+ # ## splitting doc 30 into chunks
3817
+
3818
+ # In[403]:
3819
+
3820
+
3821
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3822
+ splits31 = text_splitter.split_documents(doc_demande_de_diplome)
3823
+
3824
+
3825
+ # In[404]:
3826
+
3827
+
3828
+ splits31
3829
+
3830
+
3831
+ # In[405]:
3832
+
3833
+
3834
+ contents31= [doc.page_content for doc in splits31]
3835
+ metadata31= [doc.metadata for doc in splits31]
3836
+
3837
+
3838
+ # In[406]:
3839
+
3840
+
3841
+ embeddings31 = embeddings_model.embed_documents(
3842
+ [doc.page_content for doc in splits31],
3843
+ # normalize_embeddings=True,
3844
+ # batch_size=256,
3845
+ # show_progress_bar=True
3846
+ )
3847
+ print(embeddings31)
3848
+
3849
+
3850
+ # In[407]:
3851
+
3852
+
3853
+ ids31 = [str(uuid.uuid4()) for _ in range(len(contents31))]
3854
+
3855
+
3856
+ # In[408]:
3857
+
3858
+
3859
+ data.add(
3860
+ documents=contents31,
3861
+ embeddings=embeddings31,
3862
+ metadatas=metadata31,
3863
+ ids=ids31
3864
+ )
3865
+
3866
+
3867
+ # In[409]:
3868
+
3869
+
3870
+ append_data(contents31, metadata31, embeddings31)
3871
+
3872
+
3873
+ # In[410]:
3874
+
3875
+
3876
+ df
3877
+
3878
+
3879
+ # # <p style="color: orange;">Document 31 INFORMATION sur master rechereche mathematique </p>
3880
+ #
3881
+
3882
+ # In[412]:
3883
+
3884
+
3885
+ loader = WebBaseLoader(
3886
+ web_paths=("https://um.rnu.tn/fr/formations/formation-lmd/master/mat%C3%A8re-de-recherche-en-math%C3%A9matiques-fsm/",),
3887
+ bs_kwargs=dict(
3888
+ parse_only=bs4.SoupStrainer(
3889
+ class_=("single-post-content single-content")
3890
+ )
3891
+ ),
3892
+ )
3893
+ info_supp_mastere_math = loader.load()
3894
+
3895
+
3896
+ # In[413]:
3897
+
3898
+
3899
+ info_supp_mastere_math = [
3900
+ Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
3901
+ for doc in info_supp_mastere_math]
3902
+ info_supp_mastere_math
3903
+
3904
+
3905
+ # ## spitting doc 31 into chunks
3906
+
3907
+ # In[415]:
3908
+
3909
+
3910
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100, separators=["\n\n", "\n", ".", " "])
3911
+ splits32 = text_splitter.split_documents(info_supp_mastere_math)
3912
+
3913
+
3914
+ # In[416]:
3915
+
3916
+
3917
+ splits32
3918
+
3919
+
3920
+ # In[417]:
3921
+
3922
+
3923
+ contents32= [doc.page_content for doc in splits32]
3924
+ metadata32 = [doc.metadata for doc in splits32]
3925
+
3926
+
3927
+ # In[418]:
3928
+
3929
+
3930
+ embeddings32 = embeddings_model.embed_documents(
3931
+ [doc.page_content for doc in splits32],
3932
+ # normalize_embeddings=True,
3933
+ # batch_size=256,
3934
+ # show_progress_bar=True
3935
+ )
3936
+ print(embeddings32)
3937
+
3938
+
3939
+ # In[419]:
3940
+
3941
+
3942
+ ids32 = [str(uuid.uuid4()) for _ in range(len(contents32))]
3943
+
3944
+
3945
+ # In[420]:
3946
+
3947
+
3948
+ data.add(
3949
+ documents=contents32,
3950
+ embeddings=embeddings32,
3951
+ metadatas=metadata32,
3952
+ ids=ids32
3953
+ )
3954
+
3955
+
3956
+ # In[421]:
3957
+
3958
+
3959
+ append_data(contents32, metadata32, embeddings32)
3960
+
3961
+
3962
+
3963
+ data = data.get(include=['embeddings'])
3964
+ print(data)
3965
+
3966
+
3967
+ # In[427]:
3968
+
3969
+
3970
+ if 'embeddings' in data:
3971
+ embeddings_array = np.array(data['embeddings'])
3972
+ print("Embeddings shape:", embeddings_array.shape)
3973
+ else:
3974
+ print("No embeddings found in vectorstore.")
3975
+
3976
+
3977
+ # In[428]:
3978
+
3979
+
3980
+ if embeddings_array.size > 0:
3981
+ pca = PCA(n_components=2)
3982
+ embeddings_2d = pca.fit_transform(embeddings_array)
3983
+
3984
+ # Plot embeddings
3985
+ plt.figure(figsize=(8, 6))
3986
+ plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.7)
3987
+ plt.xlabel("PCA 1")
3988
+ plt.ylabel("PCA 2")
3989
+ plt.title("2D Visualization of Embeddings")
3990
+ plt.show()
3991
+ else:
3992
+ print("No embeddings available for PCA visualization.")
3993
+
3994
+
3995
+ # # Manully testing to retrive 2st attempt just checking 👌
3996
+
3997
+ # In[430]:
3998
+
3999
+
4000
+ data = chroma_client.get_collection(name="my_dataaaa")
4001
+
4002
+
4003
+ # In[431]:
4004
+
4005
+
4006
+ query_embedding = embeddings_model.embed_query("Quelles sont les documents de stage obligatoire?")
4007
+
4008
+ results = data.query(
4009
+ query_embeddings=[query_embedding],
4010
+ n_results=50
4011
+ )
4012
+
4013
+
4014
+ # In[432]:
4015
+
4016
+
4017
+ results
4018
+
4019
+
4020
+ # In[783]:
4021
+
4022
+
4023
+ chroma_client = chromadb.PersistentClient(path="chroma_db")
4024
+ collections = chroma_client.list_collections()
4025
+ print("Available collections:", collections)
4026
+ if "my_dataaaa" in collections:
4027
+ collection = chroma_client.get_collection(name="my_dataaaa")
4028
+ print(" Successfully loaded collection:", collection)
4029
+ else:
4030
+ print("Collection 'my_dataaaa' does not exist.", collections)
4031
  embeddings_model = HuggingFaceEmbeddings(model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5")
4032
 
4033
  model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")
 
4039
  result = classifier(text, candidate_labels=["question", "greeting", "small talk", "feedback", "thanks"])
4040
  label = result["labels"][0]
4041
  return label.lower()
4042
+
4043
+ chroma_db_path = "./chroma_db"
 
 
 
 
4044
  chroma_client = chromadb.PersistentClient(path=chroma_db_path)
4045
 
4046
  data = chroma_client.get_collection(name="my_dataaaa")
4047
  vectorstore = Chroma(
4048
  collection_name="my_dataaaa",
4049
+ persist_directory="./chroma_db",
4050
  embedding_function=embeddings_model
4051
  )
4052
 
 
4087
  def format_docs(docs):
4088
  return "\n\n".join(doc.page_content for doc in docs)
4089
 
4090
+ context = format_docs(docs)
4091
+ context
4092
 
4093
  rag_chain = (
4094
  {
 
4226
  )
4227
  gr.Markdown("© 2025 Esra Belhassen. All rights reserved")
4228
 
4229
+ chat.launch(share=True)
4230