ramysaidagieb commited on
Commit
029c0b0
·
verified ·
1 Parent(s): 767177a

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +123 -0
  2. app.py +247 -0
  3. requirements.txt +7 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Arabic Book Analysis AI
3
+ emoji: 📚
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: "4.31.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # Arabic Book Analysis AI
15
+
16
+ This Huggingface Space hosts an AI system that analyzes Arabic books and answers questions based solely on their content. The system supports uploading books in .docx and .pdf formats, fine-tunes an AraBERT model on the extracted text, and provides answers in Arabic.
17
+
18
+ ## Features
19
+ - Upload multiple books (.docx, .pdf).
20
+ - Visual training progress with detailed logs: "Loading book", "Extracting ideas", "Training in progress", "Training completed".
21
+ - Question-answering interface in Arabic, restricted to book content.
22
+ - Separate tab for question-answering only, shareable with friends.
23
+ - Formal answer style, suitable for scholarly texts.
24
+
25
+ ## Configuration
26
+ To ensure proper deployment on Huggingface Spaces, configure the Space with the following settings:
27
+ - **SDK**: Gradio (specified in the YAML front matter above)
28
+ - **SDK Version**: 4.31.0 (matches `requirements.txt`)
29
+ - **App File**: `app.py` (entry point for the Gradio app)
30
+ - **Visibility**: Public (required for sharing the question-answering link)
31
+ - **Hardware**: CPU (default free tier is sufficient; upgrade to GPU for faster training if needed)
32
+ - **File Structure**:
33
+ - Place `app.py`, `requirements.txt`, and `README.md` in the root directory (`/`).
34
+ - No additional folders are required; the app expects all files at the root level.
35
+ - **Environment Variables**: None required; all dependencies are listed in `requirements.txt`.
36
+ - **Persistence**: The fine-tuned model is saved to `./fine_tuned_model` within the Space’s storage for reuse.
37
+ - **Huggingface Space Name**: Use `arabic-book-analysis` for consistency with the provided links.
38
+
39
+ ## How to Use
40
+ ### Training and Question Interface
41
+ 1. **Upload Books**:
42
+ - Access the main interface at `/` and select the "التدريب والسؤال" tab.
43
+ - Click the "رفع الكتب" field to select .docx or .pdf files.
44
+ - Press "رفع وتدريب" to start processing and training.
45
+ - View uploaded files, training logs, and status.
46
+ - After training, see the message: "Training process finished: Enter your question".
47
+
48
+ 2. **Ask Questions**:
49
+ - In the same tab, enter an Arabic question in the "أدخل سؤالك بالعربية" field.
50
+ - Click "طرح السؤال" to get an answer based on the book’s content.
51
+
52
+ ### Question-Answering Only Interface
53
+ - Share the main link with friends: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis`
54
+ - Users can:
55
+ - Select the "طرح الأسئلة فقط" tab.
56
+ - Enter an Arabic question in the provided field.
57
+ - Receive answers based on the trained model’s knowledge.
58
+ - No training required.
59
+
60
+ 3. **Example**:
61
+ - Question: "ما هو قانون الإيمان وفقًا للكتاب؟"
62
+ - Answer: "قانون الإيمان هو أساس العقيدة المسيحية، ويؤمن به كل الكنائس المسيحية في العالم..."
63
+
64
+ ## Requirements
65
+ - Python 3.8+
66
+ - Dependencies listed in `requirements.txt`
67
+
68
+ ## Deployment
69
+ 1. **Create/Update Huggingface Space**:
70
+ - Log in to Huggingface with your username (replace `your_huggingface_username`).
71
+ - Create a new Space named "arabic-book-analysis" or update an existing one.
72
+ - Select "Gradio" as the SDK and set visibility to "Public".
73
+
74
+ 2. **Upload Zipped Folder**:
75
+ - Run the provided `create_zip.py` script to generate `arabic_book_analysis.zip`.
76
+ - In the Huggingface Space, go to the "Files" tab.
77
+ - Upload `arabic_book_analysis.zip` and extract it.
78
+ - Move `app.py`, `requirements.txt`, and `README.md` from `/arabic_book_analysis/` to the root directory (`/`).
79
+ - Ensure these files are in the root.
80
+
81
+ 3. **Build and Launch**:
82
+ - Huggingface Spaces will automatically install dependencies from `requirements.txt` and launch the app.
83
+ - Monitor build logs for errors.
84
+
85
+ 4. **Access Links**:
86
+ - Main interface (both tabs): `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis`
87
+
88
+ ## Troubleshooting Configuration and Runtime Errors
89
+ - **Error: "Invalid SDK"**:
90
+ - Ensure the Space is configured to use the Gradio SDK in the Space settings and the YAML front matter specifies `sdk: gradio`.
91
+ - **Error: "Files not found"**:
92
+ - Verify that `app.py`, `requirements.txt`, and `README.md` are in the root directory (`/`), not in subfolders.
93
+ - Re-upload `arabic_book_analysis.zip`, extract it, and move files to the root using the Huggingface UI.
94
+ - **Error: "Dependency installation failed"**:
95
+ - Check `requirements.txt` for correct package versions.
96
+ - Review build logs in the Space’s "Settings" tab for specific errors.
97
+ - **Error: "Private Space access denied"**:
98
+ - Set the Space visibility to "Public" to enable the question-answering tab for friends.
99
+ - **Error: "Model not found"**:
100
+ - Ensure training is completed at least once to save the fine-tuned model to `./fine_tuned_model`.
101
+ - **Error: "FileNotFoundError: No such file or directory: 'java'"**:
102
+ - The `arabert` library’s Farasa dependency requires Java, which is not installed by default. This project uses a custom `preprocess_arabic_text` function in `app.py` to avoid Farasa and Java requirements. If this error occurs, ensure `app.py` uses `preprocess_arabic_text` instead of `ArabertPreprocessor`.
103
+ - **Error: "TypeError: ArabertPreprocessor.__init__() got an unexpected keyword argument 'use_farasapy'"**:
104
+ - This occurred because `arabert==1.0.1` does not support the `use_farasapy` parameter. The current `app.py` avoids `ArabertPreprocessor` entirely, using a custom preprocessing function.
105
+ - **Error: "FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0"**:
106
+ - This warning from `huggingface_hub` (used by `transformers`) may cause the container to exit if warnings are treated as errors. The current `app.py` suppresses this warning using a `warnings.filterwarnings` directive. If the error persists, check build logs for additional issues during model/tokenizer loading.
107
+ - **Error: "TypeError: App.create_app() missing 1 required positional argument: 'blocks'"**:
108
+ - This occurred due to incorrect Gradio app setup in `app.py`. The current `app.py` uses a single `gr.Blocks` app with tabs, launched directly with `demo.launch()`, avoiding manual app mounting.
109
+ - **Warning: "Some weights of BertForQuestionAnswering were not initialized"**:
110
+ - This warning appears when loading `aubmindlab/bert-base-arabertv2` for question-answering, as the model is not pre-fine-tuned for this task. It is expected and harmless, as the model will be fine-tuned during training. Ensure training is completed to initialize these weights.
111
+
112
+ ## Notes
113
+ - Ensure books are in Arabic for accurate processing.
114
+ - The system is optimized for Huggingface Spaces’ free tier; training may take a few minutes.
115
+ - Replace `your_huggingface_username` with your actual Huggingface username.
116
+ - Training logs are displayed in the interface for transparency.
117
+ - The zipped folder simplifies uploads; ensure files are moved to the root directory after extraction.
118
+ - The `arabert` library’s Farasa dependency is bypassed using a custom `preprocess_arabic_text` function to avoid Java requirements. This may slightly reduce preprocessing capabilities but ensures compatibility with Huggingface Spaces.
119
+ - The `huggingface_hub` warning is suppressed in `app.py`. Future updates to `transformers` may resolve this when `huggingface_hub>=1.0.0` is released.
120
+ - The model weights warning is normal and does not affect functionality after training.
121
+
122
+ ## License
123
+ MIT License
app.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import docx
4
+ import fitz # PyMuPDF
5
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, pipeline
6
+ from datasets import Dataset
7
+ import re
8
+ import logging
9
+ from datetime import datetime
10
+ import warnings
11
+
12
+ # Suppress FutureWarning from huggingface_hub
13
+ warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub.file_download")
14
+
15
+ # Setup logging
16
+ logging.basicConfig(level=logging.INFO)
17
+ logger = logging.getLogger(__name__)
18
+
19
+ # Initialize tokenizer and model with error handling
20
+ model_name = "aubmindlab/bert-base-arabertv2"
21
+ try:
22
+ logger.info(f"{datetime.now()}: Loading tokenizer for {model_name}")
23
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
24
+ logger.info(f"{datetime.now()}: Loading model for {model_name}")
25
+ model = AutoModelForQuestionAnswering.from_pretrained(model_name)
26
+ except Exception as e:
27
+ logger.error(f"{datetime.now()}: Failed to load model/tokenizer: {e}")
28
+ raise
29
+
30
+ # Directory to save fine-tuned model
31
+ MODEL_SAVE_PATH = "./fine_tuned_model"
32
+
33
+ # Custom Arabic text preprocessing function
34
+ def preprocess_arabic_text(text):
35
+ logger.info(f"{datetime.now()}: Preprocessing text (length: {len(text)} characters)")
36
+ # Remove Arabic diacritics
37
+ diacritics = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
38
+ text = diacritics.sub('', text)
39
+ # Normalize Arabic characters
40
+ text = re.sub(r'[أإآ]', 'ا', text)
41
+ text = re.sub(r'ى', 'ي', text)
42
+ text = re.sub(r'ة', 'ه', text)
43
+ # Remove extra spaces and non-essential characters
44
+ text = re.sub(r'\s+', ' ', text)
45
+ text = re.sub(r'[^\w\s]', '', text)
46
+ logger.info(f"{datetime.now()}: Text preprocessed, new length: {len(text)} characters")
47
+ return text.strip()
48
+
49
+ # Function to extract text from .docx
50
+ def extract_text_docx(file_path):
51
+ logger.info(f"{datetime.now()}: Extracting text from .docx file: {file_path}")
52
+ try:
53
+ doc = docx.Document(file_path)
54
+ text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
55
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .docx")
56
+ return text
57
+ except Exception as e:
58
+ logger.error(f"{datetime.now()}: Error extracting text from .docx: {e}")
59
+ return ""
60
+
61
+ # Function to extract text from .pdf
62
+ def extract_text_pdf(file_path):
63
+ logger.info(f"{datetime.now()}: Extracting text from .pdf file: {file_path}")
64
+ try:
65
+ doc = fitz.open(file_path)
66
+ text = ""
67
+ for page in doc:
68
+ text += page.get_text()
69
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .pdf")
70
+ return text
71
+ except Exception as e:
72
+ logger.error(f"{datetime.now()}: Error extracting text from .pdf: {e}")
73
+ return ""
74
+
75
+ # Function to chunk text for dataset
76
+ def chunk_text(text, max_length=512):
77
+ logger.info(f"{datetime.now()}: Chunking text into segments")
78
+ words = text.split()
79
+ chunks = []
80
+ current_chunk = []
81
+ current_length = 0
82
+ for word in words:
83
+ current_chunk.append(word)
84
+ current_length += len(word) + 1
85
+ if current_length >= max_length:
86
+ chunks.append(" ".join(current_chunk))
87
+ current_chunk = []
88
+ current_length = 0
89
+ if current_chunk:
90
+ chunks.append(" ".join(current_chunk))
91
+ logger.info(f"{datetime.now()}: Created {len(chunks)} text chunks")
92
+ return chunks
93
+
94
+ # Function to prepare dataset
95
+ def prepare_dataset(text):
96
+ logger.info(f"{datetime.now()}: Preparing dataset")
97
+ chunks = chunk_text(text)
98
+ data = {"text": chunks}
99
+ dataset = Dataset.from_dict(data)
100
+ logger.info(f"{datetime.now()}: Dataset prepared with {len(dataset)} examples")
101
+ return dataset
102
+
103
+ # Function to tokenize dataset
104
+ def tokenize_dataset(dataset):
105
+ logger.info(f"{datetime.now()}: Tokenizing dataset")
106
+ def tokenize_function(examples):
107
+ return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
108
+ tokenized_dataset = dataset.map(tokenize_function, batched=True)
109
+ logger.info(f"{datetime.now()}: Dataset tokenized")
110
+ return tokenized_dataset
111
+
112
+ # Function to fine-tune model
113
+ def fine_tune_model(dataset):
114
+ logger.info(f"{datetime.now()}: Starting model fine-tuning")
115
+ training_args = TrainingArguments(
116
+ output_dir="./results",
117
+ num_train_epochs=1,
118
+ per_device_train_batch_size=4,
119
+ save_steps=10_000,
120
+ save_total_limit=2,
121
+ logging_dir='./logs',
122
+ logging_steps=200,
123
+ )
124
+
125
+ trainer = Trainer(
126
+ model=model,
127
+ args=training_args,
128
+ train_dataset=dataset,
129
+ )
130
+
131
+ trainer.train()
132
+ model.save_pretrained(MODEL_SAVE_PATH)
133
+ tokenizer.save_pretrained(MODEL_SAVE_PATH)
134
+ logger.info(f"{datetime.now()}: Model fine-tuned and saved to {MODEL_SAVE_PATH}")
135
+
136
+ # Function to handle file upload and training
137
+ def upload_and_train(files, progress=gr.Progress()):
138
+ uploaded_files = []
139
+ all_text = ""
140
+ training_log = []
141
+
142
+ def log_and_update(step, desc, progress_value):
143
+ msg = f"{datetime.now()}: {desc}"
144
+ logger.info(msg)
145
+ training_log.append(msg)
146
+ progress(progress_value, desc=desc)
147
+ return "\n".join(training_log)
148
+
149
+ log_and_update("Starting upload", "Loading books...", 0.1)
150
+ for file in files:
151
+ file_name = os.path.basename(file.name)
152
+ uploaded_files.append(file_name)
153
+ if file_name.endswith(".docx"):
154
+ text = extract_text_docx(file.name)
155
+ elif file_name.endswith(".pdf"):
156
+ text = extract_text_pdf(file.name)
157
+ else:
158
+ continue
159
+ all_text += text + "\n"
160
+
161
+ if not all_text.strip():
162
+ msg = f"{datetime.now()}: No valid text extracted from uploaded files."
163
+ logger.error(msg)
164
+ training_log.append(msg)
165
+ return "\n".join(training_log), uploaded_files
166
+
167
+ log_and_update("Text extraction complete", "Extracting ideas...", 0.4)
168
+ cleaned_text = preprocess_arabic_text(all_text)
169
+
170
+ log_and_update("Preprocessing complete", "Preparing dataset...", 0.6)
171
+ dataset = prepare_dataset(cleaned_text)
172
+ tokenized_dataset = tokenize_dataset(dataset)
173
+
174
+ log_and_update("Dataset preparation complete", "Training in progress...", 0.8)
175
+ fine_tune_model(tokenized_dataset)
176
+
177
+ log_and_update("Training complete", "Training completed!", 1.0)
178
+
179
+ # Example QA
180
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
181
+ example_question = "ما هو قانون الإيمان وفقًا للكتاب؟"
182
+ example_answer = qa_pipeline(question=example_question, context=cleaned_text[:512])["answer"]
183
+
184
+ final_message = (
185
+ f"Training process finished: Enter your question\n\n"
186
+ f"**مثال لسؤال**: {example_question}\n"
187
+ f"**الإجابة**: {example_answer}\n\n"
188
+ f"**سجل التدريب**:\n" + "\n".join(training_log)
189
+ )
190
+ return final_message, uploaded_files
191
+
192
+ # Function to answer questions
193
+ def answer_question(question, context):
194
+ if not os.path.exists(MODEL_SAVE_PATH):
195
+ return "النظام لم يتم تدريبه بعد. الرجاء رفع الكتب وتدريب النظام أولاً."
196
+
197
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
198
+ answer = qa_pipeline(question=question, context=context[:512])["answer"]
199
+ return answer
200
+
201
+ # Gradio Interface with Tabs
202
+ with gr.Blocks(title="Arabic Book Analysis AI") as demo:
203
+ gr.Markdown("# نظام ذكاء اصطناعي لتحليل الكتب باللغة العربية")
204
+
205
+ with gr.Tabs():
206
+ with gr.TabItem("التدريب والسؤال"):
207
+ with gr.Row():
208
+ with gr.Column():
209
+ file_upload = gr.File(file_types=[".docx", ".pdf"], label="رفع الكتب", file_count="multiple")
210
+ upload_button = gr.Button("رفع وتدريب")
211
+ uploaded_files = gr.Textbox(label="الكتب المرفوعة")
212
+
213
+ with gr.Column():
214
+ training_status = gr.Textbox(label="حالة التدريب", lines=10)
215
+
216
+ with gr.Row():
217
+ question_input = gr.Textbox(label="أدخل سؤالك بالعربية", placeholder="مثال: ما هو قانون الإيمان؟")
218
+ answer_output = gr.Textbox(label="الإجابة")
219
+ ask_button = gr.Button("طرح السؤال")
220
+
221
+ # Event handlers
222
+ upload_button.click(
223
+ fn=upload_and_train,
224
+ inputs=[file_upload],
225
+ outputs=[training_status, uploaded_files]
226
+ )
227
+
228
+ ask_button.click(
229
+ fn=answer_question,
230
+ inputs=[question_input, gr.State(value="")],
231
+ outputs=[answer_output]
232
+ )
233
+
234
+ with gr.TabItem("طرح الأسئلة فقط"):
235
+ gr.Markdown("أدخل سؤالك بالعربية وسيتم الإجابة بناءً على محتوى الكتب المدربة.")
236
+ question_input_qa = gr.Textbox(label="أدخل سؤالك", placeholder="مثال: ما هو قانون الإيمان؟")
237
+ answer_output_qa = gr.Textbox(label="الإجابة")
238
+ ask_button_qa = gr.Button("طرح السؤال")
239
+
240
+ ask_button_qa.click(
241
+ fn=answer_question,
242
+ inputs=[question_input_qa, gr.State(value="")],
243
+ outputs=[answer_output_qa]
244
+ )
245
+
246
+ if __name__ == "__main__":
247
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==4.31.0
2
+ transformers==4.38.2
3
+ datasets==2.18.0
4
+ arabert==1.0.1
5
+ python-docx==1.1.0
6
+ PyMuPDF==1.24.2
7
+ torch==2.2.1