ramysaidagieb commited on
Commit
2b4974e
·
verified ·
1 Parent(s): d9382f5

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +117 -0
  2. app.py +251 -0
  3. requirements.txt +7 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Arabic Book Analysis AI
3
+ emoji: 📚
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: "4.31.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # Arabic Book Analysis AI
15
+
16
+ This Huggingface Space hosts an AI system that analyzes Arabic books and answers questions based solely on their content. The system supports uploading books in .docx and .pdf formats, fine-tunes an AraBERT model on the extracted text, and provides answers in Arabic.
17
+
18
+ ## Features
19
+ - Upload multiple books (.docx, .pdf).
20
+ - Visual training progress with detailed logs: "Loading book", "Extracting ideas", "Training in progress", "Training completed".
21
+ - Question-answering interface in Arabic, restricted to book content.
22
+ - Separate interface for question-answering only, shareable with friends.
23
+ - Formal answer style, suitable for scholarly texts.
24
+
25
+ ## Configuration
26
+ To ensure proper deployment on Huggingface Spaces, configure the Space with the following settings:
27
+ - **SDK**: Gradio (specified in the YAML front matter above)
28
+ - **SDK Version**: 4.31.0 (matches `requirements.txt`)
29
+ - **App File**: `app.py` (entry point for the Gradio app)
30
+ - **Visibility**: Public (required for sharing the question-answering link)
31
+ - **Hardware**: CPU (default free tier is sufficient; upgrade to GPU for faster training if needed)
32
+ - **File Structure**:
33
+ - Place `app.py`, `requirements.txt`, and `README.md` in the root directory (`/`).
34
+ - No additional folders are required; the app expects all files at the root level.
35
+ - **Environment Variables**: None required; all dependencies are listed in `requirements.txt`.
36
+ - **Persistence**: The fine-tuned model is saved to `./fine_tuned_model` within the Space’s storage for reuse.
37
+ - **Huggingface Space Name**: Use `arabic-book-analysis` for consistency with the provided links.
38
+
39
+ ## How to Use
40
+ ### Training Interface
41
+ 1. **Upload Books**:
42
+ - Access the main interface at `/`.
43
+ - Click the "رفع الكتب" field to select .docx or .pdf files.
44
+ - Press "رفع وتدريب" to start processing and training.
45
+ - View uploaded files, training logs, and status.
46
+ - After training, see the message: "Training process finished: Enter your question".
47
+
48
+ 2. **Ask Questions**:
49
+ - Enter an Arabic question in the "أدخل سؤالك بالعربية" field.
50
+ - Click "طرح السؤال" to get an answer based on the book’s content.
51
+
52
+ ### Question-Answering Only Interface
53
+ - Share this link with friends: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
54
+ - Users can:
55
+ - Enter an Arabic question.
56
+ - Receive answers based on the trained model’s knowledge.
57
+ - No training required.
58
+
59
+ 3. **Example**:
60
+ - Question: "ما هو قانون الإيمان وفقًا للكتاب؟"
61
+ - Answer: "قانون الإيمان هو أساس العقيدة المسيحية، ويؤمن به كل الكنائس المسيحية في العالم..."
62
+
63
+ ## Requirements
64
+ - Python 3.8+
65
+ - Dependencies listed in `requirements.txt`
66
+
67
+ ## Deployment
68
+ 1. **Create/Update Huggingface Space**:
69
+ - Log in to Huggingface with your username (replace `your_huggingface_username`).
70
+ - Create a new Space named "arabic-book-analysis" or update an existing one.
71
+ - Select "Gradio" as the SDK and set visibility to "Public".
72
+
73
+ 2. **Upload Zipped Folder**:
74
+ - Run the provided `create_zip.py` script to generate `arabic_book_analysis.zip`.
75
+ - In the Huggingface Space, go to the "Files" tab.
76
+ - Upload `arabic_book_analysis.zip` and extract it.
77
+ - Move `app.py`, `requirements.txt`, and `README.md` from `/arabic_book_analysis/` to the root directory (`/`).
78
+ - Ensure these files are in the root.
79
+
80
+ 3. **Build and Launch**:
81
+ - Huggingface Spaces will automatically install dependencies from `requirements.txt` and launch the app.
82
+ - Monitor build logs for errors.
83
+
84
+ 4. **Access Links**:
85
+ - Main interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis`
86
+ - Question-answering interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
87
+
88
+ ## Troubleshooting Configuration and Runtime Errors
89
+ - **Error: "Invalid SDK"**:
90
+ - Ensure the Space is configured to use the Gradio SDK in the Space settings and the YAML front matter specifies `sdk: gradio`.
91
+ - **Error: "Files not found"**:
92
+ - Verify that `app.py`, `requirements.txt`, and `README.md` are in the root directory (`/`), not in subfolders.
93
+ - Re-upload `arabic_book_analysis.zip`, extract it, and move files to the root using the Huggingface UI.
94
+ - **Error: "Dependency installation failed"**:
95
+ - Check `requirements.txt` for correct package versions.
96
+ - Review build logs in the Space’s "Settings" tab for specific errors.
97
+ - **Error: "Private Space access denied"**:
98
+ - Set the Space visibility to "Public" to enable the question-answering link for friends.
99
+ - **Error: "Model not found"**:
100
+ - Ensure training is completed at least once to save the fine-tuned model to `./fine_tuned_model`.
101
+ - **Error: "FileNotFoundError: No such file or directory: 'java'"**:
102
+ - The `arabert` library’s Farasa dependency requires Java, which is not installed by default. This project uses a custom `preprocess_arabic_text` function in `app.py` to avoid Farasa and Java requirements. If this error occurs, ensure `app.py` uses `preprocess_arabic_text` instead of `ArabertPreprocessor`.
103
+ - **Error: "TypeError: ArabertPreprocessor.__init__() got an unexpected keyword argument 'use_farasapy'"**:
104
+ - This occurred because `arabert==1.0.1` does not support the `use_farasapy` parameter. The current `app.py` avoids `ArabertPreprocessor` entirely, using a custom preprocessing function.
105
+ - **Persistent Configuration Error**:
106
+ - If the "Missing configuration in README" error persists, double-check that `README.md` is in the root directory and contains the exact YAML front matter shown above. Ensure the zip file is extracted correctly, move files to the root, and restart the Space build.
107
+
108
+ ## Notes
109
+ - Ensure books are in Arabic for accurate processing.
110
+ - The system is optimized for Huggingface Spaces’ free tier; training may take a few minutes.
111
+ - Replace `your_huggingface_username` with your actual Huggingface username.
112
+ - Training logs are displayed in the interface for transparency.
113
+ - The zipped folder simplifies uploads; ensure files are moved to the root directory after extraction.
114
+ - The `arabert` library’s Farasa dependency is bypassed using a custom `preprocess_arabic_text` function to avoid Java requirements. This may slightly reduce preprocessing capabilities but ensures compatibility with Huggingface Spaces.
115
+
116
+ ## License
117
+ MIT License
app.py ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import docx
4
+ import fitz # PyMuPDF
5
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, pipeline
6
+ from datasets import Dataset
7
+ import re
8
+ import logging
9
+ from datetime import datetime
10
+ import gradio.routes
11
+
12
+ # Setup logging
13
+ logging.basicConfig(level=logging.INFO)
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Initialize tokenizer
17
+ model_name = "aubmindlab/bert-base-arabertv2"
18
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
19
+ model = AutoModelForQuestionAnswering.from_pretrained(model_name)
20
+
21
+ # Directory to save fine-tuned model
22
+ MODEL_SAVE_PATH = "./fine_tuned_model"
23
+
24
+ # Custom Arabic text preprocessing function
25
+ def preprocess_arabic_text(text):
26
+ logger.info(f"{datetime.now()}: Preprocessing text (length: {len(text)} characters)")
27
+ # Remove Arabic diacritics
28
+ diacritics = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
29
+ text = diacritics.sub('', text)
30
+ # Normalize Arabic characters
31
+ text = re.sub(r'[أإآ]', 'ا', text)
32
+ text = re.sub(r'ى', 'ي', text)
33
+ text = re.sub(r'ة', 'ه', text)
34
+ # Remove extra spaces and non-essential characters
35
+ text = re.sub(r'\s+', ' ', text)
36
+ text = re.sub(r'[^\w\s]', '', text)
37
+ logger.info(f"{datetime.now()}: Text preprocessed, new length: {len(text)} characters")
38
+ return text.strip()
39
+
40
+ # Function to extract text from .docx
41
+ def extract_text_docx(file_path):
42
+ logger.info(f"{datetime.now()}: Extracting text from .docx file: {file_path}")
43
+ try:
44
+ doc = docx.Document(file_path)
45
+ text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
46
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .docx")
47
+ return text
48
+ except Exception as e:
49
+ logger.error(f"{datetime.now()}: Error extracting text from .docx: {e}")
50
+ return ""
51
+
52
+ # Function to extract text from .pdf
53
+ def extract_text_pdf(file_path):
54
+ logger.info(f"{datetime.now()}: Extracting text from .pdf file: {file_path}")
55
+ try:
56
+ doc = fitz.open(file_path)
57
+ text = ""
58
+ for page in doc:
59
+ text += page.get_text()
60
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .pdf")
61
+ return text
62
+ except Exception as e:
63
+ logger.error(f"{datetime.now()}: Error extracting text from .pdf: {e}")
64
+ return ""
65
+
66
+ # Function to chunk text for dataset
67
+ def chunk_text(text, max_length=512):
68
+ logger.info(f"{datetime.now()}: Chunking text into segments")
69
+ words = text.split()
70
+ chunks = []
71
+ current_chunk = []
72
+ current_length = 0
73
+ for word in words:
74
+ current_chunk.append(word)
75
+ current_length += len(word) + 1
76
+ if current_length >= max_length:
77
+ chunks.append(" ".join(current_chunk))
78
+ current_chunk = []
79
+ current_length = 0
80
+ if current_chunk:
81
+ chunks.append(" ".join(current_chunk))
82
+ logger.info(f"{datetime.now()}: Created {len(chunks)} text chunks")
83
+ return chunks
84
+
85
+ # Function to prepare dataset
86
+ def prepare_dataset(text):
87
+ logger.info(f"{datetime.now()}: Preparing dataset")
88
+ chunks = chunk_text(text)
89
+ data = {"text": chunks}
90
+ dataset = Dataset.from_dict(data)
91
+ logger.info(f"{datetime.now()}: Dataset prepared with {len(dataset)} examples")
92
+ return dataset
93
+
94
+ # Function to tokenize dataset
95
+ def tokenize_dataset(dataset):
96
+ logger.info(f"{datetime.now()}: Tokenizing dataset")
97
+ def tokenize_function(examples):
98
+ return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
99
+ tokenized_dataset = dataset.map(tokenize_function, batched=True)
100
+ logger.info(f"{datetime.now()}: Dataset tokenized")
101
+ return tokenized_dataset
102
+
103
+ # Function to fine-tune model
104
+ def fine_tune_model(dataset):
105
+ logger.info(f"{datetime.now()}: Starting model fine-tuning")
106
+ training_args = TrainingArguments(
107
+ output_dir="./results",
108
+ num_train_epochs=1,
109
+ per_device_train_batch_size=4,
110
+ save_steps=10_000,
111
+ save_total_limit=2,
112
+ logging_dir='./logs',
113
+ logging_steps=200,
114
+ )
115
+
116
+ trainer = Trainer(
117
+ model=model,
118
+ args=training_args,
119
+ train_dataset=dataset,
120
+ )
121
+
122
+ trainer.train()
123
+ model.save_pretrained(MODEL_SAVE_PATH)
124
+ tokenizer.save_pretrained(MODEL_SAVE_PATH)
125
+ logger.info(f"{datetime.now()}: Model fine-tuned and saved to {MODEL_SAVE_PATH}")
126
+
127
+ # Function to handle file upload and training
128
+ def upload_and_train(files, progress=gr.Progress()):
129
+ uploaded_files = []
130
+ all_text = ""
131
+ training_log = []
132
+
133
+ def log_and_update(step, desc, progress_value):
134
+ msg = f"{datetime.now()}: {desc}"
135
+ logger.info(msg)
136
+ training_log.append(msg)
137
+ progress(progress_value, desc=desc)
138
+ return "\n".join(training_log)
139
+
140
+ log_and_update("Starting upload", "Loading books...", 0.1)
141
+ for file in files:
142
+ file_name = os.path.basename(file.name)
143
+ uploaded_files.append(file_name)
144
+ if file_name.endswith(".docx"):
145
+ text = extract_text_docx(file.name)
146
+ elif file_name.endswith(".pdf"):
147
+ text = extract_text_pdf(file.name)
148
+ else:
149
+ continue
150
+ all_text += text + "\n"
151
+
152
+ if not all_text.strip():
153
+ msg = f"{datetime.now()}: No valid text extracted from uploaded files."
154
+ logger.error(msg)
155
+ training_log.append(msg)
156
+ return "\n".join(training_log), uploaded_files
157
+
158
+ log_and_update("Text extraction complete", "Extracting ideas...", 0.4)
159
+ cleaned_text = preprocess_arabic_text(all_text)
160
+
161
+ log_and_update("Preprocessing complete", "Preparing dataset...", 0.6)
162
+ dataset = prepare_dataset(cleaned_text)
163
+ tokenized_dataset = tokenize_dataset(dataset)
164
+
165
+ log_and_update("Dataset preparation complete", "Training in progress...", 0.8)
166
+ fine_tune_model(tokenized_dataset)
167
+
168
+ log_and_update("Training complete", "Training completed!", 1.0)
169
+
170
+ # Example QA
171
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
172
+ example_question = "ما هو قانون الإيمان وفقًا للكتاب؟"
173
+ example_answer = qa_pipeline(question=example_question, context=cleaned_text[:512])["answer"]
174
+
175
+ final_message = (
176
+ f"Training process finished: Enter your question\n\n"
177
+ f"**مثال لسؤال**: {example_question}\n"
178
+ f"**الإجابة**: {example_answer}\n\n"
179
+ f"**سجل التدريب**:\n" + "\n".join(training_log)
180
+ )
181
+ return final_message, uploaded_files
182
+
183
+ # Function to answer questions
184
+ def answer_question(question, context):
185
+ if not os.path.exists(MODEL_SAVE_PATH):
186
+ return "النظام لم يتم تدريبه بعد. الرجاء رفع الكتب وتدريب النظام أولاً."
187
+
188
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
189
+ answer = qa_pipeline(question=question, context=context[:512])["answer"]
190
+ return answer
191
+
192
+ # Main Gradio Interface (for training and QA)
193
+ with gr.Blocks() as main_demo:
194
+ gr.Markdown("# نظام ذكاء اصطناعي لتحليل الكتب باللغة العربية")
195
+
196
+ with gr.Row():
197
+ with gr.Column():
198
+ file_upload = gr.File(file_types=[".docx", ".pdf"], label="رفع الكتب", file_count="multiple")
199
+ upload_button = gr.Button("رفع وتدريب")
200
+ uploaded_files = gr.Textbox(label="الكتب المرفوعة")
201
+
202
+ with gr.Column():
203
+ training_status = gr.Textbox(label="حالة التدريب", lines=10)
204
+
205
+ with gr.Row():
206
+ question_input = gr.Textbox(label="أدخل سؤالك بالعربية", placeholder="مثال: ما هو قانون الإيمان؟")
207
+ answer_output = gr.Textbox(label="الإجابة")
208
+ ask_button = gr.Button("طرح السؤال")
209
+
210
+ # Event handlers
211
+ upload_button.click(
212
+ fn=upload_and_train,
213
+ inputs=[file_upload],
214
+ outputs=[training_status, uploaded_files]
215
+ )
216
+
217
+ ask_button.click(
218
+ fn=answer_question,
219
+ inputs=[question_input, gr.State(value="")],
220
+ outputs=[answer_output]
221
+ )
222
+
223
+ # Question-Answering Only Interface
224
+ with gr.Blocks() as answer_demo:
225
+ gr.Markdown("# طرح الأسئلة على نظام تحليل الكتب باللغة العربية")
226
+ gr.Markdown("أدخل سؤالك بالعربية وسيتم الإجابة بناءً على محتوى الكتب المدربة.")
227
+
228
+ question_input = gr.Textbox(label="أدخل سؤالك", placeholder="مثال: ما هو قانون الإيمان؟")
229
+ answer_output = gr.Textbox(label="الإجابة")
230
+ ask_button = gr.Button("طرح السؤال")
231
+
232
+ ask_button.click(
233
+ fn=answer_question,
234
+ inputs=[question_input, gr.State(value="")],
235
+ outputs=[answer_output]
236
+ )
237
+
238
+ # Combine both interfaces with routes
239
+ app = gr.mount_gradio_app(
240
+ gradio.routes.App.create_app(),
241
+ main_demo,
242
+ path="/",
243
+ )
244
+ app = gr.mount_gradio_app(
245
+ app,
246
+ answer_demo,
247
+ path="/answer",
248
+ )
249
+
250
+ if __name__ == "__main__":
251
+ main_demo.launch()
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==4.31.0
2
+ transformers==4.38.2
3
+ datasets==2.18.0
4
+ arabert==1.0.1
5
+ python-docx==1.1.0
6
+ PyMuPDF==1.24.2
7
+ torch==2.2.1