ramysaidagieb commited on
Commit
97035bf
·
verified ·
1 Parent(s): 463d062

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +117 -0
  2. app.py +246 -0
  3. requirements.txt +7 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Arabic Book Analysis AI
3
+ emoji: 📚
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: "4.31.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # Arabic Book Analysis AI
15
+
16
+ This Huggingface Space hosts an AI system that analyzes Arabic books and answers questions based solely on their content. The system supports uploading books in .docx and .pdf formats, fine-tunes an AraBERT model on the extracted text, and provides answers in Arabic.
17
+
18
+ ## Features
19
+ - Upload multiple books (.docx, .pdf).
20
+ - Visual training progress with detailed logs: "Loading book", "Extracting ideas", "Training in progress", "Training completed".
21
+ - Question-answering interface in Arabic, restricted to book content.
22
+ - Separate interface for question-answering only, shareable with friends.
23
+ - Formal answer style, suitable for scholarly texts.
24
+
25
+ ## Configuration
26
+ To ensure proper deployment on Huggingface Spaces, configure the Space with the following settings:
27
+ - **SDK**: Gradio (specified in the YAML front matter above)
28
+ - **SDK Version**: 4.31.0 (matches `requirements.txt`)
29
+ - **App File**: `app.py` (entry point for the Gradio app)
30
+ - **Visibility**: Public (required for sharing the question-answering link)
31
+ - **Hardware**: CPU (default free tier is sufficient; upgrade to GPU for faster training if needed)
32
+ - **File Structure**:
33
+ - Place `app.py`, `requirements.txt`, and `README.md` in the root directory (`/`).
34
+ - No additional folders are required; the app expects all files at the root level.
35
+ - **Environment Variables**: None required; all dependencies are listed in `requirements.txt`.
36
+ - **Persistence**: The fine-tuned model is saved to `./fine_tuned_model` within the Space’s storage for reuse.
37
+ - **Huggingface Space Name**: Use `arabic-book-analysis` for consistency with the provided links.
38
+
39
+ ## How to Use
40
+ ### Training Interface
41
+ 1. **Upload Books**:
42
+ - Access the main interface at `/`.
43
+ - Click the "رفع الكتب" field to select .docx or .pdf files.
44
+ - Press "رفع وتدريب" to start processing and training.
45
+ - View uploaded files, training logs, and status.
46
+ - After training, see the message: "Training process finished: Enter your question".
47
+
48
+ 2. **Ask Questions**:
49
+ - Enter an Arabic question in the "أدخل سؤالك بالعربية" field.
50
+ - Click "طرح السؤال" to get an answer based on the book’s content.
51
+
52
+ ### Question-Answering Only Interface
53
+ - Share this link with friends: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
54
+ - Users can:
55
+ - Enter an Arabic question.
56
+ - Receive answers based on the trained model’s knowledge.
57
+ - No training required.
58
+
59
+ 3. **Example**:
60
+ - Question: "ما هو قانون الإيمان وفقًا للكتاب؟"
61
+ - Answer: "قانون الإيمان هو أساس العقيدة المسيحية، ويؤمن به كل الكنائس المسيحية في العالم..."
62
+
63
+ ## Requirements
64
+ - Python 3.8+
65
+ - Dependencies listed in `requirements.txt`
66
+
67
+ ## Deployment
68
+ 1. **Create/Update Huggingface Space**:
69
+ - Log in to Huggingface with your username (replace `your_huggingface_username`).
70
+ - Create a new Space named "arabic-book-analysis" or update an existing one.
71
+ - Select "Gradio" as the SDK and set visibility to "Public".
72
+
73
+ 2. **Upload Zipped Folder**:
74
+ - Run the provided `create_zip.py` script to generate `arabic_book_analysis.zip`.
75
+ - In the Huggingface Space, go to the "Files" tab.
76
+ - Upload `arabic_book_analysis.zip` and extract it.
77
+ - Move `app.py`, `requirements.txt`, and `README.md` from `/arabic_book_analysis/` to the root directory (`/`).
78
+ - Ensure these files are in the root.
79
+
80
+ 3. **Build and Launch**:
81
+ - Huggingface Spaces will automatically install dependencies from `requirements.txt` and launch the app.
82
+ - Monitor build logs for errors.
83
+
84
+ 4. **Access Links**:
85
+ - Main interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis`
86
+ - Question-answering interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
87
+
88
+ ## Troubleshooting Configuration and Runtime Errors
89
+ - **Error: "Invalid SDK"**:
90
+ - Ensure the Space is configured to use the Gradio SDK in the Space settings and the YAML front matter specifies `sdk: gradio`.
91
+ - **Error: "Files not found"**:
92
+ - Verify that `app.py`, `requirements.txt`, and `README.md` are in the root directory (`/`), not in subfolders.
93
+ - Re-upload `arabic_book_analysis.zip`, extract it, and move files to the root using the Huggingface UI.
94
+ - **Error: "Dependency installation failed"**:
95
+ - Check `requirements.txt` for correct package versions.
96
+ - Review build logs in the Space’s "Settings" tab for specific errors.
97
+ - **Error: "Private Space access denied"**:
98
+ - Set the Space visibility to "Public" to enable the question-answering link for friends.
99
+ - **Error: "Model not found"**:
100
+ - Ensure training is completed at least once to save the fine-tuned model to `./fine_tuned_model`.
101
+ - **Error: "FileNotFoundError: No such file or directory: 'java'"**:
102
+ - The `arabert` library’s Farasa dependency requires Java, which is not installed by default. This project uses `keep_emojis=True` in `ArabertPreprocessor` to avoid Farasa and Java requirements. If you encounter this error, ensure `app.py` uses `keep_emojis=True`.
103
+ - **Error: "TypeError: ArabertPreprocessor.__init__() got an unexpected keyword argument 'use_farasapy'"**:
104
+ - This occurs if the `arabert` library version (e.g., 1.0.1) does not support the `use_farasapy` parameter. The current `app.py` uses `keep_emojis=True` instead, which is compatible with `arabert==1.0.1`.
105
+ - **Persistent Configuration Error**:
106
+ - If the "Missing configuration in README" error persists, double-check that `README.md` is in the root directory and contains the exact YAML front matter shown above. Ensure the zip file is extracted correctly, move files to the root, and restart the Space build.
107
+
108
+ ## Notes
109
+ - Ensure books are in Arabic for accurate processing.
110
+ - The system is optimized for Huggingface Spaces’ free tier; training may take a few minutes.
111
+ - Replace `your_huggingface_username` with your actual Huggingface username.
112
+ - Training logs are displayed in the interface for transparency.
113
+ - The zipped folder simplifies uploads; ensure files are moved to the root directory after extraction.
114
+ - The `arabert` library’s Farasa dependency is bypassed using `keep_emojis=True` to avoid Java requirements, which may slightly reduce preprocessing capabilities but ensures compatibility with Huggingface Spaces.
115
+
116
+ ## License
117
+ MIT License
app.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import docx
4
+ import fitz # PyMuPDF
5
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, pipeline
6
+ from arabert.preprocess import ArabertPreprocessor
7
+ from datasets import Dataset
8
+ import uuid
9
+ import re
10
+ import logging
11
+ from datetime import datetime
12
+ import gradio.routes
13
+
14
+ # Setup logging
15
+ logging.basicConfig(level=logging.INFO)
16
+ logger = logging.getLogger(__name__)
17
+
18
+ # Initialize Arabic preprocessor and tokenizer
19
+ model_name = "aubmindlab/bert-base-arabertv2"
20
+ arabert_preprocessor = ArabertPreprocessor(model_name=model_name, keep_emojis=True)
21
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
22
+ model = AutoModelForQuestionAnswering.from_pretrained(model_name)
23
+
24
+ # Directory to save fine-tuned model
25
+ MODEL_SAVE_PATH = "./fine_tuned_model"
26
+
27
+ # Function to extract text from .docx
28
+ def extract_text_docx(file_path):
29
+ logger.info(f"{datetime.now()}: Extracting text from .docx file: {file_path}")
30
+ try:
31
+ doc = docx.Document(file_path)
32
+ text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
33
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .docx")
34
+ return text
35
+ except Exception as e:
36
+ logger.error(f"{datetime.now()}: Error extracting text from .docx: {e}")
37
+ return ""
38
+
39
+ # Function to extract text from .pdf
40
+ def extract_text_pdf(file_path):
41
+ logger.info(f"{datetime.now()}: Extracting text from .pdf file: {file_path}")
42
+ try:
43
+ doc = fitz.open(file_path)
44
+ text = ""
45
+ for page in doc:
46
+ text += page.get_text()
47
+ logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .pdf")
48
+ return text
49
+ except Exception as e:
50
+ logger.error(f"{datetime.now()}: Error extracting text from .pdf: {e}")
51
+ return ""
52
+
53
+ # Function to clean and preprocess text
54
+ def preprocess_text(text):
55
+ logger.info(f"{datetime.now()}: Preprocessing text (length: {len(text)} characters)")
56
+ text = re.sub(r'\s+', ' ', text) # Remove extra spaces
57
+ text = arabert_preprocessor.preprocess(text) # Normalize Arabic text
58
+ logger.info(f"{datetime.now()}: Text preprocessed, new length: {len(text)} characters")
59
+ return text.strip()
60
+
61
+ # Function to chunk text for dataset
62
+ def chunk_text(text, max_length=512):
63
+ logger.info(f"{datetime.now()}: Chunking text into segments")
64
+ words = text.split()
65
+ chunks = []
66
+ current_chunk = []
67
+ current_length = 0
68
+ for word in words:
69
+ current_chunk.append(word)
70
+ current_length += len(word) + 1
71
+ if current_length >= max_length:
72
+ chunks.append(" ".join(current_chunk))
73
+ current_chunk = []
74
+ current_length = 0
75
+ if current_chunk:
76
+ chunks.append(" ".join(current_chunk))
77
+ logger.info(f"{datetime.now()}: Created {len(chunks)} text chunks")
78
+ return chunks
79
+
80
+ # Function to prepare dataset
81
+ def prepare_dataset(text):
82
+ logger.info(f"{datetime.now()}: Preparing dataset")
83
+ chunks = chunk_text(text)
84
+ data = {"text": chunks}
85
+ dataset = Dataset.from_dict(data)
86
+ logger.info(f"{datetime.now()}: Dataset prepared with {len(dataset)} examples")
87
+ return dataset
88
+
89
+ # Function to tokenize dataset
90
+ def tokenize_dataset(dataset):
91
+ logger.info(f"{datetime.now()}: Tokenizing dataset")
92
+ def tokenize_function(examples):
93
+ return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
94
+ tokenized_dataset = dataset.map(tokenize_function, batched=True)
95
+ logger.info(f"{datetime.now()}: Dataset tokenized")
96
+ return tokenized_dataset
97
+
98
+ # Function to fine-tune model
99
+ def fine_tune_model(dataset):
100
+ logger.info(f"{datetime.now()}: Starting model fine-tuning")
101
+ training_args = TrainingArguments(
102
+ output_dir="./results",
103
+ num_train_epochs=1,
104
+ per_device_train_batch_size=4,
105
+ save_steps=10_000,
106
+ save_total_limit=2,
107
+ logging_dir='./logs',
108
+ logging_steps=200,
109
+ )
110
+
111
+ trainer = Trainer(
112
+ model=model,
113
+ args=training_args,
114
+ train_dataset=dataset,
115
+ )
116
+
117
+ trainer.train()
118
+ model.save_pretrained(MODEL_SAVE_PATH)
119
+ tokenizer.save_pretrained(MODEL_SAVE_PATH)
120
+ logger.info(f"{datetime.now()}: Model fine-tuned and saved to {MODEL_SAVE_PATH}")
121
+
122
+ # Function to handle file upload and training
123
+ def upload_and_train(files, progress=gr.Progress()):
124
+ uploaded_files = []
125
+ all_text = ""
126
+ training_log = []
127
+
128
+ def log_and_update(step, desc, progress_value):
129
+ msg = f"{datetime.now()}: {desc}"
130
+ logger.info(msg)
131
+ training_log.append(msg)
132
+ progress(progress_value, desc=desc)
133
+ return "\n".join(training_log)
134
+
135
+ log_and_update("Starting upload", "Loading books...", 0.1)
136
+ for file in files:
137
+ file_name = os.path.basename(file.name)
138
+ uploaded_files.append(file_name)
139
+ if file_name.endswith(".docx"):
140
+ text = extract_text_docx(file.name)
141
+ elif file_name.endswith(".pdf"):
142
+ text = extract_text_pdf(file.name)
143
+ else:
144
+ continue
145
+ all_text += text + "\n"
146
+
147
+ if not all_text.strip():
148
+ msg = f"{datetime.now()}: No valid text extracted from uploaded files."
149
+ logger.error(msg)
150
+ training_log.append(msg)
151
+ return "\n".join(training_log), uploaded_files
152
+
153
+ log_and_update("Text extraction complete", "Extracting ideas...", 0.4)
154
+ cleaned_text = preprocess_text(all_text)
155
+
156
+ log_and_update("Preprocessing complete", "Preparing dataset...", 0.6)
157
+ dataset = prepare_dataset(cleaned_text)
158
+ tokenized_dataset = tokenize_dataset(dataset)
159
+
160
+ log_and_update("Dataset preparation complete", "Training in progress...", 0.8)
161
+ fine_tune_model(tokenized_dataset)
162
+
163
+ log_and_update("Training complete", "Training completed!", 1.0)
164
+
165
+ # Example QA
166
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
167
+ example_question = "ما هو قانون الإيمان وفقًا للكتاب؟"
168
+ example_answer = qa_pipeline(question=example_question, context=cleaned_text[:512])["answer"]
169
+
170
+ final_message = (
171
+ f"Training process finished: Enter your question\n\n"
172
+ f"**مثال لسؤال**: {example_question}\n"
173
+ f"**الإجابة**: {example_answer}\n\n"
174
+ f"**سجل التدريب**:\n" + "\n".join(training_log)
175
+ )
176
+ return final_message, uploaded_files
177
+
178
+ # Function to answer questions
179
+ def answer_question(question, context):
180
+ if not os.path.exists(MODEL_SAVE_PATH):
181
+ return "النظام لم يتم تدريبه بعد. الرجاء رفع الكتب وتدريب النظام أولاً."
182
+
183
+ qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
184
+ answer = qa_pipeline(question=question, context=context[:512])["answer"]
185
+ return answer
186
+
187
+ # Main Gradio Interface (for training and QA)
188
+ with gr.Blocks() as main_demo:
189
+ gr.Markdown("# نظام ذكاء اصطناعي لتحليل الكتب باللغة العربية")
190
+
191
+ with gr.Row():
192
+ with gr.Column():
193
+ file_upload = gr.File(file_types=[".docx", ".pdf"], label="رفع الكتب", file_count="multiple")
194
+ upload_button = gr.Button("رفع وتدريب")
195
+ uploaded_files = gr.Textbox(label="الكتب المرفوعة")
196
+
197
+ with gr.Column():
198
+ training_status = gr.Textbox(label="حالة التدريب", lines=10)
199
+
200
+ with gr.Row():
201
+ question_input = gr.Textbox(label="أدخل سؤالك بالعربية", placeholder="مثال: ما هو قانون الإيمان؟")
202
+ answer_output = gr.Textbox(label="الإجابة")
203
+ ask_button = gr.Button("طرح السؤال")
204
+
205
+ # Event handlers
206
+ upload_button.click(
207
+ fn=upload_and_train,
208
+ inputs=[file_upload],
209
+ outputs=[training_status, uploaded_files]
210
+ )
211
+
212
+ ask_button.click(
213
+ fn=answer_question,
214
+ inputs=[question_input, gr.State(value="")],
215
+ outputs=[answer_output]
216
+ )
217
+
218
+ # Question-Answering Only Interface
219
+ with gr.Blocks() as answer_demo:
220
+ gr.Markdown("# طرح الأسئلة على نظام تحليل الكتب باللغة العربية")
221
+ gr.Markdown("أدخل سؤالك بالعربية وسيتم الإجابة بناءً على محتوى الكتب المدربة.")
222
+
223
+ question_input = gr.Textbox(label="أدخل سؤالك", placeholder="مثال: ما هو قانون الإيمان؟")
224
+ answer_output = gr.Textbox(label="الإجابة")
225
+ ask_button = gr.Button("طرح السؤال")
226
+
227
+ ask_button.click(
228
+ fn=answer_question,
229
+ inputs=[question_input, gr.State(value="")],
230
+ outputs=[answer_output]
231
+ )
232
+
233
+ # Combine both interfaces with routes
234
+ app = gr.mount_gradio_app(
235
+ gradio.routes.App.create_app(),
236
+ main_demo,
237
+ path="/",
238
+ )
239
+ app = gr.mount_gradio_app(
240
+ app,
241
+ answer_demo,
242
+ path="/answer",
243
+ )
244
+
245
+ if __name__ == "__main__":
246
+ main_demo.launch()
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==4.31.0
2
+ transformers==4.38.2
3
+ datasets==2.18.0
4
+ arabert==1.0.1
5
+ python-docx==1.1.0
6
+ PyMuPDF==1.24.2
7
+ torch==2.2.1