Spaces:

ramysaidagieb
/

pope3

Configuration error

App Files Files Community

ramysaidagieb commited on Apr 29

Commit

2b4974e

verified ·

1 Parent(s): d9382f5

Upload 3 files

Browse files

Files changed (3) hide show

README.md +117 -0
app.py +251 -0
requirements.txt +7 -0

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+---
+title: Arabic Book Analysis AI
+emoji: 📚
+colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: "4.31.0"
+app_file: app.py
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Arabic Book Analysis AI
+This Huggingface Space hosts an AI system that analyzes Arabic books and answers questions based solely on their content. The system supports uploading books in .docx and .pdf formats, fine-tunes an AraBERT model on the extracted text, and provides answers in Arabic.
+## Features
+- Upload multiple books (.docx, .pdf).
+- Visual training progress with detailed logs: "Loading book", "Extracting ideas", "Training in progress", "Training completed".
+- Question-answering interface in Arabic, restricted to book content.
+- Separate interface for question-answering only, shareable with friends.
+- Formal answer style, suitable for scholarly texts.
+## Configuration
+To ensure proper deployment on Huggingface Spaces, configure the Space with the following settings:
+- **SDK**: Gradio (specified in the YAML front matter above)
+- **SDK Version**: 4.31.0 (matches `requirements.txt`)
+- **App File**: `app.py` (entry point for the Gradio app)
+- **Visibility**: Public (required for sharing the question-answering link)
+- **Hardware**: CPU (default free tier is sufficient; upgrade to GPU for faster training if needed)
+- **File Structure**:
+  - Place `app.py`, `requirements.txt`, and `README.md` in the root directory (`/`).
+  - No additional folders are required; the app expects all files at the root level.
+- **Environment Variables**: None required; all dependencies are listed in `requirements.txt`.
+- **Persistence**: The fine-tuned model is saved to `./fine_tuned_model` within the Space’s storage for reuse.
+- **Huggingface Space Name**: Use `arabic-book-analysis` for consistency with the provided links.
+## How to Use
+### Training Interface
+1. **Upload Books**:
+   - Access the main interface at `/`.
+   - Click the "رفع الكتب" field to select .docx or .pdf files.
+   - Press "رفع وتدريب" to start processing and training.
+   - View uploaded files, training logs, and status.
+   - After training, see the message: "Training process finished: Enter your question".
+2. **Ask Questions**:
+   - Enter an Arabic question in the "أدخل سؤالك بالعربية" field.
+   - Click "طرح السؤال" to get an answer based on the book’s content.
+### Question-Answering Only Interface
+- Share this link with friends: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
+- Users can:
+  - Enter an Arabic question.
+  - Receive answers based on the trained model’s knowledge.
+  - No training required.
+3. **Example**:
+   - Question: "ما هو قانون الإيمان وفقًا للكتاب؟"
+   - Answer: "قانون الإيمان هو أساس العقيدة المسيحية، ويؤمن به كل الكنائس المسيحية في العالم..."
+## Requirements
+- Python 3.8+
+- Dependencies listed in `requirements.txt`
+## Deployment
+1. **Create/Update Huggingface Space**:
+   - Log in to Huggingface with your username (replace `your_huggingface_username`).
+   - Create a new Space named "arabic-book-analysis" or update an existing one.
+   - Select "Gradio" as the SDK and set visibility to "Public".
+2. **Upload Zipped Folder**:
+   - Run the provided `create_zip.py` script to generate `arabic_book_analysis.zip`.
+   - In the Huggingface Space, go to the "Files" tab.
+   - Upload `arabic_book_analysis.zip` and extract it.
+   - Move `app.py`, `requirements.txt`, and `README.md` from `/arabic_book_analysis/` to the root directory (`/`).
+   - Ensure these files are in the root.
+3. **Build and Launch**:
+   - Huggingface Spaces will automatically install dependencies from `requirements.txt` and launch the app.
+   - Monitor build logs for errors.
+4. **Access Links**:
+   - Main interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis`
+   - Question-answering interface: `https://huggingface.co/spaces/your_huggingface_username/arabic-book-analysis/answer`
+## Troubleshooting Configuration and Runtime Errors
+- **Error: "Invalid SDK"**:
+  - Ensure the Space is configured to use the Gradio SDK in the Space settings and the YAML front matter specifies `sdk: gradio`.
+- **Error: "Files not found"**:
+  - Verify that `app.py`, `requirements.txt`, and `README.md` are in the root directory (`/`), not in subfolders.
+  - Re-upload `arabic_book_analysis.zip`, extract it, and move files to the root using the Huggingface UI.
+- **Error: "Dependency installation failed"**:
+  - Check `requirements.txt` for correct package versions.
+  - Review build logs in the Space’s "Settings" tab for specific errors.
+- **Error: "Private Space access denied"**:
+  - Set the Space visibility to "Public" to enable the question-answering link for friends.
+- **Error: "Model not found"**:
+  - Ensure training is completed at least once to save the fine-tuned model to `./fine_tuned_model`.
+- **Error: "FileNotFoundError: No such file or directory: 'java'"**:
+  - The `arabert` library’s Farasa dependency requires Java, which is not installed by default. This project uses a custom `preprocess_arabic_text` function in `app.py` to avoid Farasa and Java requirements. If this error occurs, ensure `app.py` uses `preprocess_arabic_text` instead of `ArabertPreprocessor`.
+- **Error: "TypeError: ArabertPreprocessor.__init__() got an unexpected keyword argument 'use_farasapy'"**:
+  - This occurred because `arabert==1.0.1` does not support the `use_farasapy` parameter. The current `app.py` avoids `ArabertPreprocessor` entirely, using a custom preprocessing function.
+- **Persistent Configuration Error**:
+  - If the "Missing configuration in README" error persists, double-check that `README.md` is in the root directory and contains the exact YAML front matter shown above. Ensure the zip file is extracted correctly, move files to the root, and restart the Space build.
+## Notes
+- Ensure books are in Arabic for accurate processing.
+- The system is optimized for Huggingface Spaces’ free tier; training may take a few minutes.
+- Replace `your_huggingface_username` with your actual Huggingface username.
+- Training logs are displayed in the interface for transparency.
+- The zipped folder simplifies uploads; ensure files are moved to the root directory after extraction.
+- The `arabert` library’s Farasa dependency is bypassed using a custom `preprocess_arabic_text` function to avoid Java requirements. This may slightly reduce preprocessing capabilities but ensures compatibility with Huggingface Spaces.
+## License
+MIT License

app.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import gradio as gr
+import os
+import docx
+import fitz  # PyMuPDF
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, pipeline
+from datasets import Dataset
+import re
+import logging
+from datetime import datetime
+import gradio.routes
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Initialize tokenizer
+model_name = "aubmindlab/bert-base-arabertv2"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForQuestionAnswering.from_pretrained(model_name)
+# Directory to save fine-tuned model
+MODEL_SAVE_PATH = "./fine_tuned_model"
+# Custom Arabic text preprocessing function
+def preprocess_arabic_text(text):
+    logger.info(f"{datetime.now()}: Preprocessing text (length: {len(text)} characters)")
+    # Remove Arabic diacritics
+    diacritics = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
+    text = diacritics.sub('', text)
+    # Normalize Arabic characters
+    text = re.sub(r'[أإآ]', 'ا', text)
+    text = re.sub(r'ى', 'ي', text)
+    text = re.sub(r'ة', 'ه', text)
+    # Remove extra spaces and non-essential characters
+    text = re.sub(r'\s+', ' ', text)
+    text = re.sub(r'[^\w\s]', '', text)
+    logger.info(f"{datetime.now()}: Text preprocessed, new length: {len(text)} characters")
+    return text.strip()
+# Function to extract text from .docx
+def extract_text_docx(file_path):
+    logger.info(f"{datetime.now()}: Extracting text from .docx file: {file_path}")
+    try:
+        doc = docx.Document(file_path)
+        text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
+        logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .docx")
+        return text
+    except Exception as e:
+        logger.error(f"{datetime.now()}: Error extracting text from .docx: {e}")
+        return ""
+# Function to extract text from .pdf
+def extract_text_pdf(file_path):
+    logger.info(f"{datetime.now()}: Extracting text from .pdf file: {file_path}")
+    try:
+        doc = fitz.open(file_path)
+        text = ""
+        for page in doc:
+            text += page.get_text()
+        logger.info(f"{datetime.now()}: Successfully extracted {len(text)} characters from .pdf")
+        return text
+    except Exception as e:
+        logger.error(f"{datetime.now()}: Error extracting text from .pdf: {e}")
+        return ""
+# Function to chunk text for dataset
+def chunk_text(text, max_length=512):
+    logger.info(f"{datetime.now()}: Chunking text into segments")
+    words = text.split()
+    chunks = []
+    current_chunk = []
+    current_length = 0
+    for word in words:
+        current_chunk.append(word)
+        current_length += len(word) + 1
+        if current_length >= max_length:
+            chunks.append(" ".join(current_chunk))
+            current_chunk = []
+            current_length = 0
+    if current_chunk:
+        chunks.append(" ".join(current_chunk))
+    logger.info(f"{datetime.now()}: Created {len(chunks)} text chunks")
+    return chunks
+# Function to prepare dataset
+def prepare_dataset(text):
+    logger.info(f"{datetime.now()}: Preparing dataset")
+    chunks = chunk_text(text)
+    data = {"text": chunks}
+    dataset = Dataset.from_dict(data)
+    logger.info(f"{datetime.now()}: Dataset prepared with {len(dataset)} examples")
+    return dataset
+# Function to tokenize dataset
+def tokenize_dataset(dataset):
+    logger.info(f"{datetime.now()}: Tokenizing dataset")
+    def tokenize_function(examples):
+        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
+    tokenized_dataset = dataset.map(tokenize_function, batched=True)
+    logger.info(f"{datetime.now()}: Dataset tokenized")
+    return tokenized_dataset
+# Function to fine-tune model
+def fine_tune_model(dataset):
+    logger.info(f"{datetime.now()}: Starting model fine-tuning")
+    training_args = TrainingArguments(
+        output_dir="./results",
+        num_train_epochs=1,
+        per_device_train_batch_size=4,
+        save_steps=10_000,
+        save_total_limit=2,
+        logging_dir='./logs',
+        logging_steps=200,
+    )
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset,
+    )
+    trainer.train()
+    model.save_pretrained(MODEL_SAVE_PATH)
+    tokenizer.save_pretrained(MODEL_SAVE_PATH)
+    logger.info(f"{datetime.now()}: Model fine-tuned and saved to {MODEL_SAVE_PATH}")
+# Function to handle file upload and training
+def upload_and_train(files, progress=gr.Progress()):
+    uploaded_files = []
+    all_text = ""
+    training_log = []
+    def log_and_update(step, desc, progress_value):
+        msg = f"{datetime.now()}: {desc}"
+        logger.info(msg)
+        training_log.append(msg)
+        progress(progress_value, desc=desc)
+        return "\n".join(training_log)
+    log_and_update("Starting upload", "Loading books...", 0.1)
+    for file in files:
+        file_name = os.path.basename(file.name)
+        uploaded_files.append(file_name)
+        if file_name.endswith(".docx"):
+            text = extract_text_docx(file.name)
+        elif file_name.endswith(".pdf"):
+            text = extract_text_pdf(file.name)
+        else:
+            continue
+        all_text += text + "\n"
+    if not all_text.strip():
+        msg = f"{datetime.now()}: No valid text extracted from uploaded files."
+        logger.error(msg)
+        training_log.append(msg)
+        return "\n".join(training_log), uploaded_files
+    log_and_update("Text extraction complete", "Extracting ideas...", 0.4)
+    cleaned_text = preprocess_arabic_text(all_text)
+    log_and_update("Preprocessing complete", "Preparing dataset...", 0.6)
+    dataset = prepare_dataset(cleaned_text)
+    tokenized_dataset = tokenize_dataset(dataset)
+    log_and_update("Dataset preparation complete", "Training in progress...", 0.8)
+    fine_tune_model(tokenized_dataset)
+    log_and_update("Training complete", "Training completed!", 1.0)
+    # Example QA
+    qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
+    example_question = "ما هو قانون الإيمان وفقًا للكتاب؟"
+    example_answer = qa_pipeline(question=example_question, context=cleaned_text[:512])["answer"]
+    final_message = (
+        f"Training process finished: Enter your question\n\n"
+        f"**مثال لسؤال**: {example_question}\n"
+        f"**الإجابة**: {example_answer}\n\n"
+        f"**سجل التدريب**:\n" + "\n".join(training_log)
+    )
+    return final_message, uploaded_files
+# Function to answer questions
+def answer_question(question, context):
+    if not os.path.exists(MODEL_SAVE_PATH):
+        return "النظام لم يتم تدريبه بعد. الرجاء رفع الكتب وتدريب النظام أولاً."
+    qa_pipeline = pipeline("question-answering", model=MODEL_SAVE_PATH, tokenizer=MODEL_SAVE_PATH)
+    answer = qa_pipeline(question=question, context=context[:512])["answer"]
+    return answer
+# Main Gradio Interface (for training and QA)
+with gr.Blocks() as main_demo:
+    gr.Markdown("# نظام ذكاء اصطناعي لتحليل الكتب باللغة العربية")
+    with gr.Row():
+        with gr.Column():
+            file_upload = gr.File(file_types=[".docx", ".pdf"], label="رفع الكتب", file_count="multiple")
+            upload_button = gr.Button("رفع وتدريب")
+            uploaded_files = gr.Textbox(label="الكتب المرفوعة")
+        with gr.Column():
+            training_status = gr.Textbox(label="حالة التدريب", lines=10)
+    with gr.Row():
+        question_input = gr.Textbox(label="أدخل سؤالك بالعربية", placeholder="مثال: ما هو قانون الإيمان؟")
+        answer_output = gr.Textbox(label="الإجابة")
+        ask_button = gr.Button("طرح السؤال")
+    # Event handlers
+    upload_button.click(
+        fn=upload_and_train,
+        inputs=[file_upload],
+        outputs=[training_status, uploaded_files]
+    )
+    ask_button.click(
+        fn=answer_question,
+        inputs=[question_input, gr.State(value="")],
+        outputs=[answer_output]
+    )
+# Question-Answering Only Interface
+with gr.Blocks() as answer_demo:
+    gr.Markdown("# طرح الأسئلة على نظام تحليل الكتب باللغة العربية")
+    gr.Markdown("أدخل سؤالك بالعربية وسيتم الإجابة بناءً على محتوى الكتب المدربة.")
+    question_input = gr.Textbox(label="أدخل سؤالك", placeholder="مثال: ما هو قانون الإيمان؟")
+    answer_output = gr.Textbox(label="الإجابة")
+    ask_button = gr.Button("طرح السؤال")
+    ask_button.click(
+        fn=answer_question,
+        inputs=[question_input, gr.State(value="")],
+        outputs=[answer_output]
+    )
+# Combine both interfaces with routes
+app = gr.mount_gradio_app(
+    gradio.routes.App.create_app(),
+    main_demo,
+    path="/",
+)
+app = gr.mount_gradio_app(
+    app,
+    answer_demo,
+    path="/answer",
+)
+if __name__ == "__main__":
+    main_demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio==4.31.0
+transformers==4.38.2
+datasets==2.18.0
+arabert==1.0.1
+python-docx==1.1.0
+PyMuPDF==1.24.2
+torch==2.2.1