Spaces:

shayan5422
/

Docx_to_latex

Sleeping

App Files Files Community

shayan5422 commited on Jun 25

Commit

a469ee1

verified ·

1 Parent(s): 3c2310b

Upload 9 files

Browse files

Files changed (9) hide show

DEPLOYMENT_INSTRUCTIONS.txt +39 -0
Dockerfile +32 -0
README.md +126 -5
app.py +33 -0
converter.py +878 -0
preserve_linebreaks.lua +29 -0
requirements.txt +7 -0
temp/.DS_Store +0 -0
web_api.py +443 -0

DEPLOYMENT_INSTRUCTIONS.txt ADDED Viewed

	@@ -0,0 +1,39 @@

+# راهنمای Deploy کردن روی Hugging Face Spaces
+## مرحله ۱: ایجاد Space
+1. به https://huggingface.co/spaces بروید
+2. "Create new Space" را کلیک کنید
+3. نام Space را وارد کنید: docx-to-latex
+4. SDK: Python انتخاب کنید
+5. "Create Space" را کلیک کنید
+## مرحله ۲: آپلود فایل‌ها
+تمام فایل‌های این پوشه را به Space آپلود کنید:
+### روش ۱: Git
+```bash
+git clone https://huggingface.co/spaces/YOUR_USERNAME/docx-to-latex
+cd docx-to-latex
+cp -r ../huggingface_deployment/* .
+git add .
+git commit -m "Add DOCX to LaTeX converter API"
+git push
+```
+### روش ۲: Web Interface
+فایل‌ها را drag & drop کنید در صفحه Space
+## مرحله ۳: تست
+پس از deploy، API در آدرس زیر دردسترس خواهد بود:
+https://YOUR_USERNAME-docx-to-latex.hf.space/api/health
+## فایل‌های کپی شده:
+- app.py
+- web_api.py
+- converter.py
+- requirements.txt
+- README.md
+- Dockerfile
+- .gitignore
+- preserve_linebreaks.lua

Dockerfile ADDED Viewed

	@@ -0,0 +1,32 @@

+FROM python:3.9-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    pandoc \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy core application files
+COPY app.py .
+COPY web_api.py .
+COPY converter.py .
+COPY preserve_linebreaks.lua .
+# Create necessary directories
+RUN mkdir -p temp/uploads temp/outputs
+# Expose port
+EXPOSE 7860
+# Set environment variables
+ENV PYTHONPATH=/app
+ENV PORT=7860
+# Run the application
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,10 +1,131 @@
 ---
-title: Docx To Latex
-emoji: 🐢
-colorFrom: pink
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DOCX to LaTeX Converter
+emoji: 📄
+colorFrom: blue
+colorTo: green
 sdk: docker
+app_port: 7860
 pinned: false
+license: mit
 ---
+# 📄 DOCX to LaTeX Converter API
+تبدیل‌کننده حرفه‌ای فایل‌های Word (DOCX) به LaTeX با قابلیت‌های پیشرفته
+A professional DOCX to LaTeX converter with advanced features and modern web interface.
+## 🌟 ویژگی‌ها / Features
+### فارسی
+- ✅ تبدیل فایل‌های DOCX به LaTeX با کیفیت بالا
+- ✅ استخراج و حفظ تصاویر
+- ✅ سازگار با Overleaf
+- ✅ حفظ فرمت‌ها و استایل‌ها
+- ✅ تولید فهرست مطالب خودکار
+- ✅ دانلود فایل کامل در قالب ZIP
+- ✅ رابط API ساده و قدرتمند
+- ✅ اجرا رایگان روی Hugging Face Spaces
+### English
+- ✅ High-quality DOCX to LaTeX conversion
+- ✅ Image extraction and preservation
+- ✅ Overleaf compatibility
+- ✅ Style and formatting preservation
+- ✅ Automatic table of contents generation
+- ✅ Complete ZIP package download
+- ✅ Simple and powerful API interface
+- ✅ Free hosting on Hugging Face Spaces
+## 🚀 استفاده / Usage
+### API Endpoints
+#### 1. Health Check
+```bash
+GET /api/health
+```
+#### 2. Upload File
+```bash
+POST /api/upload
+Content-Type: multipart/form-data
+Body: file (DOCX file)
+```
+#### 3. Convert Document
+```bash
+POST /api/convert
+Content-Type: application/json
+Body: {
+  "task_id": "string",
+  "output_filename": "string",
+  "options": {
+    "generateToc": boolean,
+    "extractMedia": boolean,
+    "overleafCompatible": boolean,
+    "preserveStyles": boolean,
+    "preserveLineBreaks": boolean
+  }
+}
+```
+#### 4. Download Complete Package
+```bash
+GET /api/download-complete/{task_id}
+```
+### مثال استفاده / Example Usage
+```python
+import requests
+# Upload file
+with open('document.docx', 'rb') as f:
+    response = requests.post('https://YOUR_USERNAME-docx-to-latex.hf.space/api/upload',
+                           files={'file': f})
+task_id = response.json()['task_id']
+# Convert
+convert_response = requests.post('https://YOUR_USERNAME-docx-to-latex.hf.space/api/convert',
+                               json={
+                                   'task_id': task_id,
+                                   'options': {
+                                       'generateToc': True,
+                                       'extractMedia': True,
+                                       'overleafCompatible': True
+                                   }
+                               })
+# Download complete package
+download_response = requests.get(f'https://YOUR_USERNAME-docx-to-latex.hf.space/api/download-complete/{task_id}')
+with open('converted_package.zip', 'wb') as f:
+    f.write(download_response.content)
+```
+## 🔧 نصب محلی / Local Installation
+```bash
+git clone https://github.com/YOUR_USERNAME/docx-to-latex.git
+cd docx-to-latex
+pip install -r requirements.txt
+python app.py
+```
+## 📚 مستندات / Documentation
+این API امکان تبدیل فایل‌های Word به LaTeX با حفظ فرمت‌ها، تصاویر و جداول را فراهم می‌کند. خروجی نهایی شامل فایل LaTeX و پوشه تصاویر در قالب ZIP است که مستقیماً در Overleaf قابل استفاده است.
+This API provides seamless conversion from Word documents to LaTeX while preserving formatting, images, and tables. The final output includes the LaTeX file and media folder in a ZIP package ready for use in Overleaf.
+## 🤝 مشارکت / Contributing
+مشارکت‌ها خوشحال دریافت می‌شوند! لطفاً Issue ایجاد کرده یا Pull Request ارسال کنید.
+Contributions are welcome! Please feel free to submit issues or pull requests.
+## 📄 مجوز / License
+MIT License - برای جزئیات فایل LICENSE را مشاهده کنید.
+MIT License - see LICENSE file for details.

app.py ADDED Viewed

	@@ -0,0 +1,33 @@

+#!/usr/bin/env python3
+"""
+DOCX to LaTeX Converter API
+Main entry point for Hugging Face Spaces deployment
+"""
+import os
+import sys
+# Set up environment for Hugging Face Spaces
+if 'SPACE_ID' in os.environ:
+    # Running on Hugging Face Spaces
+    PORT = int(os.environ.get('PORT', 7860))
+    HOST = '0.0.0.0'
+else:
+    # Running locally
+    PORT = 5001
+    HOST = '127.0.0.1'
+# Import the Flask app
+from web_api import app
+if __name__ == "__main__":
+    print(f"🚀 Starting DOCX to LaTeX Converter API")
+    print(f"🌐 Server running on http://{HOST}:{PORT}")
+    print(f"📖 Health check: http://{HOST}:{PORT}/api/health")
+    print(f"📚 API Documentation: https://huggingface.co/spaces/YOUR_USERNAME/docx-to-latex")
+    app.run(
+        host=HOST,
+        port=PORT,
+        debug=False  # Disable debug in production
+    )

converter.py ADDED Viewed

	@@ -0,0 +1,878 @@

+import pypandoc
+import os
+import re
+import tempfile
+def convert_docx_to_latex(
+    docx_path: str,
+    latex_path: str,
+    generate_toc: bool = False,
+    extract_media_to_path: str = None,
+    latex_template_path: str = None,
+    overleaf_compatible: bool = False,
+    preserve_styles: bool = True,
+    preserve_linebreaks: bool = True
+) -> tuple[bool, str]:
+    """
+    Converts a DOCX file to a LaTeX file using pypandoc with enhanced features.
+    Args:
+        docx_path: Path to the input .docx file.
+        latex_path: Path to save the output .tex file.
+        generate_toc: If True, attempts to generate a Table of Contents.
+        extract_media_to_path: If specified, path to extract media to (e.g., "./media").
+        latex_template_path: If specified, path to a custom Pandoc LaTeX template file.
+        overleaf_compatible: If True, makes images work in Overleaf with relative paths.
+        preserve_styles: If True, preserves document styles like centering and alignment.
+        preserve_linebreaks: If True, preserves line breaks and proper list formatting.
+    Returns:
+        A tuple (success: bool, message: str).
+    """
+    extra_args = []
+    # Ensure standalone document (not fragment)
+    extra_args.append("--standalone")
+    # Basic options
+    if generate_toc:
+        extra_args.append("--toc")
+    if extract_media_to_path:
+        extra_args.append(f"--extract-media={extract_media_to_path}")
+    if latex_template_path and os.path.isfile(latex_template_path):
+        extra_args.append(f"--template={latex_template_path}")
+    elif latex_template_path:
+        pass  # Template not found, Pandoc will handle the error
+    # Enhanced features
+    if overleaf_compatible:
+        extra_args.extend([
+            "--resource-path=./",
+            "--default-image-extension=png"
+        ])
+    if preserve_styles:
+        extra_args.extend([
+            "--from=docx+styles",
+            "--wrap=preserve",
+            "--columns=72",
+            "--strip-comments"  # Remove comments that might cause highlighting
+        ])
+    if preserve_linebreaks:
+        extra_args.extend([
+            "--preserve-tabs",
+            "--wrap=preserve",
+            "--reference-doc=" + docx_path  # Use original Word doc as reference for formatting
+        ])
+        # Create minimal Lua filter that preserves Word's original line breaks
+        lua_filter_content = '''
+function Para(elem)
+  -- Preserve all line breaks exactly as they appear in Word
+  -- This maintains Word's original pagination and formatting
+  local new_content = {}
+  for i, item in ipairs(elem.content) do
+    if item.t == "SoftBreak" then
+      -- Convert all soft breaks to line breaks to match Word's formatting
+      table.insert(new_content, pandoc.LineBreak())
+    else
+      table.insert(new_content, item)
+    end
+  end
+  elem.content = new_content
+  return elem
+end
+function LineBlock(elem)
+  -- Preserve line blocks exactly as they are
+  return elem
+end
+function Span(elem)
+  -- Remove unwanted highlighting and formatting
+  if elem.attributes and elem.attributes.style then
+    -- Remove background colors and highlighting
+    local style = elem.attributes.style
+    if string.find(style, "background") or string.find(style, "highlight") then
+      elem.attributes.style = nil
+    end
+  end
+  return elem
+end
+function Div(elem)
+  -- Remove unwanted div formatting that causes highlighting
+  if elem.attributes and elem.attributes.style then
+    local style = elem.attributes.style
+    if string.find(style, "background") or string.find(style, "highlight") then
+      elem.attributes.style = nil
+    end
+  end
+  return elem
+end
+function RawBlock(elem)
+  -- Preserve raw LaTeX blocks
+  if elem.format == "latex" then
+    return elem
+  end
+end
+'''
+        # Create temporary Lua filter file
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.lua', delete=False) as f:
+            f.write(lua_filter_content)
+            lua_filter_path = f.name
+        extra_args.append(f"--lua-filter={lua_filter_path}")
+    try:
+        # Perform conversion
+        pypandoc.convert_file(docx_path, 'latex', outputfile=latex_path, extra_args=extra_args)
+        # Clean up temporary Lua filter if created
+        if preserve_linebreaks and 'lua_filter_path' in locals():
+            try:
+                os.unlink(lua_filter_path)
+            except OSError:
+                pass
+        # Apply post-processing enhancements (always applied for Unicode conversion)
+        _apply_post_processing(latex_path, overleaf_compatible, preserve_styles, preserve_linebreaks, extract_media_to_path)
+        # Generate status message
+        enhancements = []
+        if overleaf_compatible:
+            enhancements.append("Overleaf compatibility")
+        if preserve_styles:
+            enhancements.append("style preservation")
+        if preserve_linebreaks:
+            enhancements.append("line break preservation")
+        if enhancements:
+            enhancement_msg = f" with {', '.join(enhancements)}"
+        else:
+            enhancement_msg = ""
+        return True, f"Conversion successful{enhancement_msg}!"
+    except RuntimeError as e:
+        # Clean up temporary Lua filter if created
+        if preserve_linebreaks and 'lua_filter_path' in locals():
+            try:
+                os.unlink(lua_filter_path)
+            except OSError:
+                pass
+        return False, f"RuntimeError: Could not execute Pandoc. Please ensure Pandoc is installed and in your system's PATH. Error: {e}"
+    except Exception as e:
+        # Clean up temporary Lua filter if created
+        if preserve_linebreaks and 'lua_filter_path' in locals():
+            try:
+                os.unlink(lua_filter_path)
+            except OSError:
+                pass
+        return False, f"Conversion failed: {e}"
+def _apply_post_processing(latex_path: str, overleaf_compatible: bool, preserve_styles: bool, preserve_linebreaks: bool, extract_media_to_path: str = None):
+    """
+    Apply post-processing enhancements to the generated LaTeX file.
+    """
+    try:
+        with open(latex_path, 'r', encoding='utf-8') as f:
+            content = f.read()
+        # Always inject essential packages for compilation compatibility
+        content = _inject_essential_packages(content)
+        # Fix mixed mathematical expressions first to remove duplicated text
+        content = _fix_mixed_mathematical_expressions(content)
+        # Convert Unicode mathematical characters to LaTeX equivalents (always applied)
+        content = _convert_unicode_math_characters(content)
+        # Apply additional Unicode cleanup as a safety net
+        content = _additional_unicode_cleanup(content)
+        # Apply overleaf compatibility fixes
+        if overleaf_compatible:
+            content = _fix_image_paths_for_overleaf(content, extract_media_to_path)
+        # Apply style preservation enhancements
+        if preserve_styles:
+            content = _inject_latex_packages(content)
+            content = _add_centering_commands(content)
+        # Apply line break preservation fixes
+        if preserve_linebreaks:
+            content = _fix_line_breaks_and_spacing(content)
+        # Remove unwanted formatting and highlighting
+        content = _remove_unwanted_formatting(content)
+        # Fix common LaTeX compilation issues
+        content = _fix_compilation_issues(content)
+        # Write back the processed content
+        with open(latex_path, 'w', encoding='utf-8') as f:
+            f.write(content)
+    except Exception as e:
+        # Post-processing failures shouldn't break the conversion
+        print(f"Warning: Post-processing failed: {e}")
+def _inject_essential_packages(content: str) -> str:
+    """
+    Inject essential packages that are always needed for compilation.
+    """
+    # Core packages that Pandoc might not include but are often needed
+    essential_packages = [
+        r'\usepackage[utf8]{inputenc}',  # UTF-8 input encoding
+        r'\usepackage[T1]{fontenc}',     # Font encoding
+        r'\usepackage{graphicx}',        # For images
+        r'\usepackage{longtable}',       # For tables
+        r'\usepackage{booktabs}',        # Better table formatting
+        r'\usepackage{hyperref}',        # For links (if not already included)
+        r'\usepackage{amsmath}',         # Mathematical formatting
+        r'\usepackage{amssymb}',         # Mathematical symbols
+        r'\usepackage{textcomp}',        # Additional text symbols
+    ]
+    documentclass_pattern = r'\\documentclass(?:\[[^\]]*\])?\{[^}]+\}'
+    documentclass_match = re.search(documentclass_pattern, content)
+    if documentclass_match:
+        insert_pos = documentclass_match.end()
+        packages_to_insert = []
+        for package in essential_packages:
+            package_name = package.split('{')[1].split('}')[0].split(']')[0]  # Extract package name
+            if f'usepackage' not in content or package_name not in content:
+                packages_to_insert.append(package)
+        if packages_to_insert:
+            package_block = '\n% Essential packages for compilation\n' + '\n'.join(packages_to_insert) + '\n'
+            content = content[:insert_pos] + package_block + content[insert_pos:]
+        # Add Unicode character definitions to handle any remaining problematic characters
+        unicode_definitions = r'''
+% Unicode character definitions for LaTeX compatibility
+\DeclareUnicodeCharacter{2003}{ }  % Em space
+\DeclareUnicodeCharacter{2002}{ }  % En space
+\DeclareUnicodeCharacter{2009}{ }  % Thin space
+\DeclareUnicodeCharacter{200A}{ }  % Hair space
+\DeclareUnicodeCharacter{2004}{ }  % Three-per-em space
+\DeclareUnicodeCharacter{2005}{ }  % Four-per-em space
+\DeclareUnicodeCharacter{2006}{ }  % Six-per-em space
+\DeclareUnicodeCharacter{2008}{ }  % Punctuation space
+\DeclareUnicodeCharacter{202F}{ }  % Narrow no-break space
+\DeclareUnicodeCharacter{2212}{-}  % Unicode minus sign
+\DeclareUnicodeCharacter{2010}{-}  % Hyphen
+\DeclareUnicodeCharacter{2011}{-}  % Non-breaking hyphen
+\DeclareUnicodeCharacter{2013}{--} % En dash
+\DeclareUnicodeCharacter{2014}{---}% Em dash
+'''
+        # Insert Unicode definitions after packages but before \begin{document}
+        begin_doc_match = re.search(r'\\begin\{document\}', content)
+        if begin_doc_match:
+            insert_pos_unicode = begin_doc_match.start()
+            content = content[:insert_pos_unicode] + unicode_definitions + '\n' + content[insert_pos_unicode:]
+    return content
+def _convert_unicode_math_characters(content: str) -> str:
+    """
+    Convert Unicode mathematical characters to their LaTeX equivalents.
+    """
+    # Dictionary of Unicode characters to LaTeX commands
+    unicode_to_latex = {
+        # Mathematical operators
+        'Δ': r'$\Delta$',           # U+0394 - Greek capital letter delta
+        'δ': r'$\delta$',           # U+03B4 - Greek small letter delta
+        '∑': r'$\sum$',             # U+2211 - N-ary summation
+        '∏': r'$\prod$',            # U+220F - N-ary product
+        '∫': r'$\int$',             # U+222B - Integral
+        '∂': r'$\partial$',         # U+2202 - Partial differential
+        '∇': r'$\nabla$',           # U+2207 - Nabla
+        '√': r'$\sqrt{}$',          # U+221A - Square root
+        '∞': r'$\infty$',           # U+221E - Infinity
+        # Relations and equality
+        '≈': r'$\approx$',          # U+2248 - Almost equal to
+        '≠': r'$\neq$',             # U+2260 - Not equal to
+        '≤': r'$\leq$',             # U+2264 - Less-than or equal to
+        '≥': r'$\geq$',             # U+2265 - Greater-than or equal to
+        '±': r'$\pm$',              # U+00B1 - Plus-minus sign
+        '∓': r'$\mp$',              # U+2213 - Minus-or-plus sign
+        '×': r'$\times$',           # U+00D7 - Multiplication sign
+        '÷': r'$\div$',             # U+00F7 - Division sign
+        '⋅': r'$\cdot$',            # U+22C5 - Dot operator
+        # Set theory and logic
+        '∈': r'$\in$',              # U+2208 - Element of
+        '∉': r'$\notin$',           # U+2209 - Not an element of
+        '⊂': r'$\subset$',          # U+2282 - Subset of
+        '⊃': r'$\supset$',          # U+2283 - Superset of
+        '⊆': r'$\subseteq$',        # U+2286 - Subset of or equal to
+        '⊇': r'$\supseteq$',        # U+2287 - Superset of or equal to
+        '∪': r'$\cup$',             # U+222A - Union
+        '∩': r'$\cap$',             # U+2229 - Intersection
+        '∅': r'$\emptyset$',        # U+2205 - Empty set
+        '∀': r'$\forall$',          # U+2200 - For all
+        '∃': r'$\exists$',          # U+2203 - There exists
+        # Special symbols
+        '∣': r'$|$',                # U+2223 - Divides
+        '∥': r'$\parallel$',        # U+2225 - Parallel to
+        '⊥': r'$\perp$',            # U+22A5 - Up tack (perpendicular)
+        '∠': r'$\angle$',           # U+2220 - Angle
+        '°': r'$^\circ$',           # U+00B0 - Degree sign
+        # Arrows
+        '→': r'$\rightarrow$',      # U+2192 - Rightwards arrow
+        '←': r'$\leftarrow$',       # U+2190 - Leftwards arrow
+        '↔': r'$\leftrightarrow$',  # U+2194 - Left right arrow
+        '⇒': r'$\Rightarrow$',      # U+21D2 - Rightwards double arrow
+        '⇐': r'$\Leftarrow$',       # U+21D0 - Leftwards double arrow
+        '⇔': r'$\Leftrightarrow$',  # U+21D4 - Left right double arrow
+        # Accents and diacritics
+        'ˉ': r'$\bar{}$',           # U+02C9 - Modifier letter macron
+        'ˆ': r'$\hat{}$',           # U+02C6 - Modifier letter circumflex accent
+        'ˇ': r'$\check{}$',         # U+02C7 - Caron
+        '˜': r'$\tilde{}$',         # U+02DC - Small tilde
+        '˙': r'$\dot{}$',           # U+02D9 - Dot above
+        '¨': r'$\ddot{}$',          # U+00A8 - Diaeresis
+        # Special minus and spaces - using explicit Unicode escape sequences
+        '−': r'-',                  # U+2212 - Minus sign (convert to regular hyphen)
+        '\u2003': r' ',             # U+2003 - Em space (convert to regular space)
+        '\u2009': r' ',             # U+2009 - Thin space (convert to regular space)
+        '\u2002': r' ',             # U+2002 - En space (convert to regular space)
+        '\u2004': r' ',             # U+2004 - Three-per-em space
+        '\u2005': r' ',             # U+2005 - Four-per-em space
+        '\u2006': r' ',             # U+2006 - Six-per-em space
+        '\u2008': r' ',             # U+2008 - Punctuation space
+        '\u200A': r' ',             # U+200A - Hair space
+        '\u202F': r' ',             # U+202F - Narrow no-break space
+        # Greek letters (commonly used in math)
+        'α': r'$\alpha$',           # U+03B1
+        'β': r'$\beta$',            # U+03B2
+        'γ': r'$\gamma$',           # U+03B3
+        'Γ': r'$\Gamma$',           # U+0393
+        'ε': r'$\varepsilon$',      # U+03B5
+        'ζ': r'$\zeta$',            # U+03B6
+        'η': r'$\eta$',             # U+03B7
+        'θ': r'$\theta$',           # U+03B8
+        'Θ': r'$\Theta$',           # U+0398
+        'ι': r'$\iota$',            # U+03B9
+        'κ': r'$\kappa$',           # U+03BA
+        'λ': r'$\lambda$',          # U+03BB
+        'Λ': r'$\Lambda$',          # U+039B
+        'μ': r'$\mu$',              # U+03BC
+        'ν': r'$\nu$',              # U+03BD
+        'ξ': r'$\xi$',              # U+03BE
+        'Ξ': r'$\Xi$',              # U+039E
+        'π': r'$\pi$',              # U+03C0
+        'Π': r'$\Pi$',              # U+03A0
+        'ρ': r'$\rho$',             # U+03C1
+        'σ': r'$\sigma$',           # U+03C3
+        'Σ': r'$\Sigma$',           # U+03A3
+        'τ': r'$\tau$',             # U+03C4
+        'υ': r'$\upsilon$',         # U+03C5
+        'Υ': r'$\Upsilon$',         # U+03A5
+        'φ': r'$\varphi$',          # U+03C6
+        'Φ': r'$\Phi$',             # U+03A6
+        'χ': r'$\chi$',             # U+03C7
+        'ψ': r'$\psi$',             # U+03C8
+        'Ψ': r'$\Psi$',             # U+03A8
+        'ω': r'$\omega$',           # U+03C9
+        'Ω': r'$\Omega$',           # U+03A9
+    }
+    # Apply conversions
+    for unicode_char, latex_cmd in unicode_to_latex.items():
+        if unicode_char in content:
+            content = content.replace(unicode_char, latex_cmd)
+    # Additional aggressive Unicode space cleanup using regex
+    # Handle various Unicode spaces more comprehensively
+    content = re.sub(r'[\u2000-\u200F\u2028-\u202F\u205F\u3000]', ' ', content)  # All Unicode spaces
+    # Handle specific problematic Unicode characters that might not be in our dictionary
+    content = re.sub(r'[\u2010-\u2015]', '-', content)  # Various Unicode dashes
+    content = re.sub(r'[\u2212]', '-', content)         # Unicode minus sign
+    # Handle specific cases where characters might appear in math environments
+    # Fix double math mode (e.g., $\alpha$ inside already math mode)
+    content = re.sub(r'\$\$([^$]+)\$\$', r'$\1$', content)  # Convert display math to inline
+    content = re.sub(r'\$\$([^$]*)\$([^$]*)\$\$', r'$\1\2$', content)  # Fix broken math
+    # Fix bar notation that might have been broken
+    content = re.sub(r'\$\\bar\{\}\$([a-zA-Z])', r'$\\bar{\1}$', content)
+    content = re.sub(r'([a-zA-Z])\$\\bar\{\}\$', r'$\\bar{\1}$', content)
+    return content
+def _additional_unicode_cleanup(content: str) -> str:
+    """
+    Additional aggressive Unicode cleanup to handle any characters that slip through.
+    """
+    # Convert all common problematic Unicode spaces to regular spaces
+    # This covers a wider range than the dictionary approach
+    unicode_spaces = [
+        '\u00A0',  # Non-breaking space
+        '\u1680',  # Ogham space mark
+        '\u2000',  # En quad
+        '\u2001',  # Em quad
+        '\u2002',  # En space
+        '\u2003',  # Em space
+        '\u2004',  # Three-per-em space
+        '\u2005',  # Four-per-em space
+        '\u2006',  # Six-per-em space
+        '\u2007',  # Figure space
+        '\u2008',  # Punctuation space
+        '\u2009',  # Thin space
+        '\u200A',  # Hair space
+        '\u200B',  # Zero width space
+        '\u202F',  # Narrow no-break space
+        '\u205F',  # Medium mathematical space
+        '\u3000',  # Ideographic space
+    ]
+    for unicode_space in unicode_spaces:
+        content = content.replace(unicode_space, ' ')
+    # Convert Unicode dashes
+    unicode_dashes = [
+        '\u2010',  # Hyphen
+        '\u2011',  # Non-breaking hyphen
+        '\u2012',  # Figure dash
+        '\u2013',  # En dash
+        '\u2014',  # Em dash
+        '\u2015',  # Horizontal bar
+        '\u2212',  # Minus sign
+    ]
+    for unicode_dash in unicode_dashes:
+        if unicode_dash in ['\u2013', '\u2014']:  # En and Em dashes
+            content = content.replace(unicode_dash, '--')
+        else:
+            content = content.replace(unicode_dash, '-')
+    # Use regex for any remaining problematic characters
+    # Remove or replace any remaining Unicode characters that commonly cause issues
+    content = re.sub(r'[\u2000-\u200F\u2028-\u202F\u205F\u3000]', ' ', content)
+    content = re.sub(r'[\u2010-\u2015\u2212]', '-', content)
+    return content
+def _fix_mixed_mathematical_expressions(content: str) -> str:
+    """
+    Removes duplicated plain-text versions of mathematical expressions
+    that Pandoc sometimes generates alongside the LaTeX version by deleting
+    the plain text part when it is immediately followed by the LaTeX part.
+    """
+    processed_content = content
+    # A list of compiled regex patterns.
+    # Each pattern matches a plain-text formula but only if it's followed
+    # by its corresponding LaTeX version (using a positive lookahead).
+    patterns_to_remove = [
+        # Pattern for: hq,k=x[nq,k]...h_{q,k} = x[n_{q,k}]...
+        re.compile(r'h[qrs],k=x\[n[qrs],k\](?:,h[qrs],k=x\[n[qrs],k\])*\s*' +
+                   r'(?=h_{q,k}\s*=\s*x\\\[n_{q,k}\\\],)', re.UNICODE),
+        # Pattern for: ∆hq,r,k=hq,k-hr,k...\Delta h_{q,r,k} = ...
+        re.compile(r'(?:∆h[qrs],[qrs],k=h[qrs],k-h[qrs],k\s*)+' +
+                   r'(?=\\Delta\s*h_{q,r,k})', re.UNICODE),
+        # Pattern for: RRk=tr,k+1-tr,kRR_k = ...
+        re.compile(r'RRk=tr,k\+1-tr,k\s*' +
+                   r'(?=RR_k\s*=\s*t_{r,k\+1})', re.UNICODE),
+        # Pattern for: Tmed=median{RRk}T_{\mathrm{med}}
+        re.compile(r'Tmed=median\{RRk\}\s*' +
+                   r'(?=T_{\\mathrm{med}}\s*=\s*\\mathrm{median}\\{RR_k\\})', re.UNICODE),
+        # Pattern for: Tk=[tr,k-Tmed2, tr,k+Tmed2]\mathcal{T}_k
+        re.compile(r'Tk=\[tr,k-Tmed2,.*?tr,k\+Tmed2\]\s*' +
+                   r'(?=\\mathcal\{T\}_k\s*=\s*\\\[t_{r,k})', re.UNICODE | re.DOTALL),
+        # Pattern for: h¯k=1|Ik|∑n∈Ikx[n]\bar h_k
+        re.compile(r'h¯k=1\|Ik\|∑n∈Ikx\[n\]\s*' +
+                   r'(?=\\bar\s*h_k\s*=\s*\\frac)', re.UNICODE),
+        # Pattern for: Mrs=median{∆hr,s,k}M_{rs}
+        re.compile(r'Mrs=median\{∆hr,s,k\}\s*' +
+                   r'(?=M_{rs}\s*=\s*\\mathrm{median})', re.UNICODE),
+        # Pattern for: ∆h¯k=h¯k-Mrs\Delta\bar h_k
+        re.compile(r'∆h¯k=h¯k-Mrs\s*' +
+                   r'(?=\\Delta\\bar\s*h_k\s*=\s*\\bar\s*h_k)', re.UNICODE),
+    ]
+    for pattern in patterns_to_remove:
+        processed_content = pattern.sub('', processed_content)
+    return processed_content
+def _fix_compilation_issues(content: str) -> str:
+    """
+    Fix common LaTeX compilation issues.
+    """
+    # Fix \tightlist command if not defined
+    if r'\tightlist' in content and r'\providecommand{\tightlist}' not in content:
+        tightlist_def = r'''
+% Define \tightlist command for lists
+\providecommand{\tightlist}{%
+  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
+'''
+        # Insert after packages but before \begin{document}
+        begin_doc_match = re.search(r'\\begin\{document\}', content)
+        if begin_doc_match:
+            insert_pos = begin_doc_match.start()
+            content = content[:insert_pos] + tightlist_def + '\n' + content[insert_pos:]
+    # Fix \euro command if used but not defined
+    if r'\euro' in content and r'usepackage{eurosym}' not in content:
+        content = re.sub(
+            r'(\\usepackage\{[^}]+\}\s*\n)',
+            r'\1\\usepackage{eurosym}\n',
+            content,
+            count=1
+        )
+    # Fix undefined references to figures/tables
+    content = re.sub(r'\\ref\{fig:([^}]+)\}', r'Figure~\\ref{fig:\1}', content)
+    content = re.sub(r'\\ref\{tab:([^}]+)\}', r'Table~\\ref{tab:\1}', content)
+    # Ensure proper figure placement
+    if r'\begin{figure}' in content:
+        content = re.sub(
+            r'\\begin\{figure\}(?!\[)',
+            r'\\begin{figure}[htbp]',
+            content
+        )
+    # Ensure proper table placement
+    if r'\begin{table}' in content:
+        content = re.sub(
+            r'\\begin\{table\}(?!\[)',
+            r'\\begin{table}[htbp]',
+            content
+        )
+    return content
+def _fix_image_paths_for_overleaf(content: str, extract_media_to_path: str = None) -> str:
+    """
+    Convert absolute image paths to relative paths for Overleaf compatibility.
+    """
+    if extract_media_to_path:
+        # Extract the media directory name
+        media_dir = os.path.basename(extract_media_to_path.rstrip('/'))
+        # Fix paths with task IDs like: task_id_media/media/image.png -> media/image.png
+        # Pattern: \includegraphics{any_path/task_id_media/media/image.ext}
+        # Replace with: \includegraphics{media/image.ext}
+        pattern1 = r'\\includegraphics(\[[^\]]*\])?\{[^{}]*[a-f0-9\-]+_media[/\\]media[/\\]([^{}]+)\}'
+        replacement1 = r'\\includegraphics\1{media/\2}'
+        content = re.sub(pattern1, replacement1, content)
+        # Fix paths like: task_id_media/media/image.png -> media/image.png (without includegraphics)
+        pattern2 = r'[a-f0-9\-]+_media[/\\]media[/\\]'
+        replacement2 = r'media/'
+        content = re.sub(pattern2, replacement2, content)
+        # Also handle regular media paths: /absolute/path/to/media/image.ext -> media/image.ext
+        pattern3 = r'\\includegraphics(\[[^\]]*\])?\{[^{}]*[/\\]' + re.escape(media_dir) + r'[/\\]([^{}]+)\}'
+        replacement3 = r'\\includegraphics\1{' + media_dir + r'/\2}'
+        content = re.sub(pattern3, replacement3, content)
+    return content
+def _remove_unwanted_formatting(content: str) -> str:
+    """
+    Remove unwanted highlighting and formatting that causes visual issues.
+    """
+    # Remove highlighting commands
+    content = re.sub(r'\\colorbox\{[^}]*\}\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\hl\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\texthl\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\hlc\[[^\]]*\]\{([^}]*)\}', r'\1', content)
+    # Remove table cell coloring
+    content = re.sub(r'\\cellcolor\{[^}]*\}', '', content)
+    content = re.sub(r'\\rowcolor\{[^}]*\}', '', content)
+    content = re.sub(r'\\columncolor\{[^}]*\}', '', content)
+    # Remove text background colors
+    content = re.sub(r'\\textcolor\{[^}]*\}\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\color\{[^}]*\}', '', content)
+    # Remove box formatting that might cause highlighting
+    content = re.sub(r'\\fcolorbox\{[^}]*\}\{[^}]*\}\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\framebox\[[^\]]*\]\{([^}]*)\}', r'\1', content)
+    # Remove soul package highlighting
+    content = re.sub(r'\\sethlcolor\{[^}]*\}', '', content)
+    content = re.sub(r'\\ul\{([^}]*)\}', r'\1', content)  # Remove underline if causing issues
+    return content
+def _inject_latex_packages(content: str) -> str:
+    """
+    Inject additional LaTeX packages needed for enhanced formatting.
+    """
+    # Essential packages for enhanced conversion
+    essential_packages = [
+        r'\usepackage{graphicx}',      # For images - ensure it's included
+        r'\usepackage{longtable}',     # For tables
+        r'\usepackage{booktabs}',      # Better table formatting
+        r'\usepackage{array}',         # Enhanced table formatting
+        r'\usepackage{calc}',          # For calculations
+        r'\usepackage{url}',           # For URLs
+    ]
+    # Style enhancement packages
+    style_packages = [
+        r'\usepackage{float}',         # Better float positioning
+        r'\usepackage{adjustbox}',     # For centering and scaling
+        r'\usepackage{caption}',       # Better caption formatting
+        r'\usepackage{subcaption}',    # For subfigures
+        r'\usepackage{tabularx}',      # Flexible table widths
+        r'\usepackage{enumitem}',      # Better list formatting
+        r'\usepackage{setspace}',      # Line spacing control
+        r'\usepackage{ragged2e}',      # Better text alignment
+        r'\usepackage{amsmath}',       # Mathematical formatting
+        r'\usepackage{amssymb}',       # Mathematical symbols
+        r'\usepackage{needspace}',     # Prevent orphaned lines and improve page breaks
+    ]
+    all_packages = essential_packages + style_packages
+    # Find the position after \documentclass but before any existing \usepackage or \begin{document}
+    documentclass_pattern = r'\\documentclass(?:\[[^\]]*\])?\{[^}]+\}'
+    documentclass_match = re.search(documentclass_pattern, content)
+    if documentclass_match:
+        insert_pos = documentclass_match.end()
+        # Find the next significant LaTeX command to insert before it
+        # Look for existing \usepackage, \begin{document}, or other commands
+        remaining_content = content[insert_pos:]
+        next_command_match = re.search(r'\\(?:usepackage|begin\{document\}|title|author|date)', remaining_content)
+        if next_command_match:
+            insert_pos += next_command_match.start()
+        # Check which packages are not already included
+        packages_to_insert = []
+        for package in all_packages:
+            package_name = package.replace(r'\usepackage{', '').replace('}', '')
+            if f'usepackage{{{package_name}}}' not in content:
+                packages_to_insert.append(package)
+        if packages_to_insert:
+            # Add packages with proper spacing
+            package_block = '\n% Enhanced conversion packages\n' + '\n'.join(packages_to_insert) + '\n\n'
+            content = content[:insert_pos] + package_block + content[insert_pos:]
+    return content
+def _add_centering_commands(content: str) -> str:
+    """
+    Add centering commands to figures and tables.
+    """
+    # Add \centering to figure environments
+    content = re.sub(
+        r'(\\begin\{figure\}(?:\[[^\]]*\])?)\s*\n',
+        r'\1\n\\centering\n',
+        content
+    )
+    # Add \centering to table environments
+    content = re.sub(
+        r'(\\begin\{table\}(?:\[[^\]]*\])?)\s*\n',
+        r'\1\n\\centering\n',
+        content
+    )
+    return content
+def _fix_line_breaks_and_spacing(content: str) -> str:
+    """
+    Minimal fixes to preserve Word's original formatting and pagination.
+    """
+    # Remove unwanted highlighting and color commands
+    content = re.sub(r'\\colorbox\{[^}]*\}\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\hl\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\texthl\{([^}]*)\}', r'\1', content)
+    content = re.sub(r'\\cellcolor\{[^}]*\}', '', content)
+    content = re.sub(r'\\rowcolor\{[^}]*\}', '', content)
+    # Only fix critical spacing issues that break compilation
+    # Preserve Word's original line breaks and spacing as much as possible
+    # Ensure proper spacing around lists but don't change internal spacing
+    content = re.sub(r'\n\\begin\{enumerate\}\n\n', r'\n\n\\begin{enumerate}\n', content)
+    content = re.sub(r'\n\n\\end\{enumerate\}\n', r'\n\\end{enumerate}\n\n', content)
+    content = re.sub(r'\n\\begin\{itemize\}\n\n', r'\n\n\\begin{itemize}\n', content)
+    content = re.sub(r'\n\n\\end\{itemize\}\n', r'\n\\end{itemize}\n\n', content)
+    # Minimal section spacing - preserve Word's pagination
+    content = re.sub(r'\n(\\(?:sub)*section\{[^}]+\})\n\n', r'\n\n\1\n\n', content)
+    # Only remove excessive spacing (3+ line breaks) but preserve double breaks
+    content = re.sub(r'\n\n\n+', r'\n\n', content)
+    # Ensure proper spacing around figures and tables
+    content = re.sub(r'\n\\begin\{figure\}', r'\n\n\\begin{figure}', content)
+    content = re.sub(r'\\end\{figure\}\n([A-Z])', r'\\end{figure}\n\n\1', content)
+    content = re.sub(r'\n\\begin\{table\}', r'\n\n\\begin{table}', content)
+    content = re.sub(r'\\end\{table\}\n([A-Z])', r'\\end{table}\n\n\1', content)
+    return content
+if __name__ == '__main__':
+    from docx import Document
+    from docx.shared import Inches
+    from PIL import Image
+    import shutil
+    # --- Helper Functions for DOCX and Template Creation ---
+    def create_dummy_image(filename, size=(60, 60), color="red", img_format="PNG"):
+        img = Image.new('RGB', size, color=color)
+        img.save(filename, img_format)
+        print(f"Created dummy image: {filename}")
+    def create_test_docx_with_styles(filename):
+        doc = Document()
+        doc.add_heading("Document with Enhanced Features", level=1)
+        # Add paragraph with text
+        p1 = doc.add_paragraph("This document tests enhanced features including:")
+        # Add numbered list
+        doc.add_paragraph("First numbered item", style='List Number')
+        doc.add_paragraph("Second numbered item", style='List Number')
+        doc.add_paragraph("Third numbered item", style='List Number')
+        # Add some text
+        doc.add_paragraph("Here is some regular text between lists.")
+        # Add bullet list
+        doc.add_paragraph("First bullet point", style='List Bullet')
+        doc.add_paragraph("Second bullet point", style='List Bullet')
+        doc.add_heading("Image Section", level=2)
+        doc.add_paragraph("Below is a test image:")
+        doc.save(filename)
+        print(f"Created test DOCX with styles: {filename}")
+    def create_complex_docx(filename, img1_path, img2_path):
+        doc = Document()
+        doc.add_heading("Complex Document Title", level=1)
+        doc.add_paragraph("Introduction to the complex document.")
+        doc.add_heading("Image Section", level=2)
+        doc.add_picture(img1_path, width=Inches(1.0))
+        doc.add_paragraph("Some text after the first image.")
+        doc.add_picture(img2_path, width=Inches(1.0))
+        doc.add_heading("Conclusion Section", level=2)
+        doc.add_paragraph("Final remarks.")
+        doc.save(filename)
+        print(f"Created complex DOCX: {filename}")
+    # --- Test Files ---
+    docx_styles = "test_enhanced_styles.docx"
+    docx_complex = "test_complex_enhanced.docx"
+    img1 = "dummy_img1.png"
+    img2 = "dummy_img2.jpg"
+    output_enhanced_test = "output_enhanced_test.tex"
+    output_overleaf_test = "output_overleaf_test.tex"
+    media_dir = "./media_enhanced"
+    all_test_files = [docx_styles, docx_complex, img1, img2, output_enhanced_test, output_overleaf_test]
+    all_test_dirs = [media_dir]
+    # --- Create Test Files ---
+    print("--- Setting up enhanced test files ---")
+    create_dummy_image(img1, color="blue", img_format="PNG")
+    create_dummy_image(img2, color="green", img_format="JPEG")
+    create_test_docx_with_styles(docx_styles)
+    create_complex_docx(docx_complex, img1, img2)
+    print("--- Enhanced test file setup complete ---")
+    # --- Test Enhanced Features ---
+    print("\n--- Testing Enhanced Features ---")
+    # Test 1: Style preservation and line breaks
+    print("\n--- Test 1: Enhanced Style Preservation ---")
+    success, msg = convert_docx_to_latex(
+        docx_styles,
+        output_enhanced_test,
+        generate_toc=True,
+        preserve_styles=True,
+        preserve_linebreaks=True
+    )
+    print(f"Enhanced Test: {success}, Msg: {msg}")
+    if success and os.path.exists(output_enhanced_test):
+        with open(output_enhanced_test, 'r') as f:
+            content = f.read()
+            checks = {
+                'packages': any(pkg in content for pkg in ['\\usepackage{float}', '\\usepackage{enumitem}']),
+                'toc': '\\tableofcontents' in content,
+                'sections': '\\section' in content,
+                'lists': '\\begin{enumerate}' in content or '\\begin{itemize}' in content
+            }
+            print(f"Enhanced verification: {checks}")
+    # Test 2: Overleaf compatibility with images
+    print("\n--- Test 2: Overleaf Compatibility ---")
+    success, msg = convert_docx_to_latex(
+        docx_complex,
+        output_overleaf_test,
+        extract_media_to_path=media_dir,
+        overleaf_compatible=True,
+        preserve_styles=True,
+        preserve_linebreaks=True
+    )
+    print(f"Overleaf Test: {success}, Msg: {msg}")
+    if success and os.path.exists(output_overleaf_test):
+        with open(output_overleaf_test, 'r') as f:
+            content = f.read()
+            media_check = 'media/' in content and '\\includegraphics' in content
+            print(f"Overleaf compatibility check - relative paths: {media_check}")
+        media_files_exist = os.path.exists(os.path.join(media_dir, 'media'))
+        print(f"Media files extracted: {media_files_exist}")
+    # --- Cleanup ---
+    print("\n--- Cleaning up enhanced test files ---")
+    for f_path in all_test_files:
+        if os.path.exists(f_path):
+            try:
+                os.remove(f_path)
+                print(f"Removed: {f_path}")
+            except Exception as e:
+                print(f"Error removing {f_path}: {e}")
+    for d_path in all_test_dirs:
+        if os.path.isdir(d_path):
+            try:
+                shutil.rmtree(d_path)
+                print(f"Removed directory: {d_path}")
+            except Exception as e:
+                print(f"Error removing {d_path}: {e}")
+    print("--- Enhanced testing completed ---")

preserve_linebreaks.lua ADDED Viewed

	@@ -0,0 +1,29 @@

+-- preserve_linebreaks.lua
+-- Filter for better preservation of line breaks and paragraph structure
+function LineBreak(el)
+    return pandoc.RawInline("latex", "\\\\")
+end
+function SoftBreak(el)
+    return pandoc.RawInline("latex", " ")
+end
+function Para(el)
+    -- Add proper spacing for numbered lists and paragraph breaks
+    if #el.content > 0 then
+        return pandoc.Para(el.content)
+    end
+end
+-- Improve list formatting
+function OrderedList(el)
+    -- Ensure proper spacing in numbered lists
+    return el
+end
+function BulletList(el)
+    -- Ensure proper spacing in bullet lists
+    return el
+end

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+flask==2.3.3
+flask-cors==4.0.0
+pypandoc==1.13
+python-docx==0.8.11
+Pillow==10.0.0
+werkzeug==2.3.7
+gunicorn==21.2.0

temp/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

web_api.py ADDED Viewed

	@@ -0,0 +1,443 @@

+from flask import Flask, request, jsonify, send_file
+from flask_cors import CORS
+import os
+import tempfile
+import uuid
+from werkzeug.utils import secure_filename
+from converter import convert_docx_to_latex
+import shutil
+app = Flask(__name__)
+CORS(app)  # Enable CORS for all routes
+# Configuration
+app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024  # 16MB max file size
+UPLOAD_FOLDER = 'temp/uploads'
+OUTPUT_FOLDER = 'temp/outputs'
+# Ensure directories exist
+os.makedirs(UPLOAD_FOLDER, exist_ok=True)
+os.makedirs(OUTPUT_FOLDER, exist_ok=True)
+# Store conversion tasks
+conversion_tasks = {}
+@app.route('/api/health', methods=['GET'])
+def health_check():
+    """Health check endpoint"""
+    return jsonify({'status': 'healthy', 'message': 'DOCX to LaTeX API is running'})
+@app.route('/api/upload', methods=['POST'])
+def upload_file():
+    """Handle file upload"""
+    try:
+        if 'file' not in request.files:
+            return jsonify({'error': 'No file provided'}), 400
+        file = request.files['file']
+        if file.filename == '':
+            return jsonify({'error': 'No file selected'}), 400
+        if not file.filename.lower().endswith('.docx'):
+            return jsonify({'error': 'Only DOCX files are allowed'}), 400
+        # Generate unique task ID
+        task_id = str(uuid.uuid4())
+        # Save uploaded file
+        filename = secure_filename(file.filename)
+        file_path = os.path.join(UPLOAD_FOLDER, f"{task_id}_{filename}")
+        file.save(file_path)
+        # Store task info
+        conversion_tasks[task_id] = {
+            'status': 'uploaded',
+            'original_filename': filename,
+            'file_path': file_path,
+            'output_filename': filename.replace('.docx', '.tex'),
+            'created_at': os.path.getctime(file_path)
+        }
+        return jsonify({
+            'task_id': task_id,
+            'filename': filename,
+            'status': 'uploaded',
+            'message': 'File uploaded successfully'
+        })
+    except Exception as e:
+        return jsonify({'error': f'Upload failed: {str(e)}'}), 500
+@app.route('/api/convert', methods=['POST'])
+def convert_document():
+    """Convert DOCX to LaTeX"""
+    try:
+        data = request.get_json()
+        if not data or 'task_id' not in data:
+            return jsonify({'error': 'Task ID is required'}), 400
+        task_id = data['task_id']
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        if task['status'] != 'uploaded':
+            return jsonify({'error': 'Task is not in uploadable state'}), 400
+        # Get conversion options
+        options = data.get('options', {})
+        output_filename = data.get('output_filename', task['output_filename'])
+        # Update task status
+        task['status'] = 'converting'
+        task['output_filename'] = output_filename
+        # Prepare output paths
+        output_path = os.path.join(OUTPUT_FOLDER, f"{task_id}_{output_filename}")
+        media_path = os.path.join(OUTPUT_FOLDER, f"{task_id}_media")
+        # Perform conversion
+        success, message = convert_docx_to_latex(
+            docx_path=task['file_path'],
+            latex_path=output_path,
+            generate_toc=options.get('generateToc', False),
+            extract_media_to_path=media_path if options.get('extractMedia', True) else None,
+            latex_template_path=None,  # Could be added later for custom templates
+            overleaf_compatible=options.get('overleafCompatible', True),
+            preserve_styles=options.get('preserveStyles', True),
+            preserve_linebreaks=options.get('preserveLineBreaks', True)
+        )
+        if success:
+            task['status'] = 'completed'
+            task['output_path'] = output_path
+            task['media_path'] = media_path if os.path.exists(media_path) else None
+            task['conversion_message'] = message
+            return jsonify({
+                'task_id': task_id,
+                'status': 'completed',
+                'message': message,
+                'output_filename': output_filename,
+                'has_media': os.path.exists(media_path)
+            })
+        else:
+            task['status'] = 'failed'
+            task['error_message'] = message
+            return jsonify({
+                'task_id': task_id,
+                'status': 'failed',
+                'error': message
+            }), 500
+    except Exception as e:
+        # Update task status if possible
+        if 'task_id' in locals() and task_id in conversion_tasks:
+            conversion_tasks[task_id]['status'] = 'failed'
+            conversion_tasks[task_id]['error_message'] = str(e)
+        return jsonify({'error': f'Conversion failed: {str(e)}'}), 500
+@app.route('/api/download/<task_id>', methods=['GET'])
+def download_file(task_id):
+    """Download converted LaTeX file"""
+    try:
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        if task['status'] != 'completed':
+            return jsonify({'error': 'Conversion not completed'}), 400
+        if not os.path.exists(task['output_path']):
+            return jsonify({'error': 'Output file not found'}), 404
+        return send_file(
+            task['output_path'],
+            as_attachment=True,
+            download_name=task['output_filename'],
+            mimetype='text/plain'
+        )
+    except Exception as e:
+        return jsonify({'error': f'Download failed: {str(e)}'}), 500
+@app.route('/api/download-media/<task_id>', methods=['GET'])
+def download_media(task_id):
+    """Download media files as a ZIP archive"""
+    try:
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        if task['status'] != 'completed':
+            return jsonify({'error': 'Conversion not completed'}), 400
+        if not task.get('media_path') or not os.path.exists(task['media_path']):
+            return jsonify({'error': 'No media files found'}), 404
+        # Create a ZIP file of the media directory
+        zip_path = task['media_path'] + '.zip'
+        shutil.make_archive(task['media_path'], 'zip', task['media_path'])
+        return send_file(
+            zip_path,
+            as_attachment=True,
+            download_name=f"{task['output_filename'].replace('.tex', '')}_media.zip",
+            mimetype='application/zip'
+        )
+    except Exception as e:
+        return jsonify({'error': f'Media download failed: {str(e)}'}), 500
+@app.route('/api/download-complete/<task_id>', methods=['GET'])
+def download_complete_package(task_id):
+    """Download complete package (LaTeX + media) as a ZIP archive"""
+    try:
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        if task['status'] != 'completed':
+            return jsonify({'error': 'Conversion not completed'}), 400
+        if not os.path.exists(task['output_path']):
+            return jsonify({'error': 'Output file not found'}), 404
+        # Create a temporary directory for the complete package
+        import tempfile
+        base_name = task['output_filename'].replace('.tex', '')
+        with tempfile.TemporaryDirectory() as temp_dir:
+            package_dir = os.path.join(temp_dir, base_name)
+            os.makedirs(package_dir, exist_ok=True)
+            # Copy and fix LaTeX file for Overleaf compatibility
+            latex_dest = os.path.join(package_dir, task['output_filename'])
+            # Read the original LaTeX file
+            with open(task['output_path'], 'r', encoding='utf-8') as f:
+                latex_content = f.read()
+            # Fix image paths to use relative paths suitable for Overleaf
+            # Convert paths like: task_id_media/media/image.png -> media/image.png
+            import re
+            # Fix paths with task IDs
+            latex_content = re.sub(
+                r'\\includegraphics(\[[^\]]*\])?\{[^{}]*[a-f0-9\-]+_media[/\\]media[/\\]([^{}]+)\}',
+                r'\\includegraphics\1{media/\2}',
+                latex_content
+            )
+            # Fix any remaining absolute paths
+            latex_content = re.sub(
+                r'\\includegraphics(\[[^\]]*\])?\{[^{}]*[/\\]media[/\\]([^{}]+)\}',
+                r'\\includegraphics\1{media/\2}',
+                latex_content
+            )
+            # Write the fixed LaTeX file
+            with open(latex_dest, 'w', encoding='utf-8') as f:
+                f.write(latex_content)
+            # Copy media files if they exist
+            if task.get('media_path') and os.path.exists(task['media_path']):
+                media_dest = os.path.join(package_dir, 'media')
+                # Check if there's a nested media folder structure
+                inner_media = os.path.join(task['media_path'], 'media')
+                if os.path.exists(inner_media):
+                    # Copy from the inner media folder to avoid media/media/ nesting
+                    shutil.copytree(inner_media, media_dest)
+                else:
+                    # Copy the media_path directly if no nesting
+                    shutil.copytree(task['media_path'], media_dest)
+            # Create README file
+            readme_content = f"""# {base_name} - DOCX to LaTeX Conversion
+## Package Contents:
+1. **{task['output_filename']}** - Main LaTeX file
+2. **media/** - Images and media files (if any)
+## How to Use:
+### Compiling LaTeX:
+```bash
+pdflatex {task['output_filename']}
+```
+### For Overleaf:
+1. Upload all files to a new Overleaf project
+2. Set main file: {task['output_filename']}
+3. Compile the project
+### Local Compilation:
+```bash
+# Basic compilation
+pdflatex {task['output_filename']}
+# For bibliography and cross-references
+pdflatex {task['output_filename']}
+bibtex {task['output_filename'].replace('.tex', '')}
+pdflatex {task['output_filename']}
+pdflatex {task['output_filename']}
+```
+## Features:
+- Enhanced formatting preservation
+- Overleaf compatibility
+- Automatic image path fixing
+- Unicode character conversion
+- Mathematical expression optimization
+## Generated by:
+DOCX to LaTeX Web Converter
+https://github.com/your-username/docx-to-latex
+"""
+            readme_path = os.path.join(package_dir, 'README.txt')
+            with open(readme_path, 'w', encoding='utf-8') as f:
+                f.write(readme_content)
+            # Create ZIP file
+            zip_path = os.path.join(temp_dir, f"{base_name}_complete.zip")
+            shutil.make_archive(zip_path.replace('.zip', ''), 'zip', package_dir)
+            return send_file(
+                zip_path,
+                as_attachment=True,
+                download_name=f"{base_name}_complete.zip",
+                mimetype='application/zip'
+            )
+    except Exception as e:
+        return jsonify({'error': f'Complete package download failed: {str(e)}'}), 500
+@app.route('/api/status/<task_id>', methods=['GET'])
+def get_task_status(task_id):
+    """Get conversion task status"""
+    try:
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        response_data = {
+            'task_id': task_id,
+            'status': task['status'],
+            'original_filename': task['original_filename'],
+            'output_filename': task.get('output_filename', ''),
+        }
+        if task['status'] == 'completed':
+            response_data['message'] = task.get('conversion_message', 'Conversion completed successfully')
+            response_data['has_media'] = task.get('media_path') and os.path.exists(task['media_path'])
+        elif task['status'] == 'failed':
+            response_data['error'] = task.get('error_message', 'Conversion failed')
+        return jsonify(response_data)
+    except Exception as e:
+        return jsonify({'error': f'Status check failed: {str(e)}'}), 500
+@app.route('/api/cleanup/<task_id>', methods=['DELETE'])
+def cleanup_task(task_id):
+    """Clean up task files"""
+    try:
+        if task_id not in conversion_tasks:
+            return jsonify({'error': 'Invalid task ID'}), 404
+        task = conversion_tasks[task_id]
+        # Remove uploaded file
+        if os.path.exists(task['file_path']):
+            os.remove(task['file_path'])
+        # Remove output file
+        if task.get('output_path') and os.path.exists(task['output_path']):
+            os.remove(task['output_path'])
+        # Remove media directory
+        if task.get('media_path') and os.path.exists(task['media_path']):
+            shutil.rmtree(task['media_path'])
+        # Remove media ZIP if it exists
+        media_zip = task.get('media_path', '') + '.zip'
+        if os.path.exists(media_zip):
+            os.remove(media_zip)
+        # Remove task from memory
+        del conversion_tasks[task_id]
+        return jsonify({'message': 'Task cleaned up successfully'})
+    except Exception as e:
+        return jsonify({'error': f'Cleanup failed: {str(e)}'}), 500
+@app.route('/api/tasks', methods=['GET'])
+def list_tasks():
+    """List all conversion tasks (for debugging)"""
+    try:
+        tasks_summary = {}
+        for task_id, task in conversion_tasks.items():
+            tasks_summary[task_id] = {
+                'status': task['status'],
+                'original_filename': task['original_filename'],
+                'output_filename': task.get('output_filename', ''),
+                'created_at': task.get('created_at', 0)
+            }
+        return jsonify(tasks_summary)
+    except Exception as e:
+        return jsonify({'error': f'Failed to list tasks: {str(e)}'}), 500
+# Cleanup old files on startup
+def cleanup_old_files():
+    """Remove old temporary files"""
+    try:
+        import time
+        current_time = time.time()
+        cutoff_time = current_time - (24 * 60 * 60)  # 24 hours ago
+        for folder in [UPLOAD_FOLDER, OUTPUT_FOLDER]:
+            if os.path.exists(folder):
+                for filename in os.listdir(folder):
+                    file_path = os.path.join(folder, filename)
+                    if os.path.isfile(file_path):
+                        file_time = os.path.getctime(file_path)
+                        if file_time < cutoff_time:
+                            os.remove(file_path)
+                    elif os.path.isdir(file_path):
+                        dir_time = os.path.getctime(file_path)
+                        if dir_time < cutoff_time:
+                            shutil.rmtree(file_path)
+    except Exception as e:
+        print(f"Warning: Failed to cleanup old files: {e}")
+if __name__ == '__main__':
+    # Cleanup old files on startup
+    cleanup_old_files()
+    # Run the Flask app
+    print("Starting DOCX to LaTeX API server...")
+    print("API endpoints:")
+    print("  POST /api/upload - Upload DOCX file")
+    print("  POST /api/convert - Convert to LaTeX")
+    print("  GET /api/download/<task_id> - Download LaTeX file")
+    print("  GET /api/download-media/<task_id> - Download media files")
+    print("  GET /api/status/<task_id> - Get conversion status")
+    print("  DELETE /api/cleanup/<task_id> - Cleanup task files")
+    print("  GET /api/health - Health check")
+    app.run(debug=True, host='0.0.0.0', port=5000)