Spaces:

Zwounds
/

FormatReview

Sleeping

App Files Files Community

Stephen Zweibel commited on Jun 24

Commit

af140e4

1 Parent(s): bb869fd

Update app for Hugging Face

Browse files

Files changed (3) hide show

README.md +10 -59
rule_extractor.py +39 -12
startup_formatreview.sh +2 -2

README.md CHANGED Viewed

@@ -1,60 +1,11 @@
-# FormatReview
-FormatReview is a tool that helps authors ensure their manuscripts comply with journal formatting guidelines. It automatically extracts formatting rules from journal websites and analyzes documents against these rules.
-## Features
-- **Dynamic Rule Extraction**: Automatically extracts formatting guidelines from any journal's "Instructions for Authors" page
-- **Manual Rule Input**: Allows direct pasting of formatting rules for journals where automatic extraction is difficult
-- **Flexible Rule Sources**: Supports using URL-extracted rules, manually pasted rules, or a combination of both
-- **Document Analysis**: Analyzes PDF and DOCX documents against the extracted rules
-- **Comprehensive Reports**: Provides detailed compliance reports with specific issues and recommendations
-- **User-Friendly Interface**: Simple web interface with separate tabs for uploading documents, viewing extracted rules, and reviewing analysis results
-## How It Works
-1. **Rule Extraction**: The application uses crawl4ai to extract formatting rules from journal websites. It employs a Large Language Model (LLM) to understand and structure the formatting requirements.
-2. **Document Analysis**: The uploaded document is analyzed against the extracted rules using an LLM. The analysis checks for compliance with margins, font, line spacing, citations, section structure, and other formatting requirements.
-3. **Report Generation**: A detailed compliance report is generated, highlighting any issues found and providing recommendations for fixing them.
-## Technical Details
-- **Backend**: Python with asyncio for handling asynchronous operations
-- **Frontend**: Streamlit for the web interface
-- **LLM Integration**: OpenRouter API for accessing advanced language models
-- **Web Crawling**: crawl4ai for extracting content from journal websites
-- **Document Processing**: Support for PDF and DOCX formats
-## Usage
-1. Upload your manuscript (PDF or DOCX)
-2. Provide formatting rules in one of two ways (or both):
-   - Enter the URL to the journal's "Instructions for Authors" page
-   - Paste formatting rules directly into the text area
-3. Click "Analyze Document"
-4. View the formatting rules in the "Formatting Rules" tab
-5. Review the analysis results in the "Analysis Results" tab
-## Requirements
-- Python 3.9+
-- OpenRouter API key (set in .env file)
-- Required Python packages (listed in requirements.txt)
-## Installation
-1. Clone the repository
-2. Create a virtual environment: `python -m venv .venv`
-3. Activate the virtual environment: `source .venv/bin/activate`
-4. Install dependencies: `pip install -r requirements.txt`
-5. Create a `.env` file with your OpenRouter API key:
-   ```
-   OPENROUTER_API_KEY=your_api_key_here
-   ```
-6. Run the application: `streamlit run app.py`
-## License
-MIT

+---
+title: FormatReview
+emoji: 🚀
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.29.0
+python_version: 3.9
+app_file: app.py
+---

rule_extractor.py CHANGED Viewed

@@ -3,6 +3,7 @@ import asyncio
 import nest_asyncio
 import os
 import json
 from config import settings
 from pydantic import BaseModel, Field
@@ -103,10 +104,39 @@ def get_rules_from_url(url: str) -> str:
         # Initialize the crawler and run
         async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
-                url=url,
-                config=run_config
-            )
             if result.success and result.extracted_content:
                 # Format the extracted data into a readable string
@@ -127,14 +157,11 @@ def get_rules_from_url(url: str) -> str:
                 return formatted_rules
             elif result.success and result.markdown:
                 # Fallback to markdown if structured extraction fails
                 return result.markdown
             else:
-                return "Could not extract formatting rules from the provided URL."
-    # Create a new event loop and run the async function
-    loop = asyncio.new_event_loop()
-    asyncio.set_event_loop(loop)
-    try:
-        return loop.run_until_complete(_extract_rules_async(url))
-    finally:
-        loop.close()

 import nest_asyncio
 import os
 import json
+import httpx
 from config import settings
 from pydantic import BaseModel, Field
         # Initialize the crawler and run
         async with AsyncWebCrawler() as crawler:
+            try:
+                result = await crawler.arun(
+                    url=url,
+                    config=run_config
+                )
+                logger.info(f"Crawler result for {url}: {result}")
+                # Handle robots.txt blocking
+                if not result.success and "robots.txt" in str(result.error_message):
+                    logger.warning(f"Crawl blocked by robots.txt for {url}. Falling back to direct download.")
+                    try:
+                        with httpx.Client() as client:
+                            response = client.get(url, follow_redirects=True)
+                            response.raise_for_status()
+                        raw_html = response.text
+                        logger.info(f"Successfully downloaded HTML content for {url}.")
+                        # Re-run crawl4ai with raw HTML
+                        raw_html_url = f"raw:{raw_html}"
+                        result = await crawler.arun(url=raw_html_url, config=run_config)
+                        logger.info(f"Crawler result for raw HTML: {result}")
+                    except httpx.HTTPStatusError as e:
+                        logger.error(f"HTTP error while fetching {url}: {e}", exc_info=True)
+                        return "Failed to download the page content after being blocked by robots.txt."
+                    except Exception as e:
+                        logger.error(f"An error occurred during fallback processing for {url}: {e}", exc_info=True)
+                        return "An error occurred during the fallback extraction process."
+            except Exception as e:
+                logger.error(f"An error occurred during crawling {url}: {e}", exc_info=True)
+                return "An error occurred while trying to extract formatting rules."
             if result.success and result.extracted_content:
                 # Format the extracted data into a readable string
                 return formatted_rules
             elif result.success and result.markdown:
                 # Fallback to markdown if structured extraction fails
+                logger.info(f"Extraction failed, falling back to markdown for {url}")
                 return result.markdown
             else:
+                logger.warning(f"Failed to extract rules or markdown for {url}. Crawler success: {result.success}")
+                return "Could not extract formatting rules from the provided URL. The crawler did not return any content."
+    # Run the async function using the patched event loop
+    return asyncio.run(_extract_rules_async(url))

startup_formatreview.sh CHANGED Viewed

@@ -71,8 +71,8 @@ if ! command -v tailscale &> /dev/null; then
 else
     # Expose the service via Tailscale Serve
     echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
-    echo "Setting up Funnel on port 443..."
-    tailscale funnel --https=443 --bg localhost:$STREAMLIT_PORT
     # Get the Tailscale hostname
     HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')

 else
     # Expose the service via Tailscale Serve
     echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
+    echo "Setting up Funnel on port 8443..."
+    tailscale funnel --https=8443 --bg localhost:$STREAMLIT_PORT
     # Get the Tailscale hostname
     HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')