Stephen Zweibel commited on
Commit
af140e4
·
1 Parent(s): bb869fd

Update app for Hugging Face

Browse files
Files changed (3) hide show
  1. README.md +10 -59
  2. rule_extractor.py +39 -12
  3. startup_formatreview.sh +2 -2
README.md CHANGED
@@ -1,60 +1,11 @@
1
- # FormatReview
2
 
3
- FormatReview is a tool that helps authors ensure their manuscripts comply with journal formatting guidelines. It automatically extracts formatting rules from journal websites and analyzes documents against these rules.
4
-
5
- ## Features
6
-
7
- - **Dynamic Rule Extraction**: Automatically extracts formatting guidelines from any journal's "Instructions for Authors" page
8
- - **Manual Rule Input**: Allows direct pasting of formatting rules for journals where automatic extraction is difficult
9
- - **Flexible Rule Sources**: Supports using URL-extracted rules, manually pasted rules, or a combination of both
10
- - **Document Analysis**: Analyzes PDF and DOCX documents against the extracted rules
11
- - **Comprehensive Reports**: Provides detailed compliance reports with specific issues and recommendations
12
- - **User-Friendly Interface**: Simple web interface with separate tabs for uploading documents, viewing extracted rules, and reviewing analysis results
13
-
14
- ## How It Works
15
-
16
- 1. **Rule Extraction**: The application uses crawl4ai to extract formatting rules from journal websites. It employs a Large Language Model (LLM) to understand and structure the formatting requirements.
17
-
18
- 2. **Document Analysis**: The uploaded document is analyzed against the extracted rules using an LLM. The analysis checks for compliance with margins, font, line spacing, citations, section structure, and other formatting requirements.
19
-
20
- 3. **Report Generation**: A detailed compliance report is generated, highlighting any issues found and providing recommendations for fixing them.
21
-
22
- ## Technical Details
23
-
24
- - **Backend**: Python with asyncio for handling asynchronous operations
25
- - **Frontend**: Streamlit for the web interface
26
- - **LLM Integration**: OpenRouter API for accessing advanced language models
27
- - **Web Crawling**: crawl4ai for extracting content from journal websites
28
- - **Document Processing**: Support for PDF and DOCX formats
29
-
30
- ## Usage
31
-
32
- 1. Upload your manuscript (PDF or DOCX)
33
- 2. Provide formatting rules in one of two ways (or both):
34
- - Enter the URL to the journal's "Instructions for Authors" page
35
- - Paste formatting rules directly into the text area
36
- 3. Click "Analyze Document"
37
- 4. View the formatting rules in the "Formatting Rules" tab
38
- 5. Review the analysis results in the "Analysis Results" tab
39
-
40
- ## Requirements
41
-
42
- - Python 3.9+
43
- - OpenRouter API key (set in .env file)
44
- - Required Python packages (listed in requirements.txt)
45
-
46
- ## Installation
47
-
48
- 1. Clone the repository
49
- 2. Create a virtual environment: `python -m venv .venv`
50
- 3. Activate the virtual environment: `source .venv/bin/activate`
51
- 4. Install dependencies: `pip install -r requirements.txt`
52
- 5. Create a `.env` file with your OpenRouter API key:
53
- ```
54
- OPENROUTER_API_KEY=your_api_key_here
55
- ```
56
- 6. Run the application: `streamlit run app.py`
57
-
58
- ## License
59
-
60
- MIT
 
 
1
 
2
+ ---
3
+ title: FormatReview
4
+ emoji: 🚀
5
+ colorFrom: blue
6
+ colorTo: green
7
+ sdk: streamlit
8
+ sdk_version: 1.29.0
9
+ python_version: 3.9
10
+ app_file: app.py
11
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rule_extractor.py CHANGED
@@ -3,6 +3,7 @@ import asyncio
3
  import nest_asyncio
4
  import os
5
  import json
 
6
  from config import settings
7
  from pydantic import BaseModel, Field
8
 
@@ -103,10 +104,39 @@ def get_rules_from_url(url: str) -> str:
103
 
104
  # Initialize the crawler and run
105
  async with AsyncWebCrawler() as crawler:
106
- result = await crawler.arun(
107
- url=url,
108
- config=run_config
109
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  if result.success and result.extracted_content:
112
  # Format the extracted data into a readable string
@@ -127,14 +157,11 @@ def get_rules_from_url(url: str) -> str:
127
  return formatted_rules
128
  elif result.success and result.markdown:
129
  # Fallback to markdown if structured extraction fails
 
130
  return result.markdown
131
  else:
132
- return "Could not extract formatting rules from the provided URL."
 
133
 
134
- # Create a new event loop and run the async function
135
- loop = asyncio.new_event_loop()
136
- asyncio.set_event_loop(loop)
137
- try:
138
- return loop.run_until_complete(_extract_rules_async(url))
139
- finally:
140
- loop.close()
 
3
  import nest_asyncio
4
  import os
5
  import json
6
+ import httpx
7
  from config import settings
8
  from pydantic import BaseModel, Field
9
 
 
104
 
105
  # Initialize the crawler and run
106
  async with AsyncWebCrawler() as crawler:
107
+ try:
108
+ result = await crawler.arun(
109
+ url=url,
110
+ config=run_config
111
+ )
112
+ logger.info(f"Crawler result for {url}: {result}")
113
+
114
+ # Handle robots.txt blocking
115
+ if not result.success and "robots.txt" in str(result.error_message):
116
+ logger.warning(f"Crawl blocked by robots.txt for {url}. Falling back to direct download.")
117
+ try:
118
+ with httpx.Client() as client:
119
+ response = client.get(url, follow_redirects=True)
120
+ response.raise_for_status()
121
+
122
+ raw_html = response.text
123
+ logger.info(f"Successfully downloaded HTML content for {url}.")
124
+
125
+ # Re-run crawl4ai with raw HTML
126
+ raw_html_url = f"raw:{raw_html}"
127
+ result = await crawler.arun(url=raw_html_url, config=run_config)
128
+ logger.info(f"Crawler result for raw HTML: {result}")
129
+
130
+ except httpx.HTTPStatusError as e:
131
+ logger.error(f"HTTP error while fetching {url}: {e}", exc_info=True)
132
+ return "Failed to download the page content after being blocked by robots.txt."
133
+ except Exception as e:
134
+ logger.error(f"An error occurred during fallback processing for {url}: {e}", exc_info=True)
135
+ return "An error occurred during the fallback extraction process."
136
+
137
+ except Exception as e:
138
+ logger.error(f"An error occurred during crawling {url}: {e}", exc_info=True)
139
+ return "An error occurred while trying to extract formatting rules."
140
 
141
  if result.success and result.extracted_content:
142
  # Format the extracted data into a readable string
 
157
  return formatted_rules
158
  elif result.success and result.markdown:
159
  # Fallback to markdown if structured extraction fails
160
+ logger.info(f"Extraction failed, falling back to markdown for {url}")
161
  return result.markdown
162
  else:
163
+ logger.warning(f"Failed to extract rules or markdown for {url}. Crawler success: {result.success}")
164
+ return "Could not extract formatting rules from the provided URL. The crawler did not return any content."
165
 
166
+ # Run the async function using the patched event loop
167
+ return asyncio.run(_extract_rules_async(url))
 
 
 
 
 
startup_formatreview.sh CHANGED
@@ -71,8 +71,8 @@ if ! command -v tailscale &> /dev/null; then
71
  else
72
  # Expose the service via Tailscale Serve
73
  echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
74
- echo "Setting up Funnel on port 443..."
75
- tailscale funnel --https=443 --bg localhost:$STREAMLIT_PORT
76
 
77
  # Get the Tailscale hostname
78
  HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')
 
71
  else
72
  # Expose the service via Tailscale Serve
73
  echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
74
+ echo "Setting up Funnel on port 8443..."
75
+ tailscale funnel --https=8443 --bg localhost:$STREAMLIT_PORT
76
 
77
  # Get the Tailscale hostname
78
  HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')