Spaces:
Sleeping
Sleeping
Stephen Zweibel
commited on
Commit
·
bb869fd
0
Parent(s):
Add initial implementation of FormatReview tool with core features and configurations
Browse files- Created .gitignore to exclude environment and runtime files
- Added Streamlit configuration in .streamlit/config.toml
- Developed main application logic in app.py for document analysis and formatting rule extraction
- Implemented logging and settings management in config.py
- Integrated document analysis functionality in doc_analyzer.py
- Established rule extraction logic in rule_extractor.py
- Included requirements.txt for project dependencies
- Added startup script for running the Streamlit app
- Created test script for validating crawl4ai functionality
- .gitignore +43 -0
- .streamlit/config.toml +11 -0
- README.md +60 -0
- app.py +237 -0
- config.py +27 -0
- doc_analyzer.py +170 -0
- logic.py +52 -0
- requirements.txt +18 -0
- rule_extractor.py +140 -0
- startup_formatreview.sh +93 -0
- test_crawl.py +132 -0
.gitignore
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Environment variables
|
2 |
+
.env
|
3 |
+
|
4 |
+
# Python virtual environment
|
5 |
+
.venv/
|
6 |
+
venv/
|
7 |
+
ENV/
|
8 |
+
|
9 |
+
# Python cache files
|
10 |
+
__pycache__/
|
11 |
+
*.py[cod]
|
12 |
+
*$py.class
|
13 |
+
*.so
|
14 |
+
.Python
|
15 |
+
build/
|
16 |
+
develop-eggs/
|
17 |
+
dist/
|
18 |
+
downloads/
|
19 |
+
eggs/
|
20 |
+
.eggs/
|
21 |
+
lib/
|
22 |
+
lib64/
|
23 |
+
parts/
|
24 |
+
sdist/
|
25 |
+
var/
|
26 |
+
wheels/
|
27 |
+
*.egg-info/
|
28 |
+
.installed.cfg
|
29 |
+
*.egg
|
30 |
+
|
31 |
+
# Logs and runtime files
|
32 |
+
*.log
|
33 |
+
*.pid
|
34 |
+
|
35 |
+
# IDE files
|
36 |
+
.idea/
|
37 |
+
.vscode/
|
38 |
+
*.swp
|
39 |
+
*.swo
|
40 |
+
.DS_Store
|
41 |
+
|
42 |
+
# Streamlit
|
43 |
+
.streamlit/secrets.toml
|
.streamlit/config.toml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[theme]
|
2 |
+
primaryColor = "#1E88E5"
|
3 |
+
backgroundColor = "#FFFFFF"
|
4 |
+
secondaryBackgroundColor = "#F0F2F6"
|
5 |
+
textColor = "#262730"
|
6 |
+
font = "sans serif"
|
7 |
+
|
8 |
+
[server]
|
9 |
+
enableCORS = false
|
10 |
+
enableXsrfProtection = true
|
11 |
+
maxUploadSize = 200
|
README.md
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# FormatReview
|
2 |
+
|
3 |
+
FormatReview is a tool that helps authors ensure their manuscripts comply with journal formatting guidelines. It automatically extracts formatting rules from journal websites and analyzes documents against these rules.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Dynamic Rule Extraction**: Automatically extracts formatting guidelines from any journal's "Instructions for Authors" page
|
8 |
+
- **Manual Rule Input**: Allows direct pasting of formatting rules for journals where automatic extraction is difficult
|
9 |
+
- **Flexible Rule Sources**: Supports using URL-extracted rules, manually pasted rules, or a combination of both
|
10 |
+
- **Document Analysis**: Analyzes PDF and DOCX documents against the extracted rules
|
11 |
+
- **Comprehensive Reports**: Provides detailed compliance reports with specific issues and recommendations
|
12 |
+
- **User-Friendly Interface**: Simple web interface with separate tabs for uploading documents, viewing extracted rules, and reviewing analysis results
|
13 |
+
|
14 |
+
## How It Works
|
15 |
+
|
16 |
+
1. **Rule Extraction**: The application uses crawl4ai to extract formatting rules from journal websites. It employs a Large Language Model (LLM) to understand and structure the formatting requirements.
|
17 |
+
|
18 |
+
2. **Document Analysis**: The uploaded document is analyzed against the extracted rules using an LLM. The analysis checks for compliance with margins, font, line spacing, citations, section structure, and other formatting requirements.
|
19 |
+
|
20 |
+
3. **Report Generation**: A detailed compliance report is generated, highlighting any issues found and providing recommendations for fixing them.
|
21 |
+
|
22 |
+
## Technical Details
|
23 |
+
|
24 |
+
- **Backend**: Python with asyncio for handling asynchronous operations
|
25 |
+
- **Frontend**: Streamlit for the web interface
|
26 |
+
- **LLM Integration**: OpenRouter API for accessing advanced language models
|
27 |
+
- **Web Crawling**: crawl4ai for extracting content from journal websites
|
28 |
+
- **Document Processing**: Support for PDF and DOCX formats
|
29 |
+
|
30 |
+
## Usage
|
31 |
+
|
32 |
+
1. Upload your manuscript (PDF or DOCX)
|
33 |
+
2. Provide formatting rules in one of two ways (or both):
|
34 |
+
- Enter the URL to the journal's "Instructions for Authors" page
|
35 |
+
- Paste formatting rules directly into the text area
|
36 |
+
3. Click "Analyze Document"
|
37 |
+
4. View the formatting rules in the "Formatting Rules" tab
|
38 |
+
5. Review the analysis results in the "Analysis Results" tab
|
39 |
+
|
40 |
+
## Requirements
|
41 |
+
|
42 |
+
- Python 3.9+
|
43 |
+
- OpenRouter API key (set in .env file)
|
44 |
+
- Required Python packages (listed in requirements.txt)
|
45 |
+
|
46 |
+
## Installation
|
47 |
+
|
48 |
+
1. Clone the repository
|
49 |
+
2. Create a virtual environment: `python -m venv .venv`
|
50 |
+
3. Activate the virtual environment: `source .venv/bin/activate`
|
51 |
+
4. Install dependencies: `pip install -r requirements.txt`
|
52 |
+
5. Create a `.env` file with your OpenRouter API key:
|
53 |
+
```
|
54 |
+
OPENROUTER_API_KEY=your_api_key_here
|
55 |
+
```
|
56 |
+
6. Run the application: `streamlit run app.py`
|
57 |
+
|
58 |
+
## License
|
59 |
+
|
60 |
+
MIT
|
app.py
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import json
|
3 |
+
from rule_extractor import get_rules_from_url, format_rules_for_display
|
4 |
+
from doc_analyzer import analyze_document
|
5 |
+
|
6 |
+
def combine_rules(url_rules, pasted_rules):
|
7 |
+
"""Combine URL-extracted rules and manually pasted rules"""
|
8 |
+
combined_rules = ""
|
9 |
+
|
10 |
+
# Check if URL rules are in JSON format and convert if needed
|
11 |
+
if url_rules and (url_rules.strip().startswith('[') or url_rules.strip().startswith('{')):
|
12 |
+
try:
|
13 |
+
# Try to parse as JSON
|
14 |
+
rules_data = json.loads(url_rules)
|
15 |
+
if isinstance(rules_data, list) and len(rules_data) > 0:
|
16 |
+
rules_data = rules_data[0]
|
17 |
+
|
18 |
+
# Format the rules
|
19 |
+
url_rules = format_rules_for_display(rules_data)
|
20 |
+
except Exception as e:
|
21 |
+
# If parsing fails, use as is
|
22 |
+
pass
|
23 |
+
|
24 |
+
# Add URL-extracted rules if available
|
25 |
+
if url_rules:
|
26 |
+
combined_rules += url_rules
|
27 |
+
|
28 |
+
# Add pasted rules if available
|
29 |
+
if pasted_rules:
|
30 |
+
if url_rules: # If we already have URL rules, add a separator
|
31 |
+
combined_rules += "\n\n## Additional Manually Pasted Rules\n\n" + pasted_rules
|
32 |
+
else: # If no URL rules, just use pasted rules
|
33 |
+
combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
|
34 |
+
|
35 |
+
return combined_rules
|
36 |
+
|
37 |
+
st.set_page_config(
|
38 |
+
page_title="FormatReview",
|
39 |
+
page_icon="🔎",
|
40 |
+
layout="wide",
|
41 |
+
)
|
42 |
+
|
43 |
+
st.title("FormatReview")
|
44 |
+
st.markdown("Analyze your manuscript against any journal's formatting guidelines.")
|
45 |
+
|
46 |
+
# Initialize session state for storing rules
|
47 |
+
if "rules" not in st.session_state:
|
48 |
+
st.session_state.rules = None
|
49 |
+
if "results" not in st.session_state:
|
50 |
+
st.session_state.results = None
|
51 |
+
if "url_rules" not in st.session_state:
|
52 |
+
st.session_state.url_rules = None
|
53 |
+
if "pasted_rules" not in st.session_state:
|
54 |
+
st.session_state.pasted_rules = None
|
55 |
+
|
56 |
+
# Create tabs
|
57 |
+
tab1, tab2, tab3 = st.tabs(["Document Upload", "Formatting Rules", "Analysis Results"])
|
58 |
+
|
59 |
+
with tab1:
|
60 |
+
# --- UI Components ---
|
61 |
+
uploaded_file = st.file_uploader("Upload your manuscript (PDF or DOCX)", type=["pdf", "docx"])
|
62 |
+
|
63 |
+
# Rules input section
|
64 |
+
st.subheader("Formatting Rules")
|
65 |
+
st.markdown("You can provide formatting rules by URL, paste them directly, or both.")
|
66 |
+
|
67 |
+
# URL input
|
68 |
+
journal_url = st.text_input("Enter the URL to the journal's 'Instructions for Authors' page (optional if pasting rules)")
|
69 |
+
|
70 |
+
# Pasted rules input
|
71 |
+
pasted_rules = st.text_area(
|
72 |
+
"Or paste formatting rules directly (optional if providing URL)",
|
73 |
+
height=200,
|
74 |
+
placeholder="Paste journal formatting guidelines here..."
|
75 |
+
)
|
76 |
+
|
77 |
+
if st.button("Analyze Document"):
|
78 |
+
if uploaded_file is None:
|
79 |
+
st.error("Please upload your manuscript.")
|
80 |
+
elif not journal_url and not pasted_rules:
|
81 |
+
st.error("Please either enter the journal's URL or paste formatting rules.")
|
82 |
+
else:
|
83 |
+
# Initialize combined rules
|
84 |
+
combined_rules = ""
|
85 |
+
|
86 |
+
# Extract rules from URL if provided
|
87 |
+
if journal_url:
|
88 |
+
with st.spinner("Extracting rules from URL..."):
|
89 |
+
url_rules = get_rules_from_url(journal_url)
|
90 |
+
st.session_state.url_rules = url_rules
|
91 |
+
combined_rules += url_rules
|
92 |
+
|
93 |
+
# Add pasted rules if provided
|
94 |
+
if pasted_rules:
|
95 |
+
st.session_state.pasted_rules = pasted_rules
|
96 |
+
if journal_url: # If we already have URL rules, combine them
|
97 |
+
# Make sure URL rules are formatted before combining
|
98 |
+
if st.session_state.url_rules and (
|
99 |
+
st.session_state.url_rules.strip().startswith('[') or
|
100 |
+
st.session_state.url_rules.strip().startswith('{')
|
101 |
+
):
|
102 |
+
try:
|
103 |
+
# Try to parse as JSON
|
104 |
+
rules_data = json.loads(st.session_state.url_rules)
|
105 |
+
if isinstance(rules_data, list) and len(rules_data) > 0:
|
106 |
+
rules_data = rules_data[0]
|
107 |
+
|
108 |
+
# Format and update the URL rules
|
109 |
+
formatted_url_rules = format_rules_for_display(rules_data)
|
110 |
+
st.session_state.url_rules = formatted_url_rules
|
111 |
+
except Exception as e:
|
112 |
+
# If parsing fails, use as is
|
113 |
+
pass
|
114 |
+
|
115 |
+
combined_rules = combine_rules(st.session_state.url_rules, pasted_rules)
|
116 |
+
else: # If no URL rules, just use pasted rules
|
117 |
+
combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
|
118 |
+
|
119 |
+
# Store the combined rules
|
120 |
+
st.session_state.rules = combined_rules
|
121 |
+
|
122 |
+
# Analyze the document
|
123 |
+
with st.spinner("Analyzing document..."):
|
124 |
+
st.session_state.results = analyze_document(uploaded_file, st.session_state.rules)
|
125 |
+
|
126 |
+
st.success("Analysis complete! View the results in the 'Analysis Results' tab.")
|
127 |
+
|
128 |
+
with tab2:
|
129 |
+
st.header("Formatting Rules")
|
130 |
+
|
131 |
+
if st.session_state.url_rules or st.session_state.pasted_rules:
|
132 |
+
# Display URL-extracted rules if available
|
133 |
+
if st.session_state.url_rules:
|
134 |
+
st.subheader("Rules Extracted from URL")
|
135 |
+
|
136 |
+
# Check if the rules look like JSON
|
137 |
+
if isinstance(st.session_state.url_rules, str) and (
|
138 |
+
st.session_state.url_rules.strip().startswith('[') or
|
139 |
+
st.session_state.url_rules.strip().startswith('{')
|
140 |
+
):
|
141 |
+
try:
|
142 |
+
# Try to parse as JSON
|
143 |
+
rules_data = json.loads(st.session_state.url_rules)
|
144 |
+
if isinstance(rules_data, list) and len(rules_data) > 0:
|
145 |
+
rules_data = rules_data[0]
|
146 |
+
|
147 |
+
# Format and display the rules
|
148 |
+
formatted_rules = format_rules_for_display(rules_data)
|
149 |
+
st.markdown(formatted_rules)
|
150 |
+
except Exception as e:
|
151 |
+
# If parsing fails, just display as is
|
152 |
+
st.markdown(st.session_state.url_rules)
|
153 |
+
else:
|
154 |
+
# If not JSON, just display as is
|
155 |
+
st.markdown(st.session_state.url_rules)
|
156 |
+
|
157 |
+
# Display pasted rules if available
|
158 |
+
if st.session_state.pasted_rules:
|
159 |
+
st.subheader("Manually Pasted Rules")
|
160 |
+
st.text_area("", value=st.session_state.pasted_rules, height=150, disabled=True)
|
161 |
+
|
162 |
+
# Display combined rules used for analysis
|
163 |
+
if st.session_state.rules and (st.session_state.url_rules and st.session_state.pasted_rules):
|
164 |
+
st.subheader("Combined Rules (Used for Analysis)")
|
165 |
+
|
166 |
+
# The combined rules should already be formatted, but check just in case
|
167 |
+
if isinstance(st.session_state.rules, str) and (
|
168 |
+
st.session_state.rules.strip().startswith('[') or
|
169 |
+
st.session_state.rules.strip().startswith('{')
|
170 |
+
):
|
171 |
+
try:
|
172 |
+
# Try to parse as JSON
|
173 |
+
rules_data = json.loads(st.session_state.rules)
|
174 |
+
if isinstance(rules_data, list) and len(rules_data) > 0:
|
175 |
+
rules_data = rules_data[0]
|
176 |
+
|
177 |
+
# Format and display the rules
|
178 |
+
formatted_rules = format_rules_for_display(rules_data)
|
179 |
+
st.markdown(formatted_rules)
|
180 |
+
except Exception as e:
|
181 |
+
# If parsing fails, just display as is
|
182 |
+
st.markdown(st.session_state.rules)
|
183 |
+
else:
|
184 |
+
# If not JSON, just display as is
|
185 |
+
st.markdown(st.session_state.rules)
|
186 |
+
else:
|
187 |
+
st.info("Provide formatting rules via URL or direct input to view them here.")
|
188 |
+
|
189 |
+
with tab3:
|
190 |
+
st.header("Analysis Results")
|
191 |
+
if st.session_state.results:
|
192 |
+
results = st.session_state.results
|
193 |
+
|
194 |
+
if "error" in results:
|
195 |
+
st.error(results["error"])
|
196 |
+
else:
|
197 |
+
# Display summary
|
198 |
+
st.subheader("Summary")
|
199 |
+
summary = results.get("summary", {})
|
200 |
+
st.write(f"**Overall Assessment**: {summary.get('overall_assessment', 'N/A')}")
|
201 |
+
st.write(f"**Total Issues**: {summary.get('total_issues', 'N/A')}")
|
202 |
+
st.write(f"**Critical Issues**: {summary.get('critical_issues', 'N/A')}")
|
203 |
+
st.write(f"**Warning Issues**: {summary.get('warning_issues', 'N/A')}")
|
204 |
+
|
205 |
+
# Display recommendations
|
206 |
+
st.subheader("Recommendations")
|
207 |
+
recommendations = results.get("recommendations", [])
|
208 |
+
if recommendations:
|
209 |
+
for rec in recommendations:
|
210 |
+
st.write(f"- {rec}")
|
211 |
+
else:
|
212 |
+
st.write("No recommendations.")
|
213 |
+
|
214 |
+
# Display detailed report
|
215 |
+
st.subheader("Detailed Report")
|
216 |
+
issues = results.get("issues", [])
|
217 |
+
if issues:
|
218 |
+
for issue in issues:
|
219 |
+
severity = issue.get('severity', 'N/A').lower()
|
220 |
+
message = f"**{issue.get('severity', 'N/A').upper()}**: {issue.get('message', 'N/A')}"
|
221 |
+
|
222 |
+
if severity == 'critical':
|
223 |
+
st.error(message)
|
224 |
+
elif severity == 'warning':
|
225 |
+
st.warning(message)
|
226 |
+
elif severity == 'info':
|
227 |
+
st.info(message)
|
228 |
+
else:
|
229 |
+
st.success(message)
|
230 |
+
|
231 |
+
st.write(f"**Location**: {issue.get('location', 'N/A')}")
|
232 |
+
st.write(f"**Suggestion**: {issue.get('suggestion', 'N/A')}")
|
233 |
+
st.divider()
|
234 |
+
else:
|
235 |
+
st.success("No issues found.")
|
236 |
+
else:
|
237 |
+
st.info("Analyze a document to view results here.")
|
config.py
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from dotenv import load_dotenv
|
2 |
+
load_dotenv()
|
3 |
+
|
4 |
+
import os
|
5 |
+
import logging
|
6 |
+
from pydantic import BaseModel
|
7 |
+
from typing import Optional
|
8 |
+
|
9 |
+
# Logging configuration
|
10 |
+
logging.basicConfig(
|
11 |
+
level=logging.INFO,
|
12 |
+
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
|
13 |
+
handlers=[
|
14 |
+
logging.FileHandler("formatreview.log"),
|
15 |
+
logging.StreamHandler()
|
16 |
+
]
|
17 |
+
)
|
18 |
+
|
19 |
+
class Settings(BaseModel):
|
20 |
+
"""Application settings"""
|
21 |
+
llm_provider: str = os.getenv("LLM_PROVIDER", "openrouter").lower()
|
22 |
+
llm_model_name: str = os.getenv("LLM_MODEL_NAME", "google/gemini-2.5-pro")
|
23 |
+
llm_base_url: str = os.getenv("LLM_API_BASE", "https://openrouter.ai/api/v1")
|
24 |
+
openrouter_api_key: Optional[str] = os.getenv("OPENROUTER_API_KEY")
|
25 |
+
|
26 |
+
# Instantiate settings
|
27 |
+
settings = Settings()
|
doc_analyzer.py
ADDED
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
import base64
|
3 |
+
import re
|
4 |
+
import xml.etree.ElementTree as ET
|
5 |
+
from typing import Dict, Any, Union
|
6 |
+
from config import settings
|
7 |
+
from openai import OpenAI
|
8 |
+
|
9 |
+
# This function is no longer needed as we're using markdown directly
|
10 |
+
|
11 |
+
logger = logging.getLogger(__name__)
|
12 |
+
|
13 |
+
def _extract_xml_block(text: str, tag_name: str) -> str:
|
14 |
+
"""
|
15 |
+
Extracts the last complete XML block from a string, ignoring surrounding text.
|
16 |
+
"""
|
17 |
+
# This regex finds all occurrences of the specified XML block
|
18 |
+
matches = re.findall(f"<{tag_name}.*?</{tag_name}>", text, re.DOTALL)
|
19 |
+
if matches:
|
20 |
+
# Return the last match, which should be the assistant's response
|
21 |
+
return matches[-1]
|
22 |
+
logger.error(f"Could not find <{tag_name}> block in text: {text}")
|
23 |
+
return ""
|
24 |
+
|
25 |
+
def analyze_document(uploaded_file, rules: str) -> Dict[str, Any]:
|
26 |
+
"""
|
27 |
+
Analyzes a document against formatting rules using an LLM.
|
28 |
+
|
29 |
+
Args:
|
30 |
+
uploaded_file: The uploaded file (PDF or DOCX)
|
31 |
+
rules: The formatting rules as a string (in markdown format)
|
32 |
+
|
33 |
+
Returns:
|
34 |
+
Dict containing analysis results
|
35 |
+
"""
|
36 |
+
logger.info("Analyzing document against formatting rules")
|
37 |
+
|
38 |
+
# The rules are already in markdown format, so we can use them directly
|
39 |
+
formatted_rules = rules
|
40 |
+
|
41 |
+
try:
|
42 |
+
# Read the file bytes
|
43 |
+
file_bytes = uploaded_file.getvalue()
|
44 |
+
|
45 |
+
# Create a unified prompt
|
46 |
+
unified_prompt = f"""
|
47 |
+
You are an expert in academic document formatting and citation. Your goal is to analyze the user's document for compliance with the journal's formatting rules and generate a comprehensive compliance report in XML format.
|
48 |
+
|
49 |
+
Your response MUST be in the following XML format. Do not include any other text or explanations outside of the XML structure.
|
50 |
+
|
51 |
+
<compliance_report>
|
52 |
+
<summary>
|
53 |
+
<overall_assessment></overall_assessment>
|
54 |
+
<total_issues></total_issues>
|
55 |
+
<critical_issues></critical_issues>
|
56 |
+
<warning_issues></warning_issues>
|
57 |
+
</summary>
|
58 |
+
<recommendations>
|
59 |
+
<recommendation></recommendation>
|
60 |
+
</recommendations>
|
61 |
+
<issues>
|
62 |
+
<issue severity="critical/warning/info">
|
63 |
+
<message></message>
|
64 |
+
<location></location>
|
65 |
+
<suggestion></suggestion>
|
66 |
+
</issue>
|
67 |
+
</issues>
|
68 |
+
</compliance_report>
|
69 |
+
|
70 |
+
**Formatting Rules to Enforce**
|
71 |
+
|
72 |
+
{formatted_rules}
|
73 |
+
|
74 |
+
**Instructions**
|
75 |
+
|
76 |
+
Please analyze the attached document and generate the compliance report.
|
77 |
+
|
78 |
+
**Important Considerations for Analysis:**
|
79 |
+
* **Citation Style:** Determine the citation style (e.g., APA, MLA, Chicago) from the document's content and the journal's requirements. The document should follow the style specified in the formatting rules.
|
80 |
+
* **Page Numbering:** When reporting the location of an issue, use the page number exactly as it is written in the document (e.g., 'vii', '12'). Do not use the PDF reader's page count (unless necessary to clarify).
|
81 |
+
* **Visual Formatting:** When assessing visual properties like line spacing, margins, or font size from a PDF, be aware that text extraction can be imperfect. Base your findings on clear and consistent evidence throughout the document. Do not flag minor variations that could be due to PDF rendering. For example, only flag a line spacing issue if it is consistently incorrect across multiple pages and sections. Assume line spacing is correct unless it is obviously and consistently wrong.
|
82 |
+
* **Rule Interpretation:** Apply the formatting rules strictly but fairly. If a rule is ambiguous, note the ambiguity in your assessment.
|
83 |
+
* **Completeness:** Ensure that you check every rule against the document and that your report is complete.
|
84 |
+
"""
|
85 |
+
|
86 |
+
# Initialize the OpenAI client
|
87 |
+
client = OpenAI(
|
88 |
+
base_url=settings.llm_base_url,
|
89 |
+
api_key=settings.openrouter_api_key,
|
90 |
+
)
|
91 |
+
|
92 |
+
# Encode the file as base64
|
93 |
+
base64_file = base64.b64encode(file_bytes).decode('utf-8')
|
94 |
+
|
95 |
+
# Determine file type
|
96 |
+
file_extension = uploaded_file.name.split('.')[-1].lower()
|
97 |
+
mime_type = "application/pdf" if file_extension == "pdf" else "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
98 |
+
|
99 |
+
try:
|
100 |
+
# Call the LLM API
|
101 |
+
completion = client.chat.completions.create(
|
102 |
+
model=settings.llm_model_name,
|
103 |
+
messages=[
|
104 |
+
{
|
105 |
+
"role": "user",
|
106 |
+
"content": [
|
107 |
+
{"type": "text", "text": unified_prompt},
|
108 |
+
{
|
109 |
+
"type": "file",
|
110 |
+
"file": {
|
111 |
+
"file_data": f"data:{mime_type};base64,{base64_file}"
|
112 |
+
}
|
113 |
+
}
|
114 |
+
],
|
115 |
+
}
|
116 |
+
],
|
117 |
+
)
|
118 |
+
raw_response = completion.choices[0].message.content
|
119 |
+
except Exception as e:
|
120 |
+
logger.error(f"An error occurred during LLM API call: {e}")
|
121 |
+
return {"error": f"An error occurred during LLM API call: {e}"}
|
122 |
+
|
123 |
+
# Extract the XML block
|
124 |
+
clean_xml = _extract_xml_block(raw_response, "compliance_report")
|
125 |
+
if not clean_xml:
|
126 |
+
logger.error("Could not extract <compliance_report> XML block from the response.")
|
127 |
+
return {"error": "Could not extract <compliance_report> XML block from the response."}
|
128 |
+
|
129 |
+
logger.info(f"Final assembled report:\n{clean_xml}")
|
130 |
+
|
131 |
+
# Parse the final XML output
|
132 |
+
try:
|
133 |
+
root = ET.fromstring(clean_xml)
|
134 |
+
|
135 |
+
summary_node = root.find("summary")
|
136 |
+
summary = {
|
137 |
+
"overall_assessment": summary_node.findtext("overall_assessment", "No assessment available."),
|
138 |
+
"total_issues": summary_node.findtext("total_issues", "N/A"),
|
139 |
+
"critical_issues": summary_node.findtext("critical_issues", "N/A"),
|
140 |
+
"warning_issues": summary_node.findtext("warning_issues", "N/A"),
|
141 |
+
} if summary_node is not None else {}
|
142 |
+
|
143 |
+
issues = []
|
144 |
+
for issue_node in root.findall(".//issue"):
|
145 |
+
issues.append({
|
146 |
+
"severity": issue_node.get("severity"),
|
147 |
+
"message": issue_node.findtext("message"),
|
148 |
+
"location": issue_node.findtext("location"),
|
149 |
+
"suggestion": issue_node.findtext("suggestion"),
|
150 |
+
})
|
151 |
+
|
152 |
+
recommendations = [rec.text for rec in root.findall(".//recommendation")]
|
153 |
+
|
154 |
+
return {
|
155 |
+
"raw_xml": clean_xml,
|
156 |
+
"summary": summary,
|
157 |
+
"issues": issues,
|
158 |
+
"recommendations": recommendations,
|
159 |
+
}
|
160 |
+
|
161 |
+
except ET.ParseError as e:
|
162 |
+
logger.error(f"Failed to parse final LLM output: {e}", exc_info=True)
|
163 |
+
return {
|
164 |
+
"raw_xml": raw_response,
|
165 |
+
"error": "Failed to parse final LLM output."
|
166 |
+
}
|
167 |
+
|
168 |
+
except Exception as e:
|
169 |
+
logger.error(f"Error analyzing document: {str(e)}")
|
170 |
+
return {"error": f"Error analyzing document: {str(e)}"}
|
logic.py
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
import json
|
3 |
+
from rule_extractor import get_rules_from_url, format_rules_for_display
|
4 |
+
from doc_analyzer import analyze_document
|
5 |
+
|
6 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
7 |
+
|
8 |
+
def combine_rules(url_rules, pasted_rules):
|
9 |
+
"""Combine URL-extracted rules and manually pasted rules"""
|
10 |
+
combined_rules = ""
|
11 |
+
|
12 |
+
if url_rules and (url_rules.strip().startswith('[') or url_rules.strip().startswith('{')):
|
13 |
+
try:
|
14 |
+
rules_data = json.loads(url_rules)
|
15 |
+
if isinstance(rules_data, list) and len(rules_data) > 0:
|
16 |
+
rules_data = rules_data[0]
|
17 |
+
url_rules = format_rules_for_display(rules_data)
|
18 |
+
except Exception as e:
|
19 |
+
logging.error(f"Failed to parse URL rules as JSON: {e}")
|
20 |
+
|
21 |
+
if url_rules:
|
22 |
+
combined_rules += url_rules
|
23 |
+
|
24 |
+
if pasted_rules:
|
25 |
+
if url_rules:
|
26 |
+
combined_rules += "\n\n## Additional Manually Pasted Rules\n\n" + pasted_rules
|
27 |
+
else:
|
28 |
+
combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
|
29 |
+
|
30 |
+
return combined_rules
|
31 |
+
|
32 |
+
def extract_rules(journal_url):
|
33 |
+
"""Extract formatting rules from a given URL."""
|
34 |
+
try:
|
35 |
+
logging.info(f"Extracting rules from URL: {journal_url}")
|
36 |
+
rules = get_rules_from_url(journal_url)
|
37 |
+
logging.info("Successfully extracted rules from URL.")
|
38 |
+
return rules
|
39 |
+
except Exception as e:
|
40 |
+
logging.error(f"Error extracting rules from URL: {e}")
|
41 |
+
return {"error": f"Failed to extract rules from URL: {e}"}
|
42 |
+
|
43 |
+
def analyze_uploaded_document(uploaded_file, rules):
|
44 |
+
"""Analyze the uploaded document against the provided rules."""
|
45 |
+
try:
|
46 |
+
logging.info(f"Analyzing document: {uploaded_file.name}")
|
47 |
+
results = analyze_document(uploaded_file, rules)
|
48 |
+
logging.info("Successfully analyzed document.")
|
49 |
+
return results
|
50 |
+
except Exception as e:
|
51 |
+
logging.error(f"Error analyzing document: {e}")
|
52 |
+
return {"error": f"Failed to analyze document: {e}"}
|
requirements.txt
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Core dependencies
|
2 |
+
streamlit>=1.30.0
|
3 |
+
python-dotenv>=1.0.0
|
4 |
+
pydantic>=2.0.0
|
5 |
+
openai>=1.0.0
|
6 |
+
nest-asyncio>=1.5.8
|
7 |
+
|
8 |
+
# Web crawling and extraction
|
9 |
+
crawl4ai>=0.6.0
|
10 |
+
litellm>=1.0.0
|
11 |
+
|
12 |
+
# Document processing
|
13 |
+
PyPDF2>=3.0.0
|
14 |
+
python-docx>=0.8.11
|
15 |
+
|
16 |
+
# Utilities
|
17 |
+
httpx>=0.25.0
|
18 |
+
asyncio>=3.4.3
|
rule_extractor.py
ADDED
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
import asyncio
|
3 |
+
import nest_asyncio
|
4 |
+
import os
|
5 |
+
import json
|
6 |
+
from config import settings
|
7 |
+
from pydantic import BaseModel, Field
|
8 |
+
|
9 |
+
logger = logging.getLogger(__name__)
|
10 |
+
|
11 |
+
class FormattingRules(BaseModel):
|
12 |
+
"""Schema for formatting rules extraction"""
|
13 |
+
margins: str = Field(description="Margin requirements for the manuscript")
|
14 |
+
font: str = Field(description="Font requirements including size, type, etc.")
|
15 |
+
line_spacing: str = Field(description="Line spacing requirements")
|
16 |
+
citations: str = Field(description="Citation style and formatting requirements")
|
17 |
+
sections: str = Field(description="Required sections and their structure")
|
18 |
+
other_rules: str = Field(description="Any other formatting requirements")
|
19 |
+
summary: str = Field(description="A brief summary of the key formatting requirements")
|
20 |
+
|
21 |
+
def format_rules_for_display(rules_data):
|
22 |
+
"""
|
23 |
+
Format the extracted rules data into a readable markdown string.
|
24 |
+
"""
|
25 |
+
if not rules_data:
|
26 |
+
return "Could not extract formatting rules from the provided URL."
|
27 |
+
|
28 |
+
formatted_rules = f"""
|
29 |
+
# Manuscript Formatting Guidelines
|
30 |
+
|
31 |
+
## Margins
|
32 |
+
{rules_data.get('margins', 'Not specified')}
|
33 |
+
|
34 |
+
## Font
|
35 |
+
{rules_data.get('font', 'Not specified')}
|
36 |
+
|
37 |
+
## Line Spacing
|
38 |
+
{rules_data.get('line_spacing', 'Not specified')}
|
39 |
+
|
40 |
+
## Citations
|
41 |
+
{rules_data.get('citations', 'Not specified')}
|
42 |
+
|
43 |
+
## Section Structure
|
44 |
+
{rules_data.get('sections', 'Not specified')}
|
45 |
+
|
46 |
+
## Other Requirements
|
47 |
+
{rules_data.get('other_rules', 'Not specified')}
|
48 |
+
|
49 |
+
## Summary
|
50 |
+
{rules_data.get('summary', 'Not specified')}
|
51 |
+
"""
|
52 |
+
return formatted_rules
|
53 |
+
|
54 |
+
def get_rules_from_url(url: str) -> str:
|
55 |
+
"""
|
56 |
+
Extracts formatting rules from a given URL using crawl4ai.
|
57 |
+
"""
|
58 |
+
logger.info(f"Extracting rules from URL: {url}")
|
59 |
+
|
60 |
+
# Apply nest_asyncio here, when the function is called
|
61 |
+
nest_asyncio.apply()
|
62 |
+
|
63 |
+
# Import crawl4ai modules here to avoid event loop issues at module level
|
64 |
+
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, LLMConfig
|
65 |
+
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
66 |
+
|
67 |
+
async def _extract_rules_async(url: str) -> str:
|
68 |
+
"""
|
69 |
+
Asynchronously extracts formatting rules from a given URL using crawl4ai.
|
70 |
+
"""
|
71 |
+
# Configure the browser
|
72 |
+
browser_config = BrowserConfig(verbose=True)
|
73 |
+
|
74 |
+
# Configure the LLM extraction
|
75 |
+
extraction_strategy = LLMExtractionStrategy(
|
76 |
+
llm_config=LLMConfig(
|
77 |
+
provider=f"{settings.llm_provider}/{settings.llm_model_name}",
|
78 |
+
api_token=settings.openrouter_api_key
|
79 |
+
),
|
80 |
+
schema=FormattingRules.schema(),
|
81 |
+
extraction_type="schema",
|
82 |
+
instruction="""
|
83 |
+
From the crawled content, extract all formatting rules for manuscript submissions.
|
84 |
+
Focus on requirements for margins, font, line spacing, citations, section structure,
|
85 |
+
and any other formatting guidelines. Provide a comprehensive extraction of all
|
86 |
+
formatting-related information.
|
87 |
+
|
88 |
+
If a specific requirement is not mentioned in the content, include "Not specified" in the corresponding field.
|
89 |
+
"""
|
90 |
+
)
|
91 |
+
|
92 |
+
# Configure the crawler
|
93 |
+
run_config = CrawlerRunConfig(
|
94 |
+
word_count_threshold=10,
|
95 |
+
exclude_external_links=True,
|
96 |
+
process_iframes=True,
|
97 |
+
remove_overlay_elements=True,
|
98 |
+
exclude_social_media_links=True,
|
99 |
+
check_robots_txt=True,
|
100 |
+
semaphore_count=3,
|
101 |
+
extraction_strategy=extraction_strategy
|
102 |
+
)
|
103 |
+
|
104 |
+
# Initialize the crawler and run
|
105 |
+
async with AsyncWebCrawler() as crawler:
|
106 |
+
result = await crawler.arun(
|
107 |
+
url=url,
|
108 |
+
config=run_config
|
109 |
+
)
|
110 |
+
|
111 |
+
if result.success and result.extracted_content:
|
112 |
+
# Format the extracted data into a readable string
|
113 |
+
if isinstance(result.extracted_content, list) and len(result.extracted_content) > 0:
|
114 |
+
rules_data = result.extracted_content[0]
|
115 |
+
elif isinstance(result.extracted_content, dict):
|
116 |
+
rules_data = result.extracted_content
|
117 |
+
else:
|
118 |
+
# If it's a string or other type, use markdown as fallback
|
119 |
+
return str(result.extracted_content) if result.extracted_content else result.markdown if result.markdown else "Could not extract formatting rules from the provided URL."
|
120 |
+
|
121 |
+
# Store the raw data for debugging
|
122 |
+
logger.info(f"Extracted rules data: {json.dumps(rules_data, indent=2)}")
|
123 |
+
|
124 |
+
# Format the rules for display
|
125 |
+
formatted_rules = format_rules_for_display(rules_data)
|
126 |
+
logger.info(f"Formatted rules: {formatted_rules[:100]}...") # Log for debugging
|
127 |
+
return formatted_rules
|
128 |
+
elif result.success and result.markdown:
|
129 |
+
# Fallback to markdown if structured extraction fails
|
130 |
+
return result.markdown
|
131 |
+
else:
|
132 |
+
return "Could not extract formatting rules from the provided URL."
|
133 |
+
|
134 |
+
# Create a new event loop and run the async function
|
135 |
+
loop = asyncio.new_event_loop()
|
136 |
+
asyncio.set_event_loop(loop)
|
137 |
+
try:
|
138 |
+
return loop.run_until_complete(_extract_rules_async(url))
|
139 |
+
finally:
|
140 |
+
loop.close()
|
startup_formatreview.sh
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
# Startup script for FormatReview
|
3 |
+
# This script starts the Streamlit service and exposes it via Tailscale Serve
|
4 |
+
|
5 |
+
# Exit on error
|
6 |
+
set -e
|
7 |
+
|
8 |
+
# --- Configuration for Streamlit App ---
|
9 |
+
STREAMLIT_APP_FILE="app.py"
|
10 |
+
STREAMLIT_PORT="8504"
|
11 |
+
STREAMLIT_PID_FILE="formatreview.pid"
|
12 |
+
STREAMLIT_LOG_FILE="formatreview.log"
|
13 |
+
# --- End Configuration ---
|
14 |
+
|
15 |
+
# Check if UV is installed
|
16 |
+
if ! command -v uv &> /dev/null; then
|
17 |
+
echo "Error: UV is not installed. Please install UV first."
|
18 |
+
echo "You can install UV with: pip install uv"
|
19 |
+
exit 1
|
20 |
+
fi
|
21 |
+
|
22 |
+
# Create virtual environment if it doesn't exist
|
23 |
+
if [ ! -d ".venv" ]; then
|
24 |
+
echo "Creating virtual environment..."
|
25 |
+
uv venv
|
26 |
+
fi
|
27 |
+
|
28 |
+
# Activate virtual environment
|
29 |
+
echo "Activating virtual environment..."
|
30 |
+
source .venv/bin/activate
|
31 |
+
|
32 |
+
# Install dependencies
|
33 |
+
echo "Installing dependencies..."
|
34 |
+
uv pip install -r requirements.txt
|
35 |
+
|
36 |
+
# Kill any existing instances of the Streamlit app
|
37 |
+
echo "Stopping any existing instances of the Streamlit app..."
|
38 |
+
if [ -f "$STREAMLIT_PID_FILE" ]; then
|
39 |
+
OLD_STREAMLIT_PID=$(cat $STREAMLIT_PID_FILE)
|
40 |
+
if ps -p $OLD_STREAMLIT_PID > /dev/null; then
|
41 |
+
kill $OLD_STREAMLIT_PID
|
42 |
+
echo "Killed existing Streamlit app process with PID $OLD_STREAMLIT_PID"
|
43 |
+
sleep 1 # Give it time to shut down
|
44 |
+
else
|
45 |
+
echo "No running Streamlit app process found with PID $OLD_STREAMLIT_PID"
|
46 |
+
fi
|
47 |
+
fi
|
48 |
+
# Also try to kill any other streamlit processes for this specific app file and port
|
49 |
+
pkill -f "streamlit run $STREAMLIT_APP_FILE --server.port $STREAMLIT_PORT" || true
|
50 |
+
sleep 1
|
51 |
+
|
52 |
+
# Start the Streamlit app
|
53 |
+
echo "Starting Streamlit app on port $STREAMLIT_PORT..."
|
54 |
+
nohup streamlit run $STREAMLIT_APP_FILE --server.port $STREAMLIT_PORT --server.headless true > $STREAMLIT_LOG_FILE 2>&1 &
|
55 |
+
echo $! > $STREAMLIT_PID_FILE
|
56 |
+
|
57 |
+
# Check if the Streamlit service started successfully
|
58 |
+
sleep 3 # Give Streamlit a bit more time to start
|
59 |
+
if ! nc -z localhost $STREAMLIT_PORT; then
|
60 |
+
echo "Error: Failed to start Streamlit app on port $STREAMLIT_PORT."
|
61 |
+
cat $STREAMLIT_LOG_FILE # Output log file for debugging
|
62 |
+
exit 1
|
63 |
+
else
|
64 |
+
echo "Streamlit app started successfully on port $STREAMLIT_PORT."
|
65 |
+
fi
|
66 |
+
|
67 |
+
# Check if Tailscale is installed
|
68 |
+
if ! command -v tailscale &> /dev/null; then
|
69 |
+
echo "Warning: Tailscale is not installed. The app will only be available locally."
|
70 |
+
echo "Install Tailscale to expose the service over your tailnet."
|
71 |
+
else
|
72 |
+
# Expose the service via Tailscale Serve
|
73 |
+
echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
|
74 |
+
echo "Setting up Funnel on port 443..."
|
75 |
+
tailscale funnel --https=443 --bg localhost:$STREAMLIT_PORT
|
76 |
+
|
77 |
+
# Get the Tailscale hostname
|
78 |
+
HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')
|
79 |
+
if [ -n "$HOSTNAME" ]; then
|
80 |
+
echo "App may be available at a Tailscale URL. Check 'tailscale status' for details."
|
81 |
+
echo "If using a funnel, it might be https://$HOSTNAME/"
|
82 |
+
else
|
83 |
+
echo "App is exposed via Tailscale Serve, but couldn't determine the primary hostname."
|
84 |
+
echo "Check 'tailscale status' for details."
|
85 |
+
fi
|
86 |
+
fi
|
87 |
+
|
88 |
+
echo "FormatReview is now running!"
|
89 |
+
echo "Local URL: http://localhost:$STREAMLIT_PORT"
|
90 |
+
echo "Log file: $STREAMLIT_LOG_FILE"
|
91 |
+
echo "PID file: $STREAMLIT_PID_FILE"
|
92 |
+
echo ""
|
93 |
+
echo "If Tailscale is active, the app should be accessible via a Tailscale funnel URL."
|
test_crawl.py
ADDED
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
import nest_asyncio
|
3 |
+
import logging
|
4 |
+
import json
|
5 |
+
from pprint import pprint
|
6 |
+
from config import settings
|
7 |
+
from pydantic import BaseModel, Field
|
8 |
+
|
9 |
+
# Configure logging
|
10 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
|
11 |
+
logger = logging.getLogger("crawl4ai_test")
|
12 |
+
|
13 |
+
class FormattingRules(BaseModel):
|
14 |
+
"""Schema for formatting rules extraction"""
|
15 |
+
margins: str = Field(description="Margin requirements for the manuscript")
|
16 |
+
font: str = Field(description="Font requirements including size, type, etc.")
|
17 |
+
line_spacing: str = Field(description="Line spacing requirements")
|
18 |
+
citations: str = Field(description="Citation style and formatting requirements")
|
19 |
+
sections: str = Field(description="Required sections and their structure")
|
20 |
+
other_rules: str = Field(description="Any other formatting requirements")
|
21 |
+
summary: str = Field(description="A brief summary of the key formatting requirements")
|
22 |
+
|
23 |
+
async def test_crawl():
|
24 |
+
"""Test crawl4ai functionality"""
|
25 |
+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
|
26 |
+
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
27 |
+
|
28 |
+
url = "https://journal.code4lib.org/article-guidelines"
|
29 |
+
|
30 |
+
# Configure the browser
|
31 |
+
browser_config = BrowserConfig(verbose=True)
|
32 |
+
|
33 |
+
# Configure the LLM extraction
|
34 |
+
extraction_strategy = LLMExtractionStrategy(
|
35 |
+
llm_config=LLMConfig(
|
36 |
+
provider=f"{settings.llm_provider}/{settings.llm_model_name}",
|
37 |
+
api_token=settings.openrouter_api_key
|
38 |
+
),
|
39 |
+
schema=FormattingRules.schema(),
|
40 |
+
extraction_type="schema",
|
41 |
+
instruction="""
|
42 |
+
From the crawled content, extract all formatting rules for manuscript submissions.
|
43 |
+
Focus on requirements for margins, font, line spacing, citations, section structure,
|
44 |
+
and any other formatting guidelines. Provide a comprehensive extraction of all
|
45 |
+
formatting-related information.
|
46 |
+
"""
|
47 |
+
)
|
48 |
+
|
49 |
+
# Configure the crawler
|
50 |
+
run_config = CrawlerRunConfig(
|
51 |
+
word_count_threshold=10,
|
52 |
+
exclude_external_links=True,
|
53 |
+
process_iframes=True,
|
54 |
+
remove_overlay_elements=True,
|
55 |
+
exclude_social_media_links=True,
|
56 |
+
check_robots_txt=True,
|
57 |
+
semaphore_count=3,
|
58 |
+
extraction_strategy=extraction_strategy
|
59 |
+
)
|
60 |
+
|
61 |
+
# Initialize the crawler and run
|
62 |
+
async with AsyncWebCrawler() as crawler:
|
63 |
+
result = await crawler.arun(
|
64 |
+
url=url,
|
65 |
+
config=run_config
|
66 |
+
)
|
67 |
+
|
68 |
+
# Print all attributes of the result object
|
69 |
+
logger.info(f"Result object type: {type(result)}")
|
70 |
+
logger.info(f"Result object dir: {dir(result)}")
|
71 |
+
|
72 |
+
# Check for success
|
73 |
+
logger.info(f"Success: {result.success}")
|
74 |
+
|
75 |
+
# Check for markdown
|
76 |
+
if hasattr(result, 'markdown'):
|
77 |
+
logger.info(f"Has markdown: {bool(result.markdown)}")
|
78 |
+
logger.info(f"Markdown type: {type(result.markdown)}")
|
79 |
+
logger.info(f"Markdown preview: {str(result.markdown)[:200]}...")
|
80 |
+
else:
|
81 |
+
logger.info("No markdown attribute")
|
82 |
+
|
83 |
+
# Check for extracted_data
|
84 |
+
if hasattr(result, 'extracted_data'):
|
85 |
+
logger.info(f"Has extracted_data: {bool(result.extracted_data)}")
|
86 |
+
logger.info(f"Extracted data: {result.extracted_data}")
|
87 |
+
else:
|
88 |
+
logger.info("No extracted_data attribute")
|
89 |
+
|
90 |
+
# Check for other potential attributes
|
91 |
+
for attr in ['data', 'extraction', 'llm_extraction', 'content', 'text', 'extracted_content']:
|
92 |
+
if hasattr(result, attr):
|
93 |
+
logger.info(f"Has {attr}: {bool(getattr(result, attr))}")
|
94 |
+
logger.info(f"{attr} preview: {str(getattr(result, attr))[:200]}...")
|
95 |
+
|
96 |
+
# Try to access _results directly
|
97 |
+
if hasattr(result, '_results'):
|
98 |
+
logger.info(f"Has _results: {bool(result._results)}")
|
99 |
+
if result._results:
|
100 |
+
first_result = result._results[0]
|
101 |
+
logger.info(f"First result type: {type(first_result)}")
|
102 |
+
logger.info(f"First result dir: {dir(first_result)}")
|
103 |
+
|
104 |
+
# Check if first result has extracted_data
|
105 |
+
if hasattr(first_result, 'extracted_data'):
|
106 |
+
logger.info(f"First result has extracted_data: {bool(first_result.extracted_data)}")
|
107 |
+
logger.info(f"First result extracted_data: {first_result.extracted_data}")
|
108 |
+
|
109 |
+
# Check for other attributes in first result
|
110 |
+
for attr in ['data', 'extraction', 'llm_extraction', 'content', 'text', 'extracted_content']:
|
111 |
+
if hasattr(first_result, attr):
|
112 |
+
logger.info(f"First result has {attr}: {bool(getattr(first_result, attr))}")
|
113 |
+
logger.info(f"First result {attr} preview: {str(getattr(first_result, attr))[:200]}...")
|
114 |
+
|
115 |
+
return result
|
116 |
+
|
117 |
+
def main():
|
118 |
+
"""Main function"""
|
119 |
+
# Apply nest_asyncio
|
120 |
+
nest_asyncio.apply()
|
121 |
+
|
122 |
+
# Create a new event loop and run the async function
|
123 |
+
loop = asyncio.new_event_loop()
|
124 |
+
asyncio.set_event_loop(loop)
|
125 |
+
try:
|
126 |
+
result = loop.run_until_complete(test_crawl())
|
127 |
+
logger.info("Test completed successfully")
|
128 |
+
finally:
|
129 |
+
loop.close()
|
130 |
+
|
131 |
+
if __name__ == "__main__":
|
132 |
+
main()
|