Stephen Zweibel commited on
Commit
bb869fd
·
0 Parent(s):

Add initial implementation of FormatReview tool with core features and configurations

Browse files

- Created .gitignore to exclude environment and runtime files
- Added Streamlit configuration in .streamlit/config.toml
- Developed main application logic in app.py for document analysis and formatting rule extraction
- Implemented logging and settings management in config.py
- Integrated document analysis functionality in doc_analyzer.py
- Established rule extraction logic in rule_extractor.py
- Included requirements.txt for project dependencies
- Added startup script for running the Streamlit app
- Created test script for validating crawl4ai functionality

Files changed (11) hide show
  1. .gitignore +43 -0
  2. .streamlit/config.toml +11 -0
  3. README.md +60 -0
  4. app.py +237 -0
  5. config.py +27 -0
  6. doc_analyzer.py +170 -0
  7. logic.py +52 -0
  8. requirements.txt +18 -0
  9. rule_extractor.py +140 -0
  10. startup_formatreview.sh +93 -0
  11. test_crawl.py +132 -0
.gitignore ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+
4
+ # Python virtual environment
5
+ .venv/
6
+ venv/
7
+ ENV/
8
+
9
+ # Python cache files
10
+ __pycache__/
11
+ *.py[cod]
12
+ *$py.class
13
+ *.so
14
+ .Python
15
+ build/
16
+ develop-eggs/
17
+ dist/
18
+ downloads/
19
+ eggs/
20
+ .eggs/
21
+ lib/
22
+ lib64/
23
+ parts/
24
+ sdist/
25
+ var/
26
+ wheels/
27
+ *.egg-info/
28
+ .installed.cfg
29
+ *.egg
30
+
31
+ # Logs and runtime files
32
+ *.log
33
+ *.pid
34
+
35
+ # IDE files
36
+ .idea/
37
+ .vscode/
38
+ *.swp
39
+ *.swo
40
+ .DS_Store
41
+
42
+ # Streamlit
43
+ .streamlit/secrets.toml
.streamlit/config.toml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [theme]
2
+ primaryColor = "#1E88E5"
3
+ backgroundColor = "#FFFFFF"
4
+ secondaryBackgroundColor = "#F0F2F6"
5
+ textColor = "#262730"
6
+ font = "sans serif"
7
+
8
+ [server]
9
+ enableCORS = false
10
+ enableXsrfProtection = true
11
+ maxUploadSize = 200
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FormatReview
2
+
3
+ FormatReview is a tool that helps authors ensure their manuscripts comply with journal formatting guidelines. It automatically extracts formatting rules from journal websites and analyzes documents against these rules.
4
+
5
+ ## Features
6
+
7
+ - **Dynamic Rule Extraction**: Automatically extracts formatting guidelines from any journal's "Instructions for Authors" page
8
+ - **Manual Rule Input**: Allows direct pasting of formatting rules for journals where automatic extraction is difficult
9
+ - **Flexible Rule Sources**: Supports using URL-extracted rules, manually pasted rules, or a combination of both
10
+ - **Document Analysis**: Analyzes PDF and DOCX documents against the extracted rules
11
+ - **Comprehensive Reports**: Provides detailed compliance reports with specific issues and recommendations
12
+ - **User-Friendly Interface**: Simple web interface with separate tabs for uploading documents, viewing extracted rules, and reviewing analysis results
13
+
14
+ ## How It Works
15
+
16
+ 1. **Rule Extraction**: The application uses crawl4ai to extract formatting rules from journal websites. It employs a Large Language Model (LLM) to understand and structure the formatting requirements.
17
+
18
+ 2. **Document Analysis**: The uploaded document is analyzed against the extracted rules using an LLM. The analysis checks for compliance with margins, font, line spacing, citations, section structure, and other formatting requirements.
19
+
20
+ 3. **Report Generation**: A detailed compliance report is generated, highlighting any issues found and providing recommendations for fixing them.
21
+
22
+ ## Technical Details
23
+
24
+ - **Backend**: Python with asyncio for handling asynchronous operations
25
+ - **Frontend**: Streamlit for the web interface
26
+ - **LLM Integration**: OpenRouter API for accessing advanced language models
27
+ - **Web Crawling**: crawl4ai for extracting content from journal websites
28
+ - **Document Processing**: Support for PDF and DOCX formats
29
+
30
+ ## Usage
31
+
32
+ 1. Upload your manuscript (PDF or DOCX)
33
+ 2. Provide formatting rules in one of two ways (or both):
34
+ - Enter the URL to the journal's "Instructions for Authors" page
35
+ - Paste formatting rules directly into the text area
36
+ 3. Click "Analyze Document"
37
+ 4. View the formatting rules in the "Formatting Rules" tab
38
+ 5. Review the analysis results in the "Analysis Results" tab
39
+
40
+ ## Requirements
41
+
42
+ - Python 3.9+
43
+ - OpenRouter API key (set in .env file)
44
+ - Required Python packages (listed in requirements.txt)
45
+
46
+ ## Installation
47
+
48
+ 1. Clone the repository
49
+ 2. Create a virtual environment: `python -m venv .venv`
50
+ 3. Activate the virtual environment: `source .venv/bin/activate`
51
+ 4. Install dependencies: `pip install -r requirements.txt`
52
+ 5. Create a `.env` file with your OpenRouter API key:
53
+ ```
54
+ OPENROUTER_API_KEY=your_api_key_here
55
+ ```
56
+ 6. Run the application: `streamlit run app.py`
57
+
58
+ ## License
59
+
60
+ MIT
app.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import json
3
+ from rule_extractor import get_rules_from_url, format_rules_for_display
4
+ from doc_analyzer import analyze_document
5
+
6
+ def combine_rules(url_rules, pasted_rules):
7
+ """Combine URL-extracted rules and manually pasted rules"""
8
+ combined_rules = ""
9
+
10
+ # Check if URL rules are in JSON format and convert if needed
11
+ if url_rules and (url_rules.strip().startswith('[') or url_rules.strip().startswith('{')):
12
+ try:
13
+ # Try to parse as JSON
14
+ rules_data = json.loads(url_rules)
15
+ if isinstance(rules_data, list) and len(rules_data) > 0:
16
+ rules_data = rules_data[0]
17
+
18
+ # Format the rules
19
+ url_rules = format_rules_for_display(rules_data)
20
+ except Exception as e:
21
+ # If parsing fails, use as is
22
+ pass
23
+
24
+ # Add URL-extracted rules if available
25
+ if url_rules:
26
+ combined_rules += url_rules
27
+
28
+ # Add pasted rules if available
29
+ if pasted_rules:
30
+ if url_rules: # If we already have URL rules, add a separator
31
+ combined_rules += "\n\n## Additional Manually Pasted Rules\n\n" + pasted_rules
32
+ else: # If no URL rules, just use pasted rules
33
+ combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
34
+
35
+ return combined_rules
36
+
37
+ st.set_page_config(
38
+ page_title="FormatReview",
39
+ page_icon="🔎",
40
+ layout="wide",
41
+ )
42
+
43
+ st.title("FormatReview")
44
+ st.markdown("Analyze your manuscript against any journal's formatting guidelines.")
45
+
46
+ # Initialize session state for storing rules
47
+ if "rules" not in st.session_state:
48
+ st.session_state.rules = None
49
+ if "results" not in st.session_state:
50
+ st.session_state.results = None
51
+ if "url_rules" not in st.session_state:
52
+ st.session_state.url_rules = None
53
+ if "pasted_rules" not in st.session_state:
54
+ st.session_state.pasted_rules = None
55
+
56
+ # Create tabs
57
+ tab1, tab2, tab3 = st.tabs(["Document Upload", "Formatting Rules", "Analysis Results"])
58
+
59
+ with tab1:
60
+ # --- UI Components ---
61
+ uploaded_file = st.file_uploader("Upload your manuscript (PDF or DOCX)", type=["pdf", "docx"])
62
+
63
+ # Rules input section
64
+ st.subheader("Formatting Rules")
65
+ st.markdown("You can provide formatting rules by URL, paste them directly, or both.")
66
+
67
+ # URL input
68
+ journal_url = st.text_input("Enter the URL to the journal's 'Instructions for Authors' page (optional if pasting rules)")
69
+
70
+ # Pasted rules input
71
+ pasted_rules = st.text_area(
72
+ "Or paste formatting rules directly (optional if providing URL)",
73
+ height=200,
74
+ placeholder="Paste journal formatting guidelines here..."
75
+ )
76
+
77
+ if st.button("Analyze Document"):
78
+ if uploaded_file is None:
79
+ st.error("Please upload your manuscript.")
80
+ elif not journal_url and not pasted_rules:
81
+ st.error("Please either enter the journal's URL or paste formatting rules.")
82
+ else:
83
+ # Initialize combined rules
84
+ combined_rules = ""
85
+
86
+ # Extract rules from URL if provided
87
+ if journal_url:
88
+ with st.spinner("Extracting rules from URL..."):
89
+ url_rules = get_rules_from_url(journal_url)
90
+ st.session_state.url_rules = url_rules
91
+ combined_rules += url_rules
92
+
93
+ # Add pasted rules if provided
94
+ if pasted_rules:
95
+ st.session_state.pasted_rules = pasted_rules
96
+ if journal_url: # If we already have URL rules, combine them
97
+ # Make sure URL rules are formatted before combining
98
+ if st.session_state.url_rules and (
99
+ st.session_state.url_rules.strip().startswith('[') or
100
+ st.session_state.url_rules.strip().startswith('{')
101
+ ):
102
+ try:
103
+ # Try to parse as JSON
104
+ rules_data = json.loads(st.session_state.url_rules)
105
+ if isinstance(rules_data, list) and len(rules_data) > 0:
106
+ rules_data = rules_data[0]
107
+
108
+ # Format and update the URL rules
109
+ formatted_url_rules = format_rules_for_display(rules_data)
110
+ st.session_state.url_rules = formatted_url_rules
111
+ except Exception as e:
112
+ # If parsing fails, use as is
113
+ pass
114
+
115
+ combined_rules = combine_rules(st.session_state.url_rules, pasted_rules)
116
+ else: # If no URL rules, just use pasted rules
117
+ combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
118
+
119
+ # Store the combined rules
120
+ st.session_state.rules = combined_rules
121
+
122
+ # Analyze the document
123
+ with st.spinner("Analyzing document..."):
124
+ st.session_state.results = analyze_document(uploaded_file, st.session_state.rules)
125
+
126
+ st.success("Analysis complete! View the results in the 'Analysis Results' tab.")
127
+
128
+ with tab2:
129
+ st.header("Formatting Rules")
130
+
131
+ if st.session_state.url_rules or st.session_state.pasted_rules:
132
+ # Display URL-extracted rules if available
133
+ if st.session_state.url_rules:
134
+ st.subheader("Rules Extracted from URL")
135
+
136
+ # Check if the rules look like JSON
137
+ if isinstance(st.session_state.url_rules, str) and (
138
+ st.session_state.url_rules.strip().startswith('[') or
139
+ st.session_state.url_rules.strip().startswith('{')
140
+ ):
141
+ try:
142
+ # Try to parse as JSON
143
+ rules_data = json.loads(st.session_state.url_rules)
144
+ if isinstance(rules_data, list) and len(rules_data) > 0:
145
+ rules_data = rules_data[0]
146
+
147
+ # Format and display the rules
148
+ formatted_rules = format_rules_for_display(rules_data)
149
+ st.markdown(formatted_rules)
150
+ except Exception as e:
151
+ # If parsing fails, just display as is
152
+ st.markdown(st.session_state.url_rules)
153
+ else:
154
+ # If not JSON, just display as is
155
+ st.markdown(st.session_state.url_rules)
156
+
157
+ # Display pasted rules if available
158
+ if st.session_state.pasted_rules:
159
+ st.subheader("Manually Pasted Rules")
160
+ st.text_area("", value=st.session_state.pasted_rules, height=150, disabled=True)
161
+
162
+ # Display combined rules used for analysis
163
+ if st.session_state.rules and (st.session_state.url_rules and st.session_state.pasted_rules):
164
+ st.subheader("Combined Rules (Used for Analysis)")
165
+
166
+ # The combined rules should already be formatted, but check just in case
167
+ if isinstance(st.session_state.rules, str) and (
168
+ st.session_state.rules.strip().startswith('[') or
169
+ st.session_state.rules.strip().startswith('{')
170
+ ):
171
+ try:
172
+ # Try to parse as JSON
173
+ rules_data = json.loads(st.session_state.rules)
174
+ if isinstance(rules_data, list) and len(rules_data) > 0:
175
+ rules_data = rules_data[0]
176
+
177
+ # Format and display the rules
178
+ formatted_rules = format_rules_for_display(rules_data)
179
+ st.markdown(formatted_rules)
180
+ except Exception as e:
181
+ # If parsing fails, just display as is
182
+ st.markdown(st.session_state.rules)
183
+ else:
184
+ # If not JSON, just display as is
185
+ st.markdown(st.session_state.rules)
186
+ else:
187
+ st.info("Provide formatting rules via URL or direct input to view them here.")
188
+
189
+ with tab3:
190
+ st.header("Analysis Results")
191
+ if st.session_state.results:
192
+ results = st.session_state.results
193
+
194
+ if "error" in results:
195
+ st.error(results["error"])
196
+ else:
197
+ # Display summary
198
+ st.subheader("Summary")
199
+ summary = results.get("summary", {})
200
+ st.write(f"**Overall Assessment**: {summary.get('overall_assessment', 'N/A')}")
201
+ st.write(f"**Total Issues**: {summary.get('total_issues', 'N/A')}")
202
+ st.write(f"**Critical Issues**: {summary.get('critical_issues', 'N/A')}")
203
+ st.write(f"**Warning Issues**: {summary.get('warning_issues', 'N/A')}")
204
+
205
+ # Display recommendations
206
+ st.subheader("Recommendations")
207
+ recommendations = results.get("recommendations", [])
208
+ if recommendations:
209
+ for rec in recommendations:
210
+ st.write(f"- {rec}")
211
+ else:
212
+ st.write("No recommendations.")
213
+
214
+ # Display detailed report
215
+ st.subheader("Detailed Report")
216
+ issues = results.get("issues", [])
217
+ if issues:
218
+ for issue in issues:
219
+ severity = issue.get('severity', 'N/A').lower()
220
+ message = f"**{issue.get('severity', 'N/A').upper()}**: {issue.get('message', 'N/A')}"
221
+
222
+ if severity == 'critical':
223
+ st.error(message)
224
+ elif severity == 'warning':
225
+ st.warning(message)
226
+ elif severity == 'info':
227
+ st.info(message)
228
+ else:
229
+ st.success(message)
230
+
231
+ st.write(f"**Location**: {issue.get('location', 'N/A')}")
232
+ st.write(f"**Suggestion**: {issue.get('suggestion', 'N/A')}")
233
+ st.divider()
234
+ else:
235
+ st.success("No issues found.")
236
+ else:
237
+ st.info("Analyze a document to view results here.")
config.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dotenv import load_dotenv
2
+ load_dotenv()
3
+
4
+ import os
5
+ import logging
6
+ from pydantic import BaseModel
7
+ from typing import Optional
8
+
9
+ # Logging configuration
10
+ logging.basicConfig(
11
+ level=logging.INFO,
12
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
13
+ handlers=[
14
+ logging.FileHandler("formatreview.log"),
15
+ logging.StreamHandler()
16
+ ]
17
+ )
18
+
19
+ class Settings(BaseModel):
20
+ """Application settings"""
21
+ llm_provider: str = os.getenv("LLM_PROVIDER", "openrouter").lower()
22
+ llm_model_name: str = os.getenv("LLM_MODEL_NAME", "google/gemini-2.5-pro")
23
+ llm_base_url: str = os.getenv("LLM_API_BASE", "https://openrouter.ai/api/v1")
24
+ openrouter_api_key: Optional[str] = os.getenv("OPENROUTER_API_KEY")
25
+
26
+ # Instantiate settings
27
+ settings = Settings()
doc_analyzer.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import base64
3
+ import re
4
+ import xml.etree.ElementTree as ET
5
+ from typing import Dict, Any, Union
6
+ from config import settings
7
+ from openai import OpenAI
8
+
9
+ # This function is no longer needed as we're using markdown directly
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+ def _extract_xml_block(text: str, tag_name: str) -> str:
14
+ """
15
+ Extracts the last complete XML block from a string, ignoring surrounding text.
16
+ """
17
+ # This regex finds all occurrences of the specified XML block
18
+ matches = re.findall(f"<{tag_name}.*?</{tag_name}>", text, re.DOTALL)
19
+ if matches:
20
+ # Return the last match, which should be the assistant's response
21
+ return matches[-1]
22
+ logger.error(f"Could not find <{tag_name}> block in text: {text}")
23
+ return ""
24
+
25
+ def analyze_document(uploaded_file, rules: str) -> Dict[str, Any]:
26
+ """
27
+ Analyzes a document against formatting rules using an LLM.
28
+
29
+ Args:
30
+ uploaded_file: The uploaded file (PDF or DOCX)
31
+ rules: The formatting rules as a string (in markdown format)
32
+
33
+ Returns:
34
+ Dict containing analysis results
35
+ """
36
+ logger.info("Analyzing document against formatting rules")
37
+
38
+ # The rules are already in markdown format, so we can use them directly
39
+ formatted_rules = rules
40
+
41
+ try:
42
+ # Read the file bytes
43
+ file_bytes = uploaded_file.getvalue()
44
+
45
+ # Create a unified prompt
46
+ unified_prompt = f"""
47
+ You are an expert in academic document formatting and citation. Your goal is to analyze the user's document for compliance with the journal's formatting rules and generate a comprehensive compliance report in XML format.
48
+
49
+ Your response MUST be in the following XML format. Do not include any other text or explanations outside of the XML structure.
50
+
51
+ <compliance_report>
52
+ <summary>
53
+ <overall_assessment></overall_assessment>
54
+ <total_issues></total_issues>
55
+ <critical_issues></critical_issues>
56
+ <warning_issues></warning_issues>
57
+ </summary>
58
+ <recommendations>
59
+ <recommendation></recommendation>
60
+ </recommendations>
61
+ <issues>
62
+ <issue severity="critical/warning/info">
63
+ <message></message>
64
+ <location></location>
65
+ <suggestion></suggestion>
66
+ </issue>
67
+ </issues>
68
+ </compliance_report>
69
+
70
+ **Formatting Rules to Enforce**
71
+
72
+ {formatted_rules}
73
+
74
+ **Instructions**
75
+
76
+ Please analyze the attached document and generate the compliance report.
77
+
78
+ **Important Considerations for Analysis:**
79
+ * **Citation Style:** Determine the citation style (e.g., APA, MLA, Chicago) from the document's content and the journal's requirements. The document should follow the style specified in the formatting rules.
80
+ * **Page Numbering:** When reporting the location of an issue, use the page number exactly as it is written in the document (e.g., 'vii', '12'). Do not use the PDF reader's page count (unless necessary to clarify).
81
+ * **Visual Formatting:** When assessing visual properties like line spacing, margins, or font size from a PDF, be aware that text extraction can be imperfect. Base your findings on clear and consistent evidence throughout the document. Do not flag minor variations that could be due to PDF rendering. For example, only flag a line spacing issue if it is consistently incorrect across multiple pages and sections. Assume line spacing is correct unless it is obviously and consistently wrong.
82
+ * **Rule Interpretation:** Apply the formatting rules strictly but fairly. If a rule is ambiguous, note the ambiguity in your assessment.
83
+ * **Completeness:** Ensure that you check every rule against the document and that your report is complete.
84
+ """
85
+
86
+ # Initialize the OpenAI client
87
+ client = OpenAI(
88
+ base_url=settings.llm_base_url,
89
+ api_key=settings.openrouter_api_key,
90
+ )
91
+
92
+ # Encode the file as base64
93
+ base64_file = base64.b64encode(file_bytes).decode('utf-8')
94
+
95
+ # Determine file type
96
+ file_extension = uploaded_file.name.split('.')[-1].lower()
97
+ mime_type = "application/pdf" if file_extension == "pdf" else "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
98
+
99
+ try:
100
+ # Call the LLM API
101
+ completion = client.chat.completions.create(
102
+ model=settings.llm_model_name,
103
+ messages=[
104
+ {
105
+ "role": "user",
106
+ "content": [
107
+ {"type": "text", "text": unified_prompt},
108
+ {
109
+ "type": "file",
110
+ "file": {
111
+ "file_data": f"data:{mime_type};base64,{base64_file}"
112
+ }
113
+ }
114
+ ],
115
+ }
116
+ ],
117
+ )
118
+ raw_response = completion.choices[0].message.content
119
+ except Exception as e:
120
+ logger.error(f"An error occurred during LLM API call: {e}")
121
+ return {"error": f"An error occurred during LLM API call: {e}"}
122
+
123
+ # Extract the XML block
124
+ clean_xml = _extract_xml_block(raw_response, "compliance_report")
125
+ if not clean_xml:
126
+ logger.error("Could not extract <compliance_report> XML block from the response.")
127
+ return {"error": "Could not extract <compliance_report> XML block from the response."}
128
+
129
+ logger.info(f"Final assembled report:\n{clean_xml}")
130
+
131
+ # Parse the final XML output
132
+ try:
133
+ root = ET.fromstring(clean_xml)
134
+
135
+ summary_node = root.find("summary")
136
+ summary = {
137
+ "overall_assessment": summary_node.findtext("overall_assessment", "No assessment available."),
138
+ "total_issues": summary_node.findtext("total_issues", "N/A"),
139
+ "critical_issues": summary_node.findtext("critical_issues", "N/A"),
140
+ "warning_issues": summary_node.findtext("warning_issues", "N/A"),
141
+ } if summary_node is not None else {}
142
+
143
+ issues = []
144
+ for issue_node in root.findall(".//issue"):
145
+ issues.append({
146
+ "severity": issue_node.get("severity"),
147
+ "message": issue_node.findtext("message"),
148
+ "location": issue_node.findtext("location"),
149
+ "suggestion": issue_node.findtext("suggestion"),
150
+ })
151
+
152
+ recommendations = [rec.text for rec in root.findall(".//recommendation")]
153
+
154
+ return {
155
+ "raw_xml": clean_xml,
156
+ "summary": summary,
157
+ "issues": issues,
158
+ "recommendations": recommendations,
159
+ }
160
+
161
+ except ET.ParseError as e:
162
+ logger.error(f"Failed to parse final LLM output: {e}", exc_info=True)
163
+ return {
164
+ "raw_xml": raw_response,
165
+ "error": "Failed to parse final LLM output."
166
+ }
167
+
168
+ except Exception as e:
169
+ logger.error(f"Error analyzing document: {str(e)}")
170
+ return {"error": f"Error analyzing document: {str(e)}"}
logic.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import json
3
+ from rule_extractor import get_rules_from_url, format_rules_for_display
4
+ from doc_analyzer import analyze_document
5
+
6
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
7
+
8
+ def combine_rules(url_rules, pasted_rules):
9
+ """Combine URL-extracted rules and manually pasted rules"""
10
+ combined_rules = ""
11
+
12
+ if url_rules and (url_rules.strip().startswith('[') or url_rules.strip().startswith('{')):
13
+ try:
14
+ rules_data = json.loads(url_rules)
15
+ if isinstance(rules_data, list) and len(rules_data) > 0:
16
+ rules_data = rules_data[0]
17
+ url_rules = format_rules_for_display(rules_data)
18
+ except Exception as e:
19
+ logging.error(f"Failed to parse URL rules as JSON: {e}")
20
+
21
+ if url_rules:
22
+ combined_rules += url_rules
23
+
24
+ if pasted_rules:
25
+ if url_rules:
26
+ combined_rules += "\n\n## Additional Manually Pasted Rules\n\n" + pasted_rules
27
+ else:
28
+ combined_rules = "# Manually Pasted Rules\n\n" + pasted_rules
29
+
30
+ return combined_rules
31
+
32
+ def extract_rules(journal_url):
33
+ """Extract formatting rules from a given URL."""
34
+ try:
35
+ logging.info(f"Extracting rules from URL: {journal_url}")
36
+ rules = get_rules_from_url(journal_url)
37
+ logging.info("Successfully extracted rules from URL.")
38
+ return rules
39
+ except Exception as e:
40
+ logging.error(f"Error extracting rules from URL: {e}")
41
+ return {"error": f"Failed to extract rules from URL: {e}"}
42
+
43
+ def analyze_uploaded_document(uploaded_file, rules):
44
+ """Analyze the uploaded document against the provided rules."""
45
+ try:
46
+ logging.info(f"Analyzing document: {uploaded_file.name}")
47
+ results = analyze_document(uploaded_file, rules)
48
+ logging.info("Successfully analyzed document.")
49
+ return results
50
+ except Exception as e:
51
+ logging.error(f"Error analyzing document: {e}")
52
+ return {"error": f"Failed to analyze document: {e}"}
requirements.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ streamlit>=1.30.0
3
+ python-dotenv>=1.0.0
4
+ pydantic>=2.0.0
5
+ openai>=1.0.0
6
+ nest-asyncio>=1.5.8
7
+
8
+ # Web crawling and extraction
9
+ crawl4ai>=0.6.0
10
+ litellm>=1.0.0
11
+
12
+ # Document processing
13
+ PyPDF2>=3.0.0
14
+ python-docx>=0.8.11
15
+
16
+ # Utilities
17
+ httpx>=0.25.0
18
+ asyncio>=3.4.3
rule_extractor.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import asyncio
3
+ import nest_asyncio
4
+ import os
5
+ import json
6
+ from config import settings
7
+ from pydantic import BaseModel, Field
8
+
9
+ logger = logging.getLogger(__name__)
10
+
11
+ class FormattingRules(BaseModel):
12
+ """Schema for formatting rules extraction"""
13
+ margins: str = Field(description="Margin requirements for the manuscript")
14
+ font: str = Field(description="Font requirements including size, type, etc.")
15
+ line_spacing: str = Field(description="Line spacing requirements")
16
+ citations: str = Field(description="Citation style and formatting requirements")
17
+ sections: str = Field(description="Required sections and their structure")
18
+ other_rules: str = Field(description="Any other formatting requirements")
19
+ summary: str = Field(description="A brief summary of the key formatting requirements")
20
+
21
+ def format_rules_for_display(rules_data):
22
+ """
23
+ Format the extracted rules data into a readable markdown string.
24
+ """
25
+ if not rules_data:
26
+ return "Could not extract formatting rules from the provided URL."
27
+
28
+ formatted_rules = f"""
29
+ # Manuscript Formatting Guidelines
30
+
31
+ ## Margins
32
+ {rules_data.get('margins', 'Not specified')}
33
+
34
+ ## Font
35
+ {rules_data.get('font', 'Not specified')}
36
+
37
+ ## Line Spacing
38
+ {rules_data.get('line_spacing', 'Not specified')}
39
+
40
+ ## Citations
41
+ {rules_data.get('citations', 'Not specified')}
42
+
43
+ ## Section Structure
44
+ {rules_data.get('sections', 'Not specified')}
45
+
46
+ ## Other Requirements
47
+ {rules_data.get('other_rules', 'Not specified')}
48
+
49
+ ## Summary
50
+ {rules_data.get('summary', 'Not specified')}
51
+ """
52
+ return formatted_rules
53
+
54
+ def get_rules_from_url(url: str) -> str:
55
+ """
56
+ Extracts formatting rules from a given URL using crawl4ai.
57
+ """
58
+ logger.info(f"Extracting rules from URL: {url}")
59
+
60
+ # Apply nest_asyncio here, when the function is called
61
+ nest_asyncio.apply()
62
+
63
+ # Import crawl4ai modules here to avoid event loop issues at module level
64
+ from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, LLMConfig
65
+ from crawl4ai.extraction_strategy import LLMExtractionStrategy
66
+
67
+ async def _extract_rules_async(url: str) -> str:
68
+ """
69
+ Asynchronously extracts formatting rules from a given URL using crawl4ai.
70
+ """
71
+ # Configure the browser
72
+ browser_config = BrowserConfig(verbose=True)
73
+
74
+ # Configure the LLM extraction
75
+ extraction_strategy = LLMExtractionStrategy(
76
+ llm_config=LLMConfig(
77
+ provider=f"{settings.llm_provider}/{settings.llm_model_name}",
78
+ api_token=settings.openrouter_api_key
79
+ ),
80
+ schema=FormattingRules.schema(),
81
+ extraction_type="schema",
82
+ instruction="""
83
+ From the crawled content, extract all formatting rules for manuscript submissions.
84
+ Focus on requirements for margins, font, line spacing, citations, section structure,
85
+ and any other formatting guidelines. Provide a comprehensive extraction of all
86
+ formatting-related information.
87
+
88
+ If a specific requirement is not mentioned in the content, include "Not specified" in the corresponding field.
89
+ """
90
+ )
91
+
92
+ # Configure the crawler
93
+ run_config = CrawlerRunConfig(
94
+ word_count_threshold=10,
95
+ exclude_external_links=True,
96
+ process_iframes=True,
97
+ remove_overlay_elements=True,
98
+ exclude_social_media_links=True,
99
+ check_robots_txt=True,
100
+ semaphore_count=3,
101
+ extraction_strategy=extraction_strategy
102
+ )
103
+
104
+ # Initialize the crawler and run
105
+ async with AsyncWebCrawler() as crawler:
106
+ result = await crawler.arun(
107
+ url=url,
108
+ config=run_config
109
+ )
110
+
111
+ if result.success and result.extracted_content:
112
+ # Format the extracted data into a readable string
113
+ if isinstance(result.extracted_content, list) and len(result.extracted_content) > 0:
114
+ rules_data = result.extracted_content[0]
115
+ elif isinstance(result.extracted_content, dict):
116
+ rules_data = result.extracted_content
117
+ else:
118
+ # If it's a string or other type, use markdown as fallback
119
+ return str(result.extracted_content) if result.extracted_content else result.markdown if result.markdown else "Could not extract formatting rules from the provided URL."
120
+
121
+ # Store the raw data for debugging
122
+ logger.info(f"Extracted rules data: {json.dumps(rules_data, indent=2)}")
123
+
124
+ # Format the rules for display
125
+ formatted_rules = format_rules_for_display(rules_data)
126
+ logger.info(f"Formatted rules: {formatted_rules[:100]}...") # Log for debugging
127
+ return formatted_rules
128
+ elif result.success and result.markdown:
129
+ # Fallback to markdown if structured extraction fails
130
+ return result.markdown
131
+ else:
132
+ return "Could not extract formatting rules from the provided URL."
133
+
134
+ # Create a new event loop and run the async function
135
+ loop = asyncio.new_event_loop()
136
+ asyncio.set_event_loop(loop)
137
+ try:
138
+ return loop.run_until_complete(_extract_rules_async(url))
139
+ finally:
140
+ loop.close()
startup_formatreview.sh ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Startup script for FormatReview
3
+ # This script starts the Streamlit service and exposes it via Tailscale Serve
4
+
5
+ # Exit on error
6
+ set -e
7
+
8
+ # --- Configuration for Streamlit App ---
9
+ STREAMLIT_APP_FILE="app.py"
10
+ STREAMLIT_PORT="8504"
11
+ STREAMLIT_PID_FILE="formatreview.pid"
12
+ STREAMLIT_LOG_FILE="formatreview.log"
13
+ # --- End Configuration ---
14
+
15
+ # Check if UV is installed
16
+ if ! command -v uv &> /dev/null; then
17
+ echo "Error: UV is not installed. Please install UV first."
18
+ echo "You can install UV with: pip install uv"
19
+ exit 1
20
+ fi
21
+
22
+ # Create virtual environment if it doesn't exist
23
+ if [ ! -d ".venv" ]; then
24
+ echo "Creating virtual environment..."
25
+ uv venv
26
+ fi
27
+
28
+ # Activate virtual environment
29
+ echo "Activating virtual environment..."
30
+ source .venv/bin/activate
31
+
32
+ # Install dependencies
33
+ echo "Installing dependencies..."
34
+ uv pip install -r requirements.txt
35
+
36
+ # Kill any existing instances of the Streamlit app
37
+ echo "Stopping any existing instances of the Streamlit app..."
38
+ if [ -f "$STREAMLIT_PID_FILE" ]; then
39
+ OLD_STREAMLIT_PID=$(cat $STREAMLIT_PID_FILE)
40
+ if ps -p $OLD_STREAMLIT_PID > /dev/null; then
41
+ kill $OLD_STREAMLIT_PID
42
+ echo "Killed existing Streamlit app process with PID $OLD_STREAMLIT_PID"
43
+ sleep 1 # Give it time to shut down
44
+ else
45
+ echo "No running Streamlit app process found with PID $OLD_STREAMLIT_PID"
46
+ fi
47
+ fi
48
+ # Also try to kill any other streamlit processes for this specific app file and port
49
+ pkill -f "streamlit run $STREAMLIT_APP_FILE --server.port $STREAMLIT_PORT" || true
50
+ sleep 1
51
+
52
+ # Start the Streamlit app
53
+ echo "Starting Streamlit app on port $STREAMLIT_PORT..."
54
+ nohup streamlit run $STREAMLIT_APP_FILE --server.port $STREAMLIT_PORT --server.headless true > $STREAMLIT_LOG_FILE 2>&1 &
55
+ echo $! > $STREAMLIT_PID_FILE
56
+
57
+ # Check if the Streamlit service started successfully
58
+ sleep 3 # Give Streamlit a bit more time to start
59
+ if ! nc -z localhost $STREAMLIT_PORT; then
60
+ echo "Error: Failed to start Streamlit app on port $STREAMLIT_PORT."
61
+ cat $STREAMLIT_LOG_FILE # Output log file for debugging
62
+ exit 1
63
+ else
64
+ echo "Streamlit app started successfully on port $STREAMLIT_PORT."
65
+ fi
66
+
67
+ # Check if Tailscale is installed
68
+ if ! command -v tailscale &> /dev/null; then
69
+ echo "Warning: Tailscale is not installed. The app will only be available locally."
70
+ echo "Install Tailscale to expose the service over your tailnet."
71
+ else
72
+ # Expose the service via Tailscale Serve
73
+ echo "Exposing Streamlit app via Tailscale Serve on port $STREAMLIT_PORT..."
74
+ echo "Setting up Funnel on port 443..."
75
+ tailscale funnel --https=443 --bg localhost:$STREAMLIT_PORT
76
+
77
+ # Get the Tailscale hostname
78
+ HOSTNAME=$(tailscale status --json | jq -r '.Self.DNSName')
79
+ if [ -n "$HOSTNAME" ]; then
80
+ echo "App may be available at a Tailscale URL. Check 'tailscale status' for details."
81
+ echo "If using a funnel, it might be https://$HOSTNAME/"
82
+ else
83
+ echo "App is exposed via Tailscale Serve, but couldn't determine the primary hostname."
84
+ echo "Check 'tailscale status' for details."
85
+ fi
86
+ fi
87
+
88
+ echo "FormatReview is now running!"
89
+ echo "Local URL: http://localhost:$STREAMLIT_PORT"
90
+ echo "Log file: $STREAMLIT_LOG_FILE"
91
+ echo "PID file: $STREAMLIT_PID_FILE"
92
+ echo ""
93
+ echo "If Tailscale is active, the app should be accessible via a Tailscale funnel URL."
test_crawl.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import nest_asyncio
3
+ import logging
4
+ import json
5
+ from pprint import pprint
6
+ from config import settings
7
+ from pydantic import BaseModel, Field
8
+
9
+ # Configure logging
10
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
11
+ logger = logging.getLogger("crawl4ai_test")
12
+
13
+ class FormattingRules(BaseModel):
14
+ """Schema for formatting rules extraction"""
15
+ margins: str = Field(description="Margin requirements for the manuscript")
16
+ font: str = Field(description="Font requirements including size, type, etc.")
17
+ line_spacing: str = Field(description="Line spacing requirements")
18
+ citations: str = Field(description="Citation style and formatting requirements")
19
+ sections: str = Field(description="Required sections and their structure")
20
+ other_rules: str = Field(description="Any other formatting requirements")
21
+ summary: str = Field(description="A brief summary of the key formatting requirements")
22
+
23
+ async def test_crawl():
24
+ """Test crawl4ai functionality"""
25
+ from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
26
+ from crawl4ai.extraction_strategy import LLMExtractionStrategy
27
+
28
+ url = "https://journal.code4lib.org/article-guidelines"
29
+
30
+ # Configure the browser
31
+ browser_config = BrowserConfig(verbose=True)
32
+
33
+ # Configure the LLM extraction
34
+ extraction_strategy = LLMExtractionStrategy(
35
+ llm_config=LLMConfig(
36
+ provider=f"{settings.llm_provider}/{settings.llm_model_name}",
37
+ api_token=settings.openrouter_api_key
38
+ ),
39
+ schema=FormattingRules.schema(),
40
+ extraction_type="schema",
41
+ instruction="""
42
+ From the crawled content, extract all formatting rules for manuscript submissions.
43
+ Focus on requirements for margins, font, line spacing, citations, section structure,
44
+ and any other formatting guidelines. Provide a comprehensive extraction of all
45
+ formatting-related information.
46
+ """
47
+ )
48
+
49
+ # Configure the crawler
50
+ run_config = CrawlerRunConfig(
51
+ word_count_threshold=10,
52
+ exclude_external_links=True,
53
+ process_iframes=True,
54
+ remove_overlay_elements=True,
55
+ exclude_social_media_links=True,
56
+ check_robots_txt=True,
57
+ semaphore_count=3,
58
+ extraction_strategy=extraction_strategy
59
+ )
60
+
61
+ # Initialize the crawler and run
62
+ async with AsyncWebCrawler() as crawler:
63
+ result = await crawler.arun(
64
+ url=url,
65
+ config=run_config
66
+ )
67
+
68
+ # Print all attributes of the result object
69
+ logger.info(f"Result object type: {type(result)}")
70
+ logger.info(f"Result object dir: {dir(result)}")
71
+
72
+ # Check for success
73
+ logger.info(f"Success: {result.success}")
74
+
75
+ # Check for markdown
76
+ if hasattr(result, 'markdown'):
77
+ logger.info(f"Has markdown: {bool(result.markdown)}")
78
+ logger.info(f"Markdown type: {type(result.markdown)}")
79
+ logger.info(f"Markdown preview: {str(result.markdown)[:200]}...")
80
+ else:
81
+ logger.info("No markdown attribute")
82
+
83
+ # Check for extracted_data
84
+ if hasattr(result, 'extracted_data'):
85
+ logger.info(f"Has extracted_data: {bool(result.extracted_data)}")
86
+ logger.info(f"Extracted data: {result.extracted_data}")
87
+ else:
88
+ logger.info("No extracted_data attribute")
89
+
90
+ # Check for other potential attributes
91
+ for attr in ['data', 'extraction', 'llm_extraction', 'content', 'text', 'extracted_content']:
92
+ if hasattr(result, attr):
93
+ logger.info(f"Has {attr}: {bool(getattr(result, attr))}")
94
+ logger.info(f"{attr} preview: {str(getattr(result, attr))[:200]}...")
95
+
96
+ # Try to access _results directly
97
+ if hasattr(result, '_results'):
98
+ logger.info(f"Has _results: {bool(result._results)}")
99
+ if result._results:
100
+ first_result = result._results[0]
101
+ logger.info(f"First result type: {type(first_result)}")
102
+ logger.info(f"First result dir: {dir(first_result)}")
103
+
104
+ # Check if first result has extracted_data
105
+ if hasattr(first_result, 'extracted_data'):
106
+ logger.info(f"First result has extracted_data: {bool(first_result.extracted_data)}")
107
+ logger.info(f"First result extracted_data: {first_result.extracted_data}")
108
+
109
+ # Check for other attributes in first result
110
+ for attr in ['data', 'extraction', 'llm_extraction', 'content', 'text', 'extracted_content']:
111
+ if hasattr(first_result, attr):
112
+ logger.info(f"First result has {attr}: {bool(getattr(first_result, attr))}")
113
+ logger.info(f"First result {attr} preview: {str(getattr(first_result, attr))[:200]}...")
114
+
115
+ return result
116
+
117
+ def main():
118
+ """Main function"""
119
+ # Apply nest_asyncio
120
+ nest_asyncio.apply()
121
+
122
+ # Create a new event loop and run the async function
123
+ loop = asyncio.new_event_loop()
124
+ asyncio.set_event_loop(loop)
125
+ try:
126
+ result = loop.run_until_complete(test_crawl())
127
+ logger.info("Test completed successfully")
128
+ finally:
129
+ loop.close()
130
+
131
+ if __name__ == "__main__":
132
+ main()