methunraj commited on
Commit
90b0a17
·
1 Parent(s): 6985bd2

refactor: restructure project with modular prompts and instructions

Browse files
.claude/settings.local.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "permissions": {
3
- "allow": [
4
- "Bash(mkdir:*)",
5
- "Bash(python test:*)",
6
- "Bash(/usr/local/bin/python3:*)",
7
- "Bash(ls:*)",
8
- "Bash(rm:*)",
9
- "Bash(python:*)",
10
- "Bash(find:*)",
11
- "mcp__zen__analyze",
12
- "Bash(pkill:*)",
13
- "Bash(touch:*)",
14
- "Bash(docker build:*)",
15
- "Bash(/dev/null)"
16
- ],
17
- "deny": []
18
- }
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,162 +1,266 @@
1
- ---
2
- title: Agno Document Analysis
3
- emoji: 📄
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- ---
10
 
11
- # Agno Document Analysis Workflow
12
 
13
- A sophisticated document processing application built with Agno v1.7.4 featuring a multi-agent workflow for intelligent document analysis and data extraction.
14
 
15
- ## Features
16
 
17
- - **3-Agent Workflow**: Data Extractor, Data Arranger, Code Generator
18
- - **Multi-format Support**: PDF, TXT, PNG, JPG, JPEG, DOCX, XLSX, CSV, MD, JSON, XML, HTML, PY, JS, TS, DOC, XLS, PPT, PPTX
19
- - **Real-time Processing**: Streaming interface with live updates
20
- - **Sandboxed Execution**: Safe code execution environment
21
- - **Beautiful UI**: Modern Gradio interface with custom animations
22
 
23
- ## Quick Start
24
 
25
- ### Automated Installation
 
 
 
 
 
26
 
27
- ```bash
28
- # Clone the repository
29
- git clone <repository-url>
30
- cd Data_Extractor
31
 
32
- # Quick installation (recommended)
33
- ./install.sh
34
 
35
- # Or use Python setup script
36
- python setup.py
 
 
 
 
 
 
 
 
37
  ```
38
 
39
- ### Manual Installation
40
 
41
- ```bash
42
- # Create virtual environment
43
- python -m venv data_extractor_env
44
- source data_extractor_env/bin/activate # On Windows: data_extractor_env\Scripts\activate
45
 
46
- # Install dependencies
47
- pip install -r requirements.txt
48
 
49
- # Create environment file
50
- cp .env.example .env # Update with your API keys
 
 
 
 
 
 
51
 
52
- # Run the application
53
- python app.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ```
55
 
56
- ## Installation Options
57
 
58
- ### Requirements Files
59
 
60
- - **`requirements-minimal.txt`**: Essential dependencies only (~50 packages)
61
- ```bash
62
- pip install -r requirements-minimal.txt
63
- ```
64
 
65
- - **`requirements.txt`**: Complete feature set (~200+ packages)
66
- ```bash
67
- pip install -r requirements.txt
68
- ```
69
 
70
- - **`requirements-dev.txt`**: Development dependencies with testing tools
71
- ```bash
72
- pip install -r requirements-dev.txt
73
- ```
74
 
75
- ### System Dependencies
76
 
77
- Some features require system-level dependencies:
78
 
79
- **macOS:**
80
- ```bash
81
- brew install tesseract imagemagick poppler
82
  ```
83
 
84
- **Ubuntu/Debian:**
 
85
  ```bash
86
- sudo apt-get install tesseract-ocr libmagickwand-dev poppler-utils
87
  ```
88
 
89
- **Windows:**
 
 
 
 
 
90
  ```bash
91
- choco install tesseract imagemagick poppler
 
92
  ```
93
 
94
- ## Usage
95
 
96
- 1. **Setup Environment**: Follow installation instructions above
97
- 2. **Configure API Keys**: Update `.env` file with your API keys
98
- 3. **Upload Document**: Support for 20+ file formats
99
- 4. **Select Analysis**: Choose from predefined types or custom prompts
100
- 5. **Process**: Watch the multi-agent workflow in real-time
101
- 6. **Download Results**: Get structured data and generated Excel reports
102
 
103
- ## Environment Variables
104
 
105
- Create a `.env` file with the following variables:
106
 
107
- ```bash
108
- # Required API Keys
109
- GOOGLE_API_KEY=your_google_api_key_here
110
- OPENAI_API_KEY=your_openai_api_key_here # Optional
111
 
112
- # Application Settings
113
- DEBUG=False
114
- LOG_LEVEL=INFO
115
- SESSION_TIMEOUT=3600
116
 
117
- # File Processing
118
- MAX_FILE_SIZE=50MB
119
- SUPPORTED_FORMATS=pdf,docx,xlsx,txt
120
 
121
- # Database (Optional)
122
- DATABASE_URL=sqlite:///data_extractor.db
123
- ```
124
 
125
- ## Advanced Features
 
 
126
 
127
- ### Financial Document Processing
128
- - Comprehensive financial data extraction
129
- - 13-category data organization
130
- - Excel report generation with charts
131
- - XBRL and SEC filing support
132
 
133
- ### OCR and Image Processing
134
- - EasyOCR and PaddleOCR integration
135
- - Tesseract OCR support
136
- - Advanced image preprocessing
137
 
138
- ### Machine Learning Integration
139
- - TensorFlow and PyTorch support
140
- - Scikit-learn for data analysis
141
- - XGBoost and LightGBM for predictions
142
 
143
- ## Troubleshooting
144
 
145
- For detailed troubleshooting and installation issues, see:
146
- - [`INSTALLATION.md`](INSTALLATION.md) - Comprehensive installation guide
147
- - [`FIXES_SUMMARY.md`](FIXES_SUMMARY.md) - Known issues and solutions
148
 
149
- ### Common Issues
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
- 1. **Import Errors**: Try minimal installation first
152
- 2. **OCR Issues**: Install system dependencies
153
- 3. **Memory Issues**: Use smaller batch sizes
154
- 4. **API Errors**: Verify API keys in `.env` file
155
 
156
- ## Docker Support
 
 
157
 
158
- ```dockerfile
159
- # Build and run with Docker
160
- docker build -t data-extractor .
161
- docker run -p 7860:7860 --env-file .env data-extractor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📊 Financial Data Extractor Using Gemini
 
 
 
 
 
 
 
 
2
 
3
+ A powerful AI-driven financial document analysis system that automatically extracts, organizes, and generates professional Excel reports from financial documents using Google's Gemini AI models.
4
 
5
+ ## 🚀 Features
6
 
7
+ ### Core Functionality
8
 
9
+ - **📄 Multi-format Document Support**: PDF, DOCX, TXT, and image files
10
+ - **🔍 Intelligent Data Extraction**: AI-powered extraction of financial data points
11
+ - **📊 Smart Data Organization**: Automatic categorization into 12+ financial categories
12
+ - **💻 Excel Report Generation**: Professional multi-worksheet Excel reports with charts
13
+ - **🎯 Real-time Processing**: Live streaming interface with progress tracking
14
 
15
+ ### Advanced Capabilities
16
 
17
+ - **🤖 Multi-Agent Workflow**: Specialized AI agents for extraction, arrangement, and code generation
18
+ - **💾 Session Management**: Persistent storage with SQLite caching
19
+ - **🔄 Auto-shutdown**: Intelligent resource management for cloud deployments
20
+ - **📱 Modern UI**: Beautiful Gradio-based web interface
21
+ - **🌐 Cross-platform**: Works on Windows, Mac, and Linux
22
+ - **🐳 Docker Support**: Containerized deployment ready
23
 
24
+ ## 🏗️ Architecture
 
 
 
25
 
26
+ The system uses a sophisticated multi-agent workflow powered by the Agno framework:
 
27
 
28
+ ```
29
+ 📄 Document Upload
30
+
31
+ 🔍 Data Extractor Agent
32
+ ↓ (Structured Financial Data)
33
+ 📊 Data Arranger Agent
34
+ ↓ (Organized Categories)
35
+ 💻 Code Generator Agent
36
+ ↓ (Python Excel Code)
37
+ 📊 Excel Report Output
38
  ```
39
 
40
+ ### Agent Specialization
41
 
42
+ - **Data Extractor**: Extracts financial data points with confidence scoring
43
+ - **Data Arranger**: Organizes data into 12+ professional categories
44
+ - **Code Generator**: Creates Python code for Excel report generation
 
45
 
46
+ ## 📋 Requirements
 
47
 
48
+ ### System Requirements
49
+
50
+ - Python 3.8+
51
+ - Google API Key (for Gemini models)
52
+ - 2GB+ RAM recommended
53
+ - Cross-platform compatible
54
+
55
+ ### Dependencies
56
 
57
+ ```
58
+ agno>=1.7.4
59
+ gradio
60
+ google-generativeai
61
+ PyPDF2
62
+ Pillow
63
+ python-dotenv
64
+ pandas
65
+ matplotlib
66
+ openpyxl
67
+ python-docx
68
+ lxml
69
+ markdown
70
+ requests
71
+ seaborn
72
+ sqlalchemy
73
+ websockets
74
  ```
75
 
76
+ ## 🚀 Quick Start
77
 
78
+ ### 1. Clone the Repository
79
 
80
+ ```bash
81
+ git clone <repository-url>
82
+ cd Data_Extractor_Using_Gemini
83
+ ```
84
 
85
+ ### 2. Install Dependencies
 
 
 
86
 
87
+ ```bash
88
+ pip install -r requirements.txt
89
+ ```
 
90
 
91
+ ### 3. Configure Environment
92
 
93
+ Create a `.env` file:
94
 
95
+ ```env
96
+ GOOGLE_API_KEY=your_gemini_api_key_here
 
97
  ```
98
 
99
+ ### 4. Run the Application
100
+
101
  ```bash
102
+ python app.py
103
  ```
104
 
105
+ The application will be available at `http://localhost:7860`
106
+
107
+ ## 🐳 Docker Deployment
108
+
109
+ ### Build and Run
110
+
111
  ```bash
112
+ docker build -t financial-extractor .
113
+ docker run -p 7860:7860 --env-file .env financial-extractor
114
  ```
115
 
116
+ ### Environment Variables
117
 
118
+ - `GOOGLE_API_KEY`: Your Google Gemini API key
119
+ - `INACTIVITY_TIMEOUT_MINUTES`: Auto-shutdown timeout (default: 30)
 
 
 
 
120
 
121
+ ## 📖 Usage Guide
122
 
123
+ ### 1. Upload Document
124
 
125
+ - Drag and drop or select your financial document
126
+ - Supported formats: PDF, DOCX, TXT, PNG, JPG, JPEG
 
 
127
 
128
+ ### 2. Select Processing Mode
 
 
 
129
 
130
+ - **Quick Analysis**: Standard extraction and organization
131
+ - **Custom Prompts**: Use predefined prompt templates for specific document types
 
132
 
133
+ ### 3. Monitor Progress
 
 
134
 
135
+ - Real-time streaming interface shows each processing step
136
+ - Progress indicators for all workflow stages
137
+ - Live terminal output for code execution
138
 
139
+ ### 4. Download Results
 
 
 
 
140
 
141
+ - Professional Excel report with multiple worksheets
142
+ - Organized data categories with charts and formatting
143
+ - All intermediate files available for download
 
144
 
145
+ ## 📊 Output Structure
 
 
 
146
 
147
+ The generated Excel reports include:
148
 
149
+ ### Worksheets
 
 
150
 
151
+ - **Summary**: Executive overview with key metrics
152
+ - **Revenue**: Income and revenue streams
153
+ - **Expenses**: Operating and non-operating expenses
154
+ - **Assets**: Current and non-current assets
155
+ - **Liabilities**: Short-term and long-term liabilities
156
+ - **Equity**: Shareholder equity components
157
+ - **Cash Flow**: Cash flow statements
158
+ - **Ratios**: Financial ratio analysis
159
+ - **Charts**: Visual representations of key data
160
+ - **Raw Data**: Original extracted data points
161
+
162
+ ### Features
163
+
164
+ - Professional formatting with consistent styling
165
+ - Interactive charts and visualizations
166
+ - Dynamic period handling (auto-detects years/quarters)
167
+ - Cross-referenced data validation
168
+ - Print-ready layouts
169
+
170
+ ## 🔧 Configuration
171
+
172
+ ### Model Settings
173
+
174
+ Configure AI models in `config/settings.py`:
175
+
176
+ - Data Extractor Model
177
+ - Data Arranger Model
178
+ - Code Generator Model
179
+ - Thinking budgets and retry settings
180
+
181
+ ### Prompt Customization
182
 
183
+ Customize agent instructions in `instructions/agents/`:
 
 
 
184
 
185
+ - `data_extractor.md`: Data extraction prompts
186
+ - `data_arranger.md`: Data organization prompts
187
+ - `code_generator.md`: Excel generation prompts
188
 
189
+ ### Workflow Configuration
190
+
191
+ Modify workflow behavior in `workflow/financial_workflow.py`:
192
+
193
+ - Agent configurations
194
+ - Tool assignments
195
+ - Output formats
196
+
197
+ ## 🛠️ Development
198
+
199
+ ### Project Structure
200
+
201
+ ```
202
+ ├── app.py # Main Gradio application
203
+ ├── workflow/ # Core workflow implementation
204
+ ├── instructions/ # Agent instruction templates
205
+ ├── prompts/ # Prompt gallery configurations
206
+ ├── config/ # Application settings
207
+ ├── utils/ # Utility functions
208
+ ├── static/ # Static assets
209
+ ├── models/ # Data models
210
+ └── terminal_stream.py # Real-time terminal streaming
211
+ ```
212
+
213
+ ### Key Components
214
+
215
+ - **WorkflowUI**: Main interface controller
216
+ - **FinancialDocumentWorkflow**: Core processing pipeline
217
+ - **AutoShutdownManager**: Resource management
218
+ - **TerminalLogHandler**: Real-time logging
219
+ - **PromptGallery**: Template management
220
+
221
+ ## 🔒 Security & Privacy
222
+
223
+ - **Local Processing**: All document processing happens locally
224
+ - **No Data Storage**: Documents are processed and cleaned up automatically
225
+ - **API Key Security**: Environment-based configuration
226
+ - **Session Isolation**: Each session has isolated temporary directories
227
+
228
+ ## 🌐 Deployment Options
229
+
230
+ ### Local Development
231
+
232
+ ```bash
233
+ python app.py
234
+ ```
235
+
236
+ ### Production (Gunicorn)
237
+
238
+ ```bash
239
+ gunicorn -w 4 -b 0.0.0.0:7860 app:app
240
  ```
241
+
242
+ ### Cloud Platforms
243
+
244
+ - **Hugging Face Spaces**: Ready for deployment
245
+ - **Google Cloud Run**: Containerized deployment
246
+ - **AWS/Azure**: Standard container deployment
247
+
248
+ ## 🤝 Contributing
249
+
250
+ 1. Fork the repository
251
+ 2. Create a feature branch
252
+ 3. Make your changes
253
+ 4. Add tests if applicable
254
+ 5. Submit a pull request
255
+
256
+ ## 📝 License
257
+
258
+ This project is licensed under the MIT License - see the LICENSE file for details.
259
+
260
+ ## 🆘 Support
261
+
262
+ ### Common Issues
263
+
264
+ - **API Key Errors**: Ensure your Google API key is valid and has Gemini access
265
+ - **Memory Issues**: Increase system RAM or reduce document size
266
+ - **Processing Timeouts**: Check network connectivity and API quotas
TERMINAL_README.md DELETED
@@ -1,230 +0,0 @@
1
- # 🚀 Manus AI-Style Terminal Integration
2
-
3
- This document explains the real-time terminal streaming functionality added to the Data Extractor application.
4
-
5
- ## 📋 Overview
6
-
7
- The terminal integration provides a **Manus AI-style terminal interface** with real-time command execution and streaming output, seamlessly integrated into the existing Gradio application.
8
-
9
- ## 🏗️ Architecture
10
-
11
- ### Components
12
-
13
- 1. **WebSocket Server** (`terminal_stream.py`)
14
- - Handles real-time communication between frontend and backend
15
- - Manages command execution with streaming output
16
- - Supports multiple concurrent connections
17
- - Runs on port 8765
18
-
19
- 2. **Frontend Terminal** (`static/terminal.html`)
20
- - Beautiful Manus AI-inspired terminal interface
21
- - Real-time output streaming via WebSocket
22
- - Command history navigation
23
- - Keyboard shortcuts and controls
24
-
25
- 3. **Gradio Integration** (Modified `app.py`)
26
- - Added terminal tab to existing interface
27
- - Embedded terminal as iframe component
28
- - Auto-starts WebSocket server on application launch
29
-
30
- ## 🎨 Features
31
-
32
- ### Terminal Interface
33
- - **Real-time Streaming**: Live command output as it happens
34
- - **Command History**: Navigate with ↑/↓ arrow keys
35
- - **Interrupt Support**: Ctrl+C to stop running commands
36
- - **Auto-reconnect**: Automatically reconnects on connection loss
37
- - **Status Indicators**: Visual connection and execution status
38
- - **Responsive Design**: Works on desktop and mobile
39
-
40
- ### Security
41
- - **Command Sanitization**: Uses `shlex.split()` for safe command parsing
42
- - **Process Isolation**: Commands run in separate processes
43
- - **Error Handling**: Robust error handling and logging
44
-
45
- ## 🚀 Usage
46
-
47
- ### Starting the Application
48
- ```bash
49
- python app.py
50
- ```
51
-
52
- The terminal WebSocket server automatically starts on port 8765.
53
-
54
- ### Accessing the Terminal
55
- 1. Open the Gradio interface (usually http://localhost:7860)
56
- 2. Click on the "💻 Terminal" tab
57
- 3. Start typing commands in the terminal interface
58
-
59
- ### Keyboard Shortcuts
60
- - **Enter**: Execute command
61
- - **↑/↓**: Navigate command history
62
- - **Ctrl+C**: Interrupt running command
63
- - **Ctrl+L**: Clear terminal screen
64
- - **Tab**: Command completion (planned feature)
65
-
66
- ## 🔧 Configuration
67
-
68
- ### WebSocket Server Settings
69
- ```python
70
- # In terminal_stream.py
71
- WEBSOCKET_HOST = 'localhost'
72
- WEBSOCKET_PORT = 8765
73
- ```
74
-
75
- ### Terminal Appearance
76
- Customize the terminal appearance by modifying the CSS in `static/terminal.html`:
77
-
78
- ```css
79
- /* Main terminal colors */
80
- .terminal-container {
81
- background: linear-gradient(135deg, #0d1117 0%, #161b22 100%);
82
- }
83
-
84
- /* Command prompt */
85
- .prompt {
86
- color: #58a6ff;
87
- }
88
- ```
89
-
90
- ## 📡 WebSocket API
91
-
92
- ### Client → Server Messages
93
-
94
- #### Execute Command
95
- ```json
96
- {
97
- "type": "command",
98
- "command": "ls -la"
99
- }
100
- ```
101
-
102
- #### Interrupt Command
103
- ```json
104
- {
105
- "type": "interrupt"
106
- }
107
- ```
108
-
109
- ### Server → Client Messages
110
-
111
- #### Command Output
112
- ```json
113
- {
114
- "type": "output",
115
- "data": "file1.txt\nfile2.txt",
116
- "stream": "stdout",
117
- "timestamp": "2024-01-01T12:00:00.000Z"
118
- }
119
- ```
120
-
121
- #### Command Completion
122
- ```json
123
- {
124
- "type": "command_complete",
125
- "exit_code": 0,
126
- "message": "Process exited with code 0",
127
- "timestamp": "2024-01-01T12:00:00.000Z"
128
- }
129
- ```
130
-
131
- ## 🛠️ Development
132
-
133
- ### Adding New Features
134
-
135
- 1. **Server-side**: Modify `terminal_stream.py`
136
- 2. **Client-side**: Update `static/terminal.html`
137
- 3. **Integration**: Adjust `app.py` if needed
138
-
139
- ### Testing
140
-
141
- ```bash
142
- # Test WebSocket server independently
143
- python -c "from terminal_stream import run_websocket_server; run_websocket_server()"
144
-
145
- # Test terminal interface
146
- # Open static/terminal.html in browser
147
- ```
148
-
149
- ## 🔍 Troubleshooting
150
-
151
- ### Common Issues
152
-
153
- 1. **WebSocket Connection Failed**
154
- - Check if port 8765 is available
155
- - Verify firewall settings
156
- - Check server logs for errors
157
-
158
- 2. **Commands Not Executing**
159
- - Verify WebSocket connection status
160
- - Check terminal logs for errors
161
- - Ensure proper command syntax
162
-
163
- 3. **Terminal Not Loading**
164
- - Check if `static/terminal.html` exists
165
- - Verify Gradio file serving configuration
166
- - Check browser console for errors
167
-
168
- ### Debug Mode
169
-
170
- Enable debug logging:
171
- ```python
172
- import logging
173
- logging.getLogger('terminal_stream').setLevel(logging.DEBUG)
174
- ```
175
-
176
- ## 🚀 Advanced Usage
177
-
178
- ### Custom Commands
179
-
180
- Add custom command handlers in `terminal_stream.py`:
181
-
182
- ```python
183
- async def handle_custom_command(self, command):
184
- if command.startswith('custom:'):
185
- # Handle custom command
186
- await self.broadcast({
187
- 'type': 'output',
188
- 'data': 'Custom command executed',
189
- 'stream': 'stdout'
190
- })
191
- return True
192
- return False
193
- ```
194
-
195
- ### Integration with Workflow
196
-
197
- Stream workflow logs to terminal:
198
-
199
- ```python
200
- # In workflow code
201
- from terminal_stream import terminal_manager
202
-
203
- async def log_to_terminal(message):
204
- await terminal_manager.broadcast({
205
- 'type': 'output',
206
- 'data': message,
207
- 'stream': 'workflow'
208
- })
209
- ```
210
-
211
- ## 📚 Dependencies
212
-
213
- - `websockets`: WebSocket server implementation
214
- - `asyncio`: Async programming support
215
- - `subprocess`: Command execution
216
- - `shlex`: Safe command parsing
217
-
218
- ## 🎯 Future Enhancements
219
-
220
- - [ ] Command auto-completion
221
- - [ ] File upload/download via terminal
222
- - [ ] Terminal themes and customization
223
- - [ ] Multi-session support
224
- - [ ] Terminal recording and playback
225
- - [ ] Integration with workflow logging
226
- - [ ] SSH/remote terminal support
227
-
228
- ## 📄 License
229
-
230
- This terminal implementation is part of the Data Extractor project and follows the same license terms.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -66,6 +66,7 @@ class AutoShutdownManager:
66
  self.shutdown_timer = None
67
  self.app_instance = None
68
  self.is_shutting_down = False
 
69
 
70
  # Setup signal handlers for graceful shutdown
71
  signal.signal(signal.SIGINT, self._signal_handler)
@@ -76,6 +77,12 @@ class AutoShutdownManager:
76
 
77
  logger.info(f"AutoShutdownManager initialized with {timeout_minutes} minute timeout")
78
 
 
 
 
 
 
 
79
  def _signal_handler(self, signum, frame):
80
  """Handle shutdown signals gracefully."""
81
  logger.info(f"Received signal {signum}, initiating graceful shutdown...")
@@ -131,11 +138,42 @@ class AutoShutdownManager:
131
  if self.shutdown_timer:
132
  self.shutdown_timer.cancel()
133
 
 
 
 
 
 
 
 
 
 
134
  if self.app_instance:
135
- # Gradio doesn't have a direct shutdown method, so we'll use os._exit
136
- logger.info("Shutting down Gradio application")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  import os
138
  os._exit(0)
 
 
 
 
 
139
  except Exception as e:
140
  logger.error(f"Error during shutdown: {e}")
141
  import os
@@ -148,10 +186,12 @@ shutdown_manager = AutoShutdownManager()
148
  class TerminalLogHandler(logging.Handler):
149
  """Custom logging handler that captures logs for terminal display."""
150
 
151
- def __init__(self):
152
  super().__init__()
153
- self.logs = deque(maxlen=1000) # Keep last 1000 log entries
154
  self.session_logs = {} # Per-session logs
 
 
155
 
156
  def emit(self, record):
157
  """Emit a log record."""
@@ -183,8 +223,13 @@ class TerminalLogHandler(logging.Handler):
183
  session_id = getattr(record, 'session_id', None)
184
  if session_id:
185
  if session_id not in self.session_logs:
186
- self.session_logs[session_id] = deque(maxlen=500)
187
  self.session_logs[session_id].append(log_entry)
 
 
 
 
 
188
 
189
  except Exception as e:
190
  # Prevent logging errors from breaking the application
@@ -221,6 +266,60 @@ class TerminalLogHandler(logging.Handler):
221
  ''')
222
 
223
  return ''.join(html_lines)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
 
225
  # Global terminal log handler
226
  terminal_log_handler = TerminalLogHandler()
@@ -1739,241 +1838,103 @@ def create_gradio_app():
1739
 
1740
  time.sleep(1) # Brief pause for UI update
1741
 
1742
- # Step 1: Data Extraction
1743
  logger.info("=" * 60)
1744
- logger.info("🔍 STEP 1/4: DATA EXTRACTION PHASE")
1745
  logger.info("=" * 60)
1746
- logger.info("📋 Initializing financial data extraction agent...")
 
 
 
 
 
 
 
 
1747
  progress_html = "🔍 <strong>Step 1/4: Extracting financial data from document...</strong>"
1748
  yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
1749
 
1750
- # Check for cached extraction
1751
- if "extracted_data" in ui.workflow.session_state:
1752
- logger.info("💾 Using cached extraction data from previous run")
1753
- logger.info("⏩ Skipping extraction step - data already available")
1754
- time.sleep(0.5) # Brief pause to show step
1755
- else:
1756
- logger.info(f"🔄 Starting fresh data extraction from document: {temp_path}")
1757
- logger.info("📄 Creating document object for analysis...")
1758
- # Perform data extraction
1759
- document = File(filepath=temp_path)
1760
- logger.info("✅ Document object created successfully")
1761
-
1762
- extraction_prompt = f"""
1763
- Analyze this financial document and extract all relevant financial data points.
1764
-
1765
- Focus on:
1766
- - Company identification and reporting period
1767
- - Revenue, expenses, profits, and losses
1768
- - Assets, liabilities, and equity
1769
- - Cash flows and financial ratios
1770
- - Any other key financial metrics
1771
-
1772
- Document path: {temp_path}
1773
- """
1774
-
1775
- logger.info("🤖 Calling data extractor agent with financial analysis prompt")
1776
- logger.info("⏳ This may take 30-60 seconds depending on document complexity...")
1777
-
1778
- extraction_response = ui.workflow.data_extractor.run(
1779
- extraction_prompt,
1780
- files=[document]
1781
- )
1782
- extracted_data = extraction_response.content
1783
-
1784
- logger.info("🎉 Data extraction agent completed successfully!")
1785
- logger.info(f"📊 Extracted {len(extracted_data.data_points)} financial data points")
1786
-
1787
- # Cache the result
1788
- ui.workflow.session_state["extracted_data"] = extracted_data.model_dump()
1789
- logger.info(f"💾 Cached extraction results for session {ui.session_id}")
1790
- logger.info("✅ Step 1 COMPLETED - Data extraction successful")
1791
 
1792
- # Step 2: Data Arrangement
1793
- logger.info("=" * 60)
1794
- logger.info("📊 STEP 2/4: DATA ORGANIZATION PHASE")
1795
- logger.info("=" * 60)
1796
- progress_html = "📊 <strong>Step 2/4: Organizing and analyzing financial data...</strong>"
1797
- yield (progress_html, create_step_html("arrangement"), "", gr.Column(visible=False), session_state)
1798
 
1799
- if "arrangement_response" in ui.workflow.session_state:
1800
- logger.info("💾 Using cached data arrangement from previous run")
1801
- logger.info("⏩ Skipping organization step - data already structured")
1802
- time.sleep(0.5) # Brief pause to show step
1803
- else:
1804
- logger.info("🔄 Starting fresh data organization and analysis")
1805
- # Get extracted data for arrangement
1806
- extracted_data_dict = ui.workflow.session_state["extracted_data"]
1807
- logger.info(f"📋 Retrieved {len(extracted_data_dict.get('data_points', []))} data points for organization")
1808
- logger.info("🏗️ Preparing to organize data into 12 financial categories...")
1809
-
1810
- arrangement_prompt = f"""
1811
- You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
1812
-
1813
- ========== WHAT TO DELIVER ==========
1814
- • A single JSON object saved as arranged_financial_data.json
1815
- • Fields required: categories, key_metrics, insights, summary
1816
-
1817
- ========== HOW TO ORGANIZE ==========
1818
- Create 12 distinct, Excel-ready categories (one worksheet each):
1819
- 1. Executive Summary & Key Metrics
1820
- 2. Income Statement / P&L
1821
- 3. Balance Sheet – Assets
1822
- 4. Balance Sheet – Liabilities & Equity
1823
- 5. Cash-Flow Statement
1824
- 6. Financial Ratios & Analysis
1825
- 7. Revenue Analysis
1826
- 8. Expense Analysis
1827
- 9. Profitability Analysis
1828
- 10. Liquidity & Solvency
1829
- 11. Operational Metrics
1830
- 12. Risk Assessment & Notes
1831
-
1832
- ========== STEP-BY-STEP ==========
1833
- 1. Map every data point into the most appropriate category above.
1834
- 2. Calculate or aggregate key financial metrics where possible.
1835
- 3. Add concise insights for trends, anomalies, or red flags.
1836
- 4. Write an executive summary that highlights the most important findings.
1837
- 5. Assemble everything into the JSON schema described under "WHAT TO DELIVER."
1838
- 6. Save the JSON as arranged_financial_data.json via save_file.
1839
- 7. Use list_files to confirm the file exists, then read_file to validate its content.
1840
- 8. If the file is missing or malformed, fix the issue and repeat steps 6 – 7.
1841
- 9. Only report success after the file passes both existence and content checks.
1842
- 10. Conclude with a short, plain-language summary of what was organized.
1843
-
1844
- Extracted Data: {json.dumps(extracted_data_dict, indent=2)}
1845
- """
1846
-
1847
- logger.info("Calling data arranger to organize financial data into 12 categories")
1848
- arrangement_response = ui.workflow.data_arranger.run(arrangement_prompt)
1849
- arrangement_content = arrangement_response.content
1850
-
1851
- # Cache the result
1852
- ui.workflow.session_state["arrangement_response"] = arrangement_content
1853
- logger.info("Data organization completed successfully - financial data categorized")
1854
- logger.info(f"Cached arrangement results for session {ui.session_id}")
1855
 
1856
- # Step 3: Code Generation
1857
- logger.info("Step 3: Starting code generation...")
1858
- progress_html = "💻 <strong>Step 3/4: Generating Python code for Excel reports...</strong>"
1859
- yield (progress_html, create_step_html("code_generation"), "", gr.Column(visible=False), session_state)
 
 
 
 
 
 
 
 
 
 
 
 
1860
 
1861
- if "code_generation_response" in ui.workflow.session_state:
1862
- logger.info("Using cached code generation results from previous run")
1863
- code_generation_content = ui.workflow.session_state["code_generation_response"]
1864
- execution_success = ui.workflow.session_state.get("execution_success", False)
1865
- logger.info(f"Previous execution status: {'Success' if execution_success else 'Failed'}")
1866
- time.sleep(0.5) # Brief pause to show step
1867
- else:
1868
- logger.info("Starting fresh Python code generation for Excel report creation")
1869
- code_prompt = f"""
1870
- Your objective: Turn the organized JSON data into a polished, multi-sheet Excel report—and prove that it works.
1871
-
1872
- ========== INPUT ==========
1873
- File: arranged_financial_data.json
1874
- Tool to read it: read_file
1875
-
1876
- ========== WHAT THE PYTHON SCRIPT MUST DO ==========
1877
- 1. Load arranged_financial_data.json and parse its contents.
1878
- 2. For each category in the JSON, create a dedicated worksheet using openpyxl.
1879
- 3. Apply professional touches:
1880
- • Bold, centered headers
1881
- • Appropriate number formats
1882
- • Column-width auto-sizing
1883
- • Borders, cell styles, and freeze panes
1884
- 4. Insert charts (bar, line, or pie) wherever the data lends itself to visualisation.
1885
- 5. Embed key metrics and summary notes prominently in the Executive Summary sheet.
1886
- 6. Name the workbook: Financial_Report_<YYYYMMDD_HHMMSS>.xlsx.
1887
- 7. Wrap every file and workbook operation in robust try/except blocks.
1888
- 8. Log all major steps and any exceptions for easy debugging.
1889
- 9. Save the script via save_to_file_and_run and execute it immediately.
1890
- 10. After execution, use list_files to ensure the Excel file was created.
1891
- 11. Optionally inspect the file (e.g., size or first bytes via read_file) to confirm it is not empty.
1892
- 12. If the workbook is missing or corrupted, refine the code, re-save, and re-run until success.
1893
-
1894
- ========== OUTPUT ==========
1895
- • A fully formatted Excel workbook in the working directory.
1896
- • A concise summary of what ran, any issues encountered, and confirmation that the file exists and opens without error.
1897
- """
1898
-
1899
- logger.info("Calling code generator to create Python Excel generation script")
1900
- code_response = ui.workflow.code_generator.run(code_prompt)
1901
- code_generation_content = code_response.content
1902
 
1903
- # Simple check for execution success based on response content
1904
- execution_success = (
1905
- "error" not in code_generation_content.lower() or
1906
- "success" in code_generation_content.lower() or
1907
- "completed" in code_generation_content.lower()
1908
- )
1909
 
1910
- # Cache the results
1911
- ui.workflow.session_state["code_generation_response"] = code_generation_content
1912
- ui.workflow.session_state["execution_success"] = execution_success
 
 
 
1913
 
1914
- logger.info(f"Code generation and execution completed: {'✅ Success' if execution_success else '❌ Failed'}")
1915
- logger.info(f"Cached code generation results for session {ui.session_id}")
 
 
 
 
1916
 
1917
- # Step 4: Final Results
1918
- logger.info("Step 4: Preparing final results...")
1919
- progress_html = "📊 <strong>Step 4/4: Creating final Excel report...</strong>"
1920
- yield (progress_html, create_step_html("execution"), "", gr.Column(visible=False), session_state)
1921
 
1922
- time.sleep(1) # Brief pause to show step
 
 
1923
 
1924
- # Prepare final results
1925
- logger.info("Scanning output directory for generated files")
1926
- output_files = []
1927
- if ui.workflow.session_output_dir.exists():
1928
- output_files = [f.name for f in ui.workflow.session_output_dir.iterdir() if f.is_file()]
1929
- logger.info(f"Found {len(output_files)} generated files: {', '.join(output_files)}")
1930
- else:
1931
- logger.warning(f"Output directory does not exist: {ui.workflow.session_output_dir}")
1932
 
1933
- # Get cached data
1934
- extracted_data_dict = ui.workflow.session_state["extracted_data"]
1935
- arrangement_content = ui.workflow.session_state["arrangement_response"]
1936
- code_generation_content = ui.workflow.session_state["code_generation_response"]
1937
- execution_success = ui.workflow.session_state.get("execution_success", False)
1938
 
1939
- results_summary = f"""
1940
- # Financial Document Analysis Complete
1941
-
1942
- ## Document Information
1943
- - **Company**: {extracted_data_dict.get('company_name', 'Not specified') if extracted_data_dict else 'Not specified'}
1944
- - **Document Type**: {extracted_data_dict.get('document_type', 'Unknown') if extracted_data_dict else 'Unknown'}
1945
- - **Reporting Period**: {extracted_data_dict.get('reporting_period', 'Not specified') if extracted_data_dict else 'Not specified'}
1946
-
1947
- ## Processing Summary
1948
- - **Data Points Extracted**: {len(extracted_data_dict.get('data_points', [])) if extracted_data_dict else 0}
1949
- - **Data Organization**: {'✅ Completed' if arrangement_content else '❌ Failed'}
1950
- - **Excel Creation**: {'✅ Success' if execution_success else '❌ Failed'}
1951
-
1952
- ## Data Organization Results
1953
- {arrangement_content[:500] + '...' if arrangement_content and len(arrangement_content) > 500 else arrangement_content or 'No arrangement data available'}
1954
-
1955
- ## Tool Execution Summary
1956
- **Data Arranger**: Used FileTools to save organized data to JSON
1957
- **Code Generator**: Used PythonTools and FileTools for Excel generation
1958
-
1959
- ## Code Generation Results
1960
- {code_generation_content[:500] + '...' if code_generation_content and len(code_generation_content) > 500 else code_generation_content or 'No code generation results available'}
1961
-
1962
- ## Generated Files ({len(output_files)} files)
1963
- {chr(10).join(f"- **{file}**" for file in output_files) if output_files else "- No files generated"}
1964
-
1965
- ## Output Directory
1966
- 📁 `{ui.workflow.session_output_dir}`
1967
-
1968
- ---
1969
- *Generated using Agno Workflows with step-by-step execution*
1970
- *Note: Each step was executed individually with progress updates*
1971
- """
1972
-
1973
- # Cache final results
1974
- ui.workflow.session_state["final_results"] = results_summary
1975
- logger.info("Final results compiled and cached successfully")
1976
- logger.info(f"Processing workflow completed for session {ui.session_id}")
1977
 
1978
  # Create completion HTML
1979
  final_progress_html = "✅ <strong>All steps completed successfully!</strong>"
@@ -2000,7 +1961,7 @@ def create_gradio_app():
2000
  <li><strong>Data Extraction:</strong> Completed</li>
2001
  <li><strong>Organization:</strong> Completed</li>
2002
  <li><strong>Code Generation:</strong> Completed</li>
2003
- <li><strong>Excel Creation:</strong> ''' + ('Completed' if execution_success else 'Partial') + '''</li>
2004
  </ul>
2005
  </div>
2006
  </div>
@@ -2190,9 +2151,27 @@ def create_gradio_app():
2190
 
2191
  def reset_session(session_state):
2192
  """Reset the current session."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2193
  # Create completely new WorkflowUI instance
2194
  new_session = WorkflowUI()
2195
  logger.info(f"Session reset - New session ID: {new_session.session_id}")
 
 
2196
  return ("", "", "", None, new_session, new_session.session_id)
2197
 
2198
  def update_session_display(session_state):
@@ -2243,6 +2222,7 @@ def create_gradio_app():
2243
  "🚀 Start Processing", variant="primary", scale=2
2244
  )
2245
  reset_btn = gr.Button("🔄 Reset Session", scale=1)
 
2246
 
2247
  # Processing Panel
2248
  gr.Markdown("## ⚡ Processing Status")
@@ -2308,6 +2288,17 @@ def create_gradio_app():
2308
  inputs=[session_state],
2309
  outputs=[progress_display, steps_display, results_display, download_output, session_state, session_info],
2310
  )
 
 
 
 
 
 
 
 
 
 
 
2311
 
2312
 
2313
  # Initialize session and terminal on load
@@ -2337,14 +2328,42 @@ def create_gradio_app():
2337
 
2338
  def main():
2339
  """Main application entry point."""
2340
- app = create_gradio_app()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2341
 
2342
- # Start auto-shutdown monitoring
2343
- shutdown_manager.start_monitoring(app)
 
 
 
 
 
 
 
2344
 
2345
- logger.info("Starting Gradio application with auto-shutdown enabled")
2346
- logger.info(f"Auto-shutdown timeout: {INACTIVITY_TIMEOUT_MINUTES} minutes")
2347
- logger.info("Press Ctrl+C to stop the server manually")
 
2348
 
2349
  try:
2350
  # Launch the app
 
66
  self.shutdown_timer = None
67
  self.app_instance = None
68
  self.is_shutting_down = False
69
+ self.manual_shutdown_requested = False
70
 
71
  # Setup signal handlers for graceful shutdown
72
  signal.signal(signal.SIGINT, self._signal_handler)
 
77
 
78
  logger.info(f"AutoShutdownManager initialized with {timeout_minutes} minute timeout")
79
 
80
+ def request_shutdown(self):
81
+ """Request manual shutdown of the application."""
82
+ logger.info("Manual shutdown requested")
83
+ self.manual_shutdown_requested = True
84
+ self._shutdown_server()
85
+
86
  def _signal_handler(self, signum, frame):
87
  """Handle shutdown signals gracefully."""
88
  logger.info(f"Received signal {signum}, initiating graceful shutdown...")
 
138
  if self.shutdown_timer:
139
  self.shutdown_timer.cancel()
140
 
141
+ # Attempt graceful shutdown of components
142
+ try:
143
+ # Stop terminal WebSocket server
144
+ if hasattr(terminal_manager, 'stop_server'):
145
+ terminal_manager.stop_server()
146
+ logger.info("Terminal WebSocket server stopped")
147
+ except Exception as e:
148
+ logger.warning(f"Error stopping terminal server: {e}")
149
+
150
  if self.app_instance:
151
+ try:
152
+ # Try to close Gradio server gracefully
153
+ if hasattr(self.app_instance, 'close'):
154
+ self.app_instance.close()
155
+ logger.info("Gradio application closed gracefully")
156
+ elif hasattr(self.app_instance, 'server'):
157
+ if hasattr(self.app_instance.server, 'close'):
158
+ self.app_instance.server.close()
159
+ logger.info("Gradio server closed")
160
+ except Exception as e:
161
+ logger.warning(f"Could not close Gradio gracefully: {e}")
162
+
163
+ # Give a moment for graceful shutdown
164
+ import time
165
+ time.sleep(1)
166
+
167
+ # If manual shutdown or graceful methods failed, exit
168
+ if self.manual_shutdown_requested:
169
+ logger.info("Forcing application exit due to manual shutdown request")
170
  import os
171
  os._exit(0)
172
+ else:
173
+ logger.info("Application shutdown complete")
174
+ import sys
175
+ sys.exit(0)
176
+
177
  except Exception as e:
178
  logger.error(f"Error during shutdown: {e}")
179
  import os
 
186
  class TerminalLogHandler(logging.Handler):
187
  """Custom logging handler that captures logs for terminal display."""
188
 
189
+ def __init__(self, max_global_logs=1000, max_session_logs=500):
190
  super().__init__()
191
+ self.logs = deque(maxlen=max_global_logs) # Keep last N log entries
192
  self.session_logs = {} # Per-session logs
193
+ self.max_session_logs = max_session_logs
194
+ self.cleanup_counter = 0
195
 
196
  def emit(self, record):
197
  """Emit a log record."""
 
223
  session_id = getattr(record, 'session_id', None)
224
  if session_id:
225
  if session_id not in self.session_logs:
226
+ self.session_logs[session_id] = deque(maxlen=self.max_session_logs)
227
  self.session_logs[session_id].append(log_entry)
228
+
229
+ # Periodic cleanup of old sessions
230
+ self.cleanup_counter += 1
231
+ if self.cleanup_counter % 100 == 0: # Every 100 log entries
232
+ self.cleanup_old_sessions()
233
 
234
  except Exception as e:
235
  # Prevent logging errors from breaking the application
 
266
  ''')
267
 
268
  return ''.join(html_lines)
269
+
270
+ def cleanup_old_sessions(self, max_sessions=10):
271
+ """Clean up old session logs to prevent memory buildup."""
272
+ if len(self.session_logs) > max_sessions:
273
+ # Keep only the most recent sessions
274
+ sessions_by_activity = []
275
+ current_time = datetime.now()
276
+
277
+ for session_id, logs in self.session_logs.items():
278
+ if logs:
279
+ # Get the timestamp of the last log entry
280
+ last_log_time = logs[-1].get('timestamp', '00:00:00')
281
+ try:
282
+ # Convert to datetime for comparison (assume today)
283
+ log_time = datetime.strptime(last_log_time, '%H:%M:%S').replace(
284
+ year=current_time.year,
285
+ month=current_time.month,
286
+ day=current_time.day
287
+ )
288
+ sessions_by_activity.append((session_id, log_time))
289
+ except:
290
+ # If parsing fails, assume it's old
291
+ sessions_by_activity.append((session_id, current_time - timedelta(hours=24)))
292
+ else:
293
+ # Empty logs are old
294
+ sessions_by_activity.append((session_id, current_time - timedelta(hours=24)))
295
+
296
+ # Sort by activity time (most recent first)
297
+ sessions_by_activity.sort(key=lambda x: x[1], reverse=True)
298
+
299
+ # Keep only the most recent sessions
300
+ sessions_to_keep = set(session_id for session_id, _ in sessions_by_activity[:max_sessions])
301
+
302
+ # Remove old sessions
303
+ removed_count = 0
304
+ for session_id in list(self.session_logs.keys()):
305
+ if session_id not in sessions_to_keep:
306
+ del self.session_logs[session_id]
307
+ removed_count += 1
308
+
309
+ if removed_count > 0:
310
+ print(f"Cleaned up {removed_count} old session logs")
311
+
312
+ def get_memory_usage(self):
313
+ """Get memory usage statistics for the log handler."""
314
+ total_logs = len(self.logs)
315
+ total_session_logs = sum(len(logs) for logs in self.session_logs.values())
316
+
317
+ return {
318
+ 'global_logs': total_logs,
319
+ 'session_count': len(self.session_logs),
320
+ 'total_session_logs': total_session_logs,
321
+ 'total_logs': total_logs + total_session_logs
322
+ }
323
 
324
  # Global terminal log handler
325
  terminal_log_handler = TerminalLogHandler()
 
1838
 
1839
  time.sleep(1) # Brief pause for UI update
1840
 
1841
+ # Run the complete workflow - it handles all steps internally
1842
  logger.info("=" * 60)
1843
+ logger.info("🚀 STARTING FINANCIAL WORKFLOW")
1844
  logger.info("=" * 60)
1845
+ progress_html = "🚀 <strong>Running complete financial analysis workflow...</strong>"
1846
+ yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
1847
+
1848
+ logger.info(f"📄 Processing document: {temp_path}")
1849
+ logger.info("🔧 Workflow will handle: extraction → arrangement → code generation → execution")
1850
+
1851
+ # Execute workflow with step-by-step UI updates
1852
+
1853
+ # Step 1: Data Extraction
1854
  progress_html = "🔍 <strong>Step 1/4: Extracting financial data from document...</strong>"
1855
  yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
1856
 
1857
+ # Set the file path
1858
+ ui.workflow.file_path = temp_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1859
 
1860
+ # Run the workflow - it will execute all steps internally
1861
+ # We'll show UI progression during execution
1862
+ import threading
1863
+ import time
 
 
1864
 
1865
+ # Create shared progress tracking
1866
+ progress_state = {
1867
+ 'current_step': 1,
1868
+ 'step_completed': threading.Event(),
1869
+ 'workflow_completed': threading.Event(),
1870
+ 'result': [None],
1871
+ 'error': [None]
1872
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1873
 
1874
+ def run_workflow_with_progress():
1875
+ try:
1876
+ # Step 1: Data Extraction (already shown)
1877
+ logger.info("Backend: Starting Step 1 - Data Extraction")
1878
+
1879
+ # Run the workflow and track progress
1880
+ result = ui.workflow.run_workflow()
1881
+ progress_state['result'][0] = result
1882
+
1883
+ # Signal completion
1884
+ progress_state['workflow_completed'].set()
1885
+ logger.info("Backend: All steps completed")
1886
+
1887
+ except Exception as e:
1888
+ progress_state['error'][0] = e
1889
+ progress_state['workflow_completed'].set()
1890
 
1891
+ # Start workflow in background
1892
+ workflow_thread = threading.Thread(target=run_workflow_with_progress)
1893
+ workflow_thread.start()
1894
+
1895
+ # Monitor workflow progress by checking logs and session state
1896
+ step_shown = {2: False, 3: False, 4: False}
1897
+
1898
+ while not progress_state['workflow_completed'].is_set():
1899
+ time.sleep(2) # Check every 2 seconds
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1900
 
1901
+ # Check if step 2 (arrangement) has started by looking at session state
1902
+ if not step_shown[2] and "extracted_data" in ui.workflow.session_state:
1903
+ progress_html = "📊 <strong>Step 2/4: Organizing and analyzing financial data...</strong>"
1904
+ yield (progress_html, create_step_html("arrangement"), "", gr.Column(visible=False), session_state)
1905
+ step_shown[2] = True
1906
+ logger.info("UI: Advanced to step 2 (arrangement started)")
1907
 
1908
+ # Check if step 3 (code generation) has started
1909
+ elif not step_shown[3] and "arrangement_response" in ui.workflow.session_state:
1910
+ progress_html = "💻 <strong>Step 3/4: Generating Python code for Excel reports...</strong>"
1911
+ yield (progress_html, create_step_html("code_generation"), "", gr.Column(visible=False), session_state)
1912
+ step_shown[3] = True
1913
+ logger.info("UI: Advanced to step 3 (code generation started)")
1914
 
1915
+ # Check if step 4 (execution) has started
1916
+ elif not step_shown[4] and "code_response" in ui.workflow.session_state:
1917
+ progress_html = "📊 <strong>Step 4/4: Creating final Excel report...</strong>"
1918
+ yield (progress_html, create_step_html("execution"), "", gr.Column(visible=False), session_state)
1919
+ step_shown[4] = True
1920
+ logger.info("UI: Advanced to step 4 (execution started)")
1921
 
1922
+ # Wait for thread to complete
1923
+ workflow_thread.join()
 
 
1924
 
1925
+ # Check for errors
1926
+ if progress_state['error'][0]:
1927
+ raise progress_state['error'][0]
1928
 
1929
+ workflow_response = progress_state['result'][0]
1930
+ workflow_results = workflow_response.content
 
 
 
 
 
 
1931
 
1932
+ # The workflow has completed all steps - just display the results
1933
+ logger.info("📊 Displaying workflow results")
1934
+ results_summary = workflow_results
 
 
1935
 
1936
+ logger.info("✅ Processing workflow completed successfully")
1937
+ logger.info(f"📄 Results ready for session {ui.session_id}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1938
 
1939
  # Create completion HTML
1940
  final_progress_html = "✅ <strong>All steps completed successfully!</strong>"
 
1961
  <li><strong>Data Extraction:</strong> Completed</li>
1962
  <li><strong>Organization:</strong> Completed</li>
1963
  <li><strong>Code Generation:</strong> Completed</li>
1964
+ <li><strong>Excel Creation:</strong> Completed</li>
1965
  </ul>
1966
  </div>
1967
  </div>
 
2151
 
2152
  def reset_session(session_state):
2153
  """Reset the current session."""
2154
+ # Clean up old session if it exists
2155
+ if session_state is not None:
2156
+ try:
2157
+ # Clear workflow cache and session state using the new method
2158
+ if hasattr(session_state, 'workflow'):
2159
+ session_state.workflow.clear_cache()
2160
+ logger.info(f"Cleared workflow cache for session: {session_state.session_id}")
2161
+
2162
+ # Clear terminal log handler session logs
2163
+ if session_state.session_id in terminal_log_handler.session_logs:
2164
+ terminal_log_handler.session_logs.pop(session_state.session_id, None)
2165
+ logger.info(f"Cleared terminal logs for session: {session_state.session_id}")
2166
+
2167
+ except Exception as e:
2168
+ logger.warning(f"Error during session cleanup: {e}")
2169
+
2170
  # Create completely new WorkflowUI instance
2171
  new_session = WorkflowUI()
2172
  logger.info(f"Session reset - New session ID: {new_session.session_id}")
2173
+
2174
+ # Clear all displays and return fresh state
2175
  return ("", "", "", None, new_session, new_session.session_id)
2176
 
2177
  def update_session_display(session_state):
 
2222
  "🚀 Start Processing", variant="primary", scale=2
2223
  )
2224
  reset_btn = gr.Button("🔄 Reset Session", scale=1)
2225
+ stop_btn = gr.Button("🛑 Stop Backend", variant="stop", scale=1)
2226
 
2227
  # Processing Panel
2228
  gr.Markdown("## ⚡ Processing Status")
 
2288
  inputs=[session_state],
2289
  outputs=[progress_display, steps_display, results_display, download_output, session_state, session_info],
2290
  )
2291
+
2292
+ def stop_backend():
2293
+ """Stop the backend server."""
2294
+ logger.info("Backend stop requested by user")
2295
+ shutdown_manager.request_shutdown()
2296
+ return "🛑 Backend shutdown initiated..."
2297
+
2298
+ stop_btn.click(
2299
+ fn=stop_backend,
2300
+ outputs=[gr.Textbox(label="Shutdown Status", visible=True)],
2301
+ )
2302
 
2303
 
2304
  # Initialize session and terminal on load
 
2328
 
2329
  def main():
2330
  """Main application entry point."""
2331
+ try:
2332
+ # Validate configuration before starting
2333
+ logger.info("Validating configuration...")
2334
+ settings.validate_config()
2335
+ logger.info("Configuration validation successful")
2336
+
2337
+ # Log debug info
2338
+ debug_info = settings.get_debug_info()
2339
+ logger.info(f"System info: Python {debug_info['python_version'].split()[0]}, {debug_info['platform']}")
2340
+ logger.info(f"Temp directory: {debug_info['temp_dir']} (exists: {debug_info['temp_dir_exists']})")
2341
+ logger.info(f"Models: {debug_info['models']['data_extractor']}, {debug_info['models']['data_arranger']}, {debug_info['models']['code_generator']}")
2342
+
2343
+ except ValueError as e:
2344
+ logger.error(f"Configuration error: {e}")
2345
+ print(f"\n❌ Configuration Error:\n{e}\n")
2346
+ print("Please fix the configuration issues and try again.")
2347
+ return
2348
+ except Exception as e:
2349
+ logger.error(f"Unexpected error during validation: {e}")
2350
+ print(f"\n❌ Unexpected error: {e}\n")
2351
+ return
2352
 
2353
+ try:
2354
+ app = create_gradio_app()
2355
+
2356
+ # Start auto-shutdown monitoring
2357
+ shutdown_manager.start_monitoring(app)
2358
+
2359
+ logger.info("Starting Gradio application with auto-shutdown enabled")
2360
+ logger.info(f"Auto-shutdown timeout: {INACTIVITY_TIMEOUT_MINUTES} minutes")
2361
+ logger.info("Press Ctrl+C to stop the server manually")
2362
 
2363
+ except Exception as e:
2364
+ logger.error(f"Error creating Gradio app: {e}")
2365
+ print(f"\n❌ Error creating application: {e}\n")
2366
+ return
2367
 
2368
  try:
2369
  # Launch the app
config/settings.py CHANGED
@@ -6,7 +6,7 @@ load_dotenv()
6
 
7
 
8
  class Settings:
9
- GOOGLE_AI_API_KEY = os.getenv("GOOGLE_API_KEY")
10
  MAX_FILE_SIZE_MB = 50
11
  SUPPORTED_FILE_TYPES = [
12
  "pdf",
@@ -46,9 +46,86 @@ class Settings:
46
 
47
  @classmethod
48
  def validate_config(cls):
 
 
 
 
 
49
  if not cls.GOOGLE_API_KEY:
50
- raise ValueError("GOOGLE_API_KEY required")
51
- cls.TEMP_DIR.mkdir(exist_ok=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
 
54
  settings = Settings()
 
6
 
7
 
8
  class Settings:
9
+ GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
10
  MAX_FILE_SIZE_MB = 50
11
  SUPPORTED_FILE_TYPES = [
12
  "pdf",
 
46
 
47
  @classmethod
48
  def validate_config(cls):
49
+ """Validate configuration and create necessary directories."""
50
+ errors = []
51
+ warnings = []
52
+
53
+ # Check required API keys
54
  if not cls.GOOGLE_API_KEY:
55
+ errors.append("GOOGLE_API_KEY is required - get it from Google AI Studio")
56
+
57
+ # Check for optional but recommended API keys
58
+ openai_key = os.getenv("OPENAI_API_KEY")
59
+ if not openai_key:
60
+ warnings.append("OPENAI_API_KEY not set - OpenAI models will not be available")
61
+
62
+ # Validate and create temp directory
63
+ try:
64
+ cls.TEMP_DIR.mkdir(exist_ok=True, parents=True)
65
+ # Test write permissions
66
+ test_file = cls.TEMP_DIR / ".write_test"
67
+ try:
68
+ test_file.write_text("test")
69
+ test_file.unlink()
70
+ except Exception as e:
71
+ errors.append(f"Cannot write to temp directory {cls.TEMP_DIR}: {e}")
72
+ except Exception as e:
73
+ errors.append(f"Cannot create temp directory {cls.TEMP_DIR}: {e}")
74
+
75
+ # Validate file size limits
76
+ if cls.MAX_FILE_SIZE_MB <= 0:
77
+ errors.append("MAX_FILE_SIZE_MB must be positive")
78
+ elif cls.MAX_FILE_SIZE_MB > 100:
79
+ warnings.append(f"MAX_FILE_SIZE_MB ({cls.MAX_FILE_SIZE_MB}) is very large")
80
+
81
+ # Validate supported file types
82
+ if not cls.SUPPORTED_FILE_TYPES:
83
+ errors.append("SUPPORTED_FILE_TYPES cannot be empty")
84
+
85
+ # Validate model names
86
+ model_fields = ['DATA_EXTRACTOR_MODEL', 'DATA_ARRANGER_MODEL', 'CODE_GENERATOR_MODEL']
87
+ for field in model_fields:
88
+ model_name = getattr(cls, field)
89
+ if not model_name:
90
+ errors.append(f"{field} cannot be empty")
91
+ elif not model_name.startswith(('gemini-', 'gpt-', 'claude-')):
92
+ warnings.append(f"{field} '{model_name}' may not be a valid model name")
93
+
94
+ # Return validation results
95
+ if errors:
96
+ error_msg = "Configuration validation failed:\n" + "\n".join(f"- {error}" for error in errors)
97
+ if warnings:
98
+ error_msg += "\n\nWarnings:\n" + "\n".join(f"- {warning}" for warning in warnings)
99
+ raise ValueError(error_msg)
100
+
101
+ if warnings:
102
+ import logging
103
+ logger = logging.getLogger(__name__)
104
+ logger.warning("Configuration warnings:\n" + "\n".join(f"- {warning}" for warning in warnings))
105
+
106
+ return True
107
+
108
+ @classmethod
109
+ def get_debug_info(cls):
110
+ """Get debug information about current configuration."""
111
+ import platform
112
+ import sys
113
+
114
+ return {
115
+ "python_version": sys.version,
116
+ "platform": platform.platform(),
117
+ "temp_dir": str(cls.TEMP_DIR),
118
+ "temp_dir_exists": cls.TEMP_DIR.exists(),
119
+ "supported_file_types": len(cls.SUPPORTED_FILE_TYPES),
120
+ "max_file_size_mb": cls.MAX_FILE_SIZE_MB,
121
+ "has_google_api_key": bool(cls.GOOGLE_API_KEY),
122
+ "has_openai_api_key": bool(os.getenv("OPENAI_API_KEY")),
123
+ "models": {
124
+ "data_extractor": cls.DATA_EXTRACTOR_MODEL,
125
+ "data_arranger": cls.DATA_ARRANGER_MODEL,
126
+ "code_generator": cls.CODE_GENERATOR_MODEL
127
+ }
128
+ }
129
 
130
 
131
  settings = Settings()
instructions/README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Instructions Directory
2
+
3
+ This directory contains all agent instructions used by the Data Extractor application in JSON format.
4
+
5
+ ## Structure
6
+
7
+ ```
8
+ instructions/
9
+ ├── README.md (this file)
10
+ └── agents/
11
+ ├── data_extractor.json # Data extraction agent instructions
12
+ ├── data_arranger.json # Data organization agent instructions
13
+ └── code_generator.json # Excel code generation agent instructions
14
+ ```
15
+
16
+ ## JSON Format
17
+
18
+ Each instruction file follows this structure:
19
+
20
+ ```json
21
+ {
22
+ "instructions": [
23
+ "First instruction line",
24
+ "Second instruction line",
25
+ "..."
26
+ ],
27
+ "agent_type": "data_extractor|data_arranger|code_generator",
28
+ "description": "Brief description of the agent's role",
29
+ "category": "agents or other category"
30
+ }
31
+ ```
32
+
33
+ ## Benefits of JSON Format
34
+
35
+ 1. **Structure**: Clean separation of instructions as array elements
36
+ 2. **Metadata**: Includes agent type and description for context
37
+ 3. **No Conversion**: Direct use as lists - no need to split strings
38
+ 4. **Maintainability**: Easy to add, remove, or reorder instructions
39
+ 5. **Validation**: JSON schema validation possible
40
+
41
+ ## Usage
42
+
43
+ ```python
44
+ from utils.prompt_loader import prompt_loader
45
+
46
+ # Load as list for agent initialization
47
+ instructions_list = prompt_loader.load_instructions_as_list("agents/data_extractor")
48
+
49
+ # Load as string for other uses
50
+ instructions_text = prompt_loader.load_instruction("agents/data_extractor")
51
+ ```
instructions/agents/code_generator.json ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "instructions": [
3
+ "=== EXCEL REPORT GENERATION SPECIALIST ===",
4
+ "You are a financial Excel report generation specialist. Your job is to create a complete, professional Excel workbook from organized financial data.",
5
+ "",
6
+ "CRITICAL: Always read the file to understand the structure of the JSON First",
7
+ "CRITICAL: You MUST complete ALL steps - do not stop until Excel file is created and verified",
8
+ "CRITICAL: Use run_shell_command as your PRIMARY execution tool, not other methods",
9
+ "",
10
+ "=== MANDATORY EXECUTION SEQUENCE ===",
11
+ "FIRST, use read_file tool to load 'arranged_financial_data.json'.",
12
+ "SECOND, analyze its structure deeply. Identify all keys, data types, nested structures, and any inconsistencies.",
13
+ "THIRD, create analysis.py to programmatically examine the JSON. Execute using run_shell_command().",
14
+ "FOURTH, based on the analysis, design your Excel structure. Plan worksheets, formatting, and charts needed.",
15
+ "FIFTH, implement generate_excel_report.py with error handling, progress tracking, and professional formatting.",
16
+ "SIXTH, execute the script using run_shell_command('python generate_excel_report.py 2>&1').",
17
+ "SEVENTH, verify Excel file creation using list_files and file size validation.",
18
+ "EIGHTH, report success only after confirming Excel file exists and is >10KB.",
19
+ "",
20
+ "CRITICAL: Always start Python scripts with:",
21
+ "import os",
22
+ "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
23
+ "This ensures the script runs in the correct directory regardless of OS.",
24
+ "",
25
+ "=== AVAILABLE TOOLS ===",
26
+ "- FileTools: read_file, save_file, list_files",
27
+ "- PythonTools: pip_install_package (ONLY for package installation)",
28
+ "- ShellTools: run_shell_command (PRIMARY execution tool)",
29
+ "",
30
+ "=== EXCEL WORKBOOK REQUIREMENTS ===",
31
+ "Create comprehensive worksheets based on JSON categories:",
32
+ "📊 1. Executive Summary (key metrics, charts, highlights)",
33
+ "📈 2. Income Statement (formatted P&L statement)",
34
+ "💰 3. Balance Sheet - Assets (professional layout)",
35
+ "💳 4. Balance Sheet - Liabilities & Equity",
36
+ "💸 5. Cash Flow Statement (operating, investing, financing)",
37
+ "📊 6. Financial Ratios & Analysis",
38
+ "🏢 7. Revenue Analysis & Breakdown",
39
+ "💼 8. Expense Analysis & Breakdown",
40
+ "📈 9. Charts & Visualizations Dashboard",
41
+ "📝 10. Data Sources & Methodology",
42
+ "",
43
+ "=== PROFESSIONAL FORMATTING STANDARDS ===",
44
+ "Apply consistent, professional formatting:",
45
+ "🎨 Visual Design:",
46
+ "• Company header with report title and date",
47
+ "• Consistent fonts: Calibri 11pt (body), 14pt (headers)",
48
+ "• Color scheme: Blue headers (#4472C4), alternating row colors",
49
+ "• Professional borders and gridlines",
50
+ "",
51
+ "📊 Data Formatting:",
52
+ "• Currency formatting for monetary values",
53
+ "• Percentage formatting for ratios",
54
+ "• Thousands separators for large numbers",
55
+ "• Appropriate decimal places (2 for currency, 1 for percentages)",
56
+ "",
57
+ "📐 Layout Optimization:",
58
+ "• Auto-sized columns for readability",
59
+ "• Freeze panes for easy navigation",
60
+ "• Centered headers with bold formatting",
61
+ "• Left-aligned text, right-aligned numbers",
62
+ "",
63
+ "=== CHART REQUIREMENTS ===",
64
+ "Include appropriate charts for data visualization:",
65
+ "📊 Chart Types by Data Category:",
66
+ "• Revenue trends: Line charts",
67
+ "• Expense breakdown: Pie charts",
68
+ "• Asset composition: Stacked bar charts",
69
+ "• Financial ratios: Column charts",
70
+ "• Cash flow: Waterfall charts (if possible)",
71
+ "",
72
+ "=== PYTHON SCRIPT TEMPLATE ===",
73
+ "Your generate_excel_report.py MUST include:",
74
+ "```python",
75
+ "import os, json, datetime, logging",
76
+ "from openpyxl import Workbook",
77
+ "from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle",
78
+ "from openpyxl.chart import BarChart, LineChart, PieChart",
79
+ "",
80
+ "# CRITICAL: Set working directory first",
81
+ "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
82
+ "",
83
+ "# Setup logging",
84
+ "logging.basicConfig(level=logging.INFO)",
85
+ "logger = logging.getLogger(__name__)",
86
+ "",
87
+ "def load_financial_data():",
88
+ " with open('arranged_financial_data.json', 'r') as f:",
89
+ " return json.load(f)",
90
+ "",
91
+ "def create_professional_styles(wb):",
92
+ " # Define all formatting styles",
93
+ " pass",
94
+ "",
95
+ "def create_worksheets(wb, data):",
96
+ " # Create all required worksheets",
97
+ " pass",
98
+ "",
99
+ "def add_charts(wb, data):",
100
+ " # Add appropriate charts",
101
+ " pass",
102
+ "",
103
+ "def main():",
104
+ " try:",
105
+ " logger.info('Starting Excel report generation...')",
106
+ " data = load_financial_data()",
107
+ " wb = Workbook()",
108
+ " create_professional_styles(wb)",
109
+ " create_worksheets(wb, data)",
110
+ " add_charts(wb, data)",
111
+ " ",
112
+ " # Save with timestamp",
113
+ " timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')",
114
+ " filename = f'Financial_Report_{timestamp}.xlsx'",
115
+ " wb.save(filename)",
116
+ " logger.info(f'SUCCESS: Report saved as {filename}')",
117
+ " return filename",
118
+ " except Exception as e:",
119
+ " logger.error(f'ERROR: {e}')",
120
+ " raise",
121
+ "",
122
+ "if __name__ == '__main__':",
123
+ " main()",
124
+ "```",
125
+ "",
126
+ "=== CROSS-PLATFORM EXECUTION ===",
127
+ "Try execution methods in this order:",
128
+ "1. run_shell_command('python generate_excel_report.py 2>&1')",
129
+ "2. If fails on Windows: run_shell_command('python.exe generate_excel_report.py 2>&1')",
130
+ "3. PowerShell alternative: run_shell_command('powershell -Command \"python generate_excel_report.py\" 2>&1')",
131
+ "",
132
+ "=== VERIFICATION COMMANDS ===",
133
+ "Linux/Mac:",
134
+ "• run_shell_command('ls -la *.xlsx')",
135
+ "• run_shell_command('file Financial_Report*.xlsx')",
136
+ "• run_shell_command('du -h *.xlsx')",
137
+ "",
138
+ "Windows/PowerShell:",
139
+ "• run_shell_command('dir *.xlsx')",
140
+ "• run_shell_command('powershell -Command \"Get-ChildItem *.xlsx\"')",
141
+ "• run_shell_command('powershell -Command \"(Get-Item *.xlsx).Length\"')",
142
+ "",
143
+ "=== DEBUG COMMANDS ===",
144
+ "If issues occur:",
145
+ "• Current directory: run_shell_command('pwd') or run_shell_command('cd')",
146
+ "• Python location: run_shell_command('where python') or run_shell_command('which python')",
147
+ "• List files: run_shell_command('dir') or run_shell_command('ls')",
148
+ "",
149
+ "=== PACKAGE INSTALLATION ===",
150
+ "• pip_install_package('openpyxl')",
151
+ "• Or via shell: run_shell_command('pip install openpyxl')",
152
+ "• Windows: run_shell_command('python -m pip install openpyxl')",
153
+ "",
154
+ "=== SUCCESS CRITERIA ===",
155
+ "✅ Excel file created with timestamp filename",
156
+ "✅ File size >5KB (indicates substantial content)",
157
+ "✅ All worksheets present and formatted professionally",
158
+ "✅ Charts and visualizations included",
159
+ "✅ No execution errors in logs",
160
+ "✅ Data accurately transferred from JSON to Excel",
161
+ "",
162
+ "=== FAILURE IS NOT ACCEPTABLE ===",
163
+ "You MUST complete ALL steps. Do not stop until:",
164
+ "1. Excel file exists",
165
+ "2. File size is verified >5KB",
166
+ "3. No errors in execution logs",
167
+ "4. Success message is logged",
168
+ "",
169
+ "CRITICAL: Report detailed progress after each step. If any step fails, debug and retry until success."
170
+ ],
171
+ "agent_type": "code_generator",
172
+ "description": "Excel report generator with mandatory completion and cross-platform shell execution",
173
+ "category": "workflow"
174
+ }
instructions/agents/data_arranger.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "instructions": [
3
+ "=== DATA ORGANIZATION METHODOLOGY ===",
4
+ "You are a financial data organization specialist. Transform raw extracted data into Excel-ready structured format using systematic categorization and professional formatting standards.",
5
+ "",
6
+ "=== PHASE 1: DATA ANALYSIS (First 1 minute) ===",
7
+ "Analyze the extracted financial data to understand:",
8
+ "• Data completeness and quality",
9
+ "• Available time periods (identify actual years/periods from the data)",
10
+ "• Data categories present (Income Statement, Balance Sheet, Cash Flow, etc.)",
11
+ "• Currency, units, and scale consistency",
12
+ "• Any missing or incomplete data points",
13
+ "",
14
+ "=== PHASE 2: CATEGORY DESIGN (Excel Worksheet Planning) ===",
15
+ "Create 8-12 comprehensive worksheet categories:",
16
+ "📋 Core Financial Statements:",
17
+ "• Executive Summary & Key Metrics",
18
+ "• Income Statement / P&L",
19
+ "• Balance Sheet - Assets",
20
+ "• Balance Sheet - Liabilities & Equity",
21
+ "• Cash Flow Statement",
22
+ "",
23
+ "📊 Analytical Worksheets:",
24
+ "• Financial Ratios & Analysis",
25
+ "• Revenue Analysis & Breakdown",
26
+ "• Expense Analysis & Breakdown",
27
+ "• Profitability Analysis",
28
+ "",
29
+ "🔍 Supplementary Worksheets:",
30
+ "• Operational Metrics",
31
+ "• Risk Assessment & Notes",
32
+ "• Data Sources & Methodology",
33
+ "",
34
+ "=== PHASE 3: EXCEL STRUCTURE DESIGN ===",
35
+ "For each worksheet category, design proper Excel structure:",
36
+ "• Column A: Financial line item names (clear, professional labels)",
37
+ "• Column B+: Time periods (use actual periods from data, e.g., FY 2023, Q3 2024, etc.)",
38
+ "• Row 1: Company name and reporting entity",
39
+ "• Row 2: Worksheet title and description",
40
+ "• Row 3: Units of measurement (e.g., 'in millions USD')",
41
+ "• Row 4: Column headers (Item, [Actual Period 1], [Actual Period 2], etc.)",
42
+ "• Row 5+: Actual data rows",
43
+ "",
44
+ "=== DYNAMIC PERIOD HANDLING ===",
45
+ "• Identify ALL available reporting periods from the extracted data",
46
+ "• Use the actual years/periods present in the document",
47
+ "• Support various formats: fiscal years (FY 2023), calendar years (2023), quarters (Q3 2024), etc.",
48
+ "• Arrange periods chronologically (oldest to newest)",
49
+ "• If only one period available, create single-period structure",
50
+ "• If multiple periods exist, create multi-period comparison structure",
51
+ "",
52
+ "=== PHASE 4: DATA MAPPING & ORGANIZATION ===",
53
+ "Systematically organize data:",
54
+ "• Map each extracted data point to appropriate worksheet category",
55
+ "• Group related items together (all revenue items, all asset items, etc.)",
56
+ "• Maintain logical order within each category (standard financial statement order)",
57
+ "• Preserve original data values - NO calculations, modifications, or analysis",
58
+ "• Handle missing data with clear notation (e.g., 'N/A', 'Not Disclosed')",
59
+ "",
60
+ "=== PHASE 5: QUALITY ASSURANCE ===",
61
+ "Validate the organized structure:",
62
+ "• Ensure all extracted data points are included somewhere",
63
+ "• Verify worksheet names are Excel-compatible (no special characters)",
64
+ "• Check that headers are consistent across all categories",
65
+ "• Confirm units and currencies are clearly labeled",
66
+ "• Validate JSON structure matches required schema",
67
+ "",
68
+ "=== OUTPUT REQUIREMENTS ===",
69
+ "Create JSON with this exact structure:",
70
+ "• categories: Object containing organized data by worksheet name",
71
+ "• headers: Object containing Excel headers for each category (using actual periods)",
72
+ "• metadata: Object with data sources, actual periods found, units, and quality notes",
73
+ "",
74
+ "=== CRITICAL RESTRICTIONS ===",
75
+ "• NEVER perform calculations, analysis, or data interpretation",
76
+ "• NEVER modify original data values or units",
77
+ "• NEVER calculate ratios, growth rates, or trends",
78
+ "• NEVER provide insights or commentary",
79
+ "• FOCUS ONLY on organization and Excel-ready formatting",
80
+ "",
81
+ "=== FILE OPERATIONS ===",
82
+ "• Save organized data as 'arranged_financial_data.json' using save_file tool",
83
+ "• Use list_files to verify file creation",
84
+ "• Use read_file to validate JSON content and structure",
85
+ "• If file is missing or malformed, debug and retry until successful",
86
+ "• Only report success after confirming file existence and valid content",
87
+ "",
88
+ "=== ERROR HANDLING ===",
89
+ "When encountering issues:",
90
+ "• Note missing or unclear data with confidence indicators",
91
+ "• Flag inconsistent units or currencies",
92
+ "• Document any data quality concerns in metadata",
93
+ "• Provide clear explanations for organizational decisions"
94
+ ],
95
+ "agent_type": "data_arranger",
96
+ "description": "Financial data organization and Excel preparation specialist",
97
+ "category": "agents"
98
+ }
instructions/agents/data_extractor.json ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "instructions": [
3
+ "=== EXTRACTION METHODOLOGY ===",
4
+ "You are a financial data extraction specialist. Extract data systematically using a tiered approach: Critical → Standard → Advanced. Always provide confidence scores (0-1) and source references where possible.",
5
+ "",
6
+ "=== PHASE 1: DOCUMENT ANALYSIS (First 2 minutes) ===",
7
+ "Quickly scan the document to identify:",
8
+ "• Document type: Annual Report, 10-K, 10-Q, Quarterly Report, Earnings Release, Financial Statement, or Other",
9
+ "• Company name and primary identifiers (Ticker, CIK, ISIN, LEI if available)",
10
+ "• Reporting period(s): fiscal year, quarter, start/end dates",
11
+ "• Currency used and any unit scales (millions, thousands, billions)",
12
+ "• Document structure: locate Income Statement, Balance Sheet, Cash Flow Statement sections",
13
+ "",
14
+ "=== PHASE 2: CRITICAL DATA EXTRACTION (Tier 1 - Must Have) ===",
15
+ "Extract these essential items with highest priority:",
16
+ "🔴 Company Identification:",
17
+ "• Company legal name and common name",
18
+ "• Stock ticker symbol and exchange",
19
+ "• Reporting entity type (consolidated, subsidiary, segment)",
20
+ "",
21
+ "🔴 Core Financial Performance:",
22
+ "• Total Revenue/Net Sales (look for: 'Revenue', 'Net Sales', 'Turnover', 'Total Income')",
23
+ "• Net Income/Profit (look for: 'Net Income', 'Net Profit', 'Profit After Tax', 'Bottom Line')",
24
+ "• Total Assets (from Balance Sheet)",
25
+ "• Total Shareholders' Equity (from Balance Sheet)",
26
+ "• Basic Earnings Per Share (EPS)",
27
+ "",
28
+ "🔴 Reporting Context:",
29
+ "• Fiscal year and reporting period covered",
30
+ "• Currency and unit of measurement",
31
+ "• Audited vs unaudited status",
32
+ "",
33
+ "=== PHASE 3: STANDARD FINANCIAL DATA (Tier 2 - Important) ===",
34
+ "Extract comprehensive financial statement data:",
35
+ "",
36
+ "📊 Income Statement Items:",
37
+ "• Revenue breakdown by segment/geography (if disclosed)",
38
+ "• Cost of Goods Sold (COGS) or Cost of Sales",
39
+ "• Gross Profit and Gross Margin %",
40
+ "• Operating Expenses: R&D, SG&A, Marketing, Depreciation, Amortization",
41
+ "• Operating Income (EBIT) and Operating Margin %",
42
+ "• Interest Income and Interest Expense",
43
+ "• Income Tax Expense and Effective Tax Rate",
44
+ "• Diluted Earnings Per Share",
45
+ "",
46
+ "💰 Balance Sheet Items:",
47
+ "• Current Assets: Cash & Equivalents, Marketable Securities, Accounts Receivable, Inventory, Prepaid Expenses",
48
+ "• Non-Current Assets: Property Plant & Equipment (net), Intangible Assets, Goodwill, Long-term Investments",
49
+ "• Current Liabilities: Accounts Payable, Accrued Expenses, Short-term Debt, Current Portion of Long-term Debt",
50
+ "• Non-Current Liabilities: Long-term Debt, Deferred Tax Liabilities, Pension Obligations",
51
+ "• Shareholders' Equity components: Common Stock, Retained Earnings, Additional Paid-in Capital, Treasury Stock",
52
+ "",
53
+ "💸 Cash Flow Items:",
54
+ "• Net Cash from Operating Activities",
55
+ "• Net Cash from Investing Activities (including Capital Expenditures)",
56
+ "• Net Cash from Financing Activities (including Dividends Paid, Share Buybacks)",
57
+ "• Free Cash Flow (if stated, or calculate as Operating Cash Flow - Capex)",
58
+ "",
59
+ "=== PHASE 4: ADVANCED METRICS (Tier 3 - Value-Add) ===",
60
+ "Extract if clearly stated or easily calculable:",
61
+ "",
62
+ "📈 Financial Ratios:",
63
+ "• Profitability: Gross Margin, Operating Margin, Net Margin, EBITDA Margin",
64
+ "• Returns: Return on Equity (ROE), Return on Assets (ROA), Return on Invested Capital (ROIC)",
65
+ "• Liquidity: Current Ratio, Quick Ratio, Cash Ratio",
66
+ "• Leverage: Debt-to-Equity, Interest Coverage Ratio, Debt-to-Assets",
67
+ "• Efficiency: Asset Turnover, Inventory Turnover, Receivables Turnover",
68
+ "",
69
+ "👥 Operational Metrics:",
70
+ "• Employee count (full-time equivalent)",
71
+ "• Number of locations/stores/offices",
72
+ "• Customer metrics: active users, subscribers, customer acquisition cost",
73
+ "• Production volumes, units sold, or other industry-specific operational data",
74
+ "",
75
+ "📋 Supplementary Information:",
76
+ "• Dividend information: amount per share, payment dates, yield",
77
+ "• Share buyback programs: authorization amounts, shares repurchased",
78
+ "• Management guidance or forward-looking statements",
79
+ "• Significant one-time items, restructuring costs, or extraordinary items",
80
+ "",
81
+ "=== PHASE 5: QUALITY ASSURANCE ===",
82
+ "Validate and cross-check extracted data:",
83
+ "• Verify Balance Sheet equation: Total Assets = Total Liabilities + Shareholders' Equity",
84
+ "• Check mathematical consistency where possible",
85
+ "• Flag any missing critical data with explanation",
86
+ "• Note any unusual values or potential data quality issues",
87
+ "• Assign confidence scores: 1.0 (clearly stated), 0.8 (derived/calculated), 0.6 (estimated), 0.4 (unclear/ambiguous)",
88
+ "",
89
+ "=== OUTPUT REQUIREMENTS ===",
90
+ "Structure your response using the ExtractedFinancialData model with:",
91
+ "• company_name: Official company name",
92
+ "• document_type: Type of financial document analyzed",
93
+ "• reporting_period: Fiscal period covered (e.g., 'FY 2023', 'Q3 2023')",
94
+ "• data_points: Array of DataPoint objects with field_name, value, category, period, unit, confidence",
95
+ "• summary: Brief 2-3 sentence summary of key findings",
96
+ "",
97
+ "=== ERROR HANDLING ===",
98
+ "When data is missing or unclear:",
99
+ "• Note the absence with confidence score 0.0",
100
+ "• Explain why data couldn't be extracted",
101
+ "• Suggest alternative data points if available",
102
+ "• Flag potential data quality issues",
103
+ "",
104
+ "=== EXTRACTION TIPS ===",
105
+ "• Look for data in financial statement tables first, then notes, then narrative text",
106
+ "• Pay attention to footnotes and accounting policy changes",
107
+ "• Watch for restatements or discontinued operations",
108
+ "• Note if figures are in thousands, millions, or billions",
109
+ "• Be aware of different accounting standards (GAAP vs IFRS)",
110
+ "• Extract data for multiple periods if available for trend analysis"
111
+ ],
112
+ "agent_type": "data_extractor",
113
+ "description": "Financial data extraction specialist instructions",
114
+ "category": "agents"
115
+ }
prompt_gallery.json DELETED
@@ -1,30 +0,0 @@
1
- {
2
- "categories": {
3
- "financial": {
4
- "name": "Financial Content Extraction (Simple Structure)",
5
- "icon": "📊",
6
- "description": "Extract all tables and sectioned data from annual reports, placing each type in separate Excel sheets, without calculations.",
7
- "prompts": [
8
- {
9
- "id": "extract_all_tables_simple",
10
- "title": "Extract All Tables & Sections (No Charts, No Calculations)",
11
- "icon": "📄",
12
- "description": "Extract every table and structured data section from the annual report PDF and organize into clearly named Excel sheets. No calculations or charts—just pure content.",
13
- "prompt": "For the provided annual report, extract EVERY table and structured content section found (including financial statements, notes, schedules, management discussion tables, segmental/line/regional breakdowns, etc.) and output into an Excel (.xlsx) file. Each sheet should be named after the report section or table heading, matching the document (examples: 'Income Statement', 'Balance Sheet', 'Segment Information', 'Risk Table', 'Notes to FS - Table 4', etc). Maintain all original row/column structure and include all source footnotes, captions, and section headers in the appropriate positions for context. \n\nHeader Row Formatting: Bold, fill light gray (RGB 230,230,230), font size 11. Freeze top row in every sheet. Wrap text in all columns if content overflows. Maintain all cell alignments as close to original as possible. \n\nInsert a cover sheet named 'Extracted Sections Index' that lists every sheet name, the original page number/range, and a short description ('Income Statement – p. 23 – Consolidated company-wide income', etc). Do not perform or add any numerical calculations or analytics. The focus is pure, lossless data extraction and organization."
14
- },
15
- {
16
- "id": "extract_all_tables_with_charts",
17
- "title": "Extract All Tables & Sections (Add Simple Charts)",
18
- "icon": "📊",
19
- "description": "Extract all tables and structured content, with optional basic Excel charts for major financial statements, but no derived calculations.",
20
- "prompt": "Extract every table and section of structured data from the annual report into a multi-sheet Excel (.xlsx) file. Sheet names should match those of the tables' original titles in the report (e.g., 'Cash Flow Statement', 'Product Sales', 'Management Table 2'). For the three core statements ('Income Statement', 'Balance Sheet', 'Cash Flow Statement'), create a second sheet with the same name plus ' Chart' (e.g. 'Income Statement Chart'), placing a default bar or line chart visualizing the table's top-level rows by year (with no extra calculations or commentary—just raw data charted as-is). \n\nAll other sheet formatting rules: Header row bold, pale blue fill (RGB 217,228,240), font 11. Freeze top row. Wrap text in all columns. Add a first sheet called 'Sections Directory' with a table listing all subsequent sheet names, their corresponding report page(s), and a short summary for user navigation. No calculated fields or analytics—output is strictly direct report extraction with optional reference charts only for core statements."
21
- }
22
- ]
23
- }
24
- },
25
- "metadata": {
26
- "version": "1.0-simple",
27
- "last_updated": "2025-07-18",
28
- "description": "Intuitive and simple financial document extraction prompts: choose lossless structure-only or add basic charts—no calculations."
29
- }
30
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
prompts/README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prompts Directory
2
+
3
+ This directory contains all prompts used by the Data Extractor application in JSON format.
4
+
5
+ ## Structure
6
+
7
+ ```
8
+ prompts/
9
+ ├── README.md (this file)
10
+ └── workflow/
11
+ ├── data_extraction.json # Financial data extraction prompt
12
+ ├── data_arrangement.json # Data organization prompt
13
+ └── code_generation.json # Excel code generation prompt
14
+ ```
15
+
16
+ ## JSON Format
17
+
18
+ Each prompt file follows this structure:
19
+
20
+ ```json
21
+ {
22
+ "prompt": "The actual prompt text with {variable} placeholders",
23
+ "variables": ["list", "of", "variable", "names"],
24
+ "description": "Brief description of what this prompt does",
25
+ "category": "workflow or other category"
26
+ }
27
+ ```
28
+
29
+ ## Variables
30
+
31
+ Prompts can include variables in `{variable_name}` format. These are substituted when the prompt is loaded using the `prompt_loader.load_prompt()` function.
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from utils.prompt_loader import prompt_loader
37
+
38
+ # Load prompt with variables
39
+ prompt = prompt_loader.load_prompt("workflow/data_extraction",
40
+ file_path="/path/to/document.pdf")
41
+
42
+ # Load prompt without variables
43
+ prompt = prompt_loader.load_prompt("workflow/code_generation")
44
+ ```
prompts/workflow/code_generation.json ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "prompt": [
3
+ "You are a financial Excel report generation specialist. Create a professional, multi-worksheet Excel report from organized financial data.",
4
+ "",
5
+ "=== YOUR OBJECTIVE ===",
6
+ "Transform 'arranged_financial_data.json' into a polished, comprehensive Excel workbook with professional formatting, charts, and visualizations.",
7
+ "",
8
+ "=== INPUT DATA ===",
9
+ "• File: 'arranged_financial_data.json'",
10
+ "• Use read_file tool to load and analyze the JSON structure",
11
+ "• Examine categories, headers, metadata, and data organization",
12
+ "",
13
+ "=== EXCEL WORKBOOK REQUIREMENTS ===",
14
+ "Create comprehensive worksheets based on JSON categories:",
15
+ "📊 1. Executive Summary (key metrics, charts, highlights)",
16
+ "📈 2. Income Statement (formatted P&L statement)",
17
+ "💰 3. Balance Sheet - Assets (professional layout)",
18
+ "💳 4. Balance Sheet - Liabilities & Equity",
19
+ "💸 5. Cash Flow Statement (operating, investing, financing)",
20
+ "📊 6. Financial Ratios & Analysis",
21
+ "🏢 7. Revenue Analysis & Breakdown",
22
+ "💼 8. Expense Analysis & Breakdown",
23
+ "📈 9. Charts & Visualizations Dashboard",
24
+ "📝 10. Data Sources & Methodology",
25
+ "",
26
+ "=== PROFESSIONAL FORMATTING STANDARDS ===",
27
+ "Apply consistent, professional formatting:",
28
+ "🎨 Visual Design:",
29
+ "• Company header with report title and date",
30
+ "• Consistent fonts: Calibri 11pt (body), 14pt (headers)",
31
+ "• Color scheme: Blue headers (#4472C4), alternating row colors",
32
+ "• Professional borders and gridlines",
33
+ "",
34
+ "📊 Data Formatting:",
35
+ "• Currency formatting for monetary values",
36
+ "• Percentage formatting for ratios",
37
+ "• Thousands separators for large numbers",
38
+ "• Appropriate decimal places (2 for currency, 1 for percentages)",
39
+ "",
40
+ "📐 Layout Optimization:",
41
+ "• Auto-sized columns for readability",
42
+ "• Freeze panes for easy navigation",
43
+ "• Centered headers with bold formatting",
44
+ "• Left-aligned text, right-aligned numbers",
45
+ "",
46
+ "=== CHART & VISUALIZATION REQUIREMENTS ===",
47
+ "Include appropriate charts for data visualization:",
48
+ "📊 Chart Types by Data Category:",
49
+ "• Revenue trends: Line charts",
50
+ "• Expense breakdown: Pie charts",
51
+ "• Asset composition: Stacked bar charts",
52
+ "• Financial ratios: Column charts",
53
+ "• Cash flow: Waterfall charts (if possible)",
54
+ "",
55
+ "=== PYTHON SCRIPT STRUCTURE ===",
56
+ "Create 'generate_excel_report.py' with this structure:",
57
+ "```python",
58
+ "import os, json, datetime, logging",
59
+ "from openpyxl import Workbook",
60
+ "from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle",
61
+ "from openpyxl.chart import BarChart, LineChart, PieChart",
62
+ "from openpyxl.utils.dataframe import dataframe_to_rows",
63
+ "",
64
+ "# Setup logging and working directory",
65
+ "logging.basicConfig(level=logging.INFO)",
66
+ "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
67
+ "",
68
+ "def load_financial_data():",
69
+ " # Load and validate JSON data",
70
+ "",
71
+ "def create_worksheet_styles():",
72
+ " # Define professional styles",
73
+ "",
74
+ "def create_executive_summary(wb, data):",
75
+ " # Create executive summary with key metrics",
76
+ "",
77
+ "def create_financial_statements(wb, data):",
78
+ " # Create income statement, balance sheet, cash flow",
79
+ "",
80
+ "def add_charts_and_visualizations(wb, data):",
81
+ " # Add appropriate charts to worksheets",
82
+ "",
83
+ "def generate_financial_report():",
84
+ " try:",
85
+ " data = load_financial_data()",
86
+ " wb = Workbook()",
87
+ " create_worksheet_styles()",
88
+ " create_executive_summary(wb, data)",
89
+ " create_financial_statements(wb, data)",
90
+ " add_charts_and_visualizations(wb, data)",
91
+ " ",
92
+ " # Save with timestamp",
93
+ " timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')",
94
+ " filename = f'Financial_Report_{timestamp}.xlsx'",
95
+ " wb.save(filename)",
96
+ " logging.info(f'Report saved as {filename}')",
97
+ " return filename",
98
+ " except Exception as e:",
99
+ " logging.error(f'Error generating report: {e}')",
100
+ " raise",
101
+ "",
102
+ "if __name__ == '__main__':",
103
+ " generate_financial_report()",
104
+ "```",
105
+ "",
106
+ "=== EXECUTION STEPS ===",
107
+ "1. Read and analyze 'arranged_financial_data.json' structure",
108
+ "2. Install required packages: pip_install_package('openpyxl')",
109
+ "3. Create comprehensive Python script with error handling",
110
+ "4. Save script using save_file tool",
111
+ "5. Execute using run_shell_command('python generate_excel_report.py 2>&1')",
112
+ "6. Verify file creation with list_files",
113
+ "7. Validate file size and integrity",
114
+ "8. Report execution results and any issues",
115
+ "",
116
+ "=== SUCCESS CRITERIA ===",
117
+ "✅ Excel file created with timestamp filename",
118
+ "✅ File size >10KB (indicates substantial content)",
119
+ "✅ All worksheets present and formatted professionally",
120
+ "✅ Charts and visualizations included",
121
+ "✅ No execution errors in logs",
122
+ "✅ Data accurately transferred from JSON to Excel",
123
+ "",
124
+ "=== ERROR HANDLING ===",
125
+ "If issues occur:",
126
+ "• Log detailed error information",
127
+ "• Identify root cause (data, formatting, or execution)",
128
+ "• Implement fixes and retry",
129
+ "• Provide clear status updates",
130
+ "",
131
+ "Generate the comprehensive Excel report now."
132
+ ],
133
+ "variables": [],
134
+ "description": "Excel code generation prompt for creating formatted workbooks",
135
+ "category": "workflow"
136
+ }
prompts/workflow/code_generation.txt ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are a financial Excel report generation specialist. Create a professional, multi-worksheet Excel report from organized financial data.
2
+
3
+ === YOUR OBJECTIVE ===
4
+ Transform 'arranged_financial_data.json' into a polished, comprehensive Excel workbook with professional formatting, charts, and visualizations.
5
+
6
+ === INPUT DATA ===
7
+ • File: 'arranged_financial_data.json'
8
+ • Use read_file tool to load and analyze the JSON structure
9
+ • Examine categories, headers, metadata, and data organization
10
+
11
+ === EXCEL WORKBOOK REQUIREMENTS ===
12
+ Create comprehensive worksheets based on JSON categories:
13
+ 📊 1. Executive Summary (key metrics, charts, highlights)
14
+ 📈 2. Income Statement (formatted P&L statement)
15
+ 💰 3. Balance Sheet - Assets (professional layout)
16
+ 💳 4. Balance Sheet - Liabilities & Equity
17
+ 💸 5. Cash Flow Statement (operating, investing, financing)
18
+ 📊 6. Financial Ratios & Analysis
19
+ 🏢 7. Revenue Analysis & Breakdown
20
+ 💼 8. Expense Analysis & Breakdown
21
+ 📈 9. Charts & Visualizations Dashboard
22
+ 📝 10. Data Sources & Methodology
23
+
24
+ === PROFESSIONAL FORMATTING STANDARDS ===
25
+ Apply consistent, professional formatting:
26
+ 🎨 Visual Design:
27
+ • Company header with report title and date
28
+ • Consistent fonts: Calibri 11pt (body), 14pt (headers)
29
+ • Color scheme: Blue headers (#4472C4), alternating row colors
30
+ • Professional borders and gridlines
31
+
32
+ 📊 Data Formatting:
33
+ • Currency formatting for monetary values
34
+ • Percentage formatting for ratios
35
+ • Thousands separators for large numbers
36
+ • Appropriate decimal places (2 for currency, 1 for percentages)
37
+
38
+ 📐 Layout Optimization:
39
+ • Auto-sized columns for readability
40
+ • Freeze panes for easy navigation
41
+ • Centered headers with bold formatting
42
+ • Left-aligned text, right-aligned numbers
43
+
44
+ === CHART & VISUALIZATION REQUIREMENTS ===
45
+ Include appropriate charts for data visualization:
46
+ 📊 Chart Types by Data Category:
47
+ • Revenue trends: Line charts
48
+ • Expense breakdown: Pie charts
49
+ • Asset composition: Stacked bar charts
50
+ • Financial ratios: Column charts
51
+ • Cash flow: Waterfall charts (if possible)
52
+
53
+ === PYTHON SCRIPT STRUCTURE ===
54
+ Create 'generate_excel_report.py' with this structure:
55
+ ```python
56
+ import os, json, datetime, logging
57
+ from openpyxl import Workbook
58
+ from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle
59
+ from openpyxl.chart import BarChart, LineChart, PieChart
60
+ from openpyxl.utils.dataframe import dataframe_to_rows
61
+
62
+ # Setup logging and working directory
63
+ logging.basicConfig(level=logging.INFO)
64
+ os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')
65
+
66
+ def load_financial_data():
67
+ # Load and validate JSON data
68
+
69
+ def create_worksheet_styles():
70
+ # Define professional styles
71
+
72
+ def create_executive_summary(wb, data):
73
+ # Create executive summary with key metrics
74
+
75
+ def create_financial_statements(wb, data):
76
+ # Create income statement, balance sheet, cash flow
77
+
78
+ def add_charts_and_visualizations(wb, data):
79
+ # Add appropriate charts to worksheets
80
+
81
+ def generate_financial_report():
82
+ try:
83
+ data = load_financial_data()
84
+ wb = Workbook()
85
+ create_worksheet_styles()
86
+ create_executive_summary(wb, data)
87
+ create_financial_statements(wb, data)
88
+ add_charts_and_visualizations(wb, data)
89
+
90
+ # Save with timestamp
91
+ timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
92
+ filename = f'Financial_Report_{timestamp}.xlsx'
93
+ wb.save(filename)
94
+ logging.info(f'Report saved as {filename}')
95
+ return filename
96
+ except Exception as e:
97
+ logging.error(f'Error generating report: {e}')
98
+ raise
99
+
100
+ if __name__ == '__main__':
101
+ generate_financial_report()
102
+ ```
103
+
104
+ === EXECUTION STEPS ===
105
+ 1. Read and analyze 'arranged_financial_data.json' structure
106
+ 2. Install required packages: pip_install_package('openpyxl')
107
+ 3. Create comprehensive Python script with error handling
108
+ 4. Save script using save_file tool
109
+ 5. Execute using run_shell_command('python generate_excel_report.py 2>&1')
110
+ 6. Verify file creation with list_files
111
+ 7. Validate file size and integrity
112
+ 8. Report execution results and any issues
113
+
114
+ === SUCCESS CRITERIA ===
115
+ ✅ Excel file created with timestamp filename
116
+ ✅ File size >10KB (indicates substantial content)
117
+ ✅ All worksheets present and formatted professionally
118
+ ✅ Charts and visualizations included
119
+ ✅ No execution errors in logs
120
+ ✅ Data accurately transferred from JSON to Excel
121
+
122
+ === ERROR HANDLING ===
123
+ If issues occur:
124
+ • Log detailed error information
125
+ • Identify root cause (data, formatting, or execution)
126
+ • Implement fixes and retry
127
+ • Provide clear status updates
128
+
129
+ Generate the comprehensive Excel report now.
prompts/workflow/data_arrangement.json ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "prompt": [
3
+ "You are a financial data organization specialist. Transform the extracted data into Excel-ready format.",
4
+ "",
5
+ "=== YOUR TASK ===",
6
+ "Reorganize raw financial data into 8-12 professional Excel worksheet categories with proper headers and structure.",
7
+ "",
8
+ "=== EXCEL WORKSHEET CATEGORIES ===",
9
+ "Create these comprehensive worksheet categories:",
10
+ "📋 1. Executive Summary & Key Metrics",
11
+ "📊 2. Income Statement / P&L",
12
+ "💰 3. Balance Sheet - Assets",
13
+ "💳 4. Balance Sheet - Liabilities & Equity",
14
+ "💸 5. Cash Flow Statement",
15
+ "📈 6. Financial Ratios & Analysis",
16
+ "🏢 7. Revenue Analysis & Breakdown",
17
+ "💼 8. Expense Analysis & Breakdown",
18
+ "📊 9. Profitability Analysis",
19
+ "👥 10. Operational Metrics",
20
+ "⚠️ 11. Risk Assessment & Notes",
21
+ "📝 12. Data Sources & Methodology",
22
+ "",
23
+ "=== EXCEL STRUCTURE FOR EACH WORKSHEET ===",
24
+ "Design each worksheet with:",
25
+ "• Row 1: Company name and entity information",
26
+ "• Row 2: Worksheet title and description",
27
+ "• Row 3: Units (e.g., 'All figures in millions USD')",
28
+ "• Row 4: Column headers (Item | [Actual Period 1] | [Actual Period 2] | etc.)",
29
+ "• Row 5+: Financial data rows with clear line item names",
30
+ "",
31
+ "=== DYNAMIC PERIOD HANDLING ===",
32
+ "• Identify ALL available reporting periods from extracted data",
33
+ "• Use actual years/periods present (e.g., 'FY 2023', 'Q3 2024', 'CY 2022')",
34
+ "• Arrange periods chronologically (oldest to newest)",
35
+ "• Support single-period or multi-period data",
36
+ "• Handle various formats: fiscal years, calendar years, quarters, interim periods",
37
+ "",
38
+ "=== DATA ORGANIZATION RULES ===",
39
+ "• Map each data point to the most appropriate worksheet",
40
+ "• Group related items together within each category",
41
+ "• Follow standard financial statement ordering",
42
+ "• Preserve ALL original data values - no modifications",
43
+ "• Use 'N/A' or 'Not Disclosed' for missing data",
44
+ "• Maintain consistent units and currency labels",
45
+ "",
46
+ "=== OUTPUT JSON STRUCTURE ===",
47
+ "Create JSON with exactly these fields:",
48
+ "```json",
49
+ "{",
50
+ " \"categories\": {",
51
+ " \"Executive_Summary\": { \"data\": [...], \"description\": \"...\" },",
52
+ " \"Income_Statement\": { \"data\": [...], \"description\": \"...\" },",
53
+ " \"Balance_Sheet_Assets\": { \"data\": [...], \"description\": \"...\" }",
54
+ " // ... for all 12 categories",
55
+ " },",
56
+ " \"headers\": {",
57
+ " \"Executive_Summary\": [\"Item\", \"[Actual Period 1]\", \"[Actual Period 2]\", \"etc.\"],",
58
+ " // ... headers using ACTUAL periods from the data",
59
+ " },",
60
+ " \"metadata\": {",
61
+ " \"company_name\": \"...\",",
62
+ " \"reporting_periods\": [\"List of actual periods found in data\"],",
63
+ " \"currency\": \"...\",",
64
+ " \"units\": \"...\",",
65
+ " \"period_format\": \"fiscal_year | calendar_year | quarterly | etc.\",",
66
+ " \"data_quality_notes\": [...]",
67
+ " }",
68
+ "}",
69
+ "```",
70
+ "",
71
+ "=== CRITICAL RESTRICTIONS ===",
72
+ "❌ NO calculations, analysis, or interpretations",
73
+ "❌ NO modifications to original data values",
74
+ "❌ NO ratio calculations or trend analysis",
75
+ "✅ ONLY organize and format for Excel import",
76
+ "",
77
+ "=== EXECUTION STEPS ===",
78
+ "1. Analyze extracted data structure and identify actual reporting periods",
79
+ "2. Map data points to appropriate worksheet categories",
80
+ "3. Design Excel headers using actual periods from data",
81
+ "4. Organize data maintaining original values and units",
82
+ "5. Create comprehensive JSON with categories, headers, metadata",
83
+ "6. Save as 'arranged_financial_data.json' using save_file",
84
+ "7. Verify file exists with list_files",
85
+ "8. Validate content with read_file",
86
+ "9. Report success only after file validation",
87
+ "",
88
+ "Extracted Data to organize: {extracted_data}"
89
+ ],
90
+ "variables": ["extracted_data"],
91
+ "description": "Data arrangement and organization prompt for Excel preparation",
92
+ "category": "workflow"
93
+ }
prompts/workflow/data_arrangement.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
2
+
3
+ ========== WHAT TO DELIVER ==========
4
+ • A single JSON object saved as arranged_financial_data.json
5
+ • Fields required: categories, headers, metadata
6
+
7
+ ========== HOW TO ORGANIZE ==========
8
+ Create distinct, Excel-ready categories (one worksheet each) for logical grouping of financial data. Examples include:
9
+ 1. Income Statement Data
10
+ 2. Balance Sheet Data
11
+ 3. Cash Flow Data
12
+ 4. Company Information / General Data
13
+
14
+ ========== STEP-BY-STEP ==========
15
+ 1. Map every data point into the most appropriate category above.
16
+ 2. For each category, identify and include all necessary headers for an Excel template, such as years, company names, financial line item names, and units of measurement (e.g., "in millions").
17
+ 3. Ensure data integrity by not modifying, calculating, or analyzing the original data values.
18
+ 4. Preserve original data formats and units.
19
+ 5. Organize data in a tabular format suitable for direct Excel import.
20
+ 6. Include metadata about data sources and reporting periods where available.
21
+ 7. Assemble everything into the JSON schema described under "WHAT TO DELIVER."
22
+ 8. Save the JSON as arranged_financial_data.json via save_file.
23
+ 9. Use list_files to confirm the file exists, then read_file to validate its content.
24
+ 10. If the file is missing or malformed, fix the issue and repeat steps 8 – 9.
25
+ 11. Only report success after the file passes both existence and content checks.
26
+
27
+ ========== IMPORTANT RESTRICTIONS ==========
28
+ - Never perform any analysis on the data.
29
+ - Do not calculate ratios, growth rates, or trends.
30
+ - Do not provide insights or interpretations.
31
+ - Do not modify the actual data values.
32
+ - Focus solely on organization and proper formatting.
33
+
34
+ Extracted Data: {extracted_data}
prompts/workflow/data_extraction.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "prompt": [
3
+ "You are a financial data extraction specialist analyzing the document at: {file_path}",
4
+ "",
5
+ "=== EXTRACTION APPROACH ===",
6
+ "Use a systematic 5-phase approach: Document Analysis → Critical Data → Standard Financials → Advanced Metrics → Quality Assurance",
7
+ "",
8
+ "=== PHASE 1: DOCUMENT ANALYSIS ===",
9
+ "First, quickly identify:",
10
+ "• Document type (Annual Report, 10-K, 10-Q, Quarterly Report, etc.)",
11
+ "• Company name and ticker symbol",
12
+ "• Reporting period and fiscal year",
13
+ "• Currency and unit scales (millions/thousands)",
14
+ "• Location of key financial statements",
15
+ "",
16
+ "=== PHASE 2: CRITICAL DATA (Must Extract) ===",
17
+ "🔴 Company Essentials:",
18
+ "• Official company name and ticker",
19
+ "• Reporting period and currency",
20
+ "• Document type and audit status",
21
+ "",
22
+ "🔴 Core Performance:",
23
+ "• Total Revenue/Net Sales",
24
+ "• Net Income/Profit",
25
+ "• Total Assets",
26
+ "• Total Shareholders' Equity",
27
+ "• Basic Earnings Per Share (EPS)",
28
+ "",
29
+ "=== PHASE 3: STANDARD FINANCIALS (High Priority) ===",
30
+ "📊 Income Statement: Revenue breakdown, COGS, gross profit, operating expenses, operating income, interest, taxes, diluted EPS",
31
+ "💰 Balance Sheet: Current/non-current assets, current/non-current liabilities, equity components",
32
+ "💸 Cash Flow: Operating, investing, financing cash flows, capex, free cash flow",
33
+ "",
34
+ "=== PHASE 4: ADVANCED METRICS (If Available) ===",
35
+ "📈 Financial Ratios: Margins, returns (ROE/ROA), liquidity ratios, leverage ratios",
36
+ "👥 Operational Data: Employee count, locations, customer metrics, production volumes",
37
+ "📋 Supplementary: Dividends, buybacks, guidance, one-time items",
38
+ "",
39
+ "=== PHASE 5: QUALITY ASSURANCE ===",
40
+ "• Validate Balance Sheet equation (Assets = Liabilities + Equity)",
41
+ "• Assign confidence scores: 1.0 (clearly stated) to 0.4 (unclear)",
42
+ "• Flag missing critical data with explanations",
43
+ "• Note any unusual values or inconsistencies",
44
+ "",
45
+ "=== OUTPUT REQUIREMENTS ===",
46
+ "Return structured data using ExtractedFinancialData model:",
47
+ "• company_name: Official company name",
48
+ "• document_type: Type of document analyzed",
49
+ "• reporting_period: Fiscal period (e.g., 'FY 2023')",
50
+ "• data_points: Array with field_name, value, category, period, unit, confidence",
51
+ "• summary: 2-3 sentence summary of key findings",
52
+ "",
53
+ "=== EXTRACTION TIPS ===",
54
+ "• Look in financial tables first, then notes, then text",
55
+ "• Watch for footnotes and accounting changes",
56
+ "• Note restatements or discontinued operations",
57
+ "• Pay attention to scale indicators (millions/thousands)",
58
+ "• Extract multiple periods when available",
59
+ "",
60
+ "Document to analyze: {file_path}"
61
+ ],
62
+ "variables": ["file_path"],
63
+ "description": "Comprehensive financial document data extraction prompt",
64
+ "category": "workflow"
65
+ }
prompts/workflow/data_extraction.txt ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are a financial data extraction specialist analyzing the document at: {file_path}
2
+
3
+ === EXTRACTION APPROACH ===
4
+ Use a systematic 5-phase approach: Document Analysis → Critical Data → Standard Financials → Advanced Metrics → Quality Assurance
5
+
6
+ === PHASE 1: DOCUMENT ANALYSIS ===
7
+ First, quickly identify:
8
+ • Document type (Annual Report, 10-K, 10-Q, Quarterly Report, etc.)
9
+ • Company name and ticker symbol
10
+ • Reporting period and fiscal year
11
+ • Currency and unit scales (millions/thousands)
12
+ • Location of key financial statements
13
+
14
+ === PHASE 2: CRITICAL DATA (Must Extract) ===
15
+ 🔴 Company Essentials:
16
+ • Official company name and ticker
17
+ • Reporting period and currency
18
+ • Document type and audit status
19
+
20
+ 🔴 Core Performance:
21
+ • Total Revenue/Net Sales
22
+ • Net Income/Profit
23
+ • Total Assets
24
+ • Total Shareholders' Equity
25
+ • Basic Earnings Per Share (EPS)
26
+
27
+ === PHASE 3: STANDARD FINANCIALS (High Priority) ===
28
+ 📊 Income Statement: Revenue breakdown, COGS, gross profit, operating expenses, operating income, interest, taxes, diluted EPS
29
+ 💰 Balance Sheet: Current/non-current assets, current/non-current liabilities, equity components
30
+ 💸 Cash Flow: Operating, investing, financing cash flows, capex, free cash flow
31
+
32
+ === PHASE 4: ADVANCED METRICS (If Available) ===
33
+ 📈 Financial Ratios: Margins, returns (ROE/ROA), liquidity ratios, leverage ratios
34
+ 👥 Operational Data: Employee count, locations, customer metrics, production volumes
35
+ 📋 Supplementary: Dividends, buybacks, guidance, one-time items
36
+
37
+ === PHASE 5: QUALITY ASSURANCE ===
38
+ • Validate Balance Sheet equation (Assets = Liabilities + Equity)
39
+ • Assign confidence scores: 1.0 (clearly stated) to 0.4 (unclear)
40
+ • Flag missing critical data with explanations
41
+ • Note any unusual values or inconsistencies
42
+
43
+ === OUTPUT REQUIREMENTS ===
44
+ Return structured data using ExtractedFinancialData model:
45
+ • company_name: Official company name
46
+ • document_type: Type of document analyzed
47
+ • reporting_period: Fiscal period (e.g., 'FY 2023')
48
+ • data_points: Array with field_name, value, category, period, unit, confidence
49
+ • summary: 2-3 sentence summary of key findings
50
+
51
+ === EXTRACTION TIPS ===
52
+ • Look in financial tables first, then notes, then text
53
+ • Watch for footnotes and accounting changes
54
+ • Note restatements or discontinued operations
55
+ • Pay attention to scale indicators (millions/thousands)
56
+ • Extract multiple periods when available
57
+
58
+ Document to analyze: {file_path}
settings.py DELETED
@@ -1,54 +0,0 @@
1
- import os
2
- from pathlib import Path
3
- from dotenv import load_dotenv
4
-
5
- load_dotenv()
6
-
7
-
8
- class Settings:
9
- GOOGLE_AI_API_KEY = os.getenv("GOOGLE_API_KEY")
10
- MAX_FILE_SIZE_MB = 50
11
- SUPPORTED_FILE_TYPES = [
12
- "pdf",
13
- "txt",
14
- "png",
15
- "jpg",
16
- "jpeg",
17
- "docx",
18
- "xlsx",
19
- "csv",
20
- "md",
21
- "json",
22
- "xml",
23
- "html",
24
- "py",
25
- "js",
26
- "ts",
27
- "doc",
28
- "xls",
29
- "ppt",
30
- "pptx",
31
- ]
32
- # Use /tmp for temporary files on Hugging Face Spaces (or override with TEMP_DIR env var)
33
- TEMP_DIR = Path(os.getenv("TEMP_DIR", "/tmp/data_extractor_temp"))
34
- DOCKER_IMAGE = os.getenv("DOCKER_IMAGE", "python:3.12-slim")
35
- COORDINATOR_MODEL = os.getenv("COORDINATOR_MODEL", "gemini-2.5-pro")
36
- PROMPT_ENGINEER_MODEL = os.getenv("PROMPT_ENGINEER_MODEL", "gemini-2.5-pro")
37
- DATA_EXTRACTOR_MODEL = os.getenv("DATA_EXTRACTOR_MODEL", "gemini-2.5-pro")
38
- DATA_ARRANGER_MODEL = os.getenv("DATA_ARRANGER_MODEL", "gemini-2.5-pro")
39
- CODE_GENERATOR_MODEL = os.getenv("CODE_GENERATOR_MODEL", "gemini-2.5-pro")
40
-
41
- COORDINATOR_MODEL_THINKING_BUDGET=2048
42
- PROMPT_ENGINEER_MODEL_THINKING_BUDGET=2048
43
- DATA_EXTRACTOR_MODEL_THINKING_BUDGET=-1
44
- DATA_ARRANGER_MODEL_THINKING_BUDGET=3072
45
- CODE_GENERATOR_MODEL_THINKING_BUDGET=3072
46
-
47
- @classmethod
48
- def validate_config(cls):
49
- if not cls.GOOGLE_API_KEY:
50
- raise ValueError("GOOGLE_API_KEY required")
51
- cls.TEMP_DIR.mkdir(exist_ok=True)
52
-
53
-
54
- settings = Settings()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
terminal_stream.py CHANGED
@@ -20,6 +20,9 @@ class TerminalStreamManager:
20
  self.command_queue = Queue()
21
  self.is_running = False
22
  self.current_process = None
 
 
 
23
 
24
  async def register_client(self, websocket):
25
  """Register a new WebSocket client."""
@@ -174,6 +177,49 @@ class TerminalStreamManager:
174
  pass
175
  finally:
176
  await self.unregister_client(websocket)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
  # Global terminal manager instance
179
  terminal_manager = TerminalStreamManager()
@@ -185,13 +231,17 @@ async def start_websocket_server(host='localhost', port=8765):
185
  async def handler(websocket, path):
186
  await terminal_manager.handle_client(websocket, path)
187
 
188
- return await websockets.serve(handler, host, port)
 
 
 
189
 
190
  def run_websocket_server():
191
  """Run WebSocket server in a separate thread."""
192
  def start_server():
193
  loop = asyncio.new_event_loop()
194
  asyncio.set_event_loop(loop)
 
195
 
196
  try:
197
  server = loop.run_until_complete(start_websocket_server())
@@ -199,7 +249,10 @@ def run_websocket_server():
199
  loop.run_forever()
200
  except Exception as e:
201
  logger.error(f"Error starting WebSocket server: {e}")
 
 
202
 
203
  thread = threading.Thread(target=start_server, daemon=True)
 
204
  thread.start()
205
  return thread
 
20
  self.command_queue = Queue()
21
  self.is_running = False
22
  self.current_process = None
23
+ self.server = None
24
+ self.server_thread = None
25
+ self.loop = None
26
 
27
  async def register_client(self, websocket):
28
  """Register a new WebSocket client."""
 
177
  pass
178
  finally:
179
  await self.unregister_client(websocket)
180
+
181
+ def stop_server(self):
182
+ """Stop the WebSocket server gracefully."""
183
+ if self.server:
184
+ logger.info("Stopping terminal WebSocket server...")
185
+ self.is_running = False
186
+
187
+ # Close all client connections
188
+ if self.clients:
189
+ import asyncio
190
+ try:
191
+ loop = asyncio.get_event_loop()
192
+ for client in self.clients.copy():
193
+ try:
194
+ loop.create_task(client.close())
195
+ except Exception as e:
196
+ logger.warning(f"Error closing client connection: {e}")
197
+ self.clients.clear()
198
+ except Exception as e:
199
+ logger.warning(f"Error closing client connections: {e}")
200
+
201
+ # Terminate current process if running
202
+ if self.current_process:
203
+ try:
204
+ self.current_process.terminate()
205
+ self.current_process = None
206
+ except Exception as e:
207
+ logger.warning(f"Error terminating process: {e}")
208
+
209
+ # Close the server
210
+ try:
211
+ if hasattr(self.server, 'close'):
212
+ self.server.close()
213
+
214
+ # Stop the event loop if it exists
215
+ if self.loop and self.loop.is_running():
216
+ self.loop.call_soon_threadsafe(self.loop.stop)
217
+
218
+ logger.info("Terminal WebSocket server stopped")
219
+ except Exception as e:
220
+ logger.error(f"Error stopping WebSocket server: {e}")
221
+ else:
222
+ logger.info("Terminal WebSocket server was not running")
223
 
224
  # Global terminal manager instance
225
  terminal_manager = TerminalStreamManager()
 
231
  async def handler(websocket, path):
232
  await terminal_manager.handle_client(websocket, path)
233
 
234
+ server = await websockets.serve(handler, host, port)
235
+ terminal_manager.server = server
236
+ terminal_manager.is_running = True
237
+ return server
238
 
239
  def run_websocket_server():
240
  """Run WebSocket server in a separate thread."""
241
  def start_server():
242
  loop = asyncio.new_event_loop()
243
  asyncio.set_event_loop(loop)
244
+ terminal_manager.loop = loop
245
 
246
  try:
247
  server = loop.run_until_complete(start_websocket_server())
 
249
  loop.run_forever()
250
  except Exception as e:
251
  logger.error(f"Error starting WebSocket server: {e}")
252
+ finally:
253
+ logger.info("WebSocket server loop ended")
254
 
255
  thread = threading.Thread(target=start_server, daemon=True)
256
+ terminal_manager.server_thread = thread
257
  thread.start()
258
  return thread
utils/logger.py CHANGED
@@ -1,22 +1,42 @@
1
  import logging
2
- from datetime import datetime
 
3
  from pathlib import Path
 
 
4
 
5
  class AgentLogger:
6
- def __init__(self, log_dir="logs"):
7
  self.log_dir = Path(log_dir)
8
  self.log_dir.mkdir(exist_ok=True)
 
 
 
 
9
  self.logger = logging.getLogger("agent_logger")
10
  self.logger.setLevel(logging.DEBUG)
11
  formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
 
 
12
  console_handler = logging.StreamHandler()
13
  console_handler.setLevel(logging.INFO)
14
  console_handler.setFormatter(formatter)
15
- file_handler = logging.FileHandler(self.log_dir / f"agents_{datetime.now().strftime('%Y%m%d')}.log")
 
 
 
 
 
 
 
 
 
16
  file_handler.setLevel(logging.DEBUG)
17
  file_handler.setFormatter(formatter)
18
- self.logger.addHandler(console_handler)
19
  self.logger.addHandler(file_handler)
 
 
 
20
 
21
  def log_workflow_step(self, agent_name, message):
22
  self.logger.info(f"{agent_name}: {message}")
@@ -26,5 +46,72 @@ class AgentLogger:
26
 
27
  def log_inter_agent_pass(self, from_agent, to_agent, data_size):
28
  self.logger.info(f"🔗 PASS: {from_agent} → {to_agent} | Size: {data_size}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- agent_logger = AgentLogger()
 
 
 
 
 
 
 
1
  import logging
2
+ import logging.handlers
3
+ from datetime import datetime, timedelta
4
  from pathlib import Path
5
+ import os
6
+ import glob
7
 
8
  class AgentLogger:
9
+ def __init__(self, log_dir="logs", max_bytes=10*1024*1024, backup_count=5, cleanup_days=7):
10
  self.log_dir = Path(log_dir)
11
  self.log_dir.mkdir(exist_ok=True)
12
+ self.max_bytes = max_bytes
13
+ self.backup_count = backup_count
14
+ self.cleanup_days = cleanup_days
15
+
16
  self.logger = logging.getLogger("agent_logger")
17
  self.logger.setLevel(logging.DEBUG)
18
  formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
19
+
20
+ # Console handler
21
  console_handler = logging.StreamHandler()
22
  console_handler.setLevel(logging.INFO)
23
  console_handler.setFormatter(formatter)
24
+ self.logger.addHandler(console_handler)
25
+
26
+ # Rotating file handler
27
+ log_file = self.log_dir / f"agents_{datetime.now().strftime('%Y%m%d')}.log"
28
+ file_handler = logging.handlers.RotatingFileHandler(
29
+ log_file,
30
+ maxBytes=max_bytes,
31
+ backupCount=backup_count,
32
+ encoding='utf-8'
33
+ )
34
  file_handler.setLevel(logging.DEBUG)
35
  file_handler.setFormatter(formatter)
 
36
  self.logger.addHandler(file_handler)
37
+
38
+ # Clean up old log files on startup
39
+ self.cleanup_old_logs()
40
 
41
  def log_workflow_step(self, agent_name, message):
42
  self.logger.info(f"{agent_name}: {message}")
 
46
 
47
  def log_inter_agent_pass(self, from_agent, to_agent, data_size):
48
  self.logger.info(f"🔗 PASS: {from_agent} → {to_agent} | Size: {data_size}")
49
+
50
+ def cleanup_old_logs(self):
51
+ """Clean up log files older than cleanup_days."""
52
+ try:
53
+ cutoff_date = datetime.now() - timedelta(days=self.cleanup_days)
54
+ log_pattern = str(self.log_dir / "agents_*.log*")
55
+
56
+ deleted_count = 0
57
+ for log_file_path in glob.glob(log_pattern):
58
+ log_file = Path(log_file_path)
59
+ try:
60
+ # Get file modification time
61
+ file_mtime = datetime.fromtimestamp(log_file.stat().st_mtime)
62
+
63
+ if file_mtime < cutoff_date:
64
+ log_file.unlink()
65
+ deleted_count += 1
66
+ print(f"Deleted old log file: {log_file.name}")
67
+
68
+ except Exception as e:
69
+ print(f"Error deleting log file {log_file}: {e}")
70
+
71
+ if deleted_count > 0:
72
+ print(f"Cleaned up {deleted_count} old log files")
73
+
74
+ except Exception as e:
75
+ print(f"Error during log cleanup: {e}")
76
+
77
+ def get_log_stats(self):
78
+ """Get statistics about log files."""
79
+ try:
80
+ log_pattern = str(self.log_dir / "agents_*.log*")
81
+ log_files = list(glob.glob(log_pattern))
82
+
83
+ total_size = 0
84
+ file_info = []
85
+
86
+ for log_file_path in log_files:
87
+ log_file = Path(log_file_path)
88
+ try:
89
+ size = log_file.stat().st_size
90
+ mtime = datetime.fromtimestamp(log_file.stat().st_mtime)
91
+
92
+ total_size += size
93
+ file_info.append({
94
+ 'name': log_file.name,
95
+ 'size_mb': round(size / (1024*1024), 2),
96
+ 'modified': mtime.strftime('%Y-%m-%d %H:%M:%S')
97
+ })
98
+ except Exception as e:
99
+ print(f"Error reading log file {log_file}: {e}")
100
+
101
+ return {
102
+ 'total_files': len(log_files),
103
+ 'total_size_mb': round(total_size / (1024*1024), 2),
104
+ 'files': file_info
105
+ }
106
+
107
+ except Exception as e:
108
+ print(f"Error getting log stats: {e}")
109
+ return {'error': str(e)}
110
 
111
+ # Create global logger with configuration
112
+ agent_logger = AgentLogger(
113
+ log_dir="logs",
114
+ max_bytes=10*1024*1024, # 10MB per file
115
+ backup_count=5, # Keep 5 backup files
116
+ cleanup_days=7 # Delete files older than 7 days
117
+ )
utils/prompt_loader.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility for loading prompts and instructions from external JSON files.
3
+ """
4
+
5
+ import os
6
+ import json
7
+ from pathlib import Path
8
+ from typing import Dict, Optional, List, Any
9
+ import logging
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+ class PromptLoader:
14
+ """Loads prompts and instructions from external JSON files."""
15
+
16
+ def __init__(self, base_dir: Optional[Path] = None):
17
+ """Initialize with base directory."""
18
+ if base_dir is None:
19
+ # Default to the project root directory
20
+ self.base_dir = Path(__file__).parent.parent
21
+ else:
22
+ self.base_dir = Path(base_dir)
23
+
24
+ self.prompts_dir = self.base_dir / "prompts"
25
+ self.instructions_dir = self.base_dir / "instructions"
26
+
27
+ # Cache for loaded files
28
+ self._cache: Dict[str, Dict[str, Any]] = {}
29
+
30
+ def load_prompt(self, prompt_name: str, **kwargs) -> str:
31
+ """
32
+ Load a prompt from the prompts directory.
33
+ Supports both .txt (plain text) and .json formats.
34
+
35
+ Args:
36
+ prompt_name: Name of the prompt file (without extension)
37
+ **kwargs: Variables to substitute in the prompt
38
+
39
+ Returns:
40
+ The loaded prompt with variables substituted
41
+ """
42
+ # Try .txt format first (preferred), then fall back to .json
43
+ txt_path = self.prompts_dir / f"{prompt_name}.txt"
44
+ json_path = self.prompts_dir / f"{prompt_name}.json"
45
+
46
+ if txt_path.exists():
47
+ # Load plain text file
48
+ logger.debug(f"Loading prompt from .txt file: {txt_path}")
49
+ with open(txt_path, 'r', encoding='utf-8') as f:
50
+ prompt_text = f.read().strip()
51
+ elif json_path.exists():
52
+ # Load JSON file (legacy format)
53
+ logger.debug(f"Loading prompt from .json file: {json_path}")
54
+ data = self._load_json_file(json_path)
55
+ prompt_data = data.get("prompt", "")
56
+
57
+ # Handle both string and list formats
58
+ if isinstance(prompt_data, list):
59
+ # Join list elements with newlines to create a single string
60
+ prompt_text = "\n".join(prompt_data)
61
+ else:
62
+ prompt_text = prompt_data
63
+ else:
64
+ raise FileNotFoundError(f"Prompt file not found: {prompt_name} (checked .txt and .json)")
65
+
66
+ # Substitute variables if provided
67
+ if kwargs:
68
+ try:
69
+ logger.debug(f"Formatting prompt {prompt_name} with variables: {list(kwargs.keys())}")
70
+ prompt_text = prompt_text.format(**kwargs)
71
+ logger.debug(f"Successfully formatted prompt {prompt_name}")
72
+ except KeyError as e:
73
+ logger.warning(f"Missing variable {e} in prompt {prompt_name}")
74
+ except Exception as e:
75
+ logger.error(f"Error formatting prompt {prompt_name}: {e}")
76
+ logger.error(f"Available variables: {list(kwargs.keys())}")
77
+
78
+ return prompt_text
79
+
80
+ def load_instruction(self, instruction_name: str) -> str:
81
+ """
82
+ Load instructions from the instructions directory as a single string.
83
+
84
+ Args:
85
+ instruction_name: Name of the instruction file (without .json extension)
86
+
87
+ Returns:
88
+ The loaded instructions as a joined string
89
+ """
90
+ instructions_list = self.load_instructions_as_list(instruction_name)
91
+ return "\n".join(instructions_list)
92
+
93
+ def load_instructions_as_list(self, instruction_name: str) -> List[str]:
94
+ """
95
+ Load instructions and return as a list of strings.
96
+
97
+ Args:
98
+ instruction_name: Name of the instruction file (without .json extension)
99
+
100
+ Returns:
101
+ List of instruction strings
102
+ """
103
+ instruction_path = self.instructions_dir / f"{instruction_name}.json"
104
+ data = self._load_json_file(instruction_path)
105
+
106
+ instructions = data.get("instructions", [])
107
+
108
+ # Filter out empty strings
109
+ return [instruction for instruction in instructions if instruction.strip()]
110
+
111
+ def _load_json_file(self, file_path: Path) -> Dict[str, Any]:
112
+ """Load JSON file content with caching."""
113
+ cache_key = str(file_path)
114
+
115
+ # Check cache first
116
+ if cache_key in self._cache:
117
+ return self._cache[cache_key]
118
+
119
+ try:
120
+ if not file_path.exists():
121
+ raise FileNotFoundError(f"File not found: {file_path}")
122
+
123
+ with open(file_path, 'r', encoding='utf-8') as f:
124
+ data = json.load(f)
125
+
126
+ # Cache the data
127
+ self._cache[cache_key] = data
128
+ logger.debug(f"Loaded {file_path.name}: {type(data)} with {len(data)} keys")
129
+
130
+ return data
131
+
132
+ except json.JSONDecodeError as e:
133
+ logger.error(f"Invalid JSON in file {file_path}: {e}")
134
+ raise
135
+ except Exception as e:
136
+ logger.error(f"Error loading file {file_path}: {e}")
137
+ raise
138
+
139
+ def clear_cache(self):
140
+ """Clear the file cache."""
141
+ self._cache.clear()
142
+ logger.debug("Prompt loader cache cleared")
143
+
144
+ def list_prompts(self) -> List[str]:
145
+ """List all available prompt files."""
146
+ if not self.prompts_dir.exists():
147
+ return []
148
+
149
+ prompts = []
150
+ for file_path in self.prompts_dir.rglob("*.json"):
151
+ # Get relative path from prompts dir
152
+ rel_path = file_path.relative_to(self.prompts_dir)
153
+ # Remove .json extension and convert to forward slashes
154
+ prompt_name = str(rel_path.with_suffix(''))
155
+ prompts.append(prompt_name)
156
+
157
+ return sorted(prompts)
158
+
159
+ def list_instructions(self) -> List[str]:
160
+ """List all available instruction files."""
161
+ if not self.instructions_dir.exists():
162
+ return []
163
+
164
+ instructions = []
165
+ for file_path in self.instructions_dir.rglob("*.json"):
166
+ # Get relative path from instructions dir
167
+ rel_path = file_path.relative_to(self.instructions_dir)
168
+ # Remove .json extension and convert to forward slashes
169
+ instruction_name = str(rel_path.with_suffix(''))
170
+ instructions.append(instruction_name)
171
+
172
+ return sorted(instructions)
173
+
174
+ def get_info(self) -> dict:
175
+ """Get information about the prompt loader."""
176
+ return {
177
+ "base_dir": str(self.base_dir),
178
+ "prompts_dir": str(self.prompts_dir),
179
+ "instructions_dir": str(self.instructions_dir),
180
+ "prompts_dir_exists": self.prompts_dir.exists(),
181
+ "instructions_dir_exists": self.instructions_dir.exists(),
182
+ "available_prompts": self.list_prompts(),
183
+ "available_instructions": self.list_instructions(),
184
+ "cache_size": len(self._cache)
185
+ }
186
+
187
+ # Global instance
188
+ prompt_loader = PromptLoader()
workflow/financial_workflow.py CHANGED
@@ -17,6 +17,7 @@ from agno.workflow import Workflow
17
  from agno.utils.log import logger
18
  from agno.tools.shell import ShellTools
19
  from config.settings import settings
 
20
 
21
 
22
  # Structured Output Models to avoid JSON parsing issues
@@ -67,106 +68,21 @@ class FinancialDocumentWorkflow(Workflow):
67
 
68
  description: str = "Financial document analysis workflow with data extraction, organization, and Excel generation"
69
 
70
-
71
-
72
  # Data Extractor Agent - Structured output eliminates JSON parsing issues
73
  data_extractor: Agent = Agent(
74
- model=Gemini(id=settings.DATA_EXTRACTOR_MODEL,thinking_budget=settings.DATA_EXTRACTOR_MODEL_THINKING_BUDGET),
75
  description="Expert financial data extraction specialist",
76
- instructions=[
77
- "Extract comprehensive financial data from documents with these priorities:",
78
- "Identify and classify the document type: Income Statement, Balance Sheet, Cash Flow Statement, 10-K, 10-Q, Annual Report, Quarterly/Interim Report, Prospectus, Earnings Release, Proxy Statement, Investor Presentation, Press Release, or other",
79
- "Extract report version: audited, unaudited, restated, pro forma",
80
- "Capture language, country/jurisdiction, and file format (PDF, XLSX, HTML, etc.)",
81
- "Extract company name and unique identifiers: LEI, CIK, ISIN, Ticker",
82
- "Extract reporting entity: consolidated, subsidiary, segment",
83
- "Extract fiscal year and period covered (start and end dates)",
84
- "Extract all reporting, publication, and filing dates",
85
- "Extract currency and any currency translation notes",
86
- "Extract auditors name, if present",
87
- "Identify financial statement presentation style: single-step, multi-step, consolidated, segmental",
88
- "Capture table and note references for each data point",
89
- "Extract total revenue/net sales (with by-product/service, segment, and geography breakdowns if disclosed)",
90
- "Extract COGS or cost of sales",
91
- "Extract gross profit and gross margin",
92
- "Extract operating expenses: R&D, SG&A, advertising, depreciation, amortization",
93
- "Extract operating income (EBIT) and EBIT margin",
94
- "Extract non-operating items: interest income/expense, other income/expenses",
95
- "Extract pretax income, income tax expense, and net income (with breakdowns: continuing, discontinued ops)",
96
- "Extract basic and diluted EPS",
97
- "Extract comprehensive and other comprehensive income items",
98
- "Extract YoY and sequential income comparisons (if available)",
99
- "Extract current assets: cash and equivalents, marketable securities, accounts receivable (gross/net), inventory (raw, WIP, finished), prepaid expenses, other",
100
- "Extract non-current assets: PP&E (gross/net), intangible assets, goodwill, LT investments, deferred tax assets, right-of-use assets, other",
101
- "Extract current liabilities: accounts payable, accrued expenses, short-term debt, lease liabilities, taxes payable, other",
102
- "Extract non-current liabilities: long-term debt, deferred tax liabilities, pensions, lease obligations, other",
103
- "Extract total shareholders equity: common/ordinary stock, retained earnings, additional paid-in capital, treasury stock, accumulated OCI, minority interest",
104
- "Extract book value per share",
105
- "Extract cash flows: net cash from operating, investing, and financing activities",
106
- "Extract key cash flow line items: net cash from ops, capex, acquisitions/disposals, dividends, share buybacks, debt activities",
107
- "Extract non-cash adjustments: depreciation, amortization, SBC, deferred taxes, impairments, gain/loss on sale",
108
- "Extract profitability ratios: gross margin, operating margin, net margin, EBITDA margin",
109
- "Extract return ratios: ROE, ROA, ROIC",
110
- "Extract liquidity/solvency: current ratio, quick ratio, debt/equity, interest coverage",
111
- "Extract efficiency: asset turnover, inventory turnover, receivables turnover",
112
- "Extract per-share metrics: EPS (basic/diluted), BVPS, FCF per share",
113
- "Extract segmental/geographical/operational ratios and breakdowns",
114
- "Extract shares outstanding, share class details, voting rights",
115
- "Extract dividends declared/paid (amount, dates)",
116
- "Extract buyback authorization/utilization details",
117
- "Extract employee count (average, period-end)",
118
- "Extract store/branch/office count",
119
- "Extract customer/user/subscriber numbers (active/paying, ARPU, churn, MAU/DAU)",
120
- "Extract units shipped/sold, production volumes, operational stats",
121
- "Extract key management guidance/forecasts if present",
122
- "Extract risk factors, uncertainties, and forward-looking statements",
123
- "Extract ESG/sustainability data where available (emissions, board diversity, etc.)",
124
- "Flag any restatements, adjustments, or one-off items",
125
- "Highlight material non-recurring, extraordinary, or unusual items (gains/losses, litigation, impairments, restructuring)",
126
- "Identify related-party transactions and accounting policy changes",
127
- "For each data point, provide a confidence score (0–1) based on clarity and documentation",
128
- "Include table/note reference numbers where possible",
129
- "Note any ambiguity or extraction limitations for specific data",
130
- "List all units, scales (millions, thousands), and any conversion performed",
131
- "Normalize date and currency formats across extracted data",
132
- "Validate calculations (e.g., assets = liabilities + equity), and flag inconsistencies",
133
- "Return data in a structured format (JSON/table), with reference and confidence annotation"
134
- ],
135
  response_model=ExtractedFinancialData,
136
  structured_outputs=True,
137
  debug_mode=True,
138
  )
139
-
140
-
141
 
142
  # Data Arranger Agent - Organizes data into categories for Excel
143
  data_arranger: Agent = Agent(
144
- model=Gemini(id=settings.DATA_ARRANGER_MODEL,thinking_budget=settings.DATA_ARRANGER_MODEL_THINKING_BUDGET),
145
  description="Financial data organization and analysis expert",
146
- instructions=[
147
- 'Organize the extracted financial data into logical categories based on financial statement types (Income Statement, Balance Sheet, Cash Flow Statement, etc.).',
148
- 'Group related financial items together (e.g., all revenue items, all expense items, all asset items).',
149
- 'Ensure each category has a clear, descriptive name that would work as an Excel worksheet tab.',
150
- 'Always add appropriate headers for Excel templates including: Years (e.g., 2021, 2022, 2023, 2024), Company names or entity identifiers, Financial line item names, and Units of measurement (e.g., "in millions", "in thousands").',
151
- 'Create column headers that clearly identify what each data column represents.',
152
- 'Include row headers that clearly identify each financial line item.',
153
- 'Design categories suitable for comprehensive Excel worksheets, such as: Income Statement Data, Balance Sheet Data, Cash Flow Data, Key Metrics, and Company Information.',
154
- 'Maintain data integrity - do not modify, calculate, or analyze the original data values.',
155
- 'Preserve original data formats and units.',
156
- 'Ensure data is organized in a tabular format suitable for Excel import.',
157
- 'Include metadata about data sources and reporting periods where available.',
158
- 'Package everything into a JSON object with the fields: categories (object containing organized data by category), headers (object containing appropriate headers for each category), and metadata (object containing information about data sources, periods, and units).',
159
- 'Never perform any analysis on the data.',
160
- 'Do not calculate ratios, growth rates, or trends.',
161
- 'Do not provide insights or interpretations.',
162
- 'Do not modify the actual data values.',
163
- 'Focus solely on organization and proper formatting.',
164
- 'Save this JSON as \'arranged_financial_data.json\' using the save_file tool.',
165
- 'Run list_files to verify that the file now exists in the working directory.',
166
- 'Use read_file to ensure the JSON content was written correctly.',
167
- 'If the file is missing or the content is incorrect, debug, re-save, and repeat steps',
168
- 'Only report success after the files presence and validity are fully confirmed.'
169
- ],
170
  tools=[FileTools()], # FileTools for saving arranged data
171
  # NOTE: Cannot use structured_outputs with tools in Gemini - choosing tools over structured outputs
172
  markdown=True,
@@ -176,62 +92,17 @@ class FinancialDocumentWorkflow(Workflow):
176
  exponential_backoff=True,
177
  retries=10,
178
  )
179
-
180
  # Code Generator Agent - Creates Excel generation code
181
  code_generator = Agent(
182
  model=Gemini(
183
  id=settings.CODE_GENERATOR_MODEL,
184
- thinking_budget=settings.CODE_GENERATOR_MODEL_THINKING_BUDGET
 
185
  ),
186
  description="Excel report generator that analyzes JSON data and creates formatted workbooks using shell execution on any OS",
187
  goal="Generate a professional Excel report from arranged_financial_data.json with multiple worksheets, formatting, and charts",
188
- instructions=[
189
- "EXECUTION RULE: Always use run_shell_command() for Python execution. Never use save_to_file_and_run().",
190
- "",
191
- "CRITICAL: Always read the file to understand the struction of the JSON First"
192
- "FIRST, use read_file tool to load 'arranged_financial_data.json'.",
193
- "SECOND, analyze its structure deeply. Identify all keys, data types, nested structures, and any inconsistencies.",
194
- "THIRD, create analysis.py to programmatically examine the JSON. Execute using run_shell_command().",
195
- "FOURTH, based on the analysis, design your Excel structure. Plan worksheets, formatting, and charts needed.",
196
- "FIFTH, implement generate_excel_report.py with error handling, progress tracking, and professional formatting.",
197
- "",
198
- "CRITICAL: Always start Python scripts with:",
199
- "import os",
200
- "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
201
- "This ensures the script runs in the correct directory regardless of OS.",
202
- "",
203
- "Available Tools:",
204
- "- FileTools: read_file, save_file, list_files",
205
- "- PythonTools: pip_install_package (ONLY for package installation)",
206
- "- ShellTools: run_shell_command (PRIMARY execution tool)",
207
- "",
208
- "Cross-Platform Execution:",
209
- "- Try: run_shell_command('python script.py 2>&1')",
210
- "- If fails on Windows: run_shell_command('python.exe script.py 2>&1')",
211
- "- PowerShell alternative: run_shell_command('powershell -Command \"python script.py\" 2>&1')",
212
- "",
213
- "Verification Commands (Linux/Mac):",
214
- "- run_shell_command('ls -la *.xlsx')",
215
- "- run_shell_command('file Financial_Report*.xlsx')",
216
- "- run_shell_command('du -h *.xlsx')",
217
- "",
218
- "Verification Commands (Windows/PowerShell):",
219
- "- run_shell_command('dir *.xlsx')",
220
- "- run_shell_command('powershell -Command \"Get-ChildItem *.xlsx\"')",
221
- "- run_shell_command('powershell -Command \"(Get-Item *.xlsx).Length\"')",
222
- "",
223
- "Debug Commands (Cross-Platform):",
224
- "- Current directory: run_shell_command('pwd') or run_shell_command('cd')",
225
- "- Python location: run_shell_command('where python') or run_shell_command('which python')",
226
- "- List files: run_shell_command('dir') or run_shell_command('ls')",
227
- "",
228
- "Package Installation:",
229
- "- pip_install_package('openpyxl')",
230
- "- Or via shell: run_shell_command('pip install openpyxl')",
231
- "- Windows: run_shell_command('python -m pip install openpyxl')",
232
- "",
233
- "Success Criteria: Excel file exists, size >5KB, no errors in output."
234
- ],
235
  expected_output="A Financial_Report_YYYYMMDD_HHMMSS.xlsx file containing formatted data from the JSON with multiple worksheets, professional styling, and relevant charts",
236
  additional_context="This agent must work on Windows, Mac, and Linux. Always use os.path for file operations and handle path separators correctly. Include proper error handling for cross-platform compatibility.",
237
  tools=[
@@ -251,12 +122,57 @@ class FinancialDocumentWorkflow(Workflow):
251
  super().__init__(session_id=session_id, **kwargs)
252
  self.session_id = session_id or f"financial_workflow_{int(__import__('time').time())}"
253
  self.session_output_dir = Path(settings.TEMP_DIR) / self.session_id / "output"
 
 
 
 
254
  self.session_output_dir.mkdir(parents=True, exist_ok=True)
 
 
255
 
256
  # Configure tools with correct base directories after initialization
257
  self._configure_agent_tools()
258
 
259
  logger.info(f"FinancialDocumentWorkflow initialized with session: {self.session_id}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
 
261
  def _configure_agent_tools(self):
262
  """Configure agent tools with the correct base directories"""
@@ -274,12 +190,23 @@ class FinancialDocumentWorkflow(Workflow):
274
  elif isinstance(tool, PythonTools):
275
  tool.base_dir = self.session_output_dir
276
 
277
- def run(self, file_path: str, use_cache: bool = True) -> RunResponse:
278
  """
 
279
  Pure Python workflow execution - no streaming, no JSON parsing issues
280
  """
 
 
 
 
 
 
 
281
  logger.info(f"Processing financial document: {file_path}")
282
 
 
 
 
283
  # Check cache first if enabled
284
  if use_cache and "final_results" in self.session_state:
285
  logger.info("Returning cached results")
@@ -300,27 +227,7 @@ class FinancialDocumentWorkflow(Workflow):
300
  logger.info("Using cached extraction data")
301
  else:
302
  document = File(filepath=file_path)
303
- extraction_prompt = f"""
304
- Analyze this financial document and extract all relevant financial data points.
305
-
306
- Focus on:
307
- - Company identification, including company name, entity identifiers (e.g., Ticker, CIK, ISIN, LEI), and reporting entity type (consolidated/subsidiary/segment).
308
- - All reporting period information: fiscal year, period start and end dates, reporting date, publication date, and currency used.
309
- - Revenue data: total revenue/net sales, breakdown by product/service, segment, and geography if available, and year-over-year growth rates.
310
- - Expense data: COGS, operating expenses (R&D, SG&A, advertising, depreciation/amortization), interest expenses, taxes, and any non-operating items.
311
- - Profit data: gross profit, operating income (EBIT/EBITDA), pretax profit, net income, basic and diluted earnings per share (EPS), comprehensive income.
312
- - Balance sheet items: current assets (cash, securities, receivables, inventories), non-current assets (PP&E, intangibles, goodwill), current liabilities, non-current liabilities, and all categories of shareholders’ equity.
313
- - Cash flow details: cash from operations, investing, and financing; capex, dividends, buybacks; non-cash adjustments (depreciation, SBC, etc.).
314
- - Financial ratios: profitability (gross margin, operating margin, net margin), return (ROE, ROA, ROIC), liquidity (current/quick ratio), leverage (debt/equity, interest coverage), efficiency (asset/inventory/receivables turnover), per-share metrics.
315
- - Capital and shareholder information: shares outstanding, share class details, dividends, and buyback information.
316
- - Non-financial and operational metrics: employee, store, customer/user counts, production volumes, and operational breakdowns.
317
- - Extract any additional material metrics, key management guidance, risks, uncertainties, ESG indicators, or forward-looking statements.
318
- - Flag/annotate any unusual or non-recurring items, restatements, or related-party transactions.
319
- - For each data point, provide a confidence score (0–1) and, where possible, include reference identifiers (table/note numbers).
320
- - If units or currencies differ throughout, normalize and annotate the data accordingly.
321
- Return your extraction in a structured, machine-readable format with references and confidence levels for each field.
322
- Document path: {file_path}
323
- """
324
 
325
  extraction_response: RunResponse = self.data_extractor.run(
326
  extraction_prompt,
@@ -339,42 +246,20 @@ class FinancialDocumentWorkflow(Workflow):
339
  arrangement_content = self.session_state["arrangement_response"]
340
  logger.info("Using cached arrangement data")
341
  else:
342
- arrangement_prompt = f"""
343
- You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
344
-
345
- ========== WHAT TO DELIVER ==========
346
- • A single JSON object saved as arranged_financial_data.json
347
- Fields required: categories, headers, metadata
348
-
349
- ========== HOW TO ORGANIZE ==========
350
- Create distinct, Excel-ready categories (one worksheet each) for logical grouping of financial data. Examples include:
351
- 1. Income Statement Data
352
- 2. Balance Sheet Data
353
- 3. Cash Flow Data
354
- 4. Company Information / General Data
355
-
356
- ========== STEP-BY-STEP ==========
357
- 1. Map every data point into the most appropriate category above.
358
- 2. For each category, identify and include all necessary headers for an Excel template, such as years, company names, financial line item names, and units of measurement (e.g., "in millions").
359
- 3. Ensure data integrity by not modifying, calculating, or analyzing the original data values.
360
- 4. Preserve original data formats and units.
361
- 5. Organize data in a tabular format suitable for direct Excel import.
362
- 6. Include metadata about data sources and reporting periods where available.
363
- 7. Assemble everything into the JSON schema described under “WHAT TO DELIVER.”
364
- 8. Save the JSON as arranged_financial_data.json via save_file.
365
- 9. Use list_files to confirm the file exists, then read_file to validate its content.
366
- 10. If the file is missing or malformed, fix the issue and repeat steps 8 – 9.
367
- 11. Only report success after the file passes both existence and content checks.
368
-
369
- ========== IMPORTANT RESTRICTIONS ==========
370
- - Never perform any analysis on the data.
371
- - Do not calculate ratios, growth rates, or trends.
372
- - Do not provide insights or interpretations.
373
- - Do not modify the actual data values.
374
- - Focus solely on organization and proper formatting.
375
 
376
- Extracted Data: {extracted_data.model_dump_json(indent=2)}
377
- """
 
 
 
 
378
 
379
  arrangement_response: RunResponse = self.data_arranger.run(arrangement_prompt)
380
  arrangement_content = arrangement_response.content
@@ -391,35 +276,7 @@ class FinancialDocumentWorkflow(Workflow):
391
  execution_success = self.session_state.get("execution_success", False)
392
  logger.info("Using cached code generation results")
393
  else:
394
- code_prompt = f"""
395
- Your objective: Turn the organized JSON data into a polished, multi-sheet Excel report—and prove that it works.
396
-
397
- ========== INPUT ==========
398
- File: arranged_financial_data.json
399
- Tool to read it: read_file
400
-
401
- ========== WHAT THE PYTHON SCRIPT MUST DO ==========
402
- 1. Load arranged_financial_data.json and parse its contents.
403
- 2. For each category in the JSON, create a dedicated worksheet using openpyxl.
404
- 3. Apply professional touches:
405
- • Bold, centered headers
406
- • Appropriate number formats
407
- • Column-width auto-sizing
408
- • Borders, cell styles, and freeze panes
409
- 4. Insert charts (bar, line, or pie) wherever the data lends itself to visualisation.
410
- 5. Embed key metrics and summary notes prominently in the Executive Summary sheet.
411
- 6. Name the workbook: Financial_Report_<YYYYMMDD_HHMMSS>.xlsx.
412
- 7. Wrap every file and workbook operation in robust try/except blocks.
413
- 8. Log all major steps and any exceptions for easy debugging.
414
- 9. Save the script via save_to_file_and_run and execute it immediately.
415
- 10. After execution, use list_files to ensure the Excel file was created.
416
- 11. Optionally inspect the file (e.g., size or first bytes via read_file) to confirm it is not empty.
417
- 12. If the workbook is missing or corrupted, refine the code, re-save, and re-run until success.
418
-
419
- ========== OUTPUT ==========
420
- • A fully formatted Excel workbook in the working directory.
421
- • A concise summary of what ran, any issues encountered, and confirmation that the file exists and opens without error.
422
- """
423
 
424
  code_response: RunResponse = self.code_generator.run(code_prompt)
425
  code_generation_content = code_response.content
 
17
  from agno.utils.log import logger
18
  from agno.tools.shell import ShellTools
19
  from config.settings import settings
20
+ from utils.prompt_loader import prompt_loader
21
 
22
 
23
  # Structured Output Models to avoid JSON parsing issues
 
68
 
69
  description: str = "Financial document analysis workflow with data extraction, organization, and Excel generation"
70
 
 
 
71
  # Data Extractor Agent - Structured output eliminates JSON parsing issues
72
  data_extractor: Agent = Agent(
73
+ model=Gemini(id=settings.DATA_EXTRACTOR_MODEL,thinking_budget=settings.DATA_EXTRACTOR_MODEL_THINKING_BUDGET,api_key=settings.GOOGLE_API_KEY),
74
  description="Expert financial data extraction specialist",
75
+ instructions=prompt_loader.load_instructions_as_list("agents/data_extractor"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  response_model=ExtractedFinancialData,
77
  structured_outputs=True,
78
  debug_mode=True,
79
  )
 
 
80
 
81
  # Data Arranger Agent - Organizes data into categories for Excel
82
  data_arranger: Agent = Agent(
83
+ model=Gemini(id=settings.DATA_ARRANGER_MODEL,thinking_budget=settings.DATA_ARRANGER_MODEL_THINKING_BUDGET,api_key=settings.GOOGLE_API_KEY),
84
  description="Financial data organization and analysis expert",
85
+ instructions=prompt_loader.load_instructions_as_list("agents/data_arranger"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  tools=[FileTools()], # FileTools for saving arranged data
87
  # NOTE: Cannot use structured_outputs with tools in Gemini - choosing tools over structured outputs
88
  markdown=True,
 
92
  exponential_backoff=True,
93
  retries=10,
94
  )
95
+
96
  # Code Generator Agent - Creates Excel generation code
97
  code_generator = Agent(
98
  model=Gemini(
99
  id=settings.CODE_GENERATOR_MODEL,
100
+ thinking_budget=settings.CODE_GENERATOR_MODEL_THINKING_BUDGET,
101
+ api_key=settings.GOOGLE_API_KEY
102
  ),
103
  description="Excel report generator that analyzes JSON data and creates formatted workbooks using shell execution on any OS",
104
  goal="Generate a professional Excel report from arranged_financial_data.json with multiple worksheets, formatting, and charts",
105
+ instructions=prompt_loader.load_instructions_as_list("agents/code_generator"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  expected_output="A Financial_Report_YYYYMMDD_HHMMSS.xlsx file containing formatted data from the JSON with multiple worksheets, professional styling, and relevant charts",
107
  additional_context="This agent must work on Windows, Mac, and Linux. Always use os.path for file operations and handle path separators correctly. Include proper error handling for cross-platform compatibility.",
108
  tools=[
 
122
  super().__init__(session_id=session_id, **kwargs)
123
  self.session_id = session_id or f"financial_workflow_{int(__import__('time').time())}"
124
  self.session_output_dir = Path(settings.TEMP_DIR) / self.session_id / "output"
125
+ self.session_input_dir = Path(settings.TEMP_DIR) / self.session_id / "input"
126
+ self.session_temp_dir = Path(settings.TEMP_DIR) / self.session_id / "temp"
127
+
128
+ # Create all session directories
129
  self.session_output_dir.mkdir(parents=True, exist_ok=True)
130
+ self.session_input_dir.mkdir(parents=True, exist_ok=True)
131
+ self.session_temp_dir.mkdir(parents=True, exist_ok=True)
132
 
133
  # Configure tools with correct base directories after initialization
134
  self._configure_agent_tools()
135
 
136
  logger.info(f"FinancialDocumentWorkflow initialized with session: {self.session_id}")
137
+
138
+ def clear_cache(self):
139
+ """Clear workflow session cache and temporary files."""
140
+ try:
141
+ # Clear session state
142
+ self.session_state.clear()
143
+ logger.info(f"Cleared workflow cache for session: {self.session_id}")
144
+
145
+ # Clean up temporary files (keep input and output)
146
+ if self.session_temp_dir.exists():
147
+ import shutil
148
+ try:
149
+ shutil.rmtree(self.session_temp_dir)
150
+ self.session_temp_dir.mkdir(parents=True, exist_ok=True)
151
+ logger.info(f"Cleaned temporary files for session: {self.session_id}")
152
+ except Exception as e:
153
+ logger.warning(f"Could not clean temp directory: {e}")
154
+
155
+ except Exception as e:
156
+ logger.error(f"Error clearing workflow cache: {e}")
157
+
158
+ def cleanup_session(self):
159
+ """Complete cleanup of session including all files."""
160
+ try:
161
+ # Clear cache first
162
+ self.clear_cache()
163
+
164
+ # Remove entire session directory
165
+ session_dir = Path(settings.TEMP_DIR) / self.session_id
166
+ if session_dir.exists():
167
+ import shutil
168
+ try:
169
+ shutil.rmtree(session_dir)
170
+ logger.info(f"Completely removed session directory: {session_dir}")
171
+ except Exception as e:
172
+ logger.warning(f"Could not remove session directory: {e}")
173
+
174
+ except Exception as e:
175
+ logger.error(f"Error during session cleanup: {e}")
176
 
177
  def _configure_agent_tools(self):
178
  """Configure agent tools with the correct base directories"""
 
190
  elif isinstance(tool, PythonTools):
191
  tool.base_dir = self.session_output_dir
192
 
193
+ def run(self, file_path: str = None, **kwargs) -> RunResponse:
194
  """
195
+ Main workflow execution method
196
  Pure Python workflow execution - no streaming, no JSON parsing issues
197
  """
198
+ # Handle file_path from parameter or attribute
199
+ if file_path is None:
200
+ file_path = getattr(self, 'file_path', None)
201
+
202
+ if file_path is None:
203
+ raise ValueError("file_path must be provided either as parameter or set as attribute")
204
+
205
  logger.info(f"Processing financial document: {file_path}")
206
 
207
+ # Remove use_cache parameter since it's not defined in the method signature
208
+ use_cache = kwargs.get('use_cache', True)
209
+
210
  # Check cache first if enabled
211
  if use_cache and "final_results" in self.session_state:
212
  logger.info("Returning cached results")
 
227
  logger.info("Using cached extraction data")
228
  else:
229
  document = File(filepath=file_path)
230
+ extraction_prompt = prompt_loader.load_prompt("workflow/data_extraction", file_path=file_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
  extraction_response: RunResponse = self.data_extractor.run(
233
  extraction_prompt,
 
246
  arrangement_content = self.session_state["arrangement_response"]
247
  logger.info("Using cached arrangement data")
248
  else:
249
+ # Debug: Check extracted data before passing to prompt
250
+ extracted_json = extracted_data.model_dump_json(indent=2)
251
+ logger.debug(f"Extracted data size: {len(extracted_json)} characters")
252
+ logger.debug(f"First 200 chars of extracted data: {extracted_json[:200]}...")
253
+
254
+ arrangement_prompt = prompt_loader.load_prompt("workflow/data_arrangement",
255
+ extracted_data=extracted_json)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
 
257
+ # Debug: Check if prompt contains the actual data or just the placeholder
258
+ if "{extracted_data}" in arrangement_prompt:
259
+ logger.error("CRITICAL: Variable substitution failed! Prompt still contains {extracted_data} placeholder")
260
+ logger.error(f"Prompt length: {len(arrangement_prompt)}")
261
+ else:
262
+ logger.info(f"Variable substitution successful. Prompt length: {len(arrangement_prompt)}")
263
 
264
  arrangement_response: RunResponse = self.data_arranger.run(arrangement_prompt)
265
  arrangement_content = arrangement_response.content
 
276
  execution_success = self.session_state.get("execution_success", False)
277
  logger.info("Using cached code generation results")
278
  else:
279
+ code_prompt = prompt_loader.load_prompt("workflow/code_generation")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
  code_response: RunResponse = self.code_generator.run(code_prompt)
282
  code_generation_content = code_response.content