Spaces:

rajmethun0
/

Data_Extractor_Using_Gemini

Sleeping

App Files Files Community

methunraj commited on Jul 22

Commit

90b0a17

1 Parent(s): 6985bd2

refactor: restructure project with modular prompts and instructions

Browse files

Files changed (22) hide show

.claude/settings.local.json +0 -19
README.md +216 -112
TERMINAL_README.md +0 -230
app.py +249 -230
config/settings.py +80 -3
instructions/README.md +51 -0
instructions/agents/code_generator.json +174 -0
instructions/agents/data_arranger.json +98 -0
instructions/agents/data_extractor.json +115 -0
prompt_gallery.json +0 -30
prompts/README.md +44 -0
prompts/workflow/code_generation.json +136 -0
prompts/workflow/code_generation.txt +129 -0
prompts/workflow/data_arrangement.json +93 -0
prompts/workflow/data_arrangement.txt +34 -0
prompts/workflow/data_extraction.json +65 -0
prompts/workflow/data_extraction.txt +58 -0
settings.py +0 -54
terminal_stream.py +54 -1
utils/logger.py +92 -5
utils/prompt_loader.py +188 -0
workflow/financial_workflow.py +81 -224

.claude/settings.local.json DELETED Viewed

@@ -1,19 +0,0 @@
-{
-  "permissions": {
-    "allow": [
-      "Bash(mkdir:*)",
-      "Bash(python test:*)",
-      "Bash(/usr/local/bin/python3:*)",
-      "Bash(ls:*)",
-      "Bash(rm:*)",
-      "Bash(python:*)",
-      "Bash(find:*)",
-      "mcp__zen__analyze",
-      "Bash(pkill:*)",
-      "Bash(touch:*)",
-      "Bash(docker build:*)",
-      "Bash(/dev/null)"
-    ],
-    "deny": []
-  }
-}

README.md CHANGED Viewed

@@ -1,162 +1,266 @@
----
-title: Agno Document Analysis
-emoji: 📄
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
-license: mit
----
-# Agno Document Analysis Workflow
-A sophisticated document processing application built with Agno v1.7.4 featuring a multi-agent workflow for intelligent document analysis and data extraction.
-## Features
-- **3-Agent Workflow**: Data Extractor, Data Arranger, Code Generator
-- **Multi-format Support**: PDF, TXT, PNG, JPG, JPEG, DOCX, XLSX, CSV, MD, JSON, XML, HTML, PY, JS, TS, DOC, XLS, PPT, PPTX
-- **Real-time Processing**: Streaming interface with live updates
-- **Sandboxed Execution**: Safe code execution environment
-- **Beautiful UI**: Modern Gradio interface with custom animations
-## Quick Start
-### Automated Installation
-```bash
-# Clone the repository
-git clone <repository-url>
-cd Data_Extractor
-# Quick installation (recommended)
-./install.sh
-# Or use Python setup script
-python setup.py
 ```
-### Manual Installation
-```bash
-# Create virtual environment
-python -m venv data_extractor_env
-source data_extractor_env/bin/activate  # On Windows: data_extractor_env\Scripts\activate
-# Install dependencies
-pip install -r requirements.txt
-# Create environment file
-cp .env.example .env  # Update with your API keys
-# Run the application
-python app.py
 ```
-## Installation Options
-### Requirements Files
-- **`requirements-minimal.txt`**: Essential dependencies only (~50 packages)
-  ```bash
-  pip install -r requirements-minimal.txt
-  ```
-- **`requirements.txt`**: Complete feature set (~200+ packages)
-  ```bash
-  pip install -r requirements.txt
-  ```
-- **`requirements-dev.txt`**: Development dependencies with testing tools
-  ```bash
-  pip install -r requirements-dev.txt
-  ```
-### System Dependencies
-Some features require system-level dependencies:
-**macOS:**
-```bash
-brew install tesseract imagemagick poppler
 ```
-**Ubuntu/Debian:**
 ```bash
-sudo apt-get install tesseract-ocr libmagickwand-dev poppler-utils
 ```
-**Windows:**
 ```bash
-choco install tesseract imagemagick poppler
 ```
-## Usage
-1. **Setup Environment**: Follow installation instructions above
-2. **Configure API Keys**: Update `.env` file with your API keys
-3. **Upload Document**: Support for 20+ file formats
-4. **Select Analysis**: Choose from predefined types or custom prompts
-5. **Process**: Watch the multi-agent workflow in real-time
-6. **Download Results**: Get structured data and generated Excel reports
-## Environment Variables
-Create a `.env` file with the following variables:
-```bash
-# Required API Keys
-GOOGLE_API_KEY=your_google_api_key_here
-OPENAI_API_KEY=your_openai_api_key_here  # Optional
-# Application Settings
-DEBUG=False
-LOG_LEVEL=INFO
-SESSION_TIMEOUT=3600
-# File Processing
-MAX_FILE_SIZE=50MB
-SUPPORTED_FORMATS=pdf,docx,xlsx,txt
-# Database (Optional)
-DATABASE_URL=sqlite:///data_extractor.db
-```
-## Advanced Features
-### Financial Document Processing
-- Comprehensive financial data extraction
-- 13-category data organization
-- Excel report generation with charts
-- XBRL and SEC filing support
-### OCR and Image Processing
-- EasyOCR and PaddleOCR integration
-- Tesseract OCR support
-- Advanced image preprocessing
-### Machine Learning Integration
-- TensorFlow and PyTorch support
-- Scikit-learn for data analysis
-- XGBoost and LightGBM for predictions
-## Troubleshooting
-For detailed troubleshooting and installation issues, see:
-- [`INSTALLATION.md`](INSTALLATION.md) - Comprehensive installation guide
-- [`FIXES_SUMMARY.md`](FIXES_SUMMARY.md) - Known issues and solutions
-### Common Issues
-1. **Import Errors**: Try minimal installation first
-2. **OCR Issues**: Install system dependencies
-3. **Memory Issues**: Use smaller batch sizes
-4. **API Errors**: Verify API keys in `.env` file
-## Docker Support
-```dockerfile
-# Build and run with Docker
-docker build -t data-extractor .
-docker run -p 7860:7860 --env-file .env data-extractor
 ```

+# 📊 Financial Data Extractor Using Gemini
+A powerful AI-driven financial document analysis system that automatically extracts, organizes, and generates professional Excel reports from financial documents using Google's Gemini AI models.
+## 🚀 Features
+### Core Functionality
+- **📄 Multi-format Document Support**: PDF, DOCX, TXT, and image files
+- **🔍 Intelligent Data Extraction**: AI-powered extraction of financial data points
+- **📊 Smart Data Organization**: Automatic categorization into 12+ financial categories
+- **💻 Excel Report Generation**: Professional multi-worksheet Excel reports with charts
+- **🎯 Real-time Processing**: Live streaming interface with progress tracking
+### Advanced Capabilities
+- **🤖 Multi-Agent Workflow**: Specialized AI agents for extraction, arrangement, and code generation
+- **💾 Session Management**: Persistent storage with SQLite caching
+- **🔄 Auto-shutdown**: Intelligent resource management for cloud deployments
+- **📱 Modern UI**: Beautiful Gradio-based web interface
+- **🌐 Cross-platform**: Works on Windows, Mac, and Linux
+- **🐳 Docker Support**: Containerized deployment ready
+## 🏗️ Architecture
+The system uses a sophisticated multi-agent workflow powered by the Agno framework:
+```
+📄 Document Upload
+    ↓
+🔍 Data Extractor Agent
+    ↓ (Structured Financial Data)
+📊 Data Arranger Agent
+    ↓ (Organized Categories)
+💻 Code Generator Agent
+    ↓ (Python Excel Code)
+📊 Excel Report Output
 ```
+### Agent Specialization
+- **Data Extractor**: Extracts financial data points with confidence scoring
+- **Data Arranger**: Organizes data into 12+ professional categories
+- **Code Generator**: Creates Python code for Excel report generation
+## 📋 Requirements
+### System Requirements
+- Python 3.8+
+- Google API Key (for Gemini models)
+- 2GB+ RAM recommended
+- Cross-platform compatible
+### Dependencies
+```
+agno>=1.7.4
+gradio
+google-generativeai
+PyPDF2
+Pillow
+python-dotenv
+pandas
+matplotlib
+openpyxl
+python-docx
+lxml
+markdown
+requests
+seaborn
+sqlalchemy
+websockets
 ```
+## 🚀 Quick Start
+### 1. Clone the Repository
+```bash
+git clone <repository-url>
+cd Data_Extractor_Using_Gemini
+```
+### 2. Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+### 3. Configure Environment
+Create a `.env` file:
+```env
+GOOGLE_API_KEY=your_gemini_api_key_here
 ```
+### 4. Run the Application
 ```bash
+python app.py
 ```
+The application will be available at `http://localhost:7860`
+## 🐳 Docker Deployment
+### Build and Run
 ```bash
+docker build -t financial-extractor .
+docker run -p 7860:7860 --env-file .env financial-extractor
 ```
+### Environment Variables
+- `GOOGLE_API_KEY`: Your Google Gemini API key
+- `INACTIVITY_TIMEOUT_MINUTES`: Auto-shutdown timeout (default: 30)
+## 📖 Usage Guide
+### 1. Upload Document
+- Drag and drop or select your financial document
+- Supported formats: PDF, DOCX, TXT, PNG, JPG, JPEG
+### 2. Select Processing Mode
+- **Quick Analysis**: Standard extraction and organization
+- **Custom Prompts**: Use predefined prompt templates for specific document types
+### 3. Monitor Progress
+- Real-time streaming interface shows each processing step
+- Progress indicators for all workflow stages
+- Live terminal output for code execution
+### 4. Download Results
+- Professional Excel report with multiple worksheets
+- Organized data categories with charts and formatting
+- All intermediate files available for download
+## 📊 Output Structure
+The generated Excel reports include:
+### Worksheets
+- **Summary**: Executive overview with key metrics
+- **Revenue**: Income and revenue streams
+- **Expenses**: Operating and non-operating expenses
+- **Assets**: Current and non-current assets
+- **Liabilities**: Short-term and long-term liabilities
+- **Equity**: Shareholder equity components
+- **Cash Flow**: Cash flow statements
+- **Ratios**: Financial ratio analysis
+- **Charts**: Visual representations of key data
+- **Raw Data**: Original extracted data points
+### Features
+- Professional formatting with consistent styling
+- Interactive charts and visualizations
+- Dynamic period handling (auto-detects years/quarters)
+- Cross-referenced data validation
+- Print-ready layouts
+## 🔧 Configuration
+### Model Settings
+Configure AI models in `config/settings.py`:
+- Data Extractor Model
+- Data Arranger Model
+- Code Generator Model
+- Thinking budgets and retry settings
+### Prompt Customization
+Customize agent instructions in `instructions/agents/`:
+- `data_extractor.md`: Data extraction prompts
+- `data_arranger.md`: Data organization prompts
+- `code_generator.md`: Excel generation prompts
+### Workflow Configuration
+Modify workflow behavior in `workflow/financial_workflow.py`:
+- Agent configurations
+- Tool assignments
+- Output formats
+## 🛠️ Development
+### Project Structure
+```
+├── app.py                 # Main Gradio application
+├── workflow/              # Core workflow implementation
+├── instructions/          # Agent instruction templates
+├── prompts/              # Prompt gallery configurations
+├── config/               # Application settings
+├── utils/                # Utility functions
+├── static/               # Static assets
+├── models/               # Data models
+└── terminal_stream.py    # Real-time terminal streaming
+```
+### Key Components
+- **WorkflowUI**: Main interface controller
+- **FinancialDocumentWorkflow**: Core processing pipeline
+- **AutoShutdownManager**: Resource management
+- **TerminalLogHandler**: Real-time logging
+- **PromptGallery**: Template management
+## 🔒 Security & Privacy
+- **Local Processing**: All document processing happens locally
+- **No Data Storage**: Documents are processed and cleaned up automatically
+- **API Key Security**: Environment-based configuration
+- **Session Isolation**: Each session has isolated temporary directories
+## 🌐 Deployment Options
+### Local Development
+```bash
+python app.py
+```
+### Production (Gunicorn)
+```bash
+gunicorn -w 4 -b 0.0.0.0:7860 app:app
 ```
+### Cloud Platforms
+- **Hugging Face Spaces**: Ready for deployment
+- **Google Cloud Run**: Containerized deployment
+- **AWS/Azure**: Standard container deployment
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
+## 📝 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🆘 Support
+### Common Issues
+- **API Key Errors**: Ensure your Google API key is valid and has Gemini access
+- **Memory Issues**: Increase system RAM or reduce document size
+- **Processing Timeouts**: Check network connectivity and API quotas

TERMINAL_README.md DELETED Viewed

@@ -1,230 +0,0 @@
-# 🚀 Manus AI-Style Terminal Integration
-This document explains the real-time terminal streaming functionality added to the Data Extractor application.
-## 📋 Overview
-The terminal integration provides a **Manus AI-style terminal interface** with real-time command execution and streaming output, seamlessly integrated into the existing Gradio application.
-## 🏗️ Architecture
-### Components
-1. **WebSocket Server** (`terminal_stream.py`)
-   - Handles real-time communication between frontend and backend
-   - Manages command execution with streaming output
-   - Supports multiple concurrent connections
-   - Runs on port 8765
-2. **Frontend Terminal** (`static/terminal.html`)
-   - Beautiful Manus AI-inspired terminal interface
-   - Real-time output streaming via WebSocket
-   - Command history navigation
-   - Keyboard shortcuts and controls
-3. **Gradio Integration** (Modified `app.py`)
-   - Added terminal tab to existing interface
-   - Embedded terminal as iframe component
-   - Auto-starts WebSocket server on application launch
-## 🎨 Features
-### Terminal Interface
-- **Real-time Streaming**: Live command output as it happens
-- **Command History**: Navigate with ↑/↓ arrow keys
-- **Interrupt Support**: Ctrl+C to stop running commands
-- **Auto-reconnect**: Automatically reconnects on connection loss
-- **Status Indicators**: Visual connection and execution status
-- **Responsive Design**: Works on desktop and mobile
-### Security
-- **Command Sanitization**: Uses `shlex.split()` for safe command parsing
-- **Process Isolation**: Commands run in separate processes
-- **Error Handling**: Robust error handling and logging
-## 🚀 Usage
-### Starting the Application
-```bash
-python app.py
-```
-The terminal WebSocket server automatically starts on port 8765.
-### Accessing the Terminal
-1. Open the Gradio interface (usually http://localhost:7860)
-2. Click on the "💻 Terminal" tab
-3. Start typing commands in the terminal interface
-### Keyboard Shortcuts
-- **Enter**: Execute command
-- **↑/↓**: Navigate command history
-- **Ctrl+C**: Interrupt running command
-- **Ctrl+L**: Clear terminal screen
-- **Tab**: Command completion (planned feature)
-## 🔧 Configuration
-### WebSocket Server Settings
-```python
-# In terminal_stream.py
-WEBSOCKET_HOST = 'localhost'
-WEBSOCKET_PORT = 8765
-```
-### Terminal Appearance
-Customize the terminal appearance by modifying the CSS in `static/terminal.html`:
-```css
-/* Main terminal colors */
-.terminal-container {
-    background: linear-gradient(135deg, #0d1117 0%, #161b22 100%);
-}
-/* Command prompt */
-.prompt {
-    color: #58a6ff;
-}
-```
-## 📡 WebSocket API
-### Client → Server Messages
-#### Execute Command
-```json
-{
-    "type": "command",
-    "command": "ls -la"
-}
-```
-#### Interrupt Command
-```json
-{
-    "type": "interrupt"
-}
-```
-### Server → Client Messages
-#### Command Output
-```json
-{
-    "type": "output",
-    "data": "file1.txt\nfile2.txt",
-    "stream": "stdout",
-    "timestamp": "2024-01-01T12:00:00.000Z"
-}
-```
-#### Command Completion
-```json
-{
-    "type": "command_complete",
-    "exit_code": 0,
-    "message": "Process exited with code 0",
-    "timestamp": "2024-01-01T12:00:00.000Z"
-}
-```
-## 🛠️ Development
-### Adding New Features
-1. **Server-side**: Modify `terminal_stream.py`
-2. **Client-side**: Update `static/terminal.html`
-3. **Integration**: Adjust `app.py` if needed
-### Testing
-```bash
-# Test WebSocket server independently
-python -c "from terminal_stream import run_websocket_server; run_websocket_server()"
-# Test terminal interface
-# Open static/terminal.html in browser
-```
-## 🔍 Troubleshooting
-### Common Issues
-1. **WebSocket Connection Failed**
-   - Check if port 8765 is available
-   - Verify firewall settings
-   - Check server logs for errors
-2. **Commands Not Executing**
-   - Verify WebSocket connection status
-   - Check terminal logs for errors
-   - Ensure proper command syntax
-3. **Terminal Not Loading**
-   - Check if `static/terminal.html` exists
-   - Verify Gradio file serving configuration
-   - Check browser console for errors
-### Debug Mode
-Enable debug logging:
-```python
-import logging
-logging.getLogger('terminal_stream').setLevel(logging.DEBUG)
-```
-## 🚀 Advanced Usage
-### Custom Commands
-Add custom command handlers in `terminal_stream.py`:
-```python
-async def handle_custom_command(self, command):
-    if command.startswith('custom:'):
-        # Handle custom command
-        await self.broadcast({
-            'type': 'output',
-            'data': 'Custom command executed',
-            'stream': 'stdout'
-        })
-        return True
-    return False
-```
-### Integration with Workflow
-Stream workflow logs to terminal:
-```python
-# In workflow code
-from terminal_stream import terminal_manager
-async def log_to_terminal(message):
-    await terminal_manager.broadcast({
-        'type': 'output',
-        'data': message,
-        'stream': 'workflow'
-    })
-```
-## 📚 Dependencies
-- `websockets`: WebSocket server implementation
-- `asyncio`: Async programming support
-- `subprocess`: Command execution
-- `shlex`: Safe command parsing
-## 🎯 Future Enhancements
-- [ ] Command auto-completion
-- [ ] File upload/download via terminal
-- [ ] Terminal themes and customization
-- [ ] Multi-session support
-- [ ] Terminal recording and playback
-- [ ] Integration with workflow logging
-- [ ] SSH/remote terminal support
-## 📄 License
-This terminal implementation is part of the Data Extractor project and follows the same license terms.

app.py CHANGED Viewed

@@ -66,6 +66,7 @@ class AutoShutdownManager:
         self.shutdown_timer = None
         self.app_instance = None
         self.is_shutting_down = False
         # Setup signal handlers for graceful shutdown
         signal.signal(signal.SIGINT, self._signal_handler)
@@ -76,6 +77,12 @@ class AutoShutdownManager:
         logger.info(f"AutoShutdownManager initialized with {timeout_minutes} minute timeout")
     def _signal_handler(self, signum, frame):
         """Handle shutdown signals gracefully."""
         logger.info(f"Received signal {signum}, initiating graceful shutdown...")
@@ -131,11 +138,42 @@ class AutoShutdownManager:
             if self.shutdown_timer:
                 self.shutdown_timer.cancel()
             if self.app_instance:
-                # Gradio doesn't have a direct shutdown method, so we'll use os._exit
-                logger.info("Shutting down Gradio application")
                 import os
                 os._exit(0)
         except Exception as e:
             logger.error(f"Error during shutdown: {e}")
             import os
@@ -148,10 +186,12 @@ shutdown_manager = AutoShutdownManager()
 class TerminalLogHandler(logging.Handler):
     """Custom logging handler that captures logs for terminal display."""
-    def __init__(self):
         super().__init__()
-        self.logs = deque(maxlen=1000)  # Keep last 1000 log entries
         self.session_logs = {}  # Per-session logs
     def emit(self, record):
         """Emit a log record."""
@@ -183,8 +223,13 @@ class TerminalLogHandler(logging.Handler):
             session_id = getattr(record, 'session_id', None)
             if session_id:
                 if session_id not in self.session_logs:
-                    self.session_logs[session_id] = deque(maxlen=500)
                 self.session_logs[session_id].append(log_entry)
         except Exception as e:
             # Prevent logging errors from breaking the application
@@ -221,6 +266,60 @@ class TerminalLogHandler(logging.Handler):
             ''')
         return ''.join(html_lines)
 # Global terminal log handler
 terminal_log_handler = TerminalLogHandler()
@@ -1739,241 +1838,103 @@ def create_gradio_app():
             time.sleep(1)  # Brief pause for UI update
-            # Step 1: Data Extraction
             logger.info("=" * 60)
-            logger.info("🔍 STEP 1/4: DATA EXTRACTION PHASE")
             logger.info("=" * 60)
-            logger.info("📋 Initializing financial data extraction agent...")
             progress_html = "🔍 <strong>Step 1/4: Extracting financial data from document...</strong>"
             yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
-            # Check for cached extraction
-            if "extracted_data" in ui.workflow.session_state:
-                logger.info("💾 Using cached extraction data from previous run")
-                logger.info("⏩ Skipping extraction step - data already available")
-                time.sleep(0.5)  # Brief pause to show step
-            else:
-                logger.info(f"🔄 Starting fresh data extraction from document: {temp_path}")
-                logger.info("📄 Creating document object for analysis...")
-                # Perform data extraction
-                document = File(filepath=temp_path)
-                logger.info("✅ Document object created successfully")
-                extraction_prompt = f"""
-                Analyze this financial document and extract all relevant financial data points.
-                Focus on:
-                - Company identification and reporting period
-                - Revenue, expenses, profits, and losses
-                - Assets, liabilities, and equity
-                - Cash flows and financial ratios
-                - Any other key financial metrics
-                Document path: {temp_path}
-                """
-                logger.info("🤖 Calling data extractor agent with financial analysis prompt")
-                logger.info("⏳ This may take 30-60 seconds depending on document complexity...")
-                extraction_response = ui.workflow.data_extractor.run(
-                    extraction_prompt,
-                    files=[document]
-                )
-                extracted_data = extraction_response.content
-                logger.info("🎉 Data extraction agent completed successfully!")
-                logger.info(f"📊 Extracted {len(extracted_data.data_points)} financial data points")
-                # Cache the result
-                ui.workflow.session_state["extracted_data"] = extracted_data.model_dump()
-                logger.info(f"💾 Cached extraction results for session {ui.session_id}")
-                logger.info("✅ Step 1 COMPLETED - Data extraction successful")
-            # Step 2: Data Arrangement
-            logger.info("=" * 60)
-            logger.info("📊 STEP 2/4: DATA ORGANIZATION PHASE")
-            logger.info("=" * 60)
-            progress_html = "📊 <strong>Step 2/4: Organizing and analyzing financial data...</strong>"
-            yield (progress_html, create_step_html("arrangement"), "", gr.Column(visible=False), session_state)
-            if "arrangement_response" in ui.workflow.session_state:
-                logger.info("💾 Using cached data arrangement from previous run")
-                logger.info("⏩ Skipping organization step - data already structured")
-                time.sleep(0.5)  # Brief pause to show step
-            else:
-                logger.info("🔄 Starting fresh data organization and analysis")
-                # Get extracted data for arrangement
-                extracted_data_dict = ui.workflow.session_state["extracted_data"]
-                logger.info(f"📋 Retrieved {len(extracted_data_dict.get('data_points', []))} data points for organization")
-                logger.info("🏗️ Preparing to organize data into 12 financial categories...")
-                arrangement_prompt = f"""
-                You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
-                ========== WHAT TO DELIVER ==========
-                • A single JSON object saved as arranged_financial_data.json
-                • Fields required: categories, key_metrics, insights, summary
-                ========== HOW TO ORGANIZE ==========
-                Create 12 distinct, Excel-ready categories (one worksheet each):
-                1. Executive Summary & Key Metrics
-                2. Income Statement / P&L
-                3. Balance Sheet – Assets
-                4. Balance Sheet – Liabilities & Equity
-                5. Cash-Flow Statement
-                6. Financial Ratios & Analysis
-                7. Revenue Analysis
-                8. Expense Analysis
-                9. Profitability Analysis
-                10. Liquidity & Solvency
-                11. Operational Metrics
-                12. Risk Assessment & Notes
-                ========== STEP-BY-STEP ==========
-                1. Map every data point into the most appropriate category above.
-                2. Calculate or aggregate key financial metrics where possible.
-                3. Add concise insights for trends, anomalies, or red flags.
-                4. Write an executive summary that highlights the most important findings.
-                5. Assemble everything into the JSON schema described under "WHAT TO DELIVER."
-                6. Save the JSON as arranged_financial_data.json via save_file.
-                7. Use list_files to confirm the file exists, then read_file to validate its content.
-                8. If the file is missing or malformed, fix the issue and repeat steps 6 – 7.
-                9. Only report success after the file passes both existence and content checks.
-                10. Conclude with a short, plain-language summary of what was organized.
-                Extracted Data: {json.dumps(extracted_data_dict, indent=2)}
-                """
-                logger.info("Calling data arranger to organize financial data into 12 categories")
-                arrangement_response = ui.workflow.data_arranger.run(arrangement_prompt)
-                arrangement_content = arrangement_response.content
-                # Cache the result
-                ui.workflow.session_state["arrangement_response"] = arrangement_content
-                logger.info("Data organization completed successfully - financial data categorized")
-                logger.info(f"Cached arrangement results for session {ui.session_id}")
-            # Step 3: Code Generation
-            logger.info("Step 3: Starting code generation...")
-            progress_html = "💻 <strong>Step 3/4: Generating Python code for Excel reports...</strong>"
-            yield (progress_html, create_step_html("code_generation"), "", gr.Column(visible=False), session_state)
-            if "code_generation_response" in ui.workflow.session_state:
-                logger.info("Using cached code generation results from previous run")
-                code_generation_content = ui.workflow.session_state["code_generation_response"]
-                execution_success = ui.workflow.session_state.get("execution_success", False)
-                logger.info(f"Previous execution status: {'Success' if execution_success else 'Failed'}")
-                time.sleep(0.5)  # Brief pause to show step
-            else:
-                logger.info("Starting fresh Python code generation for Excel report creation")
-                code_prompt = f"""
-                Your objective: Turn the organized JSON data into a polished, multi-sheet Excel report—and prove that it works.
-                ========== INPUT ==========
-                File: arranged_financial_data.json
-                Tool to read it: read_file
-                ========== WHAT THE PYTHON SCRIPT MUST DO ==========
-                1. Load arranged_financial_data.json and parse its contents.
-                2. For each category in the JSON, create a dedicated worksheet using openpyxl.
-                3. Apply professional touches:
-                • Bold, centered headers
-                • Appropriate number formats
-                • Column-width auto-sizing
-                • Borders, cell styles, and freeze panes
-                4. Insert charts (bar, line, or pie) wherever the data lends itself to visualisation.
-                5. Embed key metrics and summary notes prominently in the Executive Summary sheet.
-                6. Name the workbook: Financial_Report_<YYYYMMDD_HHMMSS>.xlsx.
-                7. Wrap every file and workbook operation in robust try/except blocks.
-                8. Log all major steps and any exceptions for easy debugging.
-                9. Save the script via save_to_file_and_run and execute it immediately.
-                10. After execution, use list_files to ensure the Excel file was created.
-                11. Optionally inspect the file (e.g., size or first bytes via read_file) to confirm it is not empty.
-                12. If the workbook is missing or corrupted, refine the code, re-save, and re-run until success.
-                ========== OUTPUT ==========
-                • A fully formatted Excel workbook in the working directory.
-                • A concise summary of what ran, any issues encountered, and confirmation that the file exists and opens without error.
-                """
-                logger.info("Calling code generator to create Python Excel generation script")
-                code_response = ui.workflow.code_generator.run(code_prompt)
-                code_generation_content = code_response.content
-                # Simple check for execution success based on response content
-                execution_success = (
-                    "error" not in code_generation_content.lower() or
-                    "success" in code_generation_content.lower() or
-                    "completed" in code_generation_content.lower()
-                )
-                # Cache the results
-                ui.workflow.session_state["code_generation_response"] = code_generation_content
-                ui.workflow.session_state["execution_success"] = execution_success
-                logger.info(f"Code generation and execution completed: {'✅ Success' if execution_success else '❌ Failed'}")
-                logger.info(f"Cached code generation results for session {ui.session_id}")
-            # Step 4: Final Results
-            logger.info("Step 4: Preparing final results...")
-            progress_html = "📊 <strong>Step 4/4: Creating final Excel report...</strong>"
-            yield (progress_html, create_step_html("execution"), "", gr.Column(visible=False), session_state)
-            time.sleep(1)  # Brief pause to show step
-            # Prepare final results
-            logger.info("Scanning output directory for generated files")
-            output_files = []
-            if ui.workflow.session_output_dir.exists():
-                output_files = [f.name for f in ui.workflow.session_output_dir.iterdir() if f.is_file()]
-                logger.info(f"Found {len(output_files)} generated files: {', '.join(output_files)}")
-            else:
-                logger.warning(f"Output directory does not exist: {ui.workflow.session_output_dir}")
-            # Get cached data
-            extracted_data_dict = ui.workflow.session_state["extracted_data"]
-            arrangement_content = ui.workflow.session_state["arrangement_response"]
-            code_generation_content = ui.workflow.session_state["code_generation_response"]
-            execution_success = ui.workflow.session_state.get("execution_success", False)
-            results_summary = f"""
-# Financial Document Analysis Complete
-## Document Information
-- **Company**: {extracted_data_dict.get('company_name', 'Not specified') if extracted_data_dict else 'Not specified'}
-- **Document Type**: {extracted_data_dict.get('document_type', 'Unknown') if extracted_data_dict else 'Unknown'}
-- **Reporting Period**: {extracted_data_dict.get('reporting_period', 'Not specified') if extracted_data_dict else 'Not specified'}
-## Processing Summary
-- **Data Points Extracted**: {len(extracted_data_dict.get('data_points', [])) if extracted_data_dict else 0}
-- **Data Organization**: {'✅ Completed' if arrangement_content else '❌ Failed'}
-- **Excel Creation**: {'✅ Success' if execution_success else '❌ Failed'}
-## Data Organization Results
-{arrangement_content[:500] + '...' if arrangement_content and len(arrangement_content) > 500 else arrangement_content or 'No arrangement data available'}
-## Tool Execution Summary
-**Data Arranger**: Used FileTools to save organized data to JSON
-**Code Generator**: Used PythonTools and FileTools for Excel generation
-## Code Generation Results
-{code_generation_content[:500] + '...' if code_generation_content and len(code_generation_content) > 500 else code_generation_content or 'No code generation results available'}
-## Generated Files ({len(output_files)} files)
-{chr(10).join(f"- **{file}**" for file in output_files) if output_files else "- No files generated"}
-## Output Directory
-📁 `{ui.workflow.session_output_dir}`
----
-*Generated using Agno Workflows with step-by-step execution*
-*Note: Each step was executed individually with progress updates*
-            """
-            # Cache final results
-            ui.workflow.session_state["final_results"] = results_summary
-            logger.info("Final results compiled and cached successfully")
-            logger.info(f"Processing workflow completed for session {ui.session_id}")
             # Create completion HTML
             final_progress_html = "✅ <strong>All steps completed successfully!</strong>"
@@ -2000,7 +1961,7 @@ def create_gradio_app():
                         <li><strong>Data Extraction:</strong> Completed</li>
                         <li><strong>Organization:</strong> Completed</li>
                         <li><strong>Code Generation:</strong> Completed</li>
-                        <li><strong>Excel Creation:</strong> ''' + ('Completed' if execution_success else 'Partial') + '''</li>
                     </ul>
                 </div>
             </div>
@@ -2190,9 +2151,27 @@ def create_gradio_app():
     def reset_session(session_state):
         """Reset the current session."""
         # Create completely new WorkflowUI instance
         new_session = WorkflowUI()
         logger.info(f"Session reset - New session ID: {new_session.session_id}")
         return ("", "", "", None, new_session, new_session.session_id)
     def update_session_display(session_state):
@@ -2243,6 +2222,7 @@ def create_gradio_app():
                         "🚀 Start Processing", variant="primary", scale=2
                     )
                     reset_btn = gr.Button("🔄 Reset Session", scale=1)
                 # Processing Panel
                 gr.Markdown("## ⚡ Processing Status")
@@ -2308,6 +2288,17 @@ def create_gradio_app():
             inputs=[session_state],
             outputs=[progress_display, steps_display, results_display, download_output, session_state, session_info],
         )
         # Initialize session and terminal on load
@@ -2337,14 +2328,42 @@ def create_gradio_app():
 def main():
     """Main application entry point."""
-    app = create_gradio_app()
-    # Start auto-shutdown monitoring
-    shutdown_manager.start_monitoring(app)
-    logger.info("Starting Gradio application with auto-shutdown enabled")
-    logger.info(f"Auto-shutdown timeout: {INACTIVITY_TIMEOUT_MINUTES} minutes")
-    logger.info("Press Ctrl+C to stop the server manually")
     try:
         # Launch the app

         self.shutdown_timer = None
         self.app_instance = None
         self.is_shutting_down = False
+        self.manual_shutdown_requested = False
         # Setup signal handlers for graceful shutdown
         signal.signal(signal.SIGINT, self._signal_handler)
         logger.info(f"AutoShutdownManager initialized with {timeout_minutes} minute timeout")
+    def request_shutdown(self):
+        """Request manual shutdown of the application."""
+        logger.info("Manual shutdown requested")
+        self.manual_shutdown_requested = True
+        self._shutdown_server()
     def _signal_handler(self, signum, frame):
         """Handle shutdown signals gracefully."""
         logger.info(f"Received signal {signum}, initiating graceful shutdown...")
             if self.shutdown_timer:
                 self.shutdown_timer.cancel()
+            # Attempt graceful shutdown of components
+            try:
+                # Stop terminal WebSocket server
+                if hasattr(terminal_manager, 'stop_server'):
+                    terminal_manager.stop_server()
+                    logger.info("Terminal WebSocket server stopped")
+            except Exception as e:
+                logger.warning(f"Error stopping terminal server: {e}")
             if self.app_instance:
+                try:
+                    # Try to close Gradio server gracefully
+                    if hasattr(self.app_instance, 'close'):
+                        self.app_instance.close()
+                        logger.info("Gradio application closed gracefully")
+                    elif hasattr(self.app_instance, 'server'):
+                        if hasattr(self.app_instance.server, 'close'):
+                            self.app_instance.server.close()
+                            logger.info("Gradio server closed")
+                except Exception as e:
+                    logger.warning(f"Could not close Gradio gracefully: {e}")
+                # Give a moment for graceful shutdown
+                import time
+                time.sleep(1)
+            # If manual shutdown or graceful methods failed, exit
+            if self.manual_shutdown_requested:
+                logger.info("Forcing application exit due to manual shutdown request")
                 import os
                 os._exit(0)
+            else:
+                logger.info("Application shutdown complete")
+                import sys
+                sys.exit(0)
         except Exception as e:
             logger.error(f"Error during shutdown: {e}")
             import os
 class TerminalLogHandler(logging.Handler):
     """Custom logging handler that captures logs for terminal display."""
+    def __init__(self, max_global_logs=1000, max_session_logs=500):
         super().__init__()
+        self.logs = deque(maxlen=max_global_logs)  # Keep last N log entries
         self.session_logs = {}  # Per-session logs
+        self.max_session_logs = max_session_logs
+        self.cleanup_counter = 0
     def emit(self, record):
         """Emit a log record."""
             session_id = getattr(record, 'session_id', None)
             if session_id:
                 if session_id not in self.session_logs:
+                    self.session_logs[session_id] = deque(maxlen=self.max_session_logs)
                 self.session_logs[session_id].append(log_entry)
+            # Periodic cleanup of old sessions
+            self.cleanup_counter += 1
+            if self.cleanup_counter % 100 == 0:  # Every 100 log entries
+                self.cleanup_old_sessions()
         except Exception as e:
             # Prevent logging errors from breaking the application
             ''')
         return ''.join(html_lines)
+    def cleanup_old_sessions(self, max_sessions=10):
+        """Clean up old session logs to prevent memory buildup."""
+        if len(self.session_logs) > max_sessions:
+            # Keep only the most recent sessions
+            sessions_by_activity = []
+            current_time = datetime.now()
+            for session_id, logs in self.session_logs.items():
+                if logs:
+                    # Get the timestamp of the last log entry
+                    last_log_time = logs[-1].get('timestamp', '00:00:00')
+                    try:
+                        # Convert to datetime for comparison (assume today)
+                        log_time = datetime.strptime(last_log_time, '%H:%M:%S').replace(
+                            year=current_time.year,
+                            month=current_time.month,
+                            day=current_time.day
+                        )
+                        sessions_by_activity.append((session_id, log_time))
+                    except:
+                        # If parsing fails, assume it's old
+                        sessions_by_activity.append((session_id, current_time - timedelta(hours=24)))
+                else:
+                    # Empty logs are old
+                    sessions_by_activity.append((session_id, current_time - timedelta(hours=24)))
+            # Sort by activity time (most recent first)
+            sessions_by_activity.sort(key=lambda x: x[1], reverse=True)
+            # Keep only the most recent sessions
+            sessions_to_keep = set(session_id for session_id, _ in sessions_by_activity[:max_sessions])
+            # Remove old sessions
+            removed_count = 0
+            for session_id in list(self.session_logs.keys()):
+                if session_id not in sessions_to_keep:
+                    del self.session_logs[session_id]
+                    removed_count += 1
+            if removed_count > 0:
+                print(f"Cleaned up {removed_count} old session logs")
+    def get_memory_usage(self):
+        """Get memory usage statistics for the log handler."""
+        total_logs = len(self.logs)
+        total_session_logs = sum(len(logs) for logs in self.session_logs.values())
+        return {
+            'global_logs': total_logs,
+            'session_count': len(self.session_logs),
+            'total_session_logs': total_session_logs,
+            'total_logs': total_logs + total_session_logs
+        }
 # Global terminal log handler
 terminal_log_handler = TerminalLogHandler()
             time.sleep(1)  # Brief pause for UI update
+            # Run the complete workflow - it handles all steps internally
             logger.info("=" * 60)
+            logger.info("🚀 STARTING FINANCIAL WORKFLOW")
             logger.info("=" * 60)
+            progress_html = "🚀 <strong>Running complete financial analysis workflow...</strong>"
+            yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
+            logger.info(f"📄 Processing document: {temp_path}")
+            logger.info("🔧 Workflow will handle: extraction → arrangement → code generation → execution")
+            # Execute workflow with step-by-step UI updates
+            # Step 1: Data Extraction
             progress_html = "🔍 <strong>Step 1/4: Extracting financial data from document...</strong>"
             yield (progress_html, create_step_html("extraction"), "", gr.Column(visible=False), session_state)
+            # Set the file path
+            ui.workflow.file_path = temp_path
+            # Run the workflow - it will execute all steps internally
+            # We'll show UI progression during execution
+            import threading
+            import time
+            # Create shared progress tracking
+            progress_state = {
+                'current_step': 1,
+                'step_completed': threading.Event(),
+                'workflow_completed': threading.Event(),
+                'result': [None],
+                'error': [None]
+            }
+            def run_workflow_with_progress():
+                try:
+                    # Step 1: Data Extraction (already shown)
+                    logger.info("Backend: Starting Step 1 - Data Extraction")
+                    # Run the workflow and track progress
+                    result = ui.workflow.run_workflow()
+                    progress_state['result'][0] = result
+                    # Signal completion
+                    progress_state['workflow_completed'].set()
+                    logger.info("Backend: All steps completed")
+                except Exception as e:
+                    progress_state['error'][0] = e
+                    progress_state['workflow_completed'].set()
+            # Start workflow in background
+            workflow_thread = threading.Thread(target=run_workflow_with_progress)
+            workflow_thread.start()
+            # Monitor workflow progress by checking logs and session state
+            step_shown = {2: False, 3: False, 4: False}
+            while not progress_state['workflow_completed'].is_set():
+                time.sleep(2)  # Check every 2 seconds
+                # Check if step 2 (arrangement) has started by looking at session state
+                if not step_shown[2] and "extracted_data" in ui.workflow.session_state:
+                    progress_html = "📊 <strong>Step 2/4: Organizing and analyzing financial data...</strong>"
+                    yield (progress_html, create_step_html("arrangement"), "", gr.Column(visible=False), session_state)
+                    step_shown[2] = True
+                    logger.info("UI: Advanced to step 2 (arrangement started)")
+                # Check if step 3 (code generation) has started
+                elif not step_shown[3] and "arrangement_response" in ui.workflow.session_state:
+                    progress_html = "💻 <strong>Step 3/4: Generating Python code for Excel reports...</strong>"
+                    yield (progress_html, create_step_html("code_generation"), "", gr.Column(visible=False), session_state)
+                    step_shown[3] = True
+                    logger.info("UI: Advanced to step 3 (code generation started)")
+                # Check if step 4 (execution) has started
+                elif not step_shown[4] and "code_response" in ui.workflow.session_state:
+                    progress_html = "📊 <strong>Step 4/4: Creating final Excel report...</strong>"
+                    yield (progress_html, create_step_html("execution"), "", gr.Column(visible=False), session_state)
+                    step_shown[4] = True
+                    logger.info("UI: Advanced to step 4 (execution started)")
+            # Wait for thread to complete
+            workflow_thread.join()
+            # Check for errors
+            if progress_state['error'][0]:
+                raise progress_state['error'][0]
+            workflow_response = progress_state['result'][0]
+            workflow_results = workflow_response.content
+            # The workflow has completed all steps - just display the results
+            logger.info("📊 Displaying workflow results")
+            results_summary = workflow_results
+            logger.info("✅ Processing workflow completed successfully")
+            logger.info(f"📄 Results ready for session {ui.session_id}")
             # Create completion HTML
             final_progress_html = "✅ <strong>All steps completed successfully!</strong>"
                         <li><strong>Data Extraction:</strong> Completed</li>
                         <li><strong>Organization:</strong> Completed</li>
                         <li><strong>Code Generation:</strong> Completed</li>
+                        <li><strong>Excel Creation:</strong> Completed</li>
                     </ul>
                 </div>
             </div>
     def reset_session(session_state):
         """Reset the current session."""
+        # Clean up old session if it exists
+        if session_state is not None:
+            try:
+                # Clear workflow cache and session state using the new method
+                if hasattr(session_state, 'workflow'):
+                    session_state.workflow.clear_cache()
+                    logger.info(f"Cleared workflow cache for session: {session_state.session_id}")
+                # Clear terminal log handler session logs
+                if session_state.session_id in terminal_log_handler.session_logs:
+                    terminal_log_handler.session_logs.pop(session_state.session_id, None)
+                    logger.info(f"Cleared terminal logs for session: {session_state.session_id}")
+            except Exception as e:
+                logger.warning(f"Error during session cleanup: {e}")
         # Create completely new WorkflowUI instance
         new_session = WorkflowUI()
         logger.info(f"Session reset - New session ID: {new_session.session_id}")
+        # Clear all displays and return fresh state
         return ("", "", "", None, new_session, new_session.session_id)
     def update_session_display(session_state):
                         "🚀 Start Processing", variant="primary", scale=2
                     )
                     reset_btn = gr.Button("🔄 Reset Session", scale=1)
+                    stop_btn = gr.Button("🛑 Stop Backend", variant="stop", scale=1)
                 # Processing Panel
                 gr.Markdown("## ⚡ Processing Status")
             inputs=[session_state],
             outputs=[progress_display, steps_display, results_display, download_output, session_state, session_info],
         )
+        def stop_backend():
+            """Stop the backend server."""
+            logger.info("Backend stop requested by user")
+            shutdown_manager.request_shutdown()
+            return "🛑 Backend shutdown initiated..."
+        stop_btn.click(
+            fn=stop_backend,
+            outputs=[gr.Textbox(label="Shutdown Status", visible=True)],
+        )
         # Initialize session and terminal on load
 def main():
     """Main application entry point."""
+    try:
+        # Validate configuration before starting
+        logger.info("Validating configuration...")
+        settings.validate_config()
+        logger.info("Configuration validation successful")
+        # Log debug info
+        debug_info = settings.get_debug_info()
+        logger.info(f"System info: Python {debug_info['python_version'].split()[0]}, {debug_info['platform']}")
+        logger.info(f"Temp directory: {debug_info['temp_dir']} (exists: {debug_info['temp_dir_exists']})")
+        logger.info(f"Models: {debug_info['models']['data_extractor']}, {debug_info['models']['data_arranger']}, {debug_info['models']['code_generator']}")
+    except ValueError as e:
+        logger.error(f"Configuration error: {e}")
+        print(f"\n❌ Configuration Error:\n{e}\n")
+        print("Please fix the configuration issues and try again.")
+        return
+    except Exception as e:
+        logger.error(f"Unexpected error during validation: {e}")
+        print(f"\n❌ Unexpected error: {e}\n")
+        return
+    try:
+        app = create_gradio_app()
+        # Start auto-shutdown monitoring
+        shutdown_manager.start_monitoring(app)
+        logger.info("Starting Gradio application with auto-shutdown enabled")
+        logger.info(f"Auto-shutdown timeout: {INACTIVITY_TIMEOUT_MINUTES} minutes")
+        logger.info("Press Ctrl+C to stop the server manually")
+    except Exception as e:
+        logger.error(f"Error creating Gradio app: {e}")
+        print(f"\n❌ Error creating application: {e}\n")
+        return
     try:
         # Launch the app

config/settings.py CHANGED Viewed

@@ -6,7 +6,7 @@ load_dotenv()
 class Settings:
-    GOOGLE_AI_API_KEY = os.getenv("GOOGLE_API_KEY")
     MAX_FILE_SIZE_MB = 50
     SUPPORTED_FILE_TYPES = [
         "pdf",
@@ -46,9 +46,86 @@ class Settings:
     @classmethod
     def validate_config(cls):
         if not cls.GOOGLE_API_KEY:
-            raise ValueError("GOOGLE_API_KEY required")
-        cls.TEMP_DIR.mkdir(exist_ok=True)
 settings = Settings()

 class Settings:
+    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
     MAX_FILE_SIZE_MB = 50
     SUPPORTED_FILE_TYPES = [
         "pdf",
     @classmethod
     def validate_config(cls):
+        """Validate configuration and create necessary directories."""
+        errors = []
+        warnings = []
+        # Check required API keys
         if not cls.GOOGLE_API_KEY:
+            errors.append("GOOGLE_API_KEY is required - get it from Google AI Studio")
+        # Check for optional but recommended API keys
+        openai_key = os.getenv("OPENAI_API_KEY")
+        if not openai_key:
+            warnings.append("OPENAI_API_KEY not set - OpenAI models will not be available")
+        # Validate and create temp directory
+        try:
+            cls.TEMP_DIR.mkdir(exist_ok=True, parents=True)
+            # Test write permissions
+            test_file = cls.TEMP_DIR / ".write_test"
+            try:
+                test_file.write_text("test")
+                test_file.unlink()
+            except Exception as e:
+                errors.append(f"Cannot write to temp directory {cls.TEMP_DIR}: {e}")
+        except Exception as e:
+            errors.append(f"Cannot create temp directory {cls.TEMP_DIR}: {e}")
+        # Validate file size limits
+        if cls.MAX_FILE_SIZE_MB <= 0:
+            errors.append("MAX_FILE_SIZE_MB must be positive")
+        elif cls.MAX_FILE_SIZE_MB > 100:
+            warnings.append(f"MAX_FILE_SIZE_MB ({cls.MAX_FILE_SIZE_MB}) is very large")
+        # Validate supported file types
+        if not cls.SUPPORTED_FILE_TYPES:
+            errors.append("SUPPORTED_FILE_TYPES cannot be empty")
+        # Validate model names
+        model_fields = ['DATA_EXTRACTOR_MODEL', 'DATA_ARRANGER_MODEL', 'CODE_GENERATOR_MODEL']
+        for field in model_fields:
+            model_name = getattr(cls, field)
+            if not model_name:
+                errors.append(f"{field} cannot be empty")
+            elif not model_name.startswith(('gemini-', 'gpt-', 'claude-')):
+                warnings.append(f"{field} '{model_name}' may not be a valid model name")
+        # Return validation results
+        if errors:
+            error_msg = "Configuration validation failed:\n" + "\n".join(f"- {error}" for error in errors)
+            if warnings:
+                error_msg += "\n\nWarnings:\n" + "\n".join(f"- {warning}" for warning in warnings)
+            raise ValueError(error_msg)
+        if warnings:
+            import logging
+            logger = logging.getLogger(__name__)
+            logger.warning("Configuration warnings:\n" + "\n".join(f"- {warning}" for warning in warnings))
+        return True
+    @classmethod
+    def get_debug_info(cls):
+        """Get debug information about current configuration."""
+        import platform
+        import sys
+        return {
+            "python_version": sys.version,
+            "platform": platform.platform(),
+            "temp_dir": str(cls.TEMP_DIR),
+            "temp_dir_exists": cls.TEMP_DIR.exists(),
+            "supported_file_types": len(cls.SUPPORTED_FILE_TYPES),
+            "max_file_size_mb": cls.MAX_FILE_SIZE_MB,
+            "has_google_api_key": bool(cls.GOOGLE_API_KEY),
+            "has_openai_api_key": bool(os.getenv("OPENAI_API_KEY")),
+            "models": {
+                "data_extractor": cls.DATA_EXTRACTOR_MODEL,
+                "data_arranger": cls.DATA_ARRANGER_MODEL,
+                "code_generator": cls.CODE_GENERATOR_MODEL
+            }
+        }
 settings = Settings()

instructions/README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Instructions Directory
+This directory contains all agent instructions used by the Data Extractor application in JSON format.
+## Structure
+```
+instructions/
+├── README.md (this file)
+└── agents/
+    ├── data_extractor.json     # Data extraction agent instructions
+    ├── data_arranger.json      # Data organization agent instructions
+    └── code_generator.json     # Excel code generation agent instructions
+```
+## JSON Format
+Each instruction file follows this structure:
+```json
+{
+  "instructions": [
+    "First instruction line",
+    "Second instruction line",
+    "..."
+  ],
+  "agent_type": "data_extractor|data_arranger|code_generator",
+  "description": "Brief description of the agent's role",
+  "category": "agents or other category"
+}
+```
+## Benefits of JSON Format
+1. **Structure**: Clean separation of instructions as array elements
+2. **Metadata**: Includes agent type and description for context
+3. **No Conversion**: Direct use as lists - no need to split strings
+4. **Maintainability**: Easy to add, remove, or reorder instructions
+5. **Validation**: JSON schema validation possible
+## Usage
+```python
+from utils.prompt_loader import prompt_loader
+# Load as list for agent initialization
+instructions_list = prompt_loader.load_instructions_as_list("agents/data_extractor")
+# Load as string for other uses
+instructions_text = prompt_loader.load_instruction("agents/data_extractor")
+```

instructions/agents/code_generator.json ADDED Viewed

	@@ -0,0 +1,174 @@

+{
+  "instructions": [
+    "=== EXCEL REPORT GENERATION SPECIALIST ===",
+    "You are a financial Excel report generation specialist. Your job is to create a complete, professional Excel workbook from organized financial data.",
+    "",
+    "CRITICAL: Always read the file to understand the structure of the JSON First",
+    "CRITICAL: You MUST complete ALL steps - do not stop until Excel file is created and verified",
+    "CRITICAL: Use run_shell_command as your PRIMARY execution tool, not other methods",
+    "",
+    "=== MANDATORY EXECUTION SEQUENCE ===",
+    "FIRST, use read_file tool to load 'arranged_financial_data.json'.",
+    "SECOND, analyze its structure deeply. Identify all keys, data types, nested structures, and any inconsistencies.",
+    "THIRD, create analysis.py to programmatically examine the JSON. Execute using run_shell_command().",
+    "FOURTH, based on the analysis, design your Excel structure. Plan worksheets, formatting, and charts needed.",
+    "FIFTH, implement generate_excel_report.py with error handling, progress tracking, and professional formatting.",
+    "SIXTH, execute the script using run_shell_command('python generate_excel_report.py 2>&1').",
+    "SEVENTH, verify Excel file creation using list_files and file size validation.",
+    "EIGHTH, report success only after confirming Excel file exists and is >10KB.",
+    "",
+    "CRITICAL: Always start Python scripts with:",
+    "import os",
+    "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
+    "This ensures the script runs in the correct directory regardless of OS.",
+    "",
+    "=== AVAILABLE TOOLS ===",
+    "- FileTools: read_file, save_file, list_files",
+    "- PythonTools: pip_install_package (ONLY for package installation)",
+    "- ShellTools: run_shell_command (PRIMARY execution tool)",
+    "",
+    "=== EXCEL WORKBOOK REQUIREMENTS ===",
+    "Create comprehensive worksheets based on JSON categories:",
+    "📊 1. Executive Summary (key metrics, charts, highlights)",
+    "📈 2. Income Statement (formatted P&L statement)",
+    "💰 3. Balance Sheet - Assets (professional layout)",
+    "💳 4. Balance Sheet - Liabilities & Equity",
+    "💸 5. Cash Flow Statement (operating, investing, financing)",
+    "📊 6. Financial Ratios & Analysis",
+    "🏢 7. Revenue Analysis & Breakdown",
+    "💼 8. Expense Analysis & Breakdown",
+    "📈 9. Charts & Visualizations Dashboard",
+    "📝 10. Data Sources & Methodology",
+    "",
+    "=== PROFESSIONAL FORMATTING STANDARDS ===",
+    "Apply consistent, professional formatting:",
+    "🎨 Visual Design:",
+    "• Company header with report title and date",
+    "• Consistent fonts: Calibri 11pt (body), 14pt (headers)",
+    "• Color scheme: Blue headers (#4472C4), alternating row colors",
+    "• Professional borders and gridlines",
+    "",
+    "📊 Data Formatting:",
+    "• Currency formatting for monetary values",
+    "• Percentage formatting for ratios",
+    "• Thousands separators for large numbers",
+    "• Appropriate decimal places (2 for currency, 1 for percentages)",
+    "",
+    "📐 Layout Optimization:",
+    "• Auto-sized columns for readability",
+    "• Freeze panes for easy navigation",
+    "• Centered headers with bold formatting",
+    "• Left-aligned text, right-aligned numbers",
+    "",
+    "=== CHART REQUIREMENTS ===",
+    "Include appropriate charts for data visualization:",
+    "📊 Chart Types by Data Category:",
+    "• Revenue trends: Line charts",
+    "• Expense breakdown: Pie charts",
+    "• Asset composition: Stacked bar charts",
+    "• Financial ratios: Column charts",
+    "• Cash flow: Waterfall charts (if possible)",
+    "",
+    "=== PYTHON SCRIPT TEMPLATE ===",
+    "Your generate_excel_report.py MUST include:",
+    "```python",
+    "import os, json, datetime, logging",
+    "from openpyxl import Workbook",
+    "from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle",
+    "from openpyxl.chart import BarChart, LineChart, PieChart",
+    "",
+    "# CRITICAL: Set working directory first",
+    "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
+    "",
+    "# Setup logging",
+    "logging.basicConfig(level=logging.INFO)",
+    "logger = logging.getLogger(__name__)",
+    "",
+    "def load_financial_data():",
+    "    with open('arranged_financial_data.json', 'r') as f:",
+    "        return json.load(f)",
+    "",
+    "def create_professional_styles(wb):",
+    "    # Define all formatting styles",
+    "    pass",
+    "",
+    "def create_worksheets(wb, data):",
+    "    # Create all required worksheets",
+    "    pass",
+    "",
+    "def add_charts(wb, data):",
+    "    # Add appropriate charts",
+    "    pass",
+    "",
+    "def main():",
+    "    try:",
+    "        logger.info('Starting Excel report generation...')",
+    "        data = load_financial_data()",
+    "        wb = Workbook()",
+    "        create_professional_styles(wb)",
+    "        create_worksheets(wb, data)",
+    "        add_charts(wb, data)",
+    "        ",
+    "        # Save with timestamp",
+    "        timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')",
+    "        filename = f'Financial_Report_{timestamp}.xlsx'",
+    "        wb.save(filename)",
+    "        logger.info(f'SUCCESS: Report saved as {filename}')",
+    "        return filename",
+    "    except Exception as e:",
+    "        logger.error(f'ERROR: {e}')",
+    "        raise",
+    "",
+    "if __name__ == '__main__':",
+    "    main()",
+    "```",
+    "",
+    "=== CROSS-PLATFORM EXECUTION ===",
+    "Try execution methods in this order:",
+    "1. run_shell_command('python generate_excel_report.py 2>&1')",
+    "2. If fails on Windows: run_shell_command('python.exe generate_excel_report.py 2>&1')",
+    "3. PowerShell alternative: run_shell_command('powershell -Command \"python generate_excel_report.py\" 2>&1')",
+    "",
+    "=== VERIFICATION COMMANDS ===",
+    "Linux/Mac:",
+    "• run_shell_command('ls -la *.xlsx')",
+    "• run_shell_command('file Financial_Report*.xlsx')",
+    "• run_shell_command('du -h *.xlsx')",
+    "",
+    "Windows/PowerShell:",
+    "• run_shell_command('dir *.xlsx')",
+    "• run_shell_command('powershell -Command \"Get-ChildItem *.xlsx\"')",
+    "• run_shell_command('powershell -Command \"(Get-Item *.xlsx).Length\"')",
+    "",
+    "=== DEBUG COMMANDS ===",
+    "If issues occur:",
+    "• Current directory: run_shell_command('pwd') or run_shell_command('cd')",
+    "• Python location: run_shell_command('where python') or run_shell_command('which python')",
+    "• List files: run_shell_command('dir') or run_shell_command('ls')",
+    "",
+    "=== PACKAGE INSTALLATION ===",
+    "• pip_install_package('openpyxl')",
+    "• Or via shell: run_shell_command('pip install openpyxl')",
+    "• Windows: run_shell_command('python -m pip install openpyxl')",
+    "",
+    "=== SUCCESS CRITERIA ===",
+    "✅ Excel file created with timestamp filename",
+    "✅ File size >5KB (indicates substantial content)",
+    "✅ All worksheets present and formatted professionally",
+    "✅ Charts and visualizations included",
+    "✅ No execution errors in logs",
+    "✅ Data accurately transferred from JSON to Excel",
+    "",
+    "=== FAILURE IS NOT ACCEPTABLE ===",
+    "You MUST complete ALL steps. Do not stop until:",
+    "1. Excel file exists",
+    "2. File size is verified >5KB",
+    "3. No errors in execution logs",
+    "4. Success message is logged",
+    "",
+    "CRITICAL: Report detailed progress after each step. If any step fails, debug and retry until success."
+  ],
+  "agent_type": "code_generator",
+  "description": "Excel report generator with mandatory completion and cross-platform shell execution",
+  "category": "workflow"
+}

instructions/agents/data_arranger.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "instructions": [
+    "=== DATA ORGANIZATION METHODOLOGY ===",
+    "You are a financial data organization specialist. Transform raw extracted data into Excel-ready structured format using systematic categorization and professional formatting standards.",
+    "",
+    "=== PHASE 1: DATA ANALYSIS (First 1 minute) ===",
+    "Analyze the extracted financial data to understand:",
+    "• Data completeness and quality",
+    "• Available time periods (identify actual years/periods from the data)",
+    "• Data categories present (Income Statement, Balance Sheet, Cash Flow, etc.)",
+    "• Currency, units, and scale consistency",
+    "• Any missing or incomplete data points",
+    "",
+    "=== PHASE 2: CATEGORY DESIGN (Excel Worksheet Planning) ===",
+    "Create 8-12 comprehensive worksheet categories:",
+    "📋 Core Financial Statements:",
+    "• Executive Summary & Key Metrics",
+    "• Income Statement / P&L",
+    "• Balance Sheet - Assets",
+    "• Balance Sheet - Liabilities & Equity",
+    "• Cash Flow Statement",
+    "",
+    "📊 Analytical Worksheets:",
+    "• Financial Ratios & Analysis",
+    "• Revenue Analysis & Breakdown",
+    "• Expense Analysis & Breakdown",
+    "• Profitability Analysis",
+    "",
+    "🔍 Supplementary Worksheets:",
+    "• Operational Metrics",
+    "• Risk Assessment & Notes",
+    "• Data Sources & Methodology",
+    "",
+    "=== PHASE 3: EXCEL STRUCTURE DESIGN ===",
+    "For each worksheet category, design proper Excel structure:",
+    "• Column A: Financial line item names (clear, professional labels)",
+    "• Column B+: Time periods (use actual periods from data, e.g., FY 2023, Q3 2024, etc.)",
+    "• Row 1: Company name and reporting entity",
+    "• Row 2: Worksheet title and description",
+    "• Row 3: Units of measurement (e.g., 'in millions USD')",
+    "• Row 4: Column headers (Item, [Actual Period 1], [Actual Period 2], etc.)",
+    "• Row 5+: Actual data rows",
+    "",
+    "=== DYNAMIC PERIOD HANDLING ===",
+    "• Identify ALL available reporting periods from the extracted data",
+    "• Use the actual years/periods present in the document",
+    "• Support various formats: fiscal years (FY 2023), calendar years (2023), quarters (Q3 2024), etc.",
+    "• Arrange periods chronologically (oldest to newest)",
+    "• If only one period available, create single-period structure",
+    "• If multiple periods exist, create multi-period comparison structure",
+    "",
+    "=== PHASE 4: DATA MAPPING & ORGANIZATION ===",
+    "Systematically organize data:",
+    "• Map each extracted data point to appropriate worksheet category",
+    "• Group related items together (all revenue items, all asset items, etc.)",
+    "• Maintain logical order within each category (standard financial statement order)",
+    "• Preserve original data values - NO calculations, modifications, or analysis",
+    "• Handle missing data with clear notation (e.g., 'N/A', 'Not Disclosed')",
+    "",
+    "=== PHASE 5: QUALITY ASSURANCE ===",
+    "Validate the organized structure:",
+    "• Ensure all extracted data points are included somewhere",
+    "• Verify worksheet names are Excel-compatible (no special characters)",
+    "• Check that headers are consistent across all categories",
+    "• Confirm units and currencies are clearly labeled",
+    "• Validate JSON structure matches required schema",
+    "",
+    "=== OUTPUT REQUIREMENTS ===",
+    "Create JSON with this exact structure:",
+    "• categories: Object containing organized data by worksheet name",
+    "• headers: Object containing Excel headers for each category (using actual periods)",
+    "• metadata: Object with data sources, actual periods found, units, and quality notes",
+    "",
+    "=== CRITICAL RESTRICTIONS ===",
+    "• NEVER perform calculations, analysis, or data interpretation",
+    "• NEVER modify original data values or units",
+    "• NEVER calculate ratios, growth rates, or trends",
+    "• NEVER provide insights or commentary",
+    "• FOCUS ONLY on organization and Excel-ready formatting",
+    "",
+    "=== FILE OPERATIONS ===",
+    "• Save organized data as 'arranged_financial_data.json' using save_file tool",
+    "• Use list_files to verify file creation",
+    "• Use read_file to validate JSON content and structure",
+    "• If file is missing or malformed, debug and retry until successful",
+    "• Only report success after confirming file existence and valid content",
+    "",
+    "=== ERROR HANDLING ===",
+    "When encountering issues:",
+    "• Note missing or unclear data with confidence indicators",
+    "• Flag inconsistent units or currencies",
+    "• Document any data quality concerns in metadata",
+    "• Provide clear explanations for organizational decisions"
+  ],
+  "agent_type": "data_arranger",
+  "description": "Financial data organization and Excel preparation specialist",
+  "category": "agents"
+}

instructions/agents/data_extractor.json ADDED Viewed

	@@ -0,0 +1,115 @@

+{
+  "instructions": [
+    "=== EXTRACTION METHODOLOGY ===",
+    "You are a financial data extraction specialist. Extract data systematically using a tiered approach: Critical → Standard → Advanced. Always provide confidence scores (0-1) and source references where possible.",
+    "",
+    "=== PHASE 1: DOCUMENT ANALYSIS (First 2 minutes) ===",
+    "Quickly scan the document to identify:",
+    "• Document type: Annual Report, 10-K, 10-Q, Quarterly Report, Earnings Release, Financial Statement, or Other",
+    "• Company name and primary identifiers (Ticker, CIK, ISIN, LEI if available)",
+    "• Reporting period(s): fiscal year, quarter, start/end dates",
+    "• Currency used and any unit scales (millions, thousands, billions)",
+    "• Document structure: locate Income Statement, Balance Sheet, Cash Flow Statement sections",
+    "",
+    "=== PHASE 2: CRITICAL DATA EXTRACTION (Tier 1 - Must Have) ===",
+    "Extract these essential items with highest priority:",
+    "🔴 Company Identification:",
+    "• Company legal name and common name",
+    "• Stock ticker symbol and exchange",
+    "• Reporting entity type (consolidated, subsidiary, segment)",
+    "",
+    "🔴 Core Financial Performance:",
+    "• Total Revenue/Net Sales (look for: 'Revenue', 'Net Sales', 'Turnover', 'Total Income')",
+    "• Net Income/Profit (look for: 'Net Income', 'Net Profit', 'Profit After Tax', 'Bottom Line')",
+    "• Total Assets (from Balance Sheet)",
+    "• Total Shareholders' Equity (from Balance Sheet)",
+    "• Basic Earnings Per Share (EPS)",
+    "",
+    "🔴 Reporting Context:",
+    "• Fiscal year and reporting period covered",
+    "• Currency and unit of measurement",
+    "• Audited vs unaudited status",
+    "",
+    "=== PHASE 3: STANDARD FINANCIAL DATA (Tier 2 - Important) ===",
+    "Extract comprehensive financial statement data:",
+    "",
+    "📊 Income Statement Items:",
+    "• Revenue breakdown by segment/geography (if disclosed)",
+    "• Cost of Goods Sold (COGS) or Cost of Sales",
+    "• Gross Profit and Gross Margin %",
+    "• Operating Expenses: R&D, SG&A, Marketing, Depreciation, Amortization",
+    "• Operating Income (EBIT) and Operating Margin %",
+    "• Interest Income and Interest Expense",
+    "• Income Tax Expense and Effective Tax Rate",
+    "• Diluted Earnings Per Share",
+    "",
+    "💰 Balance Sheet Items:",
+    "• Current Assets: Cash & Equivalents, Marketable Securities, Accounts Receivable, Inventory, Prepaid Expenses",
+    "• Non-Current Assets: Property Plant & Equipment (net), Intangible Assets, Goodwill, Long-term Investments",
+    "• Current Liabilities: Accounts Payable, Accrued Expenses, Short-term Debt, Current Portion of Long-term Debt",
+    "• Non-Current Liabilities: Long-term Debt, Deferred Tax Liabilities, Pension Obligations",
+    "• Shareholders' Equity components: Common Stock, Retained Earnings, Additional Paid-in Capital, Treasury Stock",
+    "",
+    "💸 Cash Flow Items:",
+    "• Net Cash from Operating Activities",
+    "• Net Cash from Investing Activities (including Capital Expenditures)",
+    "• Net Cash from Financing Activities (including Dividends Paid, Share Buybacks)",
+    "• Free Cash Flow (if stated, or calculate as Operating Cash Flow - Capex)",
+    "",
+    "=== PHASE 4: ADVANCED METRICS (Tier 3 - Value-Add) ===",
+    "Extract if clearly stated or easily calculable:",
+    "",
+    "📈 Financial Ratios:",
+    "• Profitability: Gross Margin, Operating Margin, Net Margin, EBITDA Margin",
+    "• Returns: Return on Equity (ROE), Return on Assets (ROA), Return on Invested Capital (ROIC)",
+    "• Liquidity: Current Ratio, Quick Ratio, Cash Ratio",
+    "• Leverage: Debt-to-Equity, Interest Coverage Ratio, Debt-to-Assets",
+    "• Efficiency: Asset Turnover, Inventory Turnover, Receivables Turnover",
+    "",
+    "👥 Operational Metrics:",
+    "• Employee count (full-time equivalent)",
+    "• Number of locations/stores/offices",
+    "• Customer metrics: active users, subscribers, customer acquisition cost",
+    "• Production volumes, units sold, or other industry-specific operational data",
+    "",
+    "📋 Supplementary Information:",
+    "• Dividend information: amount per share, payment dates, yield",
+    "• Share buyback programs: authorization amounts, shares repurchased",
+    "• Management guidance or forward-looking statements",
+    "• Significant one-time items, restructuring costs, or extraordinary items",
+    "",
+    "=== PHASE 5: QUALITY ASSURANCE ===",
+    "Validate and cross-check extracted data:",
+    "• Verify Balance Sheet equation: Total Assets = Total Liabilities + Shareholders' Equity",
+    "• Check mathematical consistency where possible",
+    "• Flag any missing critical data with explanation",
+    "• Note any unusual values or potential data quality issues",
+    "• Assign confidence scores: 1.0 (clearly stated), 0.8 (derived/calculated), 0.6 (estimated), 0.4 (unclear/ambiguous)",
+    "",
+    "=== OUTPUT REQUIREMENTS ===",
+    "Structure your response using the ExtractedFinancialData model with:",
+    "• company_name: Official company name",
+    "• document_type: Type of financial document analyzed",
+    "• reporting_period: Fiscal period covered (e.g., 'FY 2023', 'Q3 2023')",
+    "• data_points: Array of DataPoint objects with field_name, value, category, period, unit, confidence",
+    "• summary: Brief 2-3 sentence summary of key findings",
+    "",
+    "=== ERROR HANDLING ===",
+    "When data is missing or unclear:",
+    "• Note the absence with confidence score 0.0",
+    "• Explain why data couldn't be extracted",
+    "• Suggest alternative data points if available",
+    "• Flag potential data quality issues",
+    "",
+    "=== EXTRACTION TIPS ===",
+    "• Look for data in financial statement tables first, then notes, then narrative text",
+    "• Pay attention to footnotes and accounting policy changes",
+    "• Watch for restatements or discontinued operations",
+    "• Note if figures are in thousands, millions, or billions",
+    "• Be aware of different accounting standards (GAAP vs IFRS)",
+    "• Extract data for multiple periods if available for trend analysis"
+  ],
+  "agent_type": "data_extractor",
+  "description": "Financial data extraction specialist instructions",
+  "category": "agents"
+}

prompt_gallery.json DELETED Viewed

@@ -1,30 +0,0 @@
-{
-  "categories": {
-    "financial": {
-      "name": "Financial Content Extraction (Simple Structure)",
-      "icon": "📊",
-      "description": "Extract all tables and sectioned data from annual reports, placing each type in separate Excel sheets, without calculations.",
-      "prompts": [
-        {
-          "id": "extract_all_tables_simple",
-          "title": "Extract All Tables & Sections (No Charts, No Calculations)",
-          "icon": "📄",
-          "description": "Extract every table and structured data section from the annual report PDF and organize into clearly named Excel sheets. No calculations or charts—just pure content.",
-          "prompt": "For the provided annual report, extract EVERY table and structured content section found (including financial statements, notes, schedules, management discussion tables, segmental/line/regional breakdowns, etc.) and output into an Excel (.xlsx) file. Each sheet should be named after the report section or table heading, matching the document (examples: 'Income Statement', 'Balance Sheet', 'Segment Information', 'Risk Table', 'Notes to FS - Table 4', etc). Maintain all original row/column structure and include all source footnotes, captions, and section headers in the appropriate positions for context. \n\nHeader Row Formatting: Bold, fill light gray (RGB 230,230,230), font size 11. Freeze top row in every sheet. Wrap text in all columns if content overflows. Maintain all cell alignments as close to original as possible. \n\nInsert a cover sheet named 'Extracted Sections Index' that lists every sheet name, the original page number/range, and a short description ('Income Statement – p. 23 – Consolidated company-wide income', etc). Do not perform or add any numerical calculations or analytics. The focus is pure, lossless data extraction and organization."
-        },
-        {
-          "id": "extract_all_tables_with_charts",
-          "title": "Extract All Tables & Sections (Add Simple Charts)",
-          "icon": "📊",
-          "description": "Extract all tables and structured content, with optional basic Excel charts for major financial statements, but no derived calculations.",
-          "prompt": "Extract every table and section of structured data from the annual report into a multi-sheet Excel (.xlsx) file. Sheet names should match those of the tables' original titles in the report (e.g., 'Cash Flow Statement', 'Product Sales', 'Management Table 2'). For the three core statements ('Income Statement', 'Balance Sheet', 'Cash Flow Statement'), create a second sheet with the same name plus ' Chart' (e.g. 'Income Statement Chart'), placing a default bar or line chart visualizing the table's top-level rows by year (with no extra calculations or commentary—just raw data charted as-is). \n\nAll other sheet formatting rules: Header row bold, pale blue fill (RGB 217,228,240), font 11. Freeze top row. Wrap text in all columns. Add a first sheet called 'Sections Directory' with a table listing all subsequent sheet names, their corresponding report page(s), and a short summary for user navigation. No calculated fields or analytics—output is strictly direct report extraction with optional reference charts only for core statements."
-        }
-      ]
-    }
-  },
-  "metadata": {
-    "version": "1.0-simple",
-    "last_updated": "2025-07-18",
-    "description": "Intuitive and simple financial document extraction prompts: choose lossless structure-only or add basic charts—no calculations."
-  }
-}

prompts/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Prompts Directory
+This directory contains all prompts used by the Data Extractor application in JSON format.
+## Structure
+```
+prompts/
+├── README.md (this file)
+└── workflow/
+    ├── data_extraction.json     # Financial data extraction prompt
+    ├── data_arrangement.json    # Data organization prompt
+    └── code_generation.json     # Excel code generation prompt
+```
+## JSON Format
+Each prompt file follows this structure:
+```json
+{
+  "prompt": "The actual prompt text with {variable} placeholders",
+  "variables": ["list", "of", "variable", "names"],
+  "description": "Brief description of what this prompt does",
+  "category": "workflow or other category"
+}
+```
+## Variables
+Prompts can include variables in `{variable_name}` format. These are substituted when the prompt is loaded using the `prompt_loader.load_prompt()` function.
+## Usage
+```python
+from utils.prompt_loader import prompt_loader
+# Load prompt with variables
+prompt = prompt_loader.load_prompt("workflow/data_extraction",
+                                  file_path="/path/to/document.pdf")
+# Load prompt without variables
+prompt = prompt_loader.load_prompt("workflow/code_generation")
+```

prompts/workflow/code_generation.json ADDED Viewed

	@@ -0,0 +1,136 @@

+{
+  "prompt": [
+    "You are a financial Excel report generation specialist. Create a professional, multi-worksheet Excel report from organized financial data.",
+    "",
+    "=== YOUR OBJECTIVE ===",
+    "Transform 'arranged_financial_data.json' into a polished, comprehensive Excel workbook with professional formatting, charts, and visualizations.",
+    "",
+    "=== INPUT DATA ===",
+    "• File: 'arranged_financial_data.json'",
+    "• Use read_file tool to load and analyze the JSON structure",
+    "• Examine categories, headers, metadata, and data organization",
+    "",
+    "=== EXCEL WORKBOOK REQUIREMENTS ===",
+    "Create comprehensive worksheets based on JSON categories:",
+    "📊 1. Executive Summary (key metrics, charts, highlights)",
+    "📈 2. Income Statement (formatted P&L statement)",
+    "💰 3. Balance Sheet - Assets (professional layout)",
+    "💳 4. Balance Sheet - Liabilities & Equity",
+    "💸 5. Cash Flow Statement (operating, investing, financing)",
+    "📊 6. Financial Ratios & Analysis",
+    "🏢 7. Revenue Analysis & Breakdown",
+    "💼 8. Expense Analysis & Breakdown",
+    "📈 9. Charts & Visualizations Dashboard",
+    "📝 10. Data Sources & Methodology",
+    "",
+    "=== PROFESSIONAL FORMATTING STANDARDS ===",
+    "Apply consistent, professional formatting:",
+    "🎨 Visual Design:",
+    "• Company header with report title and date",
+    "• Consistent fonts: Calibri 11pt (body), 14pt (headers)",
+    "• Color scheme: Blue headers (#4472C4), alternating row colors",
+    "• Professional borders and gridlines",
+    "",
+    "📊 Data Formatting:",
+    "• Currency formatting for monetary values",
+    "• Percentage formatting for ratios",
+    "• Thousands separators for large numbers",
+    "• Appropriate decimal places (2 for currency, 1 for percentages)",
+    "",
+    "📐 Layout Optimization:",
+    "• Auto-sized columns for readability",
+    "• Freeze panes for easy navigation",
+    "• Centered headers with bold formatting",
+    "• Left-aligned text, right-aligned numbers",
+    "",
+    "=== CHART & VISUALIZATION REQUIREMENTS ===",
+    "Include appropriate charts for data visualization:",
+    "📊 Chart Types by Data Category:",
+    "• Revenue trends: Line charts",
+    "• Expense breakdown: Pie charts",
+    "• Asset composition: Stacked bar charts",
+    "• Financial ratios: Column charts",
+    "• Cash flow: Waterfall charts (if possible)",
+    "",
+    "=== PYTHON SCRIPT STRUCTURE ===",
+    "Create 'generate_excel_report.py' with this structure:",
+    "```python",
+    "import os, json, datetime, logging",
+    "from openpyxl import Workbook",
+    "from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle",
+    "from openpyxl.chart import BarChart, LineChart, PieChart",
+    "from openpyxl.utils.dataframe import dataframe_to_rows",
+    "",
+    "# Setup logging and working directory",
+    "logging.basicConfig(level=logging.INFO)",
+    "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
+    "",
+    "def load_financial_data():",
+    "    # Load and validate JSON data",
+    "",
+    "def create_worksheet_styles():",
+    "    # Define professional styles",
+    "",
+    "def create_executive_summary(wb, data):",
+    "    # Create executive summary with key metrics",
+    "",
+    "def create_financial_statements(wb, data):",
+    "    # Create income statement, balance sheet, cash flow",
+    "",
+    "def add_charts_and_visualizations(wb, data):",
+    "    # Add appropriate charts to worksheets",
+    "",
+    "def generate_financial_report():",
+    "    try:",
+    "        data = load_financial_data()",
+    "        wb = Workbook()",
+    "        create_worksheet_styles()",
+    "        create_executive_summary(wb, data)",
+    "        create_financial_statements(wb, data)",
+    "        add_charts_and_visualizations(wb, data)",
+    "        ",
+    "        # Save with timestamp",
+    "        timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')",
+    "        filename = f'Financial_Report_{timestamp}.xlsx'",
+    "        wb.save(filename)",
+    "        logging.info(f'Report saved as {filename}')",
+    "        return filename",
+    "    except Exception as e:",
+    "        logging.error(f'Error generating report: {e}')",
+    "        raise",
+    "",
+    "if __name__ == '__main__':",
+    "    generate_financial_report()",
+    "```",
+    "",
+    "=== EXECUTION STEPS ===",
+    "1. Read and analyze 'arranged_financial_data.json' structure",
+    "2. Install required packages: pip_install_package('openpyxl')",
+    "3. Create comprehensive Python script with error handling",
+    "4. Save script using save_file tool",
+    "5. Execute using run_shell_command('python generate_excel_report.py 2>&1')",
+    "6. Verify file creation with list_files",
+    "7. Validate file size and integrity",
+    "8. Report execution results and any issues",
+    "",
+    "=== SUCCESS CRITERIA ===",
+    "✅ Excel file created with timestamp filename",
+    "✅ File size >10KB (indicates substantial content)",
+    "✅ All worksheets present and formatted professionally",
+    "✅ Charts and visualizations included",
+    "✅ No execution errors in logs",
+    "✅ Data accurately transferred from JSON to Excel",
+    "",
+    "=== ERROR HANDLING ===",
+    "If issues occur:",
+    "• Log detailed error information",
+    "• Identify root cause (data, formatting, or execution)",
+    "• Implement fixes and retry",
+    "• Provide clear status updates",
+    "",
+    "Generate the comprehensive Excel report now."
+  ],
+  "variables": [],
+  "description": "Excel code generation prompt for creating formatted workbooks",
+  "category": "workflow"
+}

prompts/workflow/code_generation.txt ADDED Viewed

	@@ -0,0 +1,129 @@

+You are a financial Excel report generation specialist. Create a professional, multi-worksheet Excel report from organized financial data.
+=== YOUR OBJECTIVE ===
+Transform 'arranged_financial_data.json' into a polished, comprehensive Excel workbook with professional formatting, charts, and visualizations.
+=== INPUT DATA ===
+• File: 'arranged_financial_data.json'
+• Use read_file tool to load and analyze the JSON structure
+• Examine categories, headers, metadata, and data organization
+=== EXCEL WORKBOOK REQUIREMENTS ===
+Create comprehensive worksheets based on JSON categories:
+📊 1. Executive Summary (key metrics, charts, highlights)
+📈 2. Income Statement (formatted P&L statement)
+💰 3. Balance Sheet - Assets (professional layout)
+💳 4. Balance Sheet - Liabilities & Equity
+💸 5. Cash Flow Statement (operating, investing, financing)
+📊 6. Financial Ratios & Analysis
+🏢 7. Revenue Analysis & Breakdown
+💼 8. Expense Analysis & Breakdown
+📈 9. Charts & Visualizations Dashboard
+📝 10. Data Sources & Methodology
+=== PROFESSIONAL FORMATTING STANDARDS ===
+Apply consistent, professional formatting:
+🎨 Visual Design:
+• Company header with report title and date
+• Consistent fonts: Calibri 11pt (body), 14pt (headers)
+• Color scheme: Blue headers (#4472C4), alternating row colors
+• Professional borders and gridlines
+📊 Data Formatting:
+• Currency formatting for monetary values
+• Percentage formatting for ratios
+• Thousands separators for large numbers
+• Appropriate decimal places (2 for currency, 1 for percentages)
+📐 Layout Optimization:
+• Auto-sized columns for readability
+• Freeze panes for easy navigation
+• Centered headers with bold formatting
+• Left-aligned text, right-aligned numbers
+=== CHART & VISUALIZATION REQUIREMENTS ===
+Include appropriate charts for data visualization:
+📊 Chart Types by Data Category:
+• Revenue trends: Line charts
+• Expense breakdown: Pie charts
+• Asset composition: Stacked bar charts
+• Financial ratios: Column charts
+• Cash flow: Waterfall charts (if possible)
+=== PYTHON SCRIPT STRUCTURE ===
+Create 'generate_excel_report.py' with this structure:
+```python
+import os, json, datetime, logging
+from openpyxl import Workbook
+from openpyxl.styles import Font, PatternFill, Border, Alignment, NamedStyle
+from openpyxl.chart import BarChart, LineChart, PieChart
+from openpyxl.utils.dataframe import dataframe_to_rows
+# Setup logging and working directory
+logging.basicConfig(level=logging.INFO)
+os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')
+def load_financial_data():
+    # Load and validate JSON data
+def create_worksheet_styles():
+    # Define professional styles
+def create_executive_summary(wb, data):
+    # Create executive summary with key metrics
+def create_financial_statements(wb, data):
+    # Create income statement, balance sheet, cash flow
+def add_charts_and_visualizations(wb, data):
+    # Add appropriate charts to worksheets
+def generate_financial_report():
+    try:
+        data = load_financial_data()
+        wb = Workbook()
+        create_worksheet_styles()
+        create_executive_summary(wb, data)
+        create_financial_statements(wb, data)
+        add_charts_and_visualizations(wb, data)
+        # Save with timestamp
+        timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
+        filename = f'Financial_Report_{timestamp}.xlsx'
+        wb.save(filename)
+        logging.info(f'Report saved as {filename}')
+        return filename
+    except Exception as e:
+        logging.error(f'Error generating report: {e}')
+        raise
+if __name__ == '__main__':
+    generate_financial_report()
+```
+=== EXECUTION STEPS ===
+1. Read and analyze 'arranged_financial_data.json' structure
+2. Install required packages: pip_install_package('openpyxl')
+3. Create comprehensive Python script with error handling
+4. Save script using save_file tool
+5. Execute using run_shell_command('python generate_excel_report.py 2>&1')
+6. Verify file creation with list_files
+7. Validate file size and integrity
+8. Report execution results and any issues
+=== SUCCESS CRITERIA ===
+✅ Excel file created with timestamp filename
+✅ File size >10KB (indicates substantial content)
+✅ All worksheets present and formatted professionally
+✅ Charts and visualizations included
+✅ No execution errors in logs
+✅ Data accurately transferred from JSON to Excel
+=== ERROR HANDLING ===
+If issues occur:
+• Log detailed error information
+• Identify root cause (data, formatting, or execution)
+• Implement fixes and retry
+• Provide clear status updates
+Generate the comprehensive Excel report now.

prompts/workflow/data_arrangement.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "prompt": [
+    "You are a financial data organization specialist. Transform the extracted data into Excel-ready format.",
+    "",
+    "=== YOUR TASK ===",
+    "Reorganize raw financial data into 8-12 professional Excel worksheet categories with proper headers and structure.",
+    "",
+    "=== EXCEL WORKSHEET CATEGORIES ===",
+    "Create these comprehensive worksheet categories:",
+    "📋 1. Executive Summary & Key Metrics",
+    "📊 2. Income Statement / P&L",
+    "💰 3. Balance Sheet - Assets",
+    "💳 4. Balance Sheet - Liabilities & Equity",
+    "💸 5. Cash Flow Statement",
+    "📈 6. Financial Ratios & Analysis",
+    "🏢 7. Revenue Analysis & Breakdown",
+    "💼 8. Expense Analysis & Breakdown",
+    "📊 9. Profitability Analysis",
+    "👥 10. Operational Metrics",
+    "⚠️ 11. Risk Assessment & Notes",
+    "📝 12. Data Sources & Methodology",
+    "",
+    "=== EXCEL STRUCTURE FOR EACH WORKSHEET ===",
+    "Design each worksheet with:",
+    "• Row 1: Company name and entity information",
+    "• Row 2: Worksheet title and description",
+    "• Row 3: Units (e.g., 'All figures in millions USD')",
+    "• Row 4: Column headers (Item | [Actual Period 1] | [Actual Period 2] | etc.)",
+    "• Row 5+: Financial data rows with clear line item names",
+    "",
+    "=== DYNAMIC PERIOD HANDLING ===",
+    "• Identify ALL available reporting periods from extracted data",
+    "• Use actual years/periods present (e.g., 'FY 2023', 'Q3 2024', 'CY 2022')",
+    "• Arrange periods chronologically (oldest to newest)",
+    "• Support single-period or multi-period data",
+    "• Handle various formats: fiscal years, calendar years, quarters, interim periods",
+    "",
+    "=== DATA ORGANIZATION RULES ===",
+    "• Map each data point to the most appropriate worksheet",
+    "• Group related items together within each category",
+    "• Follow standard financial statement ordering",
+    "• Preserve ALL original data values - no modifications",
+    "• Use 'N/A' or 'Not Disclosed' for missing data",
+    "• Maintain consistent units and currency labels",
+    "",
+    "=== OUTPUT JSON STRUCTURE ===",
+    "Create JSON with exactly these fields:",
+    "```json",
+    "{",
+    "  \"categories\": {",
+    "    \"Executive_Summary\": { \"data\": [...], \"description\": \"...\" },",
+    "    \"Income_Statement\": { \"data\": [...], \"description\": \"...\" },",
+    "    \"Balance_Sheet_Assets\": { \"data\": [...], \"description\": \"...\" }",
+    "    // ... for all 12 categories",
+    "  },",
+    "  \"headers\": {",
+    "    \"Executive_Summary\": [\"Item\", \"[Actual Period 1]\", \"[Actual Period 2]\", \"etc.\"],",
+    "    // ... headers using ACTUAL periods from the data",
+    "  },",
+    "  \"metadata\": {",
+    "    \"company_name\": \"...\",",
+    "    \"reporting_periods\": [\"List of actual periods found in data\"],",
+    "    \"currency\": \"...\",",
+    "    \"units\": \"...\",",
+    "    \"period_format\": \"fiscal_year | calendar_year | quarterly | etc.\",",
+    "    \"data_quality_notes\": [...]",
+    "  }",
+    "}",
+    "```",
+    "",
+    "=== CRITICAL RESTRICTIONS ===",
+    "❌ NO calculations, analysis, or interpretations",
+    "❌ NO modifications to original data values",
+    "❌ NO ratio calculations or trend analysis",
+    "✅ ONLY organize and format for Excel import",
+    "",
+    "=== EXECUTION STEPS ===",
+    "1. Analyze extracted data structure and identify actual reporting periods",
+    "2. Map data points to appropriate worksheet categories",
+    "3. Design Excel headers using actual periods from data",
+    "4. Organize data maintaining original values and units",
+    "5. Create comprehensive JSON with categories, headers, metadata",
+    "6. Save as 'arranged_financial_data.json' using save_file",
+    "7. Verify file exists with list_files",
+    "8. Validate content with read_file",
+    "9. Report success only after file validation",
+    "",
+    "Extracted Data to organize: {extracted_data}"
+  ],
+  "variables": ["extracted_data"],
+  "description": "Data arrangement and organization prompt for Excel preparation",
+  "category": "workflow"
+}

prompts/workflow/data_arrangement.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
+========== WHAT TO DELIVER ==========
+• A single JSON object saved as arranged_financial_data.json
+• Fields required: categories, headers, metadata
+========== HOW TO ORGANIZE ==========
+Create distinct, Excel-ready categories (one worksheet each) for logical grouping of financial data. Examples include:
+1. Income Statement Data
+2. Balance Sheet Data
+3. Cash Flow Data
+4. Company Information / General Data
+========== STEP-BY-STEP ==========
+1. Map every data point into the most appropriate category above.
+2. For each category, identify and include all necessary headers for an Excel template, such as years, company names, financial line item names, and units of measurement (e.g., "in millions").
+3. Ensure data integrity by not modifying, calculating, or analyzing the original data values.
+4. Preserve original data formats and units.
+5. Organize data in a tabular format suitable for direct Excel import.
+6. Include metadata about data sources and reporting periods where available.
+7. Assemble everything into the JSON schema described under "WHAT TO DELIVER."
+8. Save the JSON as arranged_financial_data.json via save_file.
+9. Use list_files to confirm the file exists, then read_file to validate its content.
+10. If the file is missing or malformed, fix the issue and repeat steps 8 – 9.
+11. Only report success after the file passes both existence and content checks.
+========== IMPORTANT RESTRICTIONS ==========
+- Never perform any analysis on the data.
+- Do not calculate ratios, growth rates, or trends.
+- Do not provide insights or interpretations.
+- Do not modify the actual data values.
+- Focus solely on organization and proper formatting.
+Extracted Data: {extracted_data}

prompts/workflow/data_extraction.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "prompt": [
+    "You are a financial data extraction specialist analyzing the document at: {file_path}",
+    "",
+    "=== EXTRACTION APPROACH ===",
+    "Use a systematic 5-phase approach: Document Analysis → Critical Data → Standard Financials → Advanced Metrics → Quality Assurance",
+    "",
+    "=== PHASE 1: DOCUMENT ANALYSIS ===",
+    "First, quickly identify:",
+    "• Document type (Annual Report, 10-K, 10-Q, Quarterly Report, etc.)",
+    "• Company name and ticker symbol",
+    "• Reporting period and fiscal year",
+    "• Currency and unit scales (millions/thousands)",
+    "• Location of key financial statements",
+    "",
+    "=== PHASE 2: CRITICAL DATA (Must Extract) ===",
+    "🔴 Company Essentials:",
+    "• Official company name and ticker",
+    "• Reporting period and currency",
+    "• Document type and audit status",
+    "",
+    "🔴 Core Performance:",
+    "• Total Revenue/Net Sales",
+    "• Net Income/Profit",
+    "• Total Assets",
+    "• Total Shareholders' Equity",
+    "• Basic Earnings Per Share (EPS)",
+    "",
+    "=== PHASE 3: STANDARD FINANCIALS (High Priority) ===",
+    "📊 Income Statement: Revenue breakdown, COGS, gross profit, operating expenses, operating income, interest, taxes, diluted EPS",
+    "💰 Balance Sheet: Current/non-current assets, current/non-current liabilities, equity components",
+    "💸 Cash Flow: Operating, investing, financing cash flows, capex, free cash flow",
+    "",
+    "=== PHASE 4: ADVANCED METRICS (If Available) ===",
+    "📈 Financial Ratios: Margins, returns (ROE/ROA), liquidity ratios, leverage ratios",
+    "👥 Operational Data: Employee count, locations, customer metrics, production volumes",
+    "📋 Supplementary: Dividends, buybacks, guidance, one-time items",
+    "",
+    "=== PHASE 5: QUALITY ASSURANCE ===",
+    "• Validate Balance Sheet equation (Assets = Liabilities + Equity)",
+    "• Assign confidence scores: 1.0 (clearly stated) to 0.4 (unclear)",
+    "• Flag missing critical data with explanations",
+    "• Note any unusual values or inconsistencies",
+    "",
+    "=== OUTPUT REQUIREMENTS ===",
+    "Return structured data using ExtractedFinancialData model:",
+    "• company_name: Official company name",
+    "• document_type: Type of document analyzed",
+    "• reporting_period: Fiscal period (e.g., 'FY 2023')",
+    "• data_points: Array with field_name, value, category, period, unit, confidence",
+    "• summary: 2-3 sentence summary of key findings",
+    "",
+    "=== EXTRACTION TIPS ===",
+    "• Look in financial tables first, then notes, then text",
+    "• Watch for footnotes and accounting changes",
+    "• Note restatements or discontinued operations",
+    "• Pay attention to scale indicators (millions/thousands)",
+    "• Extract multiple periods when available",
+    "",
+    "Document to analyze: {file_path}"
+  ],
+  "variables": ["file_path"],
+  "description": "Comprehensive financial document data extraction prompt",
+  "category": "workflow"
+}

prompts/workflow/data_extraction.txt ADDED Viewed

	@@ -0,0 +1,58 @@

+You are a financial data extraction specialist analyzing the document at: {file_path}
+=== EXTRACTION APPROACH ===
+Use a systematic 5-phase approach: Document Analysis → Critical Data → Standard Financials → Advanced Metrics → Quality Assurance
+=== PHASE 1: DOCUMENT ANALYSIS ===
+First, quickly identify:
+• Document type (Annual Report, 10-K, 10-Q, Quarterly Report, etc.)
+• Company name and ticker symbol
+• Reporting period and fiscal year
+• Currency and unit scales (millions/thousands)
+• Location of key financial statements
+=== PHASE 2: CRITICAL DATA (Must Extract) ===
+🔴 Company Essentials:
+• Official company name and ticker
+• Reporting period and currency
+• Document type and audit status
+🔴 Core Performance:
+• Total Revenue/Net Sales
+• Net Income/Profit
+• Total Assets
+• Total Shareholders' Equity
+• Basic Earnings Per Share (EPS)
+=== PHASE 3: STANDARD FINANCIALS (High Priority) ===
+📊 Income Statement: Revenue breakdown, COGS, gross profit, operating expenses, operating income, interest, taxes, diluted EPS
+💰 Balance Sheet: Current/non-current assets, current/non-current liabilities, equity components
+💸 Cash Flow: Operating, investing, financing cash flows, capex, free cash flow
+=== PHASE 4: ADVANCED METRICS (If Available) ===
+📈 Financial Ratios: Margins, returns (ROE/ROA), liquidity ratios, leverage ratios
+👥 Operational Data: Employee count, locations, customer metrics, production volumes
+📋 Supplementary: Dividends, buybacks, guidance, one-time items
+=== PHASE 5: QUALITY ASSURANCE ===
+• Validate Balance Sheet equation (Assets = Liabilities + Equity)
+• Assign confidence scores: 1.0 (clearly stated) to 0.4 (unclear)
+• Flag missing critical data with explanations
+• Note any unusual values or inconsistencies
+=== OUTPUT REQUIREMENTS ===
+Return structured data using ExtractedFinancialData model:
+• company_name: Official company name
+• document_type: Type of document analyzed
+• reporting_period: Fiscal period (e.g., 'FY 2023')
+• data_points: Array with field_name, value, category, period, unit, confidence
+• summary: 2-3 sentence summary of key findings
+=== EXTRACTION TIPS ===
+• Look in financial tables first, then notes, then text
+• Watch for footnotes and accounting changes
+• Note restatements or discontinued operations
+• Pay attention to scale indicators (millions/thousands)
+• Extract multiple periods when available
+Document to analyze: {file_path}

settings.py DELETED Viewed

@@ -1,54 +0,0 @@
-import os
-from pathlib import Path
-from dotenv import load_dotenv
-load_dotenv()
-class Settings:
-    GOOGLE_AI_API_KEY = os.getenv("GOOGLE_API_KEY")
-    MAX_FILE_SIZE_MB = 50
-    SUPPORTED_FILE_TYPES = [
-        "pdf",
-        "txt",
-        "png",
-        "jpg",
-        "jpeg",
-        "docx",
-        "xlsx",
-        "csv",
-        "md",
-        "json",
-        "xml",
-        "html",
-        "py",
-        "js",
-        "ts",
-        "doc",
-        "xls",
-        "ppt",
-        "pptx",
-    ]
-    # Use /tmp for temporary files on Hugging Face Spaces (or override with TEMP_DIR env var)
-    TEMP_DIR = Path(os.getenv("TEMP_DIR", "/tmp/data_extractor_temp"))
-    DOCKER_IMAGE = os.getenv("DOCKER_IMAGE", "python:3.12-slim")
-    COORDINATOR_MODEL = os.getenv("COORDINATOR_MODEL", "gemini-2.5-pro")
-    PROMPT_ENGINEER_MODEL = os.getenv("PROMPT_ENGINEER_MODEL", "gemini-2.5-pro")
-    DATA_EXTRACTOR_MODEL = os.getenv("DATA_EXTRACTOR_MODEL", "gemini-2.5-pro")
-    DATA_ARRANGER_MODEL = os.getenv("DATA_ARRANGER_MODEL", "gemini-2.5-pro")
-    CODE_GENERATOR_MODEL = os.getenv("CODE_GENERATOR_MODEL", "gemini-2.5-pro")
-    COORDINATOR_MODEL_THINKING_BUDGET=2048
-    PROMPT_ENGINEER_MODEL_THINKING_BUDGET=2048
-    DATA_EXTRACTOR_MODEL_THINKING_BUDGET=-1
-    DATA_ARRANGER_MODEL_THINKING_BUDGET=3072
-    CODE_GENERATOR_MODEL_THINKING_BUDGET=3072
-    @classmethod
-    def validate_config(cls):
-        if not cls.GOOGLE_API_KEY:
-            raise ValueError("GOOGLE_API_KEY required")
-        cls.TEMP_DIR.mkdir(exist_ok=True)
-settings = Settings()

terminal_stream.py CHANGED Viewed

@@ -20,6 +20,9 @@ class TerminalStreamManager:
         self.command_queue = Queue()
         self.is_running = False
         self.current_process = None
     async def register_client(self, websocket):
         """Register a new WebSocket client."""
@@ -174,6 +177,49 @@ class TerminalStreamManager:
             pass
         finally:
             await self.unregister_client(websocket)
 # Global terminal manager instance
 terminal_manager = TerminalStreamManager()
@@ -185,13 +231,17 @@ async def start_websocket_server(host='localhost', port=8765):
     async def handler(websocket, path):
         await terminal_manager.handle_client(websocket, path)
-    return await websockets.serve(handler, host, port)
 def run_websocket_server():
     """Run WebSocket server in a separate thread."""
     def start_server():
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
         try:
             server = loop.run_until_complete(start_websocket_server())
@@ -199,7 +249,10 @@ def run_websocket_server():
             loop.run_forever()
         except Exception as e:
             logger.error(f"Error starting WebSocket server: {e}")
     thread = threading.Thread(target=start_server, daemon=True)
     thread.start()
     return thread

         self.command_queue = Queue()
         self.is_running = False
         self.current_process = None
+        self.server = None
+        self.server_thread = None
+        self.loop = None
     async def register_client(self, websocket):
         """Register a new WebSocket client."""
             pass
         finally:
             await self.unregister_client(websocket)
+    def stop_server(self):
+        """Stop the WebSocket server gracefully."""
+        if self.server:
+            logger.info("Stopping terminal WebSocket server...")
+            self.is_running = False
+            # Close all client connections
+            if self.clients:
+                import asyncio
+                try:
+                    loop = asyncio.get_event_loop()
+                    for client in self.clients.copy():
+                        try:
+                            loop.create_task(client.close())
+                        except Exception as e:
+                            logger.warning(f"Error closing client connection: {e}")
+                    self.clients.clear()
+                except Exception as e:
+                    logger.warning(f"Error closing client connections: {e}")
+            # Terminate current process if running
+            if self.current_process:
+                try:
+                    self.current_process.terminate()
+                    self.current_process = None
+                except Exception as e:
+                    logger.warning(f"Error terminating process: {e}")
+            # Close the server
+            try:
+                if hasattr(self.server, 'close'):
+                    self.server.close()
+                # Stop the event loop if it exists
+                if self.loop and self.loop.is_running():
+                    self.loop.call_soon_threadsafe(self.loop.stop)
+                logger.info("Terminal WebSocket server stopped")
+            except Exception as e:
+                logger.error(f"Error stopping WebSocket server: {e}")
+        else:
+            logger.info("Terminal WebSocket server was not running")
 # Global terminal manager instance
 terminal_manager = TerminalStreamManager()
     async def handler(websocket, path):
         await terminal_manager.handle_client(websocket, path)
+    server = await websockets.serve(handler, host, port)
+    terminal_manager.server = server
+    terminal_manager.is_running = True
+    return server
 def run_websocket_server():
     """Run WebSocket server in a separate thread."""
     def start_server():
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
+        terminal_manager.loop = loop
         try:
             server = loop.run_until_complete(start_websocket_server())
             loop.run_forever()
         except Exception as e:
             logger.error(f"Error starting WebSocket server: {e}")
+        finally:
+            logger.info("WebSocket server loop ended")
     thread = threading.Thread(target=start_server, daemon=True)
+    terminal_manager.server_thread = thread
     thread.start()
     return thread

utils/logger.py CHANGED Viewed

@@ -1,22 +1,42 @@
 import logging
-from datetime import datetime
 from pathlib import Path
 class AgentLogger:
-    def __init__(self, log_dir="logs"):
         self.log_dir = Path(log_dir)
         self.log_dir.mkdir(exist_ok=True)
         self.logger = logging.getLogger("agent_logger")
         self.logger.setLevel(logging.DEBUG)
         formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
         console_handler = logging.StreamHandler()
         console_handler.setLevel(logging.INFO)
         console_handler.setFormatter(formatter)
-        file_handler = logging.FileHandler(self.log_dir / f"agents_{datetime.now().strftime('%Y%m%d')}.log")
         file_handler.setLevel(logging.DEBUG)
         file_handler.setFormatter(formatter)
-        self.logger.addHandler(console_handler)
         self.logger.addHandler(file_handler)
     def log_workflow_step(self, agent_name, message):
         self.logger.info(f"{agent_name}: {message}")
@@ -26,5 +46,72 @@ class AgentLogger:
     def log_inter_agent_pass(self, from_agent, to_agent, data_size):
         self.logger.info(f"🔗 PASS: {from_agent} → {to_agent} | Size: {data_size}")
-agent_logger = AgentLogger()

 import logging
+import logging.handlers
+from datetime import datetime, timedelta
 from pathlib import Path
+import os
+import glob
 class AgentLogger:
+    def __init__(self, log_dir="logs", max_bytes=10*1024*1024, backup_count=5, cleanup_days=7):
         self.log_dir = Path(log_dir)
         self.log_dir.mkdir(exist_ok=True)
+        self.max_bytes = max_bytes
+        self.backup_count = backup_count
+        self.cleanup_days = cleanup_days
         self.logger = logging.getLogger("agent_logger")
         self.logger.setLevel(logging.DEBUG)
         formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
+        # Console handler
         console_handler = logging.StreamHandler()
         console_handler.setLevel(logging.INFO)
         console_handler.setFormatter(formatter)
+        self.logger.addHandler(console_handler)
+        # Rotating file handler
+        log_file = self.log_dir / f"agents_{datetime.now().strftime('%Y%m%d')}.log"
+        file_handler = logging.handlers.RotatingFileHandler(
+            log_file,
+            maxBytes=max_bytes,
+            backupCount=backup_count,
+            encoding='utf-8'
+        )
         file_handler.setLevel(logging.DEBUG)
         file_handler.setFormatter(formatter)
         self.logger.addHandler(file_handler)
+        # Clean up old log files on startup
+        self.cleanup_old_logs()
     def log_workflow_step(self, agent_name, message):
         self.logger.info(f"{agent_name}: {message}")
     def log_inter_agent_pass(self, from_agent, to_agent, data_size):
         self.logger.info(f"🔗 PASS: {from_agent} → {to_agent} | Size: {data_size}")
+    def cleanup_old_logs(self):
+        """Clean up log files older than cleanup_days."""
+        try:
+            cutoff_date = datetime.now() - timedelta(days=self.cleanup_days)
+            log_pattern = str(self.log_dir / "agents_*.log*")
+            deleted_count = 0
+            for log_file_path in glob.glob(log_pattern):
+                log_file = Path(log_file_path)
+                try:
+                    # Get file modification time
+                    file_mtime = datetime.fromtimestamp(log_file.stat().st_mtime)
+                    if file_mtime < cutoff_date:
+                        log_file.unlink()
+                        deleted_count += 1
+                        print(f"Deleted old log file: {log_file.name}")
+                except Exception as e:
+                    print(f"Error deleting log file {log_file}: {e}")
+            if deleted_count > 0:
+                print(f"Cleaned up {deleted_count} old log files")
+        except Exception as e:
+            print(f"Error during log cleanup: {e}")
+    def get_log_stats(self):
+        """Get statistics about log files."""
+        try:
+            log_pattern = str(self.log_dir / "agents_*.log*")
+            log_files = list(glob.glob(log_pattern))
+            total_size = 0
+            file_info = []
+            for log_file_path in log_files:
+                log_file = Path(log_file_path)
+                try:
+                    size = log_file.stat().st_size
+                    mtime = datetime.fromtimestamp(log_file.stat().st_mtime)
+                    total_size += size
+                    file_info.append({
+                        'name': log_file.name,
+                        'size_mb': round(size / (1024*1024), 2),
+                        'modified': mtime.strftime('%Y-%m-%d %H:%M:%S')
+                    })
+                except Exception as e:
+                    print(f"Error reading log file {log_file}: {e}")
+            return {
+                'total_files': len(log_files),
+                'total_size_mb': round(total_size / (1024*1024), 2),
+                'files': file_info
+            }
+        except Exception as e:
+            print(f"Error getting log stats: {e}")
+            return {'error': str(e)}
+# Create global logger with configuration
+agent_logger = AgentLogger(
+    log_dir="logs",
+    max_bytes=10*1024*1024,  # 10MB per file
+    backup_count=5,          # Keep 5 backup files
+    cleanup_days=7           # Delete files older than 7 days
+)

utils/prompt_loader.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Utility for loading prompts and instructions from external JSON files.
+"""
+import os
+import json
+from pathlib import Path
+from typing import Dict, Optional, List, Any
+import logging
+logger = logging.getLogger(__name__)
+class PromptLoader:
+    """Loads prompts and instructions from external JSON files."""
+    def __init__(self, base_dir: Optional[Path] = None):
+        """Initialize with base directory."""
+        if base_dir is None:
+            # Default to the project root directory
+            self.base_dir = Path(__file__).parent.parent
+        else:
+            self.base_dir = Path(base_dir)
+        self.prompts_dir = self.base_dir / "prompts"
+        self.instructions_dir = self.base_dir / "instructions"
+        # Cache for loaded files
+        self._cache: Dict[str, Dict[str, Any]] = {}
+    def load_prompt(self, prompt_name: str, **kwargs) -> str:
+        """
+        Load a prompt from the prompts directory.
+        Supports both .txt (plain text) and .json formats.
+        Args:
+            prompt_name: Name of the prompt file (without extension)
+            **kwargs: Variables to substitute in the prompt
+        Returns:
+            The loaded prompt with variables substituted
+        """
+        # Try .txt format first (preferred), then fall back to .json
+        txt_path = self.prompts_dir / f"{prompt_name}.txt"
+        json_path = self.prompts_dir / f"{prompt_name}.json"
+        if txt_path.exists():
+            # Load plain text file
+            logger.debug(f"Loading prompt from .txt file: {txt_path}")
+            with open(txt_path, 'r', encoding='utf-8') as f:
+                prompt_text = f.read().strip()
+        elif json_path.exists():
+            # Load JSON file (legacy format)
+            logger.debug(f"Loading prompt from .json file: {json_path}")
+            data = self._load_json_file(json_path)
+            prompt_data = data.get("prompt", "")
+            # Handle both string and list formats
+            if isinstance(prompt_data, list):
+                # Join list elements with newlines to create a single string
+                prompt_text = "\n".join(prompt_data)
+            else:
+                prompt_text = prompt_data
+        else:
+            raise FileNotFoundError(f"Prompt file not found: {prompt_name} (checked .txt and .json)")
+        # Substitute variables if provided
+        if kwargs:
+            try:
+                logger.debug(f"Formatting prompt {prompt_name} with variables: {list(kwargs.keys())}")
+                prompt_text = prompt_text.format(**kwargs)
+                logger.debug(f"Successfully formatted prompt {prompt_name}")
+            except KeyError as e:
+                logger.warning(f"Missing variable {e} in prompt {prompt_name}")
+            except Exception as e:
+                logger.error(f"Error formatting prompt {prompt_name}: {e}")
+                logger.error(f"Available variables: {list(kwargs.keys())}")
+        return prompt_text
+    def load_instruction(self, instruction_name: str) -> str:
+        """
+        Load instructions from the instructions directory as a single string.
+        Args:
+            instruction_name: Name of the instruction file (without .json extension)
+        Returns:
+            The loaded instructions as a joined string
+        """
+        instructions_list = self.load_instructions_as_list(instruction_name)
+        return "\n".join(instructions_list)
+    def load_instructions_as_list(self, instruction_name: str) -> List[str]:
+        """
+        Load instructions and return as a list of strings.
+        Args:
+            instruction_name: Name of the instruction file (without .json extension)
+        Returns:
+            List of instruction strings
+        """
+        instruction_path = self.instructions_dir / f"{instruction_name}.json"
+        data = self._load_json_file(instruction_path)
+        instructions = data.get("instructions", [])
+        # Filter out empty strings
+        return [instruction for instruction in instructions if instruction.strip()]
+    def _load_json_file(self, file_path: Path) -> Dict[str, Any]:
+        """Load JSON file content with caching."""
+        cache_key = str(file_path)
+        # Check cache first
+        if cache_key in self._cache:
+            return self._cache[cache_key]
+        try:
+            if not file_path.exists():
+                raise FileNotFoundError(f"File not found: {file_path}")
+            with open(file_path, 'r', encoding='utf-8') as f:
+                data = json.load(f)
+            # Cache the data
+            self._cache[cache_key] = data
+            logger.debug(f"Loaded {file_path.name}: {type(data)} with {len(data)} keys")
+            return data
+        except json.JSONDecodeError as e:
+            logger.error(f"Invalid JSON in file {file_path}: {e}")
+            raise
+        except Exception as e:
+            logger.error(f"Error loading file {file_path}: {e}")
+            raise
+    def clear_cache(self):
+        """Clear the file cache."""
+        self._cache.clear()
+        logger.debug("Prompt loader cache cleared")
+    def list_prompts(self) -> List[str]:
+        """List all available prompt files."""
+        if not self.prompts_dir.exists():
+            return []
+        prompts = []
+        for file_path in self.prompts_dir.rglob("*.json"):
+            # Get relative path from prompts dir
+            rel_path = file_path.relative_to(self.prompts_dir)
+            # Remove .json extension and convert to forward slashes
+            prompt_name = str(rel_path.with_suffix(''))
+            prompts.append(prompt_name)
+        return sorted(prompts)
+    def list_instructions(self) -> List[str]:
+        """List all available instruction files."""
+        if not self.instructions_dir.exists():
+            return []
+        instructions = []
+        for file_path in self.instructions_dir.rglob("*.json"):
+            # Get relative path from instructions dir
+            rel_path = file_path.relative_to(self.instructions_dir)
+            # Remove .json extension and convert to forward slashes
+            instruction_name = str(rel_path.with_suffix(''))
+            instructions.append(instruction_name)
+        return sorted(instructions)
+    def get_info(self) -> dict:
+        """Get information about the prompt loader."""
+        return {
+            "base_dir": str(self.base_dir),
+            "prompts_dir": str(self.prompts_dir),
+            "instructions_dir": str(self.instructions_dir),
+            "prompts_dir_exists": self.prompts_dir.exists(),
+            "instructions_dir_exists": self.instructions_dir.exists(),
+            "available_prompts": self.list_prompts(),
+            "available_instructions": self.list_instructions(),
+            "cache_size": len(self._cache)
+        }
+# Global instance
+prompt_loader = PromptLoader()

workflow/financial_workflow.py CHANGED Viewed

@@ -17,6 +17,7 @@ from agno.workflow import Workflow
 from agno.utils.log import logger
 from agno.tools.shell import ShellTools
 from config.settings import settings
 # Structured Output Models to avoid JSON parsing issues
@@ -67,106 +68,21 @@ class FinancialDocumentWorkflow(Workflow):
     description: str = "Financial document analysis workflow with data extraction, organization, and Excel generation"
     # Data Extractor Agent - Structured output eliminates JSON parsing issues
     data_extractor: Agent = Agent(
-        model=Gemini(id=settings.DATA_EXTRACTOR_MODEL,thinking_budget=settings.DATA_EXTRACTOR_MODEL_THINKING_BUDGET),
         description="Expert financial data extraction specialist",
-        instructions=[
-                "Extract comprehensive financial data from documents with these priorities:",
-                "Identify and classify the document type: Income Statement, Balance Sheet, Cash Flow Statement, 10-K, 10-Q, Annual Report, Quarterly/Interim Report, Prospectus, Earnings Release, Proxy Statement, Investor Presentation, Press Release, or other",
-                "Extract report version: audited, unaudited, restated, pro forma",
-                "Capture language, country/jurisdiction, and file format (PDF, XLSX, HTML, etc.)",
-                "Extract company name and unique identifiers: LEI, CIK, ISIN, Ticker",
-                "Extract reporting entity: consolidated, subsidiary, segment",
-                "Extract fiscal year and period covered (start and end dates)",
-                "Extract all reporting, publication, and filing dates",
-                "Extract currency and any currency translation notes",
-                "Extract auditors name, if present",
-                "Identify financial statement presentation style: single-step, multi-step, consolidated, segmental",
-                "Capture table and note references for each data point",
-                "Extract total revenue/net sales (with by-product/service, segment, and geography breakdowns if disclosed)",
-                "Extract COGS or cost of sales",
-                "Extract gross profit and gross margin",
-                "Extract operating expenses: R&D, SG&A, advertising, depreciation, amortization",
-                "Extract operating income (EBIT) and EBIT margin",
-                "Extract non-operating items: interest income/expense, other income/expenses",
-                "Extract pretax income, income tax expense, and net income (with breakdowns: continuing, discontinued ops)",
-                "Extract basic and diluted EPS",
-                "Extract comprehensive and other comprehensive income items",
-                "Extract YoY and sequential income comparisons (if available)",
-                "Extract current assets: cash and equivalents, marketable securities, accounts receivable (gross/net), inventory (raw, WIP, finished), prepaid expenses, other",
-                "Extract non-current assets: PP&E (gross/net), intangible assets, goodwill, LT investments, deferred tax assets, right-of-use assets, other",
-                "Extract current liabilities: accounts payable, accrued expenses, short-term debt, lease liabilities, taxes payable, other",
-                "Extract non-current liabilities: long-term debt, deferred tax liabilities, pensions, lease obligations, other",
-                "Extract total shareholders equity: common/ordinary stock, retained earnings, additional paid-in capital, treasury stock, accumulated OCI, minority interest",
-                "Extract book value per share",
-                "Extract cash flows: net cash from operating, investing, and financing activities",
-                "Extract key cash flow line items: net cash from ops, capex, acquisitions/disposals, dividends, share buybacks, debt activities",
-                "Extract non-cash adjustments: depreciation, amortization, SBC, deferred taxes, impairments, gain/loss on sale",
-                "Extract profitability ratios: gross margin, operating margin, net margin, EBITDA margin",
-                "Extract return ratios: ROE, ROA, ROIC",
-                "Extract liquidity/solvency: current ratio, quick ratio, debt/equity, interest coverage",
-                "Extract efficiency: asset turnover, inventory turnover, receivables turnover",
-                "Extract per-share metrics: EPS (basic/diluted), BVPS, FCF per share",
-                "Extract segmental/geographical/operational ratios and breakdowns",
-                "Extract shares outstanding, share class details, voting rights",
-                "Extract dividends declared/paid (amount, dates)",
-                "Extract buyback authorization/utilization details",
-                "Extract employee count (average, period-end)",
-                "Extract store/branch/office count",
-                "Extract customer/user/subscriber numbers (active/paying, ARPU, churn, MAU/DAU)",
-                "Extract units shipped/sold, production volumes, operational stats",
-                "Extract key management guidance/forecasts if present",
-                "Extract risk factors, uncertainties, and forward-looking statements",
-                "Extract ESG/sustainability data where available (emissions, board diversity, etc.)",
-                "Flag any restatements, adjustments, or one-off items",
-                "Highlight material non-recurring, extraordinary, or unusual items (gains/losses, litigation, impairments, restructuring)",
-                "Identify related-party transactions and accounting policy changes",
-                "For each data point, provide a confidence score (0–1) based on clarity and documentation",
-                "Include table/note reference numbers where possible",
-                "Note any ambiguity or extraction limitations for specific data",
-                "List all units, scales (millions, thousands), and any conversion performed",
-                "Normalize date and currency formats across extracted data",
-                "Validate calculations (e.g., assets = liabilities + equity), and flag inconsistencies",
-                "Return data in a structured format (JSON/table), with reference and confidence annotation"
-            ],
         response_model=ExtractedFinancialData,
         structured_outputs=True,
         debug_mode=True,
     )
     # Data Arranger Agent - Organizes data into categories for Excel
     data_arranger: Agent = Agent(
-        model=Gemini(id=settings.DATA_ARRANGER_MODEL,thinking_budget=settings.DATA_ARRANGER_MODEL_THINKING_BUDGET),
         description="Financial data organization and analysis expert",
-        instructions=[
-            'Organize the extracted financial data into logical categories based on financial statement types (Income Statement, Balance Sheet, Cash Flow Statement, etc.).',
-            'Group related financial items together (e.g., all revenue items, all expense items, all asset items).',
-            'Ensure each category has a clear, descriptive name that would work as an Excel worksheet tab.',
-            'Always add appropriate headers for Excel templates including: Years (e.g., 2021, 2022, 2023, 2024), Company names or entity identifiers, Financial line item names, and Units of measurement (e.g., "in millions", "in thousands").',
-            'Create column headers that clearly identify what each data column represents.',
-            'Include row headers that clearly identify each financial line item.',
-            'Design categories suitable for comprehensive Excel worksheets, such as: Income Statement Data, Balance Sheet Data, Cash Flow Data, Key Metrics, and Company Information.',
-            'Maintain data integrity - do not modify, calculate, or analyze the original data values.',
-            'Preserve original data formats and units.',
-            'Ensure data is organized in a tabular format suitable for Excel import.',
-            'Include metadata about data sources and reporting periods where available.',
-            'Package everything into a JSON object with the fields: categories (object containing organized data by category), headers (object containing appropriate headers for each category), and metadata (object containing information about data sources, periods, and units).',
-            'Never perform any analysis on the data.',
-            'Do not calculate ratios, growth rates, or trends.',
-            'Do not provide insights or interpretations.',
-            'Do not modify the actual data values.',
-            'Focus solely on organization and proper formatting.',
-            'Save this JSON as \'arranged_financial_data.json\' using the save_file tool.',
-            'Run list_files to verify that the file now exists in the working directory.',
-            'Use read_file to ensure the JSON content was written correctly.',
-            'If the file is missing or the content is incorrect, debug, re-save, and repeat steps',
-            'Only report success after the files presence and validity are fully confirmed.'
-        ],
         tools=[FileTools()],  # FileTools for saving arranged data
         # NOTE: Cannot use structured_outputs with tools in Gemini - choosing tools over structured outputs
         markdown=True,
@@ -176,62 +92,17 @@ class FinancialDocumentWorkflow(Workflow):
         exponential_backoff=True,
         retries=10,
     )
     # Code Generator Agent - Creates Excel generation code
     code_generator = Agent(
         model=Gemini(
             id=settings.CODE_GENERATOR_MODEL,
-            thinking_budget=settings.CODE_GENERATOR_MODEL_THINKING_BUDGET
         ),
         description="Excel report generator that analyzes JSON data and creates formatted workbooks using shell execution on any OS",
         goal="Generate a professional Excel report from arranged_financial_data.json with multiple worksheets, formatting, and charts",
-        instructions=[
-            "EXECUTION RULE: Always use run_shell_command() for Python execution. Never use save_to_file_and_run().",
-            "",
-            "CRITICAL: Always read the file to understand the struction of the JSON First"
-            "FIRST, use read_file tool to load 'arranged_financial_data.json'.",
-            "SECOND, analyze its structure deeply. Identify all keys, data types, nested structures, and any inconsistencies.",
-            "THIRD, create analysis.py to programmatically examine the JSON. Execute using run_shell_command().",
-            "FOURTH, based on  the analysis, design your Excel structure. Plan worksheets, formatting, and charts needed.",
-            "FIFTH, implement generate_excel_report.py with error handling, progress tracking, and professional formatting.",
-            "",
-            "CRITICAL: Always start Python scripts with:",
-            "import os",
-            "os.chdir(os.path.dirname(os.path.abspath(__file__)) or '.')",
-            "This ensures the script runs in the correct directory regardless of OS.",
-            "",
-            "Available Tools:",
-            "- FileTools: read_file, save_file, list_files",
-            "- PythonTools: pip_install_package (ONLY for package installation)",
-            "- ShellTools: run_shell_command (PRIMARY execution tool)",
-            "",
-            "Cross-Platform Execution:",
-            "- Try: run_shell_command('python script.py 2>&1')",
-            "- If fails on Windows: run_shell_command('python.exe script.py 2>&1')",
-            "- PowerShell alternative: run_shell_command('powershell -Command \"python script.py\" 2>&1')",
-            "",
-            "Verification Commands (Linux/Mac):",
-            "- run_shell_command('ls -la *.xlsx')",
-            "- run_shell_command('file Financial_Report*.xlsx')",
-            "- run_shell_command('du -h *.xlsx')",
-            "",
-            "Verification Commands (Windows/PowerShell):",
-            "- run_shell_command('dir *.xlsx')",
-            "- run_shell_command('powershell -Command \"Get-ChildItem *.xlsx\"')",
-            "- run_shell_command('powershell -Command \"(Get-Item *.xlsx).Length\"')",
-            "",
-            "Debug Commands (Cross-Platform):",
-            "- Current directory: run_shell_command('pwd') or run_shell_command('cd')",
-            "- Python location: run_shell_command('where python') or run_shell_command('which python')",
-            "- List files: run_shell_command('dir') or run_shell_command('ls')",
-            "",
-            "Package Installation:",
-            "- pip_install_package('openpyxl')",
-            "- Or via shell: run_shell_command('pip install openpyxl')",
-            "- Windows: run_shell_command('python -m pip install openpyxl')",
-            "",
-            "Success Criteria: Excel file exists, size >5KB, no errors in output."
-        ],
         expected_output="A Financial_Report_YYYYMMDD_HHMMSS.xlsx file containing formatted data from the JSON with multiple worksheets, professional styling, and relevant charts",
         additional_context="This agent must work on Windows, Mac, and Linux. Always use os.path for file operations and handle path separators correctly. Include proper error handling for cross-platform compatibility.",
         tools=[
@@ -251,12 +122,57 @@ class FinancialDocumentWorkflow(Workflow):
         super().__init__(session_id=session_id, **kwargs)
         self.session_id = session_id or f"financial_workflow_{int(__import__('time').time())}"
         self.session_output_dir = Path(settings.TEMP_DIR) / self.session_id / "output"
         self.session_output_dir.mkdir(parents=True, exist_ok=True)
         # Configure tools with correct base directories after initialization
         self._configure_agent_tools()
         logger.info(f"FinancialDocumentWorkflow initialized with session: {self.session_id}")
     def _configure_agent_tools(self):
         """Configure agent tools with the correct base directories"""
@@ -274,12 +190,23 @@ class FinancialDocumentWorkflow(Workflow):
                 elif isinstance(tool, PythonTools):
                     tool.base_dir = self.session_output_dir
-    def run(self, file_path: str, use_cache: bool = True) -> RunResponse:
         """
         Pure Python workflow execution - no streaming, no JSON parsing issues
         """
         logger.info(f"Processing financial document: {file_path}")
         # Check cache first if enabled
         if use_cache and "final_results" in self.session_state:
             logger.info("Returning cached results")
@@ -300,27 +227,7 @@ class FinancialDocumentWorkflow(Workflow):
                 logger.info("Using cached extraction data")
             else:
                 document = File(filepath=file_path)
-                extraction_prompt = f"""
-                Analyze this financial document and extract all relevant financial data points.
-                Focus on:
-                - Company identification, including company name, entity identifiers (e.g., Ticker, CIK, ISIN, LEI), and reporting entity type (consolidated/subsidiary/segment).
-                - All reporting period information: fiscal year, period start and end dates, reporting date, publication date, and currency used.
-                - Revenue data: total revenue/net sales, breakdown by product/service, segment, and geography if available, and year-over-year growth rates.
-                - Expense data: COGS, operating expenses (R&D, SG&A, advertising, depreciation/amortization), interest expenses, taxes, and any non-operating items.
-                - Profit data: gross profit, operating income (EBIT/EBITDA), pretax profit, net income, basic and diluted earnings per share (EPS), comprehensive income.
-                - Balance sheet items: current assets (cash, securities, receivables, inventories), non-current assets (PP&E, intangibles, goodwill), current liabilities, non-current liabilities, and all categories of shareholders’ equity.
-                - Cash flow details: cash from operations, investing, and financing; capex, dividends, buybacks; non-cash adjustments (depreciation, SBC, etc.).
-                - Financial ratios: profitability (gross margin, operating margin, net margin), return (ROE, ROA, ROIC), liquidity (current/quick ratio), leverage (debt/equity, interest coverage), efficiency (asset/inventory/receivables turnover), per-share metrics.
-                - Capital and shareholder information: shares outstanding, share class details, dividends, and buyback information.
-                - Non-financial and operational metrics: employee, store, customer/user counts, production volumes, and operational breakdowns.
-                - Extract any additional material metrics, key management guidance, risks, uncertainties, ESG indicators, or forward-looking statements.
-                - Flag/annotate any unusual or non-recurring items, restatements, or related-party transactions.
-                - For each data point, provide a confidence score (0–1) and, where possible, include reference identifiers (table/note numbers).
-                - If units or currencies differ throughout, normalize and annotate the data accordingly.
-                Return your extraction in a structured, machine-readable format with references and confidence levels for each field.
-                Document path: {file_path}
-                """
                 extraction_response: RunResponse = self.data_extractor.run(
                     extraction_prompt,
@@ -339,42 +246,20 @@ class FinancialDocumentWorkflow(Workflow):
                 arrangement_content = self.session_state["arrangement_response"]
                 logger.info("Using cached arrangement data")
             else:
-                arrangement_prompt = f"""
-                You are given raw, extracted financial data. Your task is to reorganize it and prepare it for Excel-based reporting.
-                ========== WHAT TO DELIVER ==========
-                • A single JSON object saved as arranged_financial_data.json
-                • Fields required: categories, headers, metadata
-                ========== HOW TO ORGANIZE ==========
-                Create distinct, Excel-ready categories (one worksheet each) for logical grouping of financial data. Examples include:
-                1. Income Statement Data
-                2. Balance Sheet Data
-                3. Cash Flow Data
-                4. Company Information / General Data
-                ========== STEP-BY-STEP ==========
-                1. Map every data point into the most appropriate category above.
-                2. For each category, identify and include all necessary headers for an Excel template, such as years, company names, financial line item names, and units of measurement (e.g., "in millions").
-                3. Ensure data integrity by not modifying, calculating, or analyzing the original data values.
-                4. Preserve original data formats and units.
-                5. Organize data in a tabular format suitable for direct Excel import.
-                6. Include metadata about data sources and reporting periods where available.
-                7. Assemble everything into the JSON schema described under “WHAT TO DELIVER.”
-                8. Save the JSON as arranged_financial_data.json via save_file.
-                9. Use list_files to confirm the file exists, then read_file to validate its content.
-                10. If the file is missing or malformed, fix the issue and repeat steps 8 – 9.
-                11. Only report success after the file passes both existence and content checks.
-                ========== IMPORTANT RESTRICTIONS ==========
-                - Never perform any analysis on the data.
-                - Do not calculate ratios, growth rates, or trends.
-                - Do not provide insights or interpretations.
-                - Do not modify the actual data values.
-                - Focus solely on organization and proper formatting.
-                Extracted Data: {extracted_data.model_dump_json(indent=2)}
-                """
                 arrangement_response: RunResponse = self.data_arranger.run(arrangement_prompt)
                 arrangement_content = arrangement_response.content
@@ -391,35 +276,7 @@ class FinancialDocumentWorkflow(Workflow):
                 execution_success = self.session_state.get("execution_success", False)
                 logger.info("Using cached code generation results")
             else:
-                code_prompt = f"""
-                Your objective: Turn the organized JSON data into a polished, multi-sheet Excel report—and prove that it works.
-                ========== INPUT ==========
-                File: arranged_financial_data.json
-                Tool to read it: read_file
-                ========== WHAT THE PYTHON SCRIPT MUST DO ==========
-                1. Load arranged_financial_data.json and parse its contents.
-                2. For each category in the JSON, create a dedicated worksheet using openpyxl.
-                3. Apply professional touches:
-                • Bold, centered headers
-                • Appropriate number formats
-                • Column-width auto-sizing
-                • Borders, cell styles, and freeze panes
-                4. Insert charts (bar, line, or pie) wherever the data lends itself to visualisation.
-                5. Embed key metrics and summary notes prominently in the Executive Summary sheet.
-                6. Name the workbook: Financial_Report_<YYYYMMDD_HHMMSS>.xlsx.
-                7. Wrap every file and workbook operation in robust try/except blocks.
-                8. Log all major steps and any exceptions for easy debugging.
-                9. Save the script via save_to_file_and_run and execute it immediately.
-                10. After execution, use list_files to ensure the Excel file was created.
-                11. Optionally inspect the file (e.g., size or first bytes via read_file) to confirm it is not empty.
-                12. If the workbook is missing or corrupted, refine the code, re-save, and re-run until success.
-                ========== OUTPUT ==========
-                • A fully formatted Excel workbook in the working directory.
-                • A concise summary of what ran, any issues encountered, and confirmation that the file exists and opens without error.
-                """
                 code_response: RunResponse = self.code_generator.run(code_prompt)
                 code_generation_content = code_response.content

 from agno.utils.log import logger
 from agno.tools.shell import ShellTools
 from config.settings import settings
+from utils.prompt_loader import prompt_loader
 # Structured Output Models to avoid JSON parsing issues
     description: str = "Financial document analysis workflow with data extraction, organization, and Excel generation"
     # Data Extractor Agent - Structured output eliminates JSON parsing issues
     data_extractor: Agent = Agent(
+        model=Gemini(id=settings.DATA_EXTRACTOR_MODEL,thinking_budget=settings.DATA_EXTRACTOR_MODEL_THINKING_BUDGET,api_key=settings.GOOGLE_API_KEY),
         description="Expert financial data extraction specialist",
+        instructions=prompt_loader.load_instructions_as_list("agents/data_extractor"),
         response_model=ExtractedFinancialData,
         structured_outputs=True,
         debug_mode=True,
     )
     # Data Arranger Agent - Organizes data into categories for Excel
     data_arranger: Agent = Agent(
+        model=Gemini(id=settings.DATA_ARRANGER_MODEL,thinking_budget=settings.DATA_ARRANGER_MODEL_THINKING_BUDGET,api_key=settings.GOOGLE_API_KEY),
         description="Financial data organization and analysis expert",
+        instructions=prompt_loader.load_instructions_as_list("agents/data_arranger"),
         tools=[FileTools()],  # FileTools for saving arranged data
         # NOTE: Cannot use structured_outputs with tools in Gemini - choosing tools over structured outputs
         markdown=True,
         exponential_backoff=True,
         retries=10,
     )
     # Code Generator Agent - Creates Excel generation code
     code_generator = Agent(
         model=Gemini(
             id=settings.CODE_GENERATOR_MODEL,
+            thinking_budget=settings.CODE_GENERATOR_MODEL_THINKING_BUDGET,
+            api_key=settings.GOOGLE_API_KEY
         ),
         description="Excel report generator that analyzes JSON data and creates formatted workbooks using shell execution on any OS",
         goal="Generate a professional Excel report from arranged_financial_data.json with multiple worksheets, formatting, and charts",
+        instructions=prompt_loader.load_instructions_as_list("agents/code_generator"),
         expected_output="A Financial_Report_YYYYMMDD_HHMMSS.xlsx file containing formatted data from the JSON with multiple worksheets, professional styling, and relevant charts",
         additional_context="This agent must work on Windows, Mac, and Linux. Always use os.path for file operations and handle path separators correctly. Include proper error handling for cross-platform compatibility.",
         tools=[
         super().__init__(session_id=session_id, **kwargs)
         self.session_id = session_id or f"financial_workflow_{int(__import__('time').time())}"
         self.session_output_dir = Path(settings.TEMP_DIR) / self.session_id / "output"
+        self.session_input_dir = Path(settings.TEMP_DIR) / self.session_id / "input"
+        self.session_temp_dir = Path(settings.TEMP_DIR) / self.session_id / "temp"
+        # Create all session directories
         self.session_output_dir.mkdir(parents=True, exist_ok=True)
+        self.session_input_dir.mkdir(parents=True, exist_ok=True)
+        self.session_temp_dir.mkdir(parents=True, exist_ok=True)
         # Configure tools with correct base directories after initialization
         self._configure_agent_tools()
         logger.info(f"FinancialDocumentWorkflow initialized with session: {self.session_id}")
+    def clear_cache(self):
+        """Clear workflow session cache and temporary files."""
+        try:
+            # Clear session state
+            self.session_state.clear()
+            logger.info(f"Cleared workflow cache for session: {self.session_id}")
+            # Clean up temporary files (keep input and output)
+            if self.session_temp_dir.exists():
+                import shutil
+                try:
+                    shutil.rmtree(self.session_temp_dir)
+                    self.session_temp_dir.mkdir(parents=True, exist_ok=True)
+                    logger.info(f"Cleaned temporary files for session: {self.session_id}")
+                except Exception as e:
+                    logger.warning(f"Could not clean temp directory: {e}")
+        except Exception as e:
+            logger.error(f"Error clearing workflow cache: {e}")
+    def cleanup_session(self):
+        """Complete cleanup of session including all files."""
+        try:
+            # Clear cache first
+            self.clear_cache()
+            # Remove entire session directory
+            session_dir = Path(settings.TEMP_DIR) / self.session_id
+            if session_dir.exists():
+                import shutil
+                try:
+                    shutil.rmtree(session_dir)
+                    logger.info(f"Completely removed session directory: {session_dir}")
+                except Exception as e:
+                    logger.warning(f"Could not remove session directory: {e}")
+        except Exception as e:
+            logger.error(f"Error during session cleanup: {e}")
     def _configure_agent_tools(self):
         """Configure agent tools with the correct base directories"""
                 elif isinstance(tool, PythonTools):
                     tool.base_dir = self.session_output_dir
+    def run(self, file_path: str = None, **kwargs) -> RunResponse:
         """
+        Main workflow execution method
         Pure Python workflow execution - no streaming, no JSON parsing issues
         """
+        # Handle file_path from parameter or attribute
+        if file_path is None:
+            file_path = getattr(self, 'file_path', None)
+        if file_path is None:
+            raise ValueError("file_path must be provided either as parameter or set as attribute")
         logger.info(f"Processing financial document: {file_path}")
+        # Remove use_cache parameter since it's not defined in the method signature
+        use_cache = kwargs.get('use_cache', True)
         # Check cache first if enabled
         if use_cache and "final_results" in self.session_state:
             logger.info("Returning cached results")
                 logger.info("Using cached extraction data")
             else:
                 document = File(filepath=file_path)
+                extraction_prompt = prompt_loader.load_prompt("workflow/data_extraction", file_path=file_path)
                 extraction_response: RunResponse = self.data_extractor.run(
                     extraction_prompt,
                 arrangement_content = self.session_state["arrangement_response"]
                 logger.info("Using cached arrangement data")
             else:
+                # Debug: Check extracted data before passing to prompt
+                extracted_json = extracted_data.model_dump_json(indent=2)
+                logger.debug(f"Extracted data size: {len(extracted_json)} characters")
+                logger.debug(f"First 200 chars of extracted data: {extracted_json[:200]}...")
+                arrangement_prompt = prompt_loader.load_prompt("workflow/data_arrangement",
+                                                                extracted_data=extracted_json)
+                # Debug: Check if prompt contains the actual data or just the placeholder
+                if "{extracted_data}" in arrangement_prompt:
+                    logger.error("CRITICAL: Variable substitution failed! Prompt still contains {extracted_data} placeholder")
+                    logger.error(f"Prompt length: {len(arrangement_prompt)}")
+                else:
+                    logger.info(f"Variable substitution successful. Prompt length: {len(arrangement_prompt)}")
                 arrangement_response: RunResponse = self.data_arranger.run(arrangement_prompt)
                 arrangement_content = arrangement_response.content
                 execution_success = self.session_state.get("execution_success", False)
                 logger.info("Using cached code generation results")
             else:
+                code_prompt = prompt_loader.load_prompt("workflow/code_generation")
                 code_response: RunResponse = self.code_generator.run(code_prompt)
                 code_generation_content = code_response.content