File size: 6,877 Bytes
745273a
 
 
 
 
 
 
 
90b0a17
e09cfd6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
 
 
 
 
 
 
cfeb3a6
 
90b0a17
cfeb3a6
90b0a17
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
 
 
 
 
cfeb3a6
90b0a17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfeb3a6
 
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
cfeb3a6
 
90b0a17
 
cfeb3a6
90b0a17
cfeb3a6
 
90b0a17
 
 
 
 
 
cfeb3a6
90b0a17
 
cfeb3a6
 
90b0a17
cfeb3a6
90b0a17
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfeb3a6
90b0a17
cfeb3a6
90b0a17
 
 
cfeb3a6
90b0a17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfeb3a6
90b0a17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
745273a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
license: mit
title: Data Extractor Using Gemini
sdk: docker
emoji: πŸƒ
colorFrom: yellow
colorTo: blue
---
# πŸ“Š Financial Data Extractor Using Gemini

A powerful AI-driven financial document analysis system that automatically extracts, organizes, and generates professional Excel reports from financial documents using Google's Gemini AI models.

## πŸš€ Features

### Core Functionality

- **πŸ“„ Multi-format Document Support**: PDF, DOCX, TXT, and image files
- **πŸ” Intelligent Data Extraction**: AI-powered extraction of financial data points
- **πŸ“Š Smart Data Organization**: Automatic categorization into 12+ financial categories
- **πŸ’» Excel Report Generation**: Professional multi-worksheet Excel reports with charts
- **🎯 Real-time Processing**: Live streaming interface with progress tracking

### Advanced Capabilities

- **πŸ€– Multi-Agent Workflow**: Specialized AI agents for extraction, arrangement, and code generation
- **πŸ’Ύ Session Management**: Persistent storage with SQLite caching
- **πŸ”„ Auto-shutdown**: Intelligent resource management for cloud deployments
- **πŸ“± Modern UI**: Beautiful Gradio-based web interface
- **🌐 Cross-platform**: Works on Windows, Mac, and Linux
- **🐳 Docker Support**: Containerized deployment ready

## πŸ—οΈ Architecture

The system uses a sophisticated multi-agent workflow powered by the Agno framework:

```
πŸ“„ Document Upload
    ↓
πŸ” Data Extractor Agent
    ↓ (Structured Financial Data)
πŸ“Š Data Arranger Agent  
    ↓ (Organized Categories)
πŸ’» Code Generator Agent
    ↓ (Python Excel Code)
πŸ“Š Excel Report Output
```

### Agent Specialization

- **Data Extractor**: Extracts financial data points with confidence scoring
- **Data Arranger**: Organizes data into 12+ professional categories
- **Code Generator**: Creates Python code for Excel report generation

## πŸ“‹ Requirements

### System Requirements

- Python 3.8+
- Google API Key (for Gemini models)
- 2GB+ RAM recommended
- Cross-platform compatible

### Dependencies

```
agno>=1.7.4
gradio
google-generativeai
PyPDF2
Pillow
python-dotenv
pandas
matplotlib
openpyxl
python-docx
lxml
markdown
requests
seaborn
sqlalchemy
websockets
```

## πŸš€ Quick Start

### 1. Clone the Repository

```bash
git clone <repository-url>
cd Data_Extractor_Using_Gemini
```

### 2. Install Dependencies

```bash
pip install -r requirements.txt
```

### 3. Configure Environment

Create a `.env` file:

```env
GOOGLE_API_KEY=your_gemini_api_key_here
```

### 4. Run the Application

```bash
python app.py
```

The application will be available at `http://localhost:7860`

## 🐳 Docker Deployment

### Build and Run

```bash
docker build -t financial-extractor .
docker run -p 7860:7860 --env-file .env financial-extractor
```

### Environment Variables

- `GOOGLE_API_KEY`: Your Google Gemini API key
- `INACTIVITY_TIMEOUT_MINUTES`: Auto-shutdown timeout (default: 30)

## πŸ“– Usage Guide

### 1. Upload Document

- Drag and drop or select your financial document
- Supported formats: PDF, DOCX, TXT, PNG, JPG, JPEG

### 2. Select Processing Mode

- **Quick Analysis**: Standard extraction and organization
- **Custom Prompts**: Use predefined prompt templates for specific document types

### 3. Monitor Progress

- Real-time streaming interface shows each processing step
- Progress indicators for all workflow stages
- Live terminal output for code execution

### 4. Download Results

- Professional Excel report with multiple worksheets
- Organized data categories with charts and formatting
- All intermediate files available for download

## πŸ“Š Output Structure

The generated Excel reports include:

### Worksheets

- **Summary**: Executive overview with key metrics
- **Revenue**: Income and revenue streams
- **Expenses**: Operating and non-operating expenses
- **Assets**: Current and non-current assets
- **Liabilities**: Short-term and long-term liabilities
- **Equity**: Shareholder equity components
- **Cash Flow**: Cash flow statements
- **Ratios**: Financial ratio analysis
- **Charts**: Visual representations of key data
- **Raw Data**: Original extracted data points

### Features

- Professional formatting with consistent styling
- Interactive charts and visualizations
- Dynamic period handling (auto-detects years/quarters)
- Cross-referenced data validation
- Print-ready layouts

## πŸ”§ Configuration

### Model Settings

Configure AI models in `config/settings.py`:

- Data Extractor Model
- Data Arranger Model
- Code Generator Model
- Thinking budgets and retry settings

### Prompt Customization

Customize agent instructions in `instructions/agents/`:

- `data_extractor.md`: Data extraction prompts
- `data_arranger.md`: Data organization prompts
- `code_generator.md`: Excel generation prompts

### Workflow Configuration

Modify workflow behavior in `workflow/financial_workflow.py`:

- Agent configurations
- Tool assignments
- Output formats

## πŸ› οΈ Development

### Project Structure

```
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ workflow/              # Core workflow implementation
β”œβ”€β”€ instructions/          # Agent instruction templates
β”œβ”€β”€ prompts/              # Prompt gallery configurations
β”œβ”€β”€ config/               # Application settings
β”œβ”€β”€ utils/                # Utility functions
β”œβ”€β”€ static/               # Static assets
β”œβ”€β”€ models/               # Data models
└── terminal_stream.py    # Real-time terminal streaming
```

### Key Components

- **WorkflowUI**: Main interface controller
- **FinancialDocumentWorkflow**: Core processing pipeline
- **AutoShutdownManager**: Resource management
- **TerminalLogHandler**: Real-time logging
- **PromptGallery**: Template management

## πŸ”’ Security & Privacy

- **Local Processing**: All document processing happens locally
- **No Data Storage**: Documents are processed and cleaned up automatically
- **API Key Security**: Environment-based configuration
- **Session Isolation**: Each session has isolated temporary directories

## 🌐 Deployment Options

### Local Development

```bash
python app.py
```

### Production (Gunicorn)

```bash
gunicorn -w 4 -b 0.0.0.0:7860 app:app
```

### Cloud Platforms

- **Hugging Face Spaces**: Ready for deployment
- **Google Cloud Run**: Containerized deployment
- **AWS/Azure**: Standard container deployment

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

## πŸ†˜ Support

### Common Issues

- **API Key Errors**: Ensure your Google API key is valid and has Gemini access
- **Memory Issues**: Increase system RAM or reduce document size
- **Processing Timeouts**: Check network connectivity and API quotas