File size: 3,931 Bytes
e09cfd6
cfeb3a6
 
 
 
e09cfd6
 
cfeb3a6
e09cfd6
 
cfeb3a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: Agno Document Analysis
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---

# Agno Document Analysis Workflow

A sophisticated document processing application built with Agno v1.7.4 featuring a multi-agent workflow for intelligent document analysis and data extraction.

## Features

- **5-Agent Workflow**: Coordinator, Prompt Engineer, Data Extractor, Data Arranger, Code Generator
- **Multi-format Support**: PDF, TXT, PNG, JPG, JPEG, DOCX, XLSX, CSV, MD, JSON, XML, HTML, PY, JS, TS, DOC, XLS, PPT, PPTX
- **Real-time Processing**: Streaming interface with live updates
- **Sandboxed Execution**: Safe code execution environment
- **Beautiful UI**: Modern Gradio interface with custom animations

## Quick Start

### Automated Installation

```bash
# Clone the repository
git clone <repository-url>
cd Data_Extractor

# Quick installation (recommended)
./install.sh

# Or use Python setup script
python setup.py
```

### Manual Installation

```bash
# Create virtual environment
python -m venv data_extractor_env
source data_extractor_env/bin/activate  # On Windows: data_extractor_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Create environment file
cp .env.example .env  # Update with your API keys

# Run the application
python app.py
```

## Installation Options

### Requirements Files

- **`requirements-minimal.txt`**: Essential dependencies only (~50 packages)
  ```bash
  pip install -r requirements-minimal.txt
  ```

- **`requirements.txt`**: Complete feature set (~200+ packages)
  ```bash
  pip install -r requirements.txt
  ```

- **`requirements-dev.txt`**: Development dependencies with testing tools
  ```bash
  pip install -r requirements-dev.txt
  ```

### System Dependencies

Some features require system-level dependencies:

**macOS:**
```bash
brew install tesseract imagemagick poppler
```

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr libmagickwand-dev poppler-utils
```

**Windows:**
```bash
choco install tesseract imagemagick poppler
```

## Usage

1. **Setup Environment**: Follow installation instructions above
2. **Configure API Keys**: Update `.env` file with your API keys
3. **Upload Document**: Support for 20+ file formats
4. **Select Analysis**: Choose from predefined types or custom prompts
5. **Process**: Watch the multi-agent workflow in real-time
6. **Download Results**: Get structured data and generated Excel reports

## Environment Variables

Create a `.env` file with the following variables:

```bash
# Required API Keys
GOOGLE_API_KEY=your_google_api_key_here
OPENAI_API_KEY=your_openai_api_key_here  # Optional

# Application Settings
DEBUG=False
LOG_LEVEL=INFO
SESSION_TIMEOUT=3600

# File Processing
MAX_FILE_SIZE=50MB
SUPPORTED_FORMATS=pdf,docx,xlsx,txt

# Database (Optional)
DATABASE_URL=sqlite:///data_extractor.db
```

## Advanced Features

### Financial Document Processing
- Comprehensive financial data extraction
- 13-category data organization
- Excel report generation with charts
- XBRL and SEC filing support

### OCR and Image Processing
- EasyOCR and PaddleOCR integration
- Tesseract OCR support
- Advanced image preprocessing

### Machine Learning Integration
- TensorFlow and PyTorch support
- Scikit-learn for data analysis
- XGBoost and LightGBM for predictions

## Troubleshooting

For detailed troubleshooting and installation issues, see:
- [`INSTALLATION.md`](INSTALLATION.md) - Comprehensive installation guide
- [`FIXES_SUMMARY.md`](FIXES_SUMMARY.md) - Known issues and solutions

### Common Issues

1. **Import Errors**: Try minimal installation first
2. **OCR Issues**: Install system dependencies
3. **Memory Issues**: Use smaller batch sizes
4. **API Errors**: Verify API keys in `.env` file

## Docker Support

```dockerfile
# Build and run with Docker
docker build -t data-extractor .
docker run -p 7860:7860 --env-file .env data-extractor
```