Spaces:
Running
Running
Upload README2.md
Browse files- README2.md +765 -0
README2.md
ADDED
@@ -0,0 +1,765 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Scratch Vision Game - Technical Documentation
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
The Scratch Vision Game is an AI-powered system that converts visual Scratch programming blocks from images/PDFs into functional Scratch 3.0 projects (.sb3 files). The system uses computer vision, OCR, and large language models to analyze, interpret, and reconstruct Scratch programs from visual inputs.
|
6 |
+
|
7 |
+
## System Architecture
|
8 |
+
|
9 |
+
### Core Components
|
10 |
+
|
11 |
+
1. **Image Processing Pipeline** (`app.py`)
|
12 |
+
|
13 |
+
- PDF extraction and image preprocessing
|
14 |
+
- Multi-modal image enhancement using OpenCV
|
15 |
+
- OCR text extraction with Tesseract
|
16 |
+
- Visual similarity matching using multiple algorithms
|
17 |
+
|
18 |
+
2. **Block Recognition System** (`utils/block_relation_builder.py`)
|
19 |
+
|
20 |
+
- Scratch block catalog management
|
21 |
+
- Pseudocode to JSON conversion
|
22 |
+
- Block relationship building and validation
|
23 |
+
- Project structure generation
|
24 |
+
|
25 |
+
3. **AI Processing Layer**
|
26 |
+
- LLM-based code interpretation using Groq/LLaMA
|
27 |
+
- Multi-modal vision models for image captioning
|
28 |
+
- Semantic understanding of Scratch programming concepts
|
29 |
+
|
30 |
+
## Process Flow & System Tree Structure
|
31 |
+
|
32 |
+
### Complete User Journey Tree
|
33 |
+
|
34 |
+
```
|
35 |
+
USER INPUT (PDF File via Web Interface)
|
36 |
+
│
|
37 |
+
├── 📁 /process_pdf [POST] - Flask Route Handler
|
38 |
+
│ │
|
39 |
+
│ ├── 🔍 PDF Validation & Security
|
40 |
+
│ │ ├── secure_filename() - Sanitize filename
|
41 |
+
│ │ ├── tempfile.mkdtemp() - Create temp directory
|
42 |
+
│ │ └── pdf_file.save() - Save to temp location
|
43 |
+
│ │
|
44 |
+
│ ├── 📄 PDF Processing Pipeline
|
45 |
+
│ │ │
|
46 |
+
│ │ ├── 🎯 extract_images_from_pdf()
|
47 |
+
│ │ │ ├── partition_pdf() - Unstructured library extraction
|
48 |
+
│ │ │ │ ├── strategy="hi_res"
|
49 |
+
│ │ │ │ ├── extract_image_block_types=["Image"]
|
50 |
+
│ │ │ │ └── extract_image_block_to_payload=True
|
51 |
+
│ │ │ │
|
52 |
+
│ │ │ ├── 💾 Save extracted.json
|
53 |
+
│ │ │ │ └── /outputs/EXTRACTED_JSON/{pdf_name}/extracted.json
|
54 |
+
│ │ │ │
|
55 |
+
│ │ │ └── 🔄 For Each Extracted Image:
|
56 |
+
│ │ │ │
|
57 |
+
│ │ │ ├── 🖼️ Image Processing Branch
|
58 |
+
│ │ │ │ ├── base64.b64decode() - Decode image data
|
59 |
+
│ │ │ │ ├── Image.open() - PIL image creation
|
60 |
+
│ │ │ │ ├── image.save() - Save as PNG
|
61 |
+
│ │ │ │ └── /outputs/DETECTED_IMAGE/{pdf_name}/Sprite_{i}.png
|
62 |
+
│ │ │ │
|
63 |
+
│ │ │ └── 🤖 AI Analysis Branch (Parallel)
|
64 |
+
│ │ │ │
|
65 |
+
│ │ │ ├── 📝 Description Generation
|
66 |
+
│ │ │ │ ├── LangGraph Agent (Groq LLaMA)
|
67 |
+
│ │ │ │ ├── Prompt: "Give a brief Captioning."
|
68 |
+
│ │ │ │ └── response["messages"][-1].content
|
69 |
+
│ │ │ │
|
70 |
+
│ │ │ ├── 🏷️ Name Generation
|
71 |
+
│ │ │ │ ├── LangGraph Agent (Groq LLaMA)
|
72 |
+
│ │ │ │ ├── Prompt: "give a short name caption"
|
73 |
+
│ │ │ │ └── response["messages"][-1].content
|
74 |
+
│ │ │ │
|
75 |
+
│ │ │ └── 📋 Metadata Assembly
|
76 |
+
│ │ │ └── extracted_sprites.json
|
77 |
+
│ │ │ ├── "Sprite {count}": {
|
78 |
+
│ │ │ │ ├── "name": AI_generated_name
|
79 |
+
│ │ │ │ ├── "base64": image_data
|
80 |
+
│ │ │ │ ├── "file-path": pdf_directory
|
81 |
+
│ │ │ │ └── "description": AI_description
|
82 |
+
│ │ │ └── }
|
83 |
+
│ │
|
84 |
+
│ └── 🎮 Project Generation Pipeline
|
85 |
+
│ │
|
86 |
+
│ ├── 🔍 similarity_matching()
|
87 |
+
│ │ │
|
88 |
+
│ │ ├── 📊 Embedding Generation Branch
|
89 |
+
│ │ │ │
|
90 |
+
│ │ │ ├── 🎯 Query Processing
|
91 |
+
│ │ │ │ ├── base64.b64decode() - Decode sprite images
|
92 |
+
│ │ │ │ ├── tempfile.mkdtemp() - Create temp workspace
|
93 |
+
│ │ │ │ └── Image.save() - Save temp sprite files
|
94 |
+
│ │ │ │
|
95 |
+
│ │ │ ├── 🧠 CLIP Embeddings
|
96 |
+
│ │ │ │ ├── OpenCLIPEmbeddings() - Initialize embedder
|
97 |
+
│ │ │ │ ├── clip_embd.embed_image() - Generate embeddings
|
98 |
+
│ │ │ │ └── sprite_features = np.array()
|
99 |
+
│ │ │ │
|
100 |
+
│ │ │ └── 📈 Similarity Computation
|
101 |
+
│ │ │ ├── Load: /outputs/embeddings.json
|
102 |
+
│ │ │ ├── np.matmul(sprite_matrix, img_matrix.T)
|
103 |
+
│ │ │ └── np.argmax(similarity, axis=1)
|
104 |
+
│ │ │
|
105 |
+
│ │ ├── 🎨 Asset Matching & Collection
|
106 |
+
│ │ │ │
|
107 |
+
│ │ │ ├── 🧙♂️ Sprite Assets Branch
|
108 |
+
│ │ │ │ ├── Match: /blocks/sprites/{matched_folder}/
|
109 |
+
│ │ │ │ ├── Load: sprite.json
|
110 |
+
│ │ │ │ ├── Copy: All files except matched image & sprite.json
|
111 |
+
│ │ │ │ └── Append to: project_data[]
|
112 |
+
│ │ │ │
|
113 |
+
│ │ │ └── 🌄 Backdrop Assets Branch (Parallel)
|
114 |
+
│ │ │ ├── Match: /blocks/Backdrops/{matched_folder}/
|
115 |
+
│ │ │ ├── Load: project.json
|
116 |
+
│ │ │ ├── Copy: All files except matched image & project.json
|
117 |
+
│ │ │ └── Extract: Stage targets → backdrop_data[]
|
118 |
+
│ │ │
|
119 |
+
│ │ └── 🏗️ Project Assembly
|
120 |
+
│ │ │
|
121 |
+
│ │ ├── 📋 JSON Structure Creation
|
122 |
+
│ │ │ ├── final_project = {
|
123 |
+
│ │ │ │ ├── "targets": []
|
124 |
+
│ │ │ │ ├── "monitors": []
|
125 |
+
│ │ │ │ ├── "extensions": []
|
126 |
+
│ │ │ │ └── "meta": {...}
|
127 |
+
│ │ │ └── }
|
128 |
+
│ │ │
|
129 |
+
│ │ ├── 🧙♂️ Sprite Integration
|
130 |
+
│ │ │ └── For sprite in project_data:
|
131 |
+
│ │ │ └── if not sprite.get("isStage"):
|
132 |
+
│ │ │ └── final_project["targets"].append(sprite)
|
133 |
+
│ │ │
|
134 |
+
│ │ ├── 🌄 Stage/Backdrop Integration
|
135 |
+
│ │ │ └── if backdrop_data:
|
136 |
+
│ │ │ ├── Merge: all_costumes.extend()
|
137 |
+
│ │ │ ├── Merge: sounds from first backdrop
|
138 |
+
│ │ │ └── Create: Stage target with merged assets
|
139 |
+
│ │ │
|
140 |
+
│ │ └── 💾 Final Output
|
141 |
+
│ │ ├── /outputs/project_{uuid}/project.json
|
142 |
+
│ │ └── Return: project_json_path
|
143 |
+
│
|
144 |
+
├── 📤 Response Generation
|
145 |
+
│ └── JSON Response:
|
146 |
+
│ ├── "message": "✅ PDF processed successfully"
|
147 |
+
│ ├── "output_json": extracted_sprites_path
|
148 |
+
│ ├── "sprites": sprite_metadata
|
149 |
+
│ ├── "project_output_json": final_project_path
|
150 |
+
│ └── "test_url": download_link
|
151 |
+
│
|
152 |
+
└── 📥 /download_sb3/{project_id} [GET] - Download Endpoint
|
153 |
+
├── Locate: /game_samples/{project_id}.sb3
|
154 |
+
├── Validate: File existence
|
155 |
+
└── send_from_directory() - Serve .sb3 file
|
156 |
+
```
|
157 |
+
|
158 |
+
### Parallel Processing Branches
|
159 |
+
|
160 |
+
```
|
161 |
+
🔄 CONCURRENT OPERATIONS DURING PDF PROCESSING:
|
162 |
+
|
163 |
+
├── 🖼️ Image Processing Thread
|
164 |
+
│ ├── OpenCV Enhancement Pipeline
|
165 |
+
│ │ ├── upscale_image_cv() - 2x cubic interpolation
|
166 |
+
│ │ ├── reduce_noise_cv() - Non-local means denoising
|
167 |
+
│ │ ├── sharpen_cv() - Kernel-based sharpening
|
168 |
+
│ │ └── enhance_contrast_cv() - Contrast enhancement
|
169 |
+
│ │
|
170 |
+
│ └── Multi-Algorithm Similarity Matching
|
171 |
+
│ ├── DINOv2 Embeddings (Semantic)
|
172 |
+
│ ├── PHash (Perceptual Hashing)
|
173 |
+
│ └── Image Signatures (Goldberg Algorithm)
|
174 |
+
|
175 |
+
├── 🤖 AI Processing Thread
|
176 |
+
│ ├── SmolVLM Vision Model
|
177 |
+
│ │ ├── Image Captioning
|
178 |
+
│ │ └── Name Generation
|
179 |
+
│ │
|
180 |
+
│ └── Groq LLaMA Language Model
|
181 |
+
│ ├── OCR Text Refinement
|
182 |
+
│ ├── Pseudocode Generation
|
183 |
+
│ └── JSON Structure Validation
|
184 |
+
|
185 |
+
└── 💾 I/O Operations Thread
|
186 |
+
├── File System Operations
|
187 |
+
│ ├── Directory Creation
|
188 |
+
│ ├── Image Saving/Loading
|
189 |
+
│ └── JSON Serialization
|
190 |
+
│
|
191 |
+
└── Asset Management
|
192 |
+
├── Reference Asset Loading
|
193 |
+
├── Project Asset Copying
|
194 |
+
└── Final Project Assembly
|
195 |
+
```
|
196 |
+
|
197 |
+
### Data Flow Diagram
|
198 |
+
|
199 |
+
```
|
200 |
+
📊 DATA TRANSFORMATION PIPELINE:
|
201 |
+
|
202 |
+
PDF Bytes → Images → Enhanced Images → Embeddings → Similarities → Assets → .sb3
|
203 |
+
↓ ↓ ↓ ↓ ↓ ↓ ↓
|
204 |
+
[Binary] [PIL.Image] [np.ndarray] [np.float32] [indices] [JSON] [ZIP]
|
205 |
+
│ │ │ │ │ │ │
|
206 |
+
├─ OCR ─────┼─ AI ───────┼─ Models ────┼─ Search ───┼─ Match ──┼─ Build┤
|
207 |
+
│ │ │ │ │ │ │
|
208 |
+
└─ Text ────┴─ Metadata ─┴─ Features ──┴─ Ranking ──┴─ Select ─┴─ Pack ┘
|
209 |
+
```
|
210 |
+
|
211 |
+
### Key Processing Functions
|
212 |
+
|
213 |
+
**Input Processing:**
|
214 |
+
|
215 |
+
- `extract_images_from_pdf()` - Extracts images from PDF using unstructured library
|
216 |
+
- `process_image_cv2_from_pil()` - Enhances images using OpenCV (upscaling, denoising, sharpening)
|
217 |
+
|
218 |
+
### 2. Visual Similarity Matching
|
219 |
+
|
220 |
+
```
|
221 |
+
Query Image → Multi-Algorithm Matching → Asset Selection → Project Assembly
|
222 |
+
```
|
223 |
+
|
224 |
+
**Algorithms Used:**
|
225 |
+
|
226 |
+
- **DINOv2 Embeddings**: Deep learning-based semantic similarity
|
227 |
+
- **Perceptual Hashing (PHash)**: Structural image comparison
|
228 |
+
- **Image Signatures**: Goldberg algorithm for visual fingerprinting
|
229 |
+
|
230 |
+
**Implementation:**
|
231 |
+
|
232 |
+
```python
|
233 |
+
def run_query_search_flow(query_b64, embeddings_dict, hash_dict, signature_obj_map):
|
234 |
+
# 1. Preprocess query image
|
235 |
+
enhanced_query_pil = process_image_cv2_from_pil(query_from_b64, scale=2)
|
236 |
+
|
237 |
+
# 2. Generate embeddings
|
238 |
+
query_emb = get_dinov2_embedding_from_pil(prepped)
|
239 |
+
query_phash = phash.encode_image(image_array=query_hash_arr)
|
240 |
+
query_sig = gis.generate_signature(query_sig_path)
|
241 |
+
|
242 |
+
# 3. Compute similarities
|
243 |
+
emb_sim = cosine_similarity(query_emb, stored_emb)
|
244 |
+
ph_sim = 1.0 - (hamming_distance / MAX_PHASH_BITS)
|
245 |
+
im_sim = 1.0 - gis.normalized_distance(stored_sig, query_sig)
|
246 |
+
|
247 |
+
# 4. Combine scores
|
248 |
+
combined = (emb_clamped + ph_sim + im_sim) / 3.0
|
249 |
+
```
|
250 |
+
|
251 |
+
### 3. Code Block Recognition
|
252 |
+
|
253 |
+
```
|
254 |
+
OCR Text → LLM Processing → Pseudocode → Block Mapping → JSON Generation
|
255 |
+
```
|
256 |
+
|
257 |
+
**LLM System Prompt:**
|
258 |
+
|
259 |
+
```python
|
260 |
+
SYSTEM_PROMPT = """Your task is to process OCR-extracted text from images of Scratch 3.0 code blocks and produce precisely formatted pseudocode JSON.
|
261 |
+
|
262 |
+
### Core Role
|
263 |
+
- Treat this as an OCR refinement task: the input may contain typos or spacing issues.
|
264 |
+
- Intelligently correct OCR mistakes to align with valid Scratch 3.0 block syntax.
|
265 |
+
|
266 |
+
### Universal Rules
|
267 |
+
1. Code Detection: If no Scratch blocks are detected, the `pseudocode` value must be "No Code-blocks".
|
268 |
+
2. Script Ownership: Determine the target from "Script for:". If it matches a `Stage_costumes` name, set `name_variable` to "Stage".
|
269 |
+
3. Pseudocode Structure: The pseudocode must be a single JSON string with `\n` for newlines.
|
270 |
+
"""
|
271 |
+
```
|
272 |
+
|
273 |
+
### 4. Project Generation
|
274 |
+
|
275 |
+
```
|
276 |
+
Pseudocode → Block Definitions → Relationship Building → .sb3 Assembly
|
277 |
+
```
|
278 |
+
|
279 |
+
## Libraries and Dependencies
|
280 |
+
|
281 |
+
### Core Libraries
|
282 |
+
|
283 |
+
#### Computer Vision & Image Processing
|
284 |
+
|
285 |
+
- **OpenCV** (`cv2`): Image enhancement, filtering, and preprocessing
|
286 |
+
- **PIL/Pillow**: Image manipulation and format conversion
|
287 |
+
- **imagededup**: Perceptual hashing for duplicate detection
|
288 |
+
- **image-match**: Visual similarity using Goldberg signatures
|
289 |
+
|
290 |
+
#### Machine Learning & AI
|
291 |
+
|
292 |
+
- **transformers**: Hugging Face models (DINOv2, SmolVLM)
|
293 |
+
- **torch**: PyTorch for deep learning inference
|
294 |
+
- **sentence-transformers**: Text and image embeddings
|
295 |
+
- **faiss-cpu**: Fast similarity search and clustering
|
296 |
+
- **open_clip_torch**: OpenAI CLIP embeddings
|
297 |
+
|
298 |
+
#### Language Models
|
299 |
+
|
300 |
+
- **langchain**: LLM orchestration and chaining
|
301 |
+
- **langchain-groq**: Groq API integration
|
302 |
+
- **langgraph**: Graph-based agent workflows
|
303 |
+
|
304 |
+
#### Document Processing
|
305 |
+
|
306 |
+
- **unstructured**: PDF parsing and content extraction
|
307 |
+
- **pdf2image**: PDF to image conversion
|
308 |
+
- **pytesseract**: OCR text extraction
|
309 |
+
- **PyPDF2**: PDF manipulation
|
310 |
+
|
311 |
+
#### Web Framework
|
312 |
+
|
313 |
+
- **Flask**: Web application framework
|
314 |
+
- **Flask-SocketIO**: Real-time communication
|
315 |
+
- **gunicorn**: WSGI HTTP server
|
316 |
+
|
317 |
+
### Model Specifications
|
318 |
+
|
319 |
+
#### Vision Models
|
320 |
+
|
321 |
+
```python
|
322 |
+
# DINOv2 for semantic image understanding
|
323 |
+
DINOV2_MODEL = "facebook/dinov2-small"
|
324 |
+
dinov2_processor = AutoImageProcessor.from_pretrained(DINOV2_MODEL)
|
325 |
+
dinov2_model = AutoModel.from_pretrained(DINOV2_MODEL)
|
326 |
+
|
327 |
+
# SmolVLM for image captioning
|
328 |
+
smolvlm256m_processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
|
329 |
+
smolvlm256m_model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
|
330 |
+
```
|
331 |
+
|
332 |
+
#### Language Model
|
333 |
+
|
334 |
+
```python
|
335 |
+
# Groq LLaMA for code interpretation
|
336 |
+
llm = ChatGroq(
|
337 |
+
model="meta-llama/llama-4-scout-17b-16e-instruct",
|
338 |
+
temperature=0,
|
339 |
+
max_tokens=None,
|
340 |
+
)
|
341 |
+
```
|
342 |
+
|
343 |
+
## Technical Approaches
|
344 |
+
|
345 |
+
### 1. Multi-Modal Image Enhancement
|
346 |
+
|
347 |
+
**OpenCV Pipeline:**
|
348 |
+
|
349 |
+
```python
|
350 |
+
def process_image_cv2_from_pil(pil_img, scale=2):
|
351 |
+
bgr = pil_to_bgr_np(pil_img)
|
352 |
+
bgr = upscale_image_cv(bgr, scale=scale) # Cubic interpolation
|
353 |
+
bgr = reduce_noise_cv(bgr) # Non-local means denoising
|
354 |
+
bgr = sharpen_cv(bgr) # Kernel-based sharpening
|
355 |
+
bgr = enhance_contrast_cv(bgr) # Contrast enhancement
|
356 |
+
return bgr_np_to_pil(bgr)
|
357 |
+
```
|
358 |
+
|
359 |
+
### 2. Hybrid Similarity Scoring
|
360 |
+
|
361 |
+
**Multi-Algorithm Consensus:**
|
362 |
+
|
363 |
+
```python
|
364 |
+
def choose_top_candidates(embedding_results, phash_results, imgmatch_results):
|
365 |
+
# Method A: Normalized weighted average
|
366 |
+
weighted_scores[p] = (w_emb * emb_norm[p] + w_ph * ph_norm[p] + w_im * im_norm[p])
|
367 |
+
|
368 |
+
# Method B: Rank-sum (Borda count)
|
369 |
+
rank_sum[p] = rank_emb[p] + rank_ph[p] + rank_im[p]
|
370 |
+
|
371 |
+
# Method C: Harmonic mean (penalizes missing values)
|
372 |
+
harm = 3.0 / ((1.0/a) + (1.0/b) + (1.0/c))
|
373 |
+
```
|
374 |
+
|
375 |
+
### 3. Block Relationship Building
|
376 |
+
|
377 |
+
**Scratch Block Catalog System:**
|
378 |
+
|
379 |
+
```python
|
380 |
+
def generate_blocks_from_opcodes(opcode_counts, all_block_definitions):
|
381 |
+
"""
|
382 |
+
Generates Scratch blocks with proper parent-child relationships
|
383 |
+
- Hat blocks: topLevel=True, parent=None
|
384 |
+
- Stack blocks: Linked via 'next' field
|
385 |
+
- C-blocks: Contains SUBSTACK inputs
|
386 |
+
- Shadow blocks: Linked as input values
|
387 |
+
"""
|
388 |
+
```
|
389 |
+
|
390 |
+
### 4. Project Assembly Pipeline
|
391 |
+
|
392 |
+
**JSON Structure Generation:**
|
393 |
+
|
394 |
+
```python
|
395 |
+
final_project = {
|
396 |
+
"targets": [], # Sprites and Stage
|
397 |
+
"monitors": [], # Variable/list monitors
|
398 |
+
"extensions": [], # Scratch extensions
|
399 |
+
"meta": {
|
400 |
+
"semver": "3.0.0",
|
401 |
+
"vm": "11.3.0",
|
402 |
+
"agent": "OpenAI ScratchVision Agent"
|
403 |
+
}
|
404 |
+
}
|
405 |
+
```
|
406 |
+
|
407 |
+
## File System Architecture
|
408 |
+
|
409 |
+
### Project Directory Structure
|
410 |
+
|
411 |
+
```
|
412 |
+
📁 scratch-vision-game/
|
413 |
+
├── 🐍 app.py # Main Flask application (PRIMARY)
|
414 |
+
├── 📋 requirements.txt # Python dependencies
|
415 |
+
├── 🐳 Dockerfile # Container configuration
|
416 |
+
├── 📖 README.md # Basic project info
|
417 |
+
├── 📖 README2.md # Technical documentation
|
418 |
+
│
|
419 |
+
├── 📁 utils/ # Core processing utilities
|
420 |
+
│ └── 🔧 block_relation_builder.py # Scratch block logic & JSON generation
|
421 |
+
│
|
422 |
+
├── 📁 blocks/ # Scratch block definitions & assets
|
423 |
+
│ ├── 📊 blocks.json # Main block catalog
|
424 |
+
│ ├── 📊 boolean_blocks.json # Boolean/condition blocks
|
425 |
+
│ ├── 📊 cap_blocks.json # Terminal blocks (stop, delete clone)
|
426 |
+
│ ├── 📊 c_blocks.json # Control flow blocks (if, repeat, forever)
|
427 |
+
│ ├── 📊 control_blocks.json # Control category blocks
|
428 |
+
│ ├── 📊 data_blocks.json # Variables and lists blocks
|
429 |
+
│ ├── 📊 event_blocks.json # Event/trigger blocks
|
430 |
+
│ ├── 📊 hat_blocks.json # Script starter blocks
|
431 |
+
│ ├── 📊 looks_blocks.json # Appearance blocks
|
432 |
+
│ ├── 📊 motion_blocks.json # Movement blocks
|
433 |
+
│ ├── 📊 operator_blocks.json # Math and logic operators
|
434 |
+
│ ├── 📊 reporter_blocks.json # Value reporter blocks
|
435 |
+
│ ├── 📊 sensing_blocks.json # Sensor blocks
|
436 |
+
│ ├── 📊 sound_blocks.json # Audio blocks
|
437 |
+
│ ├── 📊 stack_blocks.json # Sequential action blocks
|
438 |
+
│ │
|
439 |
+
│ ├── 📁 sprites/ # Reference sprite assets
|
440 |
+
│ │ ├── 📁 {sprite_name}/
|
441 |
+
│ │ │ ├── 🖼️ {sprite_image}.png
|
442 |
+
│ │ │ ├── 📊 sprite.json # Sprite definition
|
443 |
+
│ │ │ └── 🎵 {sounds}.wav
|
444 |
+
│ │ └── ...
|
445 |
+
│ │
|
446 |
+
│ ├── 📁 Backdrops/ # Reference backdrop assets
|
447 |
+
│ │ ├── 📁 {backdrop_name}/
|
448 |
+
│ │ │ ├── 🖼️ {backdrop_image}.png
|
449 |
+
│ │ │ ├── 📊 project.json # Stage definition
|
450 |
+
│ │ │ └── 🎵 {sounds}.wav
|
451 |
+
│ │ └── ...
|
452 |
+
│ │
|
453 |
+
│ └── 📁 sound/ # Audio assets library
|
454 |
+
│ └── 🎵 *.wav
|
455 |
+
│
|
456 |
+
├── 📁 templates/ # Flask HTML templates
|
457 |
+
│ └── 🌐 *.html
|
458 |
+
│
|
459 |
+
├── 📁 static/ # Web static assets
|
460 |
+
│ ├── 🎨 css/
|
461 |
+
│ ├── 📜 js/
|
462 |
+
│ └── 🖼️ images/
|
463 |
+
│
|
464 |
+
├── 📁 game_samples/ # Pre-built .sb3 files
|
465 |
+
│ └── 🎮 *.sb3
|
466 |
+
│
|
467 |
+
├── 📁 generated_projects/ # Runtime generated projects
|
468 |
+
│ └── 📁 project_{uuid}/
|
469 |
+
│ ├── 📊 project.json
|
470 |
+
│ ├── 🖼️ *.png
|
471 |
+
│ └── 🎵 *.wav
|
472 |
+
│
|
473 |
+
└── 📁 outputs/ # Processing outputs (Runtime)
|
474 |
+
├── 📁 DETECTED_IMAGE/ # Extracted & processed images
|
475 |
+
│ └── 📁 {pdf_name}/
|
476 |
+
│ └── 🖼️ Sprite_*.png
|
477 |
+
│
|
478 |
+
├── 📁 SCANNED_IMAGE/ # Original scanned images
|
479 |
+
│
|
480 |
+
├── 📁 EXTRACTED_JSON/ # Intermediate JSON data
|
481 |
+
│ └── 📁 {pdf_name}/
|
482 |
+
│ ├── 📊 extracted.json # Raw PDF extraction
|
483 |
+
│ └── 📊 extracted_sprites.json # AI-processed sprites
|
484 |
+
│
|
485 |
+
└── 📊 embeddings.json # Pre-computed embeddings cache
|
486 |
+
```
|
487 |
+
|
488 |
+
### Runtime Directory Creation Flow
|
489 |
+
|
490 |
+
```
|
491 |
+
🏗️ DYNAMIC DIRECTORY CREATION:
|
492 |
+
|
493 |
+
User Upload → PDF Processing → Directory Structure
|
494 |
+
│ │ │
|
495 |
+
├─ temp_dir ───┼─ pdf_filename ─────┼─ /outputs/DETECTED_IMAGE/{pdf_name}/
|
496 |
+
│ │ ├─ /outputs/EXTRACTED_JSON/{pdf_name}/
|
497 |
+
│ │ └─ /generated_projects/project_{uuid}/
|
498 |
+
│ │
|
499 |
+
└─ secure_filename() ──────────────────→ Sanitized paths
|
500 |
+
```
|
501 |
+
|
502 |
+
### Data Persistence Locations
|
503 |
+
|
504 |
+
```
|
505 |
+
💾 PERSISTENT DATA STORAGE:
|
506 |
+
|
507 |
+
├── 🔄 Input Processing
|
508 |
+
│ ├── /tmp/{random}/ - Temporary PDF storage
|
509 |
+
│ ├── /outputs/DETECTED_IMAGE/ - Extracted sprite images
|
510 |
+
│ ├── /outputs/EXTRACTED_JSON/ - Processing metadata
|
511 |
+
│ └── /outputs/embeddings.json - Similarity search cache
|
512 |
+
│
|
513 |
+
├── 🎯 Asset Matching
|
514 |
+
│ ├── /blocks/sprites/ - Reference sprite library
|
515 |
+
│ ├── /blocks/Backdrops/ - Reference backdrop library
|
516 |
+
│ └── /blocks/*.json - Block definition catalogs
|
517 |
+
│
|
518 |
+
└── 🎮 Final Output
|
519 |
+
├── /generated_projects/project_{uuid}/ - Assembled project
|
520 |
+
├── /game_samples/{project_id}.sb3 - Downloadable Scratch file
|
521 |
+
└── /logs/app.log - Application logs
|
522 |
+
```
|
523 |
+
|
524 |
+
## API Endpoints
|
525 |
+
|
526 |
+
### `/process_pdf` (POST)
|
527 |
+
|
528 |
+
Processes uploaded PDF files containing Scratch code blocks.
|
529 |
+
|
530 |
+
**Request:**
|
531 |
+
|
532 |
+
```
|
533 |
+
Content-Type: multipart/form-data
|
534 |
+
pdf_file: <PDF file>
|
535 |
+
```
|
536 |
+
|
537 |
+
**Response:**
|
538 |
+
|
539 |
+
```json
|
540 |
+
{
|
541 |
+
"message": "✅ PDF processed successfully",
|
542 |
+
"output_json": "path/to/extracted.json",
|
543 |
+
"sprites": {...},
|
544 |
+
"project_output_json": "path/to/project.json"
|
545 |
+
}
|
546 |
+
```
|
547 |
+
|
548 |
+
### `/download_sb3/<project_id>` (GET)
|
549 |
+
|
550 |
+
Downloads generated Scratch 3.0 project files.
|
551 |
+
|
552 |
+
## Processing Timeline & Performance
|
553 |
+
|
554 |
+
### Execution Timeline Tree
|
555 |
+
|
556 |
+
```
|
557 |
+
⏱️ PROCESSING TIMELINE (Typical PDF with 5 images):
|
558 |
+
|
559 |
+
📤 User Upload (0.0s)
|
560 |
+
│
|
561 |
+
├── 🔍 PDF Validation (0.1s)
|
562 |
+
│ └── File security & temp storage
|
563 |
+
│
|
564 |
+
├── 📄 PDF Extraction (2-5s)
|
565 |
+
│ ├── partition_pdf() - Unstructured processing
|
566 |
+
│ ├── Image extraction & base64 encoding
|
567 |
+
│ └── extracted.json creation
|
568 |
+
│
|
569 |
+
├── 🤖 AI Processing (10-15s per image)
|
570 |
+
│ ├── 📝 Description Generation (5-7s)
|
571 |
+
│ │ ├── LangGraph agent initialization
|
572 |
+
│ │ ├── Groq API call
|
573 |
+
│ │ └── Response processing
|
574 |
+
│ │
|
575 |
+
│ ├── 🏷️ Name Generation (5-7s)
|
576 |
+
│ │ ├── Second LangGraph agent call
|
577 |
+
│ │ ├── Groq API call
|
578 |
+
│ │ └── Response processing
|
579 |
+
│ │
|
580 |
+
│ └── 📋 Metadata Assembly (0.1s)
|
581 |
+
│ └── JSON structure creation
|
582 |
+
│
|
583 |
+
├── 🔍 Similarity Matching (3-8s)
|
584 |
+
│ ├── 🎯 Image Decoding (0.5s)
|
585 |
+
│ ├── 🧠 CLIP Embeddings (2-3s)
|
586 |
+
│ ├── 📈 Similarity Computation (0.5s)
|
587 |
+
│ └── 🎨 Asset Matching (2-4s)
|
588 |
+
│
|
589 |
+
├── 🏗️ Project Assembly (1-2s)
|
590 |
+
│ ├── JSON merging
|
591 |
+
│ ├── Asset copying
|
592 |
+
│ └── Final project creation
|
593 |
+
│
|
594 |
+
└── 📤 Response Generation (0.1s)
|
595 |
+
└── JSON response formatting
|
596 |
+
|
597 |
+
TOTAL: ~60-90 seconds for 5-image PDF
|
598 |
+
```
|
599 |
+
|
600 |
+
### Performance Bottlenecks & Optimizations
|
601 |
+
|
602 |
+
```
|
603 |
+
🚀 PERFORMANCE OPTIMIZATION STRATEGIES:
|
604 |
+
|
605 |
+
├── 🧠 Model Loading (Startup Cost)
|
606 |
+
│ ├── ✅ Pre-loaded global models
|
607 |
+
│ │ ├── DINOv2: ~2GB VRAM
|
608 |
+
│ │ ├── SmolVLM: ~1GB VRAM
|
609 |
+
│ │ └── CLIP: ~500MB VRAM
|
610 |
+
│ │
|
611 |
+
│ ├── ✅ GPU Acceleration (when available)
|
612 |
+
│ │ └── torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
613 |
+
│ │
|
614 |
+
│ └── ✅ CPU Optimization
|
615 |
+
│ └── torch.set_num_threads(4)
|
616 |
+
│
|
617 |
+
├── 🖼️ Image Processing Pipeline
|
618 |
+
│ ├── ✅ Efficient NumPy Operations
|
619 |
+
│ │ ├── Vectorized computations
|
620 |
+
│ │ ├── In-place operations where possible
|
621 |
+
│ │ └── Memory-mapped file access
|
622 |
+
│ │
|
623 |
+
│ ├── ✅ OpenCV Optimizations
|
624 |
+
│ │ ├── Multi-threaded operations
|
625 |
+
│ │ ├── SIMD instructions
|
626 |
+
│ │ └── Optimized algorithms
|
627 |
+
│ │
|
628 |
+
│ └── ✅ Memory Management
|
629 |
+
│ ├── Garbage collection hints
|
630 |
+
│ ├── Temporary file cleanup
|
631 |
+
│ └── Buffer reuse
|
632 |
+
│
|
633 |
+
├── 🔍 Similarity Search Acceleration
|
634 |
+
│ ├── ✅ Pre-computed Embeddings Cache
|
635 |
+
│ │ └── /outputs/embeddings.json (persistent)
|
636 |
+
│ │
|
637 |
+
│ ├── ✅ Normalized Embeddings
|
638 |
+
│ │ ├── Cosine similarity via dot product
|
639 |
+
│ │ └── L2 normalization preprocessing
|
640 |
+
│ │
|
641 |
+
│ └── ✅ Parallel Algorithm Execution
|
642 |
+
│ ├── DINOv2, PHash, ImageMatch concurrent
|
643 |
+
│ └── Multi-threaded similarity computation
|
644 |
+
│
|
645 |
+
└── 🌐 API & I/O Optimizations
|
646 |
+
├── ✅ Async File Operations
|
647 |
+
├── ✅ Streaming Responses
|
648 |
+
├── ✅ Connection Pooling
|
649 |
+
└── ✅ Compression (gzip)
|
650 |
+
```
|
651 |
+
|
652 |
+
### Memory Usage Profile
|
653 |
+
|
654 |
+
```
|
655 |
+
💾 MEMORY CONSUMPTION BREAKDOWN:
|
656 |
+
|
657 |
+
├── 🧠 AI Models (Peak: ~4GB)
|
658 |
+
│ ├── DINOv2 Model: ~2GB
|
659 |
+
│ ├── SmolVLM Model: ~1GB
|
660 |
+
│ ├── CLIP Embeddings: ~500MB
|
661 |
+
│ └── Groq API Client: ~100MB
|
662 |
+
│
|
663 |
+
├── 🖼️ Image Processing (Peak: ~500MB per image)
|
664 |
+
│ ├── Original PIL Images: ~50MB each
|
665 |
+
│ ├── Enhanced Images: ~100MB each
|
666 |
+
│ ├── OpenCV Buffers: ~200MB each
|
667 |
+
│ └── Embedding Vectors: ~2KB each
|
668 |
+
│
|
669 |
+
├── 📊 Data Structures (Peak: ~200MB)
|
670 |
+
│ ├── Block Definitions: ~50MB
|
671 |
+
│ ├── Asset Metadata: ~100MB
|
672 |
+
│ ├── Similarity Matrices: ~50MB
|
673 |
+
│ └── JSON Structures: ~10MB
|
674 |
+
│
|
675 |
+
└── 🌐 Web Framework (Baseline: ~100MB)
|
676 |
+
├── Flask Application: ~50MB
|
677 |
+
├── Request Buffers: ~30MB
|
678 |
+
└── Response Caching: ~20MB
|
679 |
+
|
680 |
+
TOTAL PEAK: ~5GB (with GPU models loaded)
|
681 |
+
TOTAL BASELINE: ~1GB (CPU-only, no active processing)
|
682 |
+
```
|
683 |
+
|
684 |
+
### Performance Optimizations
|
685 |
+
|
686 |
+
### 1. Model Caching
|
687 |
+
|
688 |
+
- Pre-loaded models with global variables
|
689 |
+
- GPU acceleration when available
|
690 |
+
- Batch processing for multiple images
|
691 |
+
|
692 |
+
### 2. Image Processing
|
693 |
+
|
694 |
+
- Efficient numpy operations
|
695 |
+
- OpenCV optimizations
|
696 |
+
- Memory management for large images
|
697 |
+
|
698 |
+
### 3. Similarity Search
|
699 |
+
|
700 |
+
- FAISS indexing for fast nearest neighbor search
|
701 |
+
- Normalized embeddings for cosine similarity
|
702 |
+
- Parallel processing of multiple algorithms
|
703 |
+
|
704 |
+
## Error Handling
|
705 |
+
|
706 |
+
### 1. Graceful Degradation
|
707 |
+
|
708 |
+
```python
|
709 |
+
def process_image_cv2_from_pil(pil_img, scale=2):
|
710 |
+
try:
|
711 |
+
# OpenCV enhancement pipeline
|
712 |
+
return enhanced_image
|
713 |
+
except Exception as e:
|
714 |
+
print(f"Enhancement failed: {e}")
|
715 |
+
return original_image # Fallback to original
|
716 |
+
```
|
717 |
+
|
718 |
+
### 2. JSON Validation
|
719 |
+
|
720 |
+
```python
|
721 |
+
agent_json_resolver = create_react_agent(
|
722 |
+
model=llm,
|
723 |
+
prompt=SYSTEM_PROMPT_JSON_CORRECTOR
|
724 |
+
)
|
725 |
+
```
|
726 |
+
|
727 |
+
## Deployment
|
728 |
+
|
729 |
+
### Docker Configuration
|
730 |
+
|
731 |
+
```dockerfile
|
732 |
+
FROM python:3.11-slim
|
733 |
+
# System dependencies: tesseract-ocr, poppler-utils, libgl1
|
734 |
+
# Python dependencies: requirements.txt
|
735 |
+
# Environment: Flask production mode
|
736 |
+
EXPOSE 7860
|
737 |
+
CMD ["python", "app.py"]
|
738 |
+
```
|
739 |
+
|
740 |
+
### Environment Variables
|
741 |
+
|
742 |
+
- `GROQ_API_KEY`: API key for Groq language model
|
743 |
+
- `TRANSFORMERS_CACHE`: Model cache directory
|
744 |
+
- `HF_HOME`: Hugging Face cache directory
|
745 |
+
|
746 |
+
## Future Enhancements
|
747 |
+
|
748 |
+
1. **Real-time Processing**: WebSocket integration for live feedback
|
749 |
+
2. **Advanced OCR**: Custom trained models for Scratch block recognition
|
750 |
+
3. **Multi-language Support**: International Scratch block recognition
|
751 |
+
4. **Collaborative Features**: Multi-user project editing
|
752 |
+
5. **Performance Monitoring**: Detailed analytics and optimization metrics
|
753 |
+
|
754 |
+
## Contributing
|
755 |
+
|
756 |
+
The system is designed with modularity in mind:
|
757 |
+
|
758 |
+
- Add new block definitions in `blocks/` directory
|
759 |
+
- Extend similarity algorithms in the matching pipeline
|
760 |
+
- Enhance OCR accuracy with custom preprocessing
|
761 |
+
- Improve LLM prompts for better code interpretation
|
762 |
+
|
763 |
+
## License
|
764 |
+
|
765 |
+
Apache 2.0 License - See project repository for full details.
|