orasul commited on
Commit
6ff22d6
·
1 Parent(s): eb1b90d

Load initial app

Browse files
Files changed (16) hide show
  1. .DS_Store +0 -0
  2. .gitattributes +1 -0
  3. .gitignore +5 -0
  4. EDSR_x4.pb +3 -0
  5. LICENSE +674 -0
  6. README.md +2 -2
  7. app.py +318 -3
  8. best.pt +3 -0
  9. icon-image-detection-model.keras +3 -0
  10. main.py +472 -0
  11. requirements.txt +15 -0
  12. script.py +808 -0
  13. utils/json_helpers.py +18 -0
  14. utils/pills.py +44 -0
  15. wrapper.py +140 -0
  16. yolo_script.py +185 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.keras filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .venv/
4
+ venv/
5
+
EDSR_x4.pb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd35ce3cae53ecee2d16045e08a932c3e7242d641bb65cb971d123e06904347f
3
+ size 38573255
LICENSE ADDED
@@ -0,0 +1,674 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 3, 29 June 2007
3
+
4
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
5
+ Everyone is permitted to copy and distribute verbatim copies
6
+ of this license document, but changing it is not allowed.
7
+
8
+ Preamble
9
+
10
+ The GNU General Public License is a free, copyleft license for
11
+ software and other kinds of works.
12
+
13
+ The licenses for most software and other practical works are designed
14
+ to take away your freedom to share and change the works. By contrast,
15
+ the GNU General Public License is intended to guarantee your freedom to
16
+ share and change all versions of a program--to make sure it remains free
17
+ software for all its users. We, the Free Software Foundation, use the
18
+ GNU General Public License for most of our software; it applies also to
19
+ any other work released this way by its authors. You can apply it to
20
+ your programs, too.
21
+
22
+ When we speak of free software, we are referring to freedom, not
23
+ price. Our General Public Licenses are designed to make sure that you
24
+ have the freedom to distribute copies of free software (and charge for
25
+ them if you wish), that you receive source code or can get it if you
26
+ want it, that you can change the software or use pieces of it in new
27
+ free programs, and that you know you can do these things.
28
+
29
+ To protect your rights, we need to prevent others from denying you
30
+ these rights or asking you to surrender the rights. Therefore, you have
31
+ certain responsibilities if you distribute copies of the software, or if
32
+ you modify it: responsibilities to respect the freedom of others.
33
+
34
+ For example, if you distribute copies of such a program, whether
35
+ gratis or for a fee, you must pass on to the recipients the same
36
+ freedoms that you received. You must make sure that they, too, receive
37
+ or can get the source code. And you must show them these terms so they
38
+ know their rights.
39
+
40
+ Developers that use the GNU GPL protect your rights with two steps:
41
+ (1) assert copyright on the software, and (2) offer you this License
42
+ giving you legal permission to copy, distribute and/or modify it.
43
+
44
+ For the developers' and authors' protection, the GPL clearly explains
45
+ that there is no warranty for this free software. For both users' and
46
+ authors' sake, the GPL requires that modified versions be marked as
47
+ changed, so that their problems will not be attributed erroneously to
48
+ authors of previous versions.
49
+
50
+ Some devices are designed to deny users access to install or run
51
+ modified versions of the software inside them, although the manufacturer
52
+ can do so. This is fundamentally incompatible with the aim of
53
+ protecting users' freedom to change the software. The systematic
54
+ pattern of such abuse occurs in the area of products for individuals to
55
+ use, which is precisely where it is most unacceptable. Therefore, we
56
+ have designed this version of the GPL to prohibit the practice for those
57
+ products. If such problems arise substantially in other domains, we
58
+ stand ready to extend this provision to those domains in future versions
59
+ of the GPL, as needed to protect the freedom of users.
60
+
61
+ Finally, every program is threatened constantly by software patents.
62
+ States should not allow patents to restrict development and use of
63
+ software on general-purpose computers, but in those that do, we wish to
64
+ avoid the special danger that patents applied to a free program could
65
+ make it effectively proprietary. To prevent this, the GPL assures that
66
+ patents cannot be used to render the program non-free.
67
+
68
+ The precise terms and conditions for copying, distribution and
69
+ modification follow.
70
+
71
+ TERMS AND CONDITIONS
72
+
73
+ 0. Definitions.
74
+
75
+ "This License" refers to version 3 of the GNU General Public License.
76
+
77
+ "Copyright" also means copyright-like laws that apply to other kinds of
78
+ works, such as semiconductor masks.
79
+
80
+ "The Program" refers to any copyrightable work licensed under this
81
+ License. Each licensee is addressed as "you". "Licensees" and
82
+ "recipients" may be individuals or organizations.
83
+
84
+ To "modify" a work means to copy from or adapt all or part of the work
85
+ in a fashion requiring copyright permission, other than the making of an
86
+ exact copy. The resulting work is called a "modified version" of the
87
+ earlier work or a work "based on" the earlier work.
88
+
89
+ A "covered work" means either the unmodified Program or a work based
90
+ on the Program.
91
+
92
+ To "propagate" a work means to do anything with it that, without
93
+ permission, would make you directly or secondarily liable for
94
+ infringement under applicable copyright law, except executing it on a
95
+ computer or modifying a private copy. Propagation includes copying,
96
+ distribution (with or without modification), making available to the
97
+ public, and in some countries other activities as well.
98
+
99
+ To "convey" a work means any kind of propagation that enables other
100
+ parties to make or receive copies. Mere interaction with a user through
101
+ a computer network, with no transfer of a copy, is not conveying.
102
+
103
+ An interactive user interface displays "Appropriate Legal Notices"
104
+ to the extent that it includes a convenient and prominently visible
105
+ feature that (1) displays an appropriate copyright notice, and (2)
106
+ tells the user that there is no warranty for the work (except to the
107
+ extent that warranties are provided), that licensees may convey the
108
+ work under this License, and how to view a copy of this License. If
109
+ the interface presents a list of user commands or options, such as a
110
+ menu, a prominent item in the list meets this criterion.
111
+
112
+ 1. Source Code.
113
+
114
+ The "source code" for a work means the preferred form of the work
115
+ for making modifications to it. "Object code" means any non-source
116
+ form of a work.
117
+
118
+ A "Standard Interface" means an interface that either is an official
119
+ standard defined by a recognized standards body, or, in the case of
120
+ interfaces specified for a particular programming language, one that
121
+ is widely used among developers working in that language.
122
+
123
+ The "System Libraries" of an executable work include anything, other
124
+ than the work as a whole, that (a) is included in the normal form of
125
+ packaging a Major Component, but which is not part of that Major
126
+ Component, and (b) serves only to enable use of the work with that
127
+ Major Component, or to implement a Standard Interface for which an
128
+ implementation is available to the public in source code form. A
129
+ "Major Component", in this context, means a major essential component
130
+ (kernel, window system, and so on) of the specific operating system
131
+ (if any) on which the executable work runs, or a compiler used to
132
+ produce the work, or an object code interpreter used to run it.
133
+
134
+ The "Corresponding Source" for a work in object code form means all
135
+ the source code needed to generate, install, and (for an executable
136
+ work) run the object code and to modify the work, including scripts to
137
+ control those activities. However, it does not include the work's
138
+ System Libraries, or general-purpose tools or generally available free
139
+ programs which are used unmodified in performing those activities but
140
+ which are not part of the work. For example, Corresponding Source
141
+ includes interface definition files associated with source files for
142
+ the work, and the source code for shared libraries and dynamically
143
+ linked subprograms that the work is specifically designed to require,
144
+ such as by intimate data communication or control flow between those
145
+ subprograms and other parts of the work.
146
+
147
+ The Corresponding Source need not include anything that users
148
+ can regenerate automatically from other parts of the Corresponding
149
+ Source.
150
+
151
+ The Corresponding Source for a work in source code form is that
152
+ same work.
153
+
154
+ 2. Basic Permissions.
155
+
156
+ All rights granted under this License are granted for the term of
157
+ copyright on the Program, and are irrevocable provided the stated
158
+ conditions are met. This License explicitly affirms your unlimited
159
+ permission to run the unmodified Program. The output from running a
160
+ covered work is covered by this License only if the output, given its
161
+ content, constitutes a covered work. This License acknowledges your
162
+ rights of fair use or other equivalent, as provided by copyright law.
163
+
164
+ You may make, run and propagate covered works that you do not
165
+ convey, without conditions so long as your license otherwise remains
166
+ in force. You may convey covered works to others for the sole purpose
167
+ of having them make modifications exclusively for you, or provide you
168
+ with facilities for running those works, provided that you comply with
169
+ the terms of this License in conveying all material for which you do
170
+ not control copyright. Those thus making or running the covered works
171
+ for you must do so exclusively on your behalf, under your direction
172
+ and control, on terms that prohibit them from making any copies of
173
+ your copyrighted material outside their relationship with you.
174
+
175
+ Conveying under any other circumstances is permitted solely under
176
+ the conditions stated below. Sublicensing is not allowed; section 10
177
+ makes it unnecessary.
178
+
179
+ 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180
+
181
+ No covered work shall be deemed part of an effective technological
182
+ measure under any applicable law fulfilling obligations under article
183
+ 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184
+ similar laws prohibiting or restricting circumvention of such
185
+ measures.
186
+
187
+ When you convey a covered work, you waive any legal power to forbid
188
+ circumvention of technological measures to the extent such circumvention
189
+ is effected by exercising rights under this License with respect to
190
+ the covered work, and you disclaim any intention to limit operation or
191
+ modification of the work as a means of enforcing, against the work's
192
+ users, your or third parties' legal rights to forbid circumvention of
193
+ technological measures.
194
+
195
+ 4. Conveying Verbatim Copies.
196
+
197
+ You may convey verbatim copies of the Program's source code as you
198
+ receive it, in any medium, provided that you conspicuously and
199
+ appropriately publish on each copy an appropriate copyright notice;
200
+ keep intact all notices stating that this License and any
201
+ non-permissive terms added in accord with section 7 apply to the code;
202
+ keep intact all notices of the absence of any warranty; and give all
203
+ recipients a copy of this License along with the Program.
204
+
205
+ You may charge any price or no price for each copy that you convey,
206
+ and you may offer support or warranty protection for a fee.
207
+
208
+ 5. Conveying Modified Source Versions.
209
+
210
+ You may convey a work based on the Program, or the modifications to
211
+ produce it from the Program, in the form of source code under the
212
+ terms of section 4, provided that you also meet all of these conditions:
213
+
214
+ a) The work must carry prominent notices stating that you modified
215
+ it, and giving a relevant date.
216
+
217
+ b) The work must carry prominent notices stating that it is
218
+ released under this License and any conditions added under section
219
+ 7. This requirement modifies the requirement in section 4 to
220
+ "keep intact all notices".
221
+
222
+ c) You must license the entire work, as a whole, under this
223
+ License to anyone who comes into possession of a copy. This
224
+ License will therefore apply, along with any applicable section 7
225
+ additional terms, to the whole of the work, and all its parts,
226
+ regardless of how they are packaged. This License gives no
227
+ permission to license the work in any other way, but it does not
228
+ invalidate such permission if you have separately received it.
229
+
230
+ d) If the work has interactive user interfaces, each must display
231
+ Appropriate Legal Notices; however, if the Program has interactive
232
+ interfaces that do not display Appropriate Legal Notices, your
233
+ work need not make them do so.
234
+
235
+ A compilation of a covered work with other separate and independent
236
+ works, which are not by their nature extensions of the covered work,
237
+ and which are not combined with it such as to form a larger program,
238
+ in or on a volume of a storage or distribution medium, is called an
239
+ "aggregate" if the compilation and its resulting copyright are not
240
+ used to limit the access or legal rights of the compilation's users
241
+ beyond what the individual works permit. Inclusion of a covered work
242
+ in an aggregate does not cause this License to apply to the other
243
+ parts of the aggregate.
244
+
245
+ 6. Conveying Non-Source Forms.
246
+
247
+ You may convey a covered work in object code form under the terms
248
+ of sections 4 and 5, provided that you also convey the
249
+ machine-readable Corresponding Source under the terms of this License,
250
+ in one of these ways:
251
+
252
+ a) Convey the object code in, or embodied in, a physical product
253
+ (including a physical distribution medium), accompanied by the
254
+ Corresponding Source fixed on a durable physical medium
255
+ customarily used for software interchange.
256
+
257
+ b) Convey the object code in, or embodied in, a physical product
258
+ (including a physical distribution medium), accompanied by a
259
+ written offer, valid for at least three years and valid for as
260
+ long as you offer spare parts or customer support for that product
261
+ model, to give anyone who possesses the object code either (1) a
262
+ copy of the Corresponding Source for all the software in the
263
+ product that is covered by this License, on a durable physical
264
+ medium customarily used for software interchange, for a price no
265
+ more than your reasonable cost of physically performing this
266
+ conveying of source, or (2) access to copy the
267
+ Corresponding Source from a network server at no charge.
268
+
269
+ c) Convey individual copies of the object code with a copy of the
270
+ written offer to provide the Corresponding Source. This
271
+ alternative is allowed only occasionally and noncommercially, and
272
+ only if you received the object code with such an offer, in accord
273
+ with subsection 6b.
274
+
275
+ d) Convey the object code by offering access from a designated
276
+ place (gratis or for a charge), and offer equivalent access to the
277
+ Corresponding Source in the same way through the same place at no
278
+ further charge. You need not require recipients to copy the
279
+ Corresponding Source along with the object code. If the place to
280
+ copy the object code is a network server, the Corresponding Source
281
+ may be on a different server (operated by you or a third party)
282
+ that supports equivalent copying facilities, provided you maintain
283
+ clear directions next to the object code saying where to find the
284
+ Corresponding Source. Regardless of what server hosts the
285
+ Corresponding Source, you remain obligated to ensure that it is
286
+ available for as long as needed to satisfy these requirements.
287
+
288
+ e) Convey the object code using peer-to-peer transmission, provided
289
+ you inform other peers where the object code and Corresponding
290
+ Source of the work are being offered to the general public at no
291
+ charge under subsection 6d.
292
+
293
+ A separable portion of the object code, whose source code is excluded
294
+ from the Corresponding Source as a System Library, need not be
295
+ included in conveying the object code work.
296
+
297
+ A "User Product" is either (1) a "consumer product", which means any
298
+ tangible personal property which is normally used for personal, family,
299
+ or household purposes, or (2) anything designed or sold for incorporation
300
+ into a dwelling. In determining whether a product is a consumer product,
301
+ doubtful cases shall be resolved in favor of coverage. For a particular
302
+ product received by a particular user, "normally used" refers to a
303
+ typical or common use of that class of product, regardless of the status
304
+ of the particular user or of the way in which the particular user
305
+ actually uses, or expects or is expected to use, the product. A product
306
+ is a consumer product regardless of whether the product has substantial
307
+ commercial, industrial or non-consumer uses, unless such uses represent
308
+ the only significant mode of use of the product.
309
+
310
+ "Installation Information" for a User Product means any methods,
311
+ procedures, authorization keys, or other information required to install
312
+ and execute modified versions of a covered work in that User Product from
313
+ a modified version of its Corresponding Source. The information must
314
+ suffice to ensure that the continued functioning of the modified object
315
+ code is in no case prevented or interfered with solely because
316
+ modification has been made.
317
+
318
+ If you convey an object code work under this section in, or with, or
319
+ specifically for use in, a User Product, and the conveying occurs as
320
+ part of a transaction in which the right of possession and use of the
321
+ User Product is transferred to the recipient in perpetuity or for a
322
+ fixed term (regardless of how the transaction is characterized), the
323
+ Corresponding Source conveyed under this section must be accompanied
324
+ by the Installation Information. But this requirement does not apply
325
+ if neither you nor any third party retains the ability to install
326
+ modified object code on the User Product (for example, the work has
327
+ been installed in ROM).
328
+
329
+ The requirement to provide Installation Information does not include a
330
+ requirement to continue to provide support service, warranty, or updates
331
+ for a work that has been modified or installed by the recipient, or for
332
+ the User Product in which it has been modified or installed. Access to a
333
+ network may be denied when the modification itself materially and
334
+ adversely affects the operation of the network or violates the rules and
335
+ protocols for communication across the network.
336
+
337
+ Corresponding Source conveyed, and Installation Information provided,
338
+ in accord with this section must be in a format that is publicly
339
+ documented (and with an implementation available to the public in
340
+ source code form), and must require no special password or key for
341
+ unpacking, reading or copying.
342
+
343
+ 7. Additional Terms.
344
+
345
+ "Additional permissions" are terms that supplement the terms of this
346
+ License by making exceptions from one or more of its conditions.
347
+ Additional permissions that are applicable to the entire Program shall
348
+ be treated as though they were included in this License, to the extent
349
+ that they are valid under applicable law. If additional permissions
350
+ apply only to part of the Program, that part may be used separately
351
+ under those permissions, but the entire Program remains governed by
352
+ this License without regard to the additional permissions.
353
+
354
+ When you convey a copy of a covered work, you may at your option
355
+ remove any additional permissions from that copy, or from any part of
356
+ it. (Additional permissions may be written to require their own
357
+ removal in certain cases when you modify the work.) You may place
358
+ additional permissions on material, added by you to a covered work,
359
+ for which you have or can give appropriate copyright permission.
360
+
361
+ Notwithstanding any other provision of this License, for material you
362
+ add to a covered work, you may (if authorized by the copyright holders of
363
+ that material) supplement the terms of this License with terms:
364
+
365
+ a) Disclaiming warranty or limiting liability differently from the
366
+ terms of sections 15 and 16 of this License; or
367
+
368
+ b) Requiring preservation of specified reasonable legal notices or
369
+ author attributions in that material or in the Appropriate Legal
370
+ Notices displayed by works containing it; or
371
+
372
+ c) Prohibiting misrepresentation of the origin of that material, or
373
+ requiring that modified versions of such material be marked in
374
+ reasonable ways as different from the original version; or
375
+
376
+ d) Limiting the use for publicity purposes of names of licensors or
377
+ authors of the material; or
378
+
379
+ e) Declining to grant rights under trademark law for use of some
380
+ trade names, trademarks, or service marks; or
381
+
382
+ f) Requiring indemnification of licensors and authors of that
383
+ material by anyone who conveys the material (or modified versions of
384
+ it) with contractual assumptions of liability to the recipient, for
385
+ any liability that these contractual assumptions directly impose on
386
+ those licensors and authors.
387
+
388
+ All other non-permissive additional terms are considered "further
389
+ restrictions" within the meaning of section 10. If the Program as you
390
+ received it, or any part of it, contains a notice stating that it is
391
+ governed by this License along with a term that is a further
392
+ restriction, you may remove that term. If a license document contains
393
+ a further restriction but permits relicensing or conveying under this
394
+ License, you may add to a covered work material governed by the terms
395
+ of that license document, provided that the further restriction does
396
+ not survive such relicensing or conveying.
397
+
398
+ If you add terms to a covered work in accord with this section, you
399
+ must place, in the relevant source files, a statement of the
400
+ additional terms that apply to those files, or a notice indicating
401
+ where to find the applicable terms.
402
+
403
+ Additional terms, permissive or non-permissive, may be stated in the
404
+ form of a separately written license, or stated as exceptions;
405
+ the above requirements apply either way.
406
+
407
+ 8. Termination.
408
+
409
+ You may not propagate or modify a covered work except as expressly
410
+ provided under this License. Any attempt otherwise to propagate or
411
+ modify it is void, and will automatically terminate your rights under
412
+ this License (including any patent licenses granted under the third
413
+ paragraph of section 11).
414
+
415
+ However, if you cease all violation of this License, then your
416
+ license from a particular copyright holder is reinstated (a)
417
+ provisionally, unless and until the copyright holder explicitly and
418
+ finally terminates your license, and (b) permanently, if the copyright
419
+ holder fails to notify you of the violation by some reasonable means
420
+ prior to 60 days after the cessation.
421
+
422
+ Moreover, your license from a particular copyright holder is
423
+ reinstated permanently if the copyright holder notifies you of the
424
+ violation by some reasonable means, this is the first time you have
425
+ received notice of violation of this License (for any work) from that
426
+ copyright holder, and you cure the violation prior to 30 days after
427
+ your receipt of the notice.
428
+
429
+ Termination of your rights under this section does not terminate the
430
+ licenses of parties who have received copies or rights from you under
431
+ this License. If your rights have been terminated and not permanently
432
+ reinstated, you do not qualify to receive new licenses for the same
433
+ material under section 10.
434
+
435
+ 9. Acceptance Not Required for Having Copies.
436
+
437
+ You are not required to accept this License in order to receive or
438
+ run a copy of the Program. Ancillary propagation of a covered work
439
+ occurring solely as a consequence of using peer-to-peer transmission
440
+ to receive a copy likewise does not require acceptance. However,
441
+ nothing other than this License grants you permission to propagate or
442
+ modify any covered work. These actions infringe copyright if you do
443
+ not accept this License. Therefore, by modifying or propagating a
444
+ covered work, you indicate your acceptance of this License to do so.
445
+
446
+ 10. Automatic Licensing of Downstream Recipients.
447
+
448
+ Each time you convey a covered work, the recipient automatically
449
+ receives a license from the original licensors, to run, modify and
450
+ propagate that work, subject to this License. You are not responsible
451
+ for enforcing compliance by third parties with this License.
452
+
453
+ An "entity transaction" is a transaction transferring control of an
454
+ organization, or substantially all assets of one, or subdividing an
455
+ organization, or merging organizations. If propagation of a covered
456
+ work results from an entity transaction, each party to that
457
+ transaction who receives a copy of the work also receives whatever
458
+ licenses to the work the party's predecessor in interest had or could
459
+ give under the previous paragraph, plus a right to possession of the
460
+ Corresponding Source of the work from the predecessor in interest, if
461
+ the predecessor has it or can get it with reasonable efforts.
462
+
463
+ You may not impose any further restrictions on the exercise of the
464
+ rights granted or affirmed under this License. For example, you may
465
+ not impose a license fee, royalty, or other charge for exercise of
466
+ rights granted under this License, and you may not initiate litigation
467
+ (including a cross-claim or counterclaim in a lawsuit) alleging that
468
+ any patent claim is infringed by making, using, selling, offering for
469
+ sale, or importing the Program or any portion of it.
470
+
471
+ 11. Patents.
472
+
473
+ A "contributor" is a copyright holder who authorizes use under this
474
+ License of the Program or a work on which the Program is based. The
475
+ work thus licensed is called the contributor's "contributor version".
476
+
477
+ A contributor's "essential patent claims" are all patent claims
478
+ owned or controlled by the contributor, whether already acquired or
479
+ hereafter acquired, that would be infringed by some manner, permitted
480
+ by this License, of making, using, or selling its contributor version,
481
+ but do not include claims that would be infringed only as a
482
+ consequence of further modification of the contributor version. For
483
+ purposes of this definition, "control" includes the right to grant
484
+ patent sublicenses in a manner consistent with the requirements of
485
+ this License.
486
+
487
+ Each contributor grants you a non-exclusive, worldwide, royalty-free
488
+ patent license under the contributor's essential patent claims, to
489
+ make, use, sell, offer for sale, import and otherwise run, modify and
490
+ propagate the contents of its contributor version.
491
+
492
+ In the following three paragraphs, a "patent license" is any express
493
+ agreement or commitment, however denominated, not to enforce a patent
494
+ (such as an express permission to practice a patent or covenant not to
495
+ sue for patent infringement). To "grant" such a patent license to a
496
+ party means to make such an agreement or commitment not to enforce a
497
+ patent against the party.
498
+
499
+ If you convey a covered work, knowingly relying on a patent license,
500
+ and the Corresponding Source of the work is not available for anyone
501
+ to copy, free of charge and under the terms of this License, through a
502
+ publicly available network server or other readily accessible means,
503
+ then you must either (1) cause the Corresponding Source to be so
504
+ available, or (2) arrange to deprive yourself of the benefit of the
505
+ patent license for this particular work, or (3) arrange, in a manner
506
+ consistent with the requirements of this License, to extend the patent
507
+ license to downstream recipients. "Knowingly relying" means you have
508
+ actual knowledge that, but for the patent license, your conveying the
509
+ covered work in a country, or your recipient's use of the covered work
510
+ in a country, would infringe one or more identifiable patents in that
511
+ country that you have reason to believe are valid.
512
+
513
+ If, pursuant to or in connection with a single transaction or
514
+ arrangement, you convey, or propagate by procuring conveyance of, a
515
+ covered work, and grant a patent license to some of the parties
516
+ receiving the covered work authorizing them to use, propagate, modify
517
+ or convey a specific copy of the covered work, then the patent license
518
+ you grant is automatically extended to all recipients of the covered
519
+ work and works based on it.
520
+
521
+ A patent license is "discriminatory" if it does not include within
522
+ the scope of its coverage, prohibits the exercise of, or is
523
+ conditioned on the non-exercise of one or more of the rights that are
524
+ specifically granted under this License. You may not convey a covered
525
+ work if you are a party to an arrangement with a third party that is
526
+ in the business of distributing software, under which you make payment
527
+ to the third party based on the extent of your activity of conveying
528
+ the work, and under which the third party grants, to any of the
529
+ parties who would receive the covered work from you, a discriminatory
530
+ patent license (a) in connection with copies of the covered work
531
+ conveyed by you (or copies made from those copies), or (b) primarily
532
+ for and in connection with specific products or compilations that
533
+ contain the covered work, unless you entered into that arrangement,
534
+ or that patent license was granted, prior to 28 March 2007.
535
+
536
+ Nothing in this License shall be construed as excluding or limiting
537
+ any implied license or other defenses to infringement that may
538
+ otherwise be available to you under applicable patent law.
539
+
540
+ 12. No Surrender of Others' Freedom.
541
+
542
+ If conditions are imposed on you (whether by court order, agreement or
543
+ otherwise) that contradict the conditions of this License, they do not
544
+ excuse you from the conditions of this License. If you cannot convey a
545
+ covered work so as to satisfy simultaneously your obligations under this
546
+ License and any other pertinent obligations, then as a consequence you may
547
+ not convey it at all. For example, if you agree to terms that obligate you
548
+ to collect a royalty for further conveying from those to whom you convey
549
+ the Program, the only way you could satisfy both those terms and this
550
+ License would be to refrain entirely from conveying the Program.
551
+
552
+ 13. Use with the GNU Affero General Public License.
553
+
554
+ Notwithstanding any other provision of this License, you have
555
+ permission to link or combine any covered work with a work licensed
556
+ under version 3 of the GNU Affero General Public License into a single
557
+ combined work, and to convey the resulting work. The terms of this
558
+ License will continue to apply to the part which is the covered work,
559
+ but the special requirements of the GNU Affero General Public License,
560
+ section 13, concerning interaction through a network will apply to the
561
+ combination as such.
562
+
563
+ 14. Revised Versions of this License.
564
+
565
+ The Free Software Foundation may publish revised and/or new versions of
566
+ the GNU General Public License from time to time. Such new versions will
567
+ be similar in spirit to the present version, but may differ in detail to
568
+ address new problems or concerns.
569
+
570
+ Each version is given a distinguishing version number. If the
571
+ Program specifies that a certain numbered version of the GNU General
572
+ Public License "or any later version" applies to it, you have the
573
+ option of following the terms and conditions either of that numbered
574
+ version or of any later version published by the Free Software
575
+ Foundation. If the Program does not specify a version number of the
576
+ GNU General Public License, you may choose any version ever published
577
+ by the Free Software Foundation.
578
+
579
+ If the Program specifies that a proxy can decide which future
580
+ versions of the GNU General Public License can be used, that proxy's
581
+ public statement of acceptance of a version permanently authorizes you
582
+ to choose that version for the Program.
583
+
584
+ Later license versions may give you additional or different
585
+ permissions. However, no additional obligations are imposed on any
586
+ author or copyright holder as a result of your choosing to follow a
587
+ later version.
588
+
589
+ 15. Disclaimer of Warranty.
590
+
591
+ THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592
+ APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593
+ HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594
+ OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595
+ THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596
+ PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597
+ IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598
+ ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599
+
600
+ 16. Limitation of Liability.
601
+
602
+ IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604
+ THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605
+ GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606
+ USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607
+ DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608
+ PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609
+ EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610
+ SUCH DAMAGES.
611
+
612
+ 17. Interpretation of Sections 15 and 16.
613
+
614
+ If the disclaimer of warranty and limitation of liability provided
615
+ above cannot be given local legal effect according to their terms,
616
+ reviewing courts shall apply local law that most closely approximates
617
+ an absolute waiver of all civil liability in connection with the
618
+ Program, unless a warranty or assumption of liability accompanies a
619
+ copy of the Program in return for a fee.
620
+
621
+ END OF TERMS AND CONDITIONS
622
+
623
+ How to Apply These Terms to Your New Programs
624
+
625
+ If you develop a new program, and you want it to be of the greatest
626
+ possible use to the public, the best way to achieve this is to make it
627
+ free software which everyone can redistribute and change under these terms.
628
+
629
+ To do so, attach the following notices to the program. It is safest
630
+ to attach them to the start of each source file to most effectively
631
+ state the exclusion of warranty; and each file should have at least
632
+ the "copyright" line and a pointer to where the full notice is found.
633
+
634
+ <one line to give the program's name and a brief idea of what it does.>
635
+ Copyright (C) <year> <name of author>
636
+
637
+ This program is free software: you can redistribute it and/or modify
638
+ it under the terms of the GNU General Public License as published by
639
+ the Free Software Foundation, either version 3 of the License, or
640
+ (at your option) any later version.
641
+
642
+ This program is distributed in the hope that it will be useful,
643
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
644
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
645
+ GNU General Public License for more details.
646
+
647
+ You should have received a copy of the GNU General Public License
648
+ along with this program. If not, see <https://www.gnu.org/licenses/>.
649
+
650
+ Also add information on how to contact you by electronic and paper mail.
651
+
652
+ If the program does terminal interaction, make it output a short
653
+ notice like this when it starts in an interactive mode:
654
+
655
+ <program> Copyright (C) <year> <name of author>
656
+ This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657
+ This is free software, and you are welcome to redistribute it
658
+ under certain conditions; type `show c' for details.
659
+
660
+ The hypothetical commands `show w' and `show c' should show the appropriate
661
+ parts of the General Public License. Of course, your program's commands
662
+ might be different; for a GUI interface, you would use an "about box".
663
+
664
+ You should also get your employer (if you work as a programmer) or school,
665
+ if any, to sign a "copyright disclaimer" for the program, if necessary.
666
+ For more information on this, and how to apply and follow the GNU GPL, see
667
+ <https://www.gnu.org/licenses/>.
668
+
669
+ The GNU General Public License does not permit incorporating your program
670
+ into proprietary programs. If your program is a subroutine library, you
671
+ may consider it more useful to permit linking proprietary applications with
672
+ the library. If this is what you want to do, use the GNU Lesser General
673
+ Public License instead of this License. But first, please read
674
+ <https://www.gnu.org/licenses/why-not-lgpl.html>.
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: Deki
3
- emoji: 🐢
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: gradio
 
1
  ---
2
+ title: deki
3
+ emoji: 📱
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: gradio
app.py CHANGED
@@ -1,7 +1,322 @@
1
  import gradio as gr
 
 
 
 
 
 
 
 
2
 
3
- def greet(name):
4
- return "Hello " + name + "!!"
 
5
 
6
- demo = gr.Interface(fn=greet, inputs="text", outputs="text")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  demo.launch()
 
 
1
  import gradio as gr
2
+ import os
3
+ import base64
4
+ import time
5
+ import json
6
+ import logging
7
+ import tempfile
8
+ import uuid
9
+ import io
10
 
11
+ from PIL import Image
12
+ from openai import OpenAI
13
+ from ultralytics import YOLO
14
 
15
+ from wrapper import process_image_description
16
+ from utils.pills import preprocess_image
17
+ import cv2
18
+ import cv2.dnn_superres as dnn_superres
19
+ import easyocr
20
+ from spellchecker import SpellChecker
21
+
22
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
23
+
24
+ GLOBAL_SR = None
25
+ GLOBAL_READER = None
26
+ GLOBAL_SPELL = None
27
+ YOLO_MODEL = None
28
+
29
+ def load_models():
30
+ """
31
+ Called once to load all necessary models into memory.
32
+ """
33
+ global GLOBAL_SR, GLOBAL_READER, GLOBAL_SPELL, YOLO_MODEL
34
+
35
+ logging.info("Loading all models...")
36
+ start_time_total = time.perf_counter()
37
+
38
+ # Super-resolution
39
+ logging.info("Loading super-resolution model...")
40
+ start_time = time.perf_counter()
41
+ sr = None
42
+ model_path = "EDSR_x4.pb"
43
+ if os.path.exists(model_path):
44
+ if hasattr(cv2, 'dnn_superres'):
45
+ try:
46
+ sr = dnn_superres.DnnSuperResImpl_create()
47
+ except AttributeError:
48
+ sr = dnn_superres.DnnSuperResImpl()
49
+ sr.readModel(model_path)
50
+ sr.setModel('edsr', 4)
51
+ GLOBAL_SR = sr
52
+ logging.info("Super-resolution model loaded.")
53
+ else:
54
+ logging.warning("cv2.dnn_superres module not available.")
55
+ else:
56
+ logging.warning(f"Super-resolution model file not found: {model_path}. Skipping SR.")
57
+ logging.info(f"Super-resolution init took {time.perf_counter()-start_time:.3f}s.")
58
+
59
+ # EasyOCR + SpellChecker
60
+ logging.info("Loading OCR + SpellChecker...")
61
+ start_time = time.perf_counter()
62
+ GLOBAL_READER = easyocr.Reader(['en'], gpu=True)
63
+ GLOBAL_SPELL = SpellChecker()
64
+ logging.info(f"OCR + SpellChecker init took {time.perf_counter()-start_time:.3f}s.")
65
+
66
+ # YOLO Model
67
+ logging.info("Loading YOLO model...")
68
+ start_time = time.perf_counter()
69
+ yolo_weights = "best.pt"
70
+ if os.path.exists(yolo_weights):
71
+ YOLO_MODEL = YOLO(yolo_weights)
72
+ logging.info("YOLO model loaded.")
73
+ else:
74
+ logging.error(f"YOLO weights file '{yolo_weights}' not found! Endpoints will fail.")
75
+ logging.info(f"YOLO init took {time.perf_counter()-start_time:.3f}s.")
76
+
77
+ logging.info(f"Total model loading time: {time.perf_counter()-start_time_total:.3f}s.")
78
+
79
+
80
+ def pil_to_base64_str(pil_image, format="PNG"):
81
+ """Converts a PIL Image to a base64 string with a data URI header."""
82
+ buffered = io.BytesIO()
83
+ pil_image.save(buffered, format=format)
84
+ img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
85
+ return f"data:image/{format.lower()};base64,{img_str}"
86
+
87
+ def save_base64_image(image_data: str, file_path: str):
88
+ """Saves a base64 encoded image to a file."""
89
+ if image_data.startswith("data:image"):
90
+ _, image_data = image_data.split(",", 1)
91
+ img_bytes = base64.b64decode(image_data)
92
+ with open(file_path, "wb") as f:
93
+ f.write(img_bytes)
94
+ return img_bytes
95
+
96
+ def run_wrapper(image_path: str, output_dir: str, skip_ocr: bool = False, skip_spell: bool = False, json_mini=False) -> str:
97
+ """Calls the main processing script and returns the result."""
98
+ process_image_description(
99
+ input_image=image_path,
100
+ weights_file="best.pt",
101
+ output_dir=output_dir,
102
+ no_captioning=True,
103
+ output_json=True,
104
+ json_mini=json_mini,
105
+ model_obj=YOLO_MODEL,
106
+ sr=GLOBAL_SR,
107
+ spell=None if skip_ocr else GLOBAL_SPELL,
108
+ reader=None if skip_ocr else GLOBAL_READER,
109
+ skip_ocr=skip_ocr,
110
+ skip_spell=skip_spell,
111
+ )
112
+ base_name = os.path.splitext(os.path.basename(image_path))[0]
113
+ result_dir = os.path.join(output_dir, "result")
114
+ json_file = os.path.join(result_dir, f"{base_name}.json")
115
+ if os.path.exists(json_file):
116
+ with open(json_file, "r", encoding="utf-8") as f:
117
+ return f.read()
118
+ else:
119
+ raise FileNotFoundError(f"Result file not generated: {json_file}")
120
+
121
+ def handle_action(openai_key, image, prompt):
122
+ if not openai_key: return "Error: OpenAI API Key is required for /action."
123
+ if image is None: return "Error: Please upload an image."
124
+ if not prompt: return "Error: Please provide a prompt."
125
+
126
+ try:
127
+ llm_client = OpenAI(api_key=openai_key)
128
+ image_b64 = pil_to_base64_str(image)
129
+
130
+ with tempfile.TemporaryDirectory() as temp_dir:
131
+ request_id = str(uuid.uuid4())
132
+ original_image_path = os.path.join(temp_dir, f"{request_id}.png")
133
+ yolo_updated_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
134
+ save_base64_image(image_b64, original_image_path)
135
+
136
+ image_description = run_wrapper(original_image_path, temp_dir, skip_ocr=False, skip_spell=True, json_mini=True)
137
+
138
+ with open(yolo_updated_image_path, "rb") as f:
139
+ yolo_updated_img_bytes = f.read()
140
+
141
+ _, new_b64 = preprocess_image(yolo_updated_img_bytes, threshold=2000, scale=0.5, fmt="png")
142
+
143
+ base64_image_url = f"data:image/png;base64,{new_b64}"
144
+ prompt_text = f"""You are an AI agent... (rest of your long prompt)
145
+ The user said: "{prompt}"
146
+ Description: "{image_description}" """
147
+
148
+ messages = [{"role": "user", "content": [{"type": "text", "text": prompt_text}, {"type": "image_url", "image_url": {"url": base64_image_url, "detail": "high"}}]}]
149
+
150
+ response = llm_client.chat.completions.create(model="gpt-4.1", messages=messages, temperature=0.2)
151
+ return response.choices[0].message.content.strip()
152
+
153
+ except Exception as e:
154
+ logging.error(f"Error in /action endpoint: {e}", exc_info=True)
155
+ return f"An error occurred: {e}"
156
+
157
+ def handle_analyze(image, output_style):
158
+ if image is None: return "Error: Please upload an image."
159
+
160
+ try:
161
+ image_b64 = pil_to_base64_str(image)
162
+ with tempfile.TemporaryDirectory() as temp_dir:
163
+ image_path = os.path.join(temp_dir, "image_to_analyze.png")
164
+ save_base64_image(image_b64, image_path)
165
+
166
+ is_mini = (output_style == "Mini JSON")
167
+ description_str = run_wrapper(image_path=image_path, output_dir=temp_dir, json_mini=is_mini)
168
+
169
+ parsed_json = json.loads(description_str)
170
+ return json.dumps(parsed_json, indent=2)
171
+
172
+ except Exception as e:
173
+ logging.error(f"Error in /analyze endpoint: {e}", exc_info=True)
174
+ return f"An error occurred: {e}"
175
+
176
+ def handle_analyze_yolo(image, output_style):
177
+ if image is None: return None, "Error: Please upload an image."
178
+
179
+ try:
180
+ image_b64 = pil_to_base64_str(image)
181
+ with tempfile.TemporaryDirectory() as temp_dir:
182
+ request_id = str(uuid.uuid4())
183
+ image_path = os.path.join(temp_dir, f"{request_id}.png")
184
+ yolo_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
185
+ save_base64_image(image_b64, image_path)
186
+
187
+ is_mini = (output_style == "Mini JSON")
188
+ description_str = run_wrapper(image_path=image_path, output_dir=temp_dir, json_mini=is_mini)
189
+
190
+ parsed_json = json.loads(description_str)
191
+ description_output = json.dumps(parsed_json, indent=2)
192
+
193
+ yolo_image_result = Image.open(yolo_image_path)
194
+ return yolo_image_result, description_output
195
+
196
+ except Exception as e:
197
+ logging.error(f"Error in /analyze_and_get_yolo: {e}", exc_info=True)
198
+ return None, f"An error occurred: {e}"
199
+
200
+ def handle_generate(openai_key, image, prompt):
201
+ if not openai_key: return "Error: OpenAI API Key is required for /generate."
202
+ if image is None: return "Error: Please upload an image."
203
+ if not prompt: return "Error: Please provide a prompt."
204
+
205
+ try:
206
+ llm_client = OpenAI(api_key=openai_key)
207
+ image_b64 = pil_to_base64_str(image)
208
+
209
+ with tempfile.TemporaryDirectory() as temp_dir:
210
+ request_id = str(uuid.uuid4())
211
+ original_image_path = os.path.join(temp_dir, f"{request_id}.png")
212
+ yolo_updated_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
213
+ save_base64_image(image_b64, original_image_path)
214
+
215
+ image_description = run_wrapper(image_path=original_image_path, output_dir=temp_dir, json_mini=False)
216
+
217
+ with open(yolo_updated_image_path, "rb") as f:
218
+ yolo_updated_img_bytes = f.read()
219
+
220
+ _, new_b64 = preprocess_image(yolo_updated_img_bytes, threshold=1500, scale=0.5, fmt="png")
221
+
222
+ base64_image_url = f"data:image/png;base64,{new_b64}"
223
+ messages = [
224
+ {"role": "user", "content": [
225
+ {"type": "text", "text": f'"Prompt: {prompt}"\nImage description:\n"{image_description}"'},
226
+ {"type": "image_url", "image_url": {"url": base64_image_url, "detail": "high"}}
227
+ ]}
228
+ ]
229
+
230
+ response = llm_client.chat.completions.create(model="gpt-4.1", messages=messages, temperature=0.2)
231
+ return response.choices[0].message.content.strip()
232
+
233
+ except Exception as e:
234
+ logging.error(f"Error in /generate endpoint: {e}", exc_info=True)
235
+ return f"An error occurred: {e}"
236
+
237
+ default_image_1 = Image.open("./res/bb_1.jpeg")
238
+ default_image_2 = Image.open("./res/mfa_1.jpeg")
239
+
240
+ def load_example_action_1(): return default_image_1, "Open and read Umico partner"
241
+ def load_example_action_2(): return default_image_2, "Sign up in the application"
242
+ def load_example_analyze_1(): return default_image_1
243
+ def load_example_analyze_2(): return default_image_2
244
+ def load_example_yolo_1(): return default_image_1
245
+ def load_example_yolo_2(): return default_image_2
246
+ def load_example_generate_1(): return default_image_1, "Generate the code for this screen for Android XML. Try to use constraint layout"
247
+ def load_example_generate_2(): return default_image_2, "Generate the code for this screen for Android XML. Try to use constraint layout"
248
+
249
+
250
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
251
+ gr.Markdown("# Deki Automata: UI Analysis and Generation")
252
+ gr.Markdown("Provide your API keys below. The OpenAI key is only required for the 'Action' and 'Generate' tabs.")
253
+
254
+ with gr.Row():
255
+ openai_key_input = gr.Textbox(label="OpenAI API Key", placeholder="Enter your OpenAI API Key", type="password", scale=1)
256
+
257
+ with gr.Tabs():
258
+ with gr.TabItem("Action"):
259
+ gr.Markdown("### Control a device with natural language.")
260
+ with gr.Row():
261
+ image_input_action = gr.Image(type="pil", label="Upload Screen Image")
262
+ prompt_input_action = gr.Textbox(lines=2, placeholder="e.g., 'Open whatsapp and text my friend...'", label="Prompt")
263
+ action_output = gr.Textbox(label="Response Command")
264
+ action_button = gr.Button("Run Action", variant="primary")
265
+ with gr.Row():
266
+ example_action_btn1 = gr.Button("Load Example 1")
267
+ example_action_btn2 = gr.Button("Load Example 2")
268
+
269
+ with gr.TabItem("Analyze"):
270
+ gr.Markdown("### Get a structured JSON description of the UI elements.")
271
+ with gr.Row():
272
+ image_input_analyze = gr.Image(type="pil", label="Upload Screen Image")
273
+ with gr.Column():
274
+ output_style_analyze = gr.Radio(["Standard JSON", "Mini JSON"], label="Output Format", value="Standard JSON")
275
+ analyze_button = gr.Button("Analyze Image", variant="primary")
276
+ analyze_output = gr.JSON(label="JSON Description")
277
+ with gr.Row():
278
+ example_analyze_btn1 = gr.Button("Load Example 1")
279
+ example_analyze_btn2 = gr.Button("Load Example 2")
280
+
281
+ with gr.TabItem("Analyze & Get YOLO"):
282
+ gr.Markdown("### Get a JSON description and the image with detected elements.")
283
+ with gr.Row():
284
+ image_input_yolo = gr.Image(type="pil", label="Upload Screen Image")
285
+ with gr.Column():
286
+ output_style_yolo = gr.Radio(["Standard JSON", "Mini JSON"], label="Output Format", value="Standard JSON")
287
+ yolo_button = gr.Button("Analyze and Visualize", variant="primary")
288
+ with gr.Row():
289
+ yolo_image_output = gr.Image(label="YOLO Annotated Image")
290
+ description_output_yolo = gr.JSON(label="JSON Description")
291
+ with gr.Row():
292
+ example_yolo_btn1 = gr.Button("Load Example 1")
293
+ example_yolo_btn2 = gr.Button("Load Example 2")
294
+
295
+ with gr.TabItem("Generate"):
296
+ gr.Markdown("### Generate code or text based on a screenshot.")
297
+ with gr.Row():
298
+ image_input_generate = gr.Image(type="pil", label="Upload Screen Image")
299
+ prompt_input_generate = gr.Textbox(lines=2, placeholder="e.g., 'Generate the Android XML for this screen'", label="Prompt")
300
+ generate_output = gr.Code(label="Generated Output", language="xml")
301
+ generate_button = gr.Button("Generate", variant="primary")
302
+ with gr.Row():
303
+ example_generate_btn1 = gr.Button("Load Example 1")
304
+ example_generate_btn2 = gr.Button("Load Example 2")
305
+
306
+ action_button.click(fn=handle_action, inputs=[openai_key_input, image_input_action, prompt_input_action], outputs=action_output)
307
+ analyze_button.click(fn=handle_analyze, inputs=[image_input_analyze, output_style_analyze], outputs=analyze_output)
308
+ yolo_button.click(fn=handle_analyze_yolo, inputs=[image_input_yolo, output_style_yolo], outputs=[yolo_image_output, description_output_yolo])
309
+ generate_button.click(fn=handle_generate, inputs=[openai_key_input, image_input_generate, prompt_input_generate], outputs=generate_output)
310
+
311
+ example_action_btn1.click(fn=load_example_action_1, outputs=[image_input_action, prompt_input_action])
312
+ example_action_btn2.click(fn=load_example_action_2, outputs=[image_input_action, prompt_input_action])
313
+ example_analyze_btn1.click(fn=load_example_analyze_1, outputs=image_input_analyze)
314
+ example_analyze_btn2.click(fn=load_example_analyze_2, outputs=image_input_analyze)
315
+ example_yolo_btn1.click(fn=load_example_yolo_1, outputs=image_input_yolo)
316
+ example_yolo_btn2.click(fn=load_example_yolo_2, outputs=image_input_yolo)
317
+ example_generate_btn1.click(fn=load_example_generate_1, outputs=[image_input_generate, prompt_input_generate])
318
+ example_generate_btn2.click(fn=load_example_generate_2, outputs=[image_input_generate, prompt_input_generate])
319
+
320
+ load_models()
321
  demo.launch()
322
+
best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:879fd9bd951bce6910815fbe54b66a1970d903326d276c3d7cad19db798d0c2c
3
+ size 33455351
icon-image-detection-model.keras ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd08501457e8dee65d173483026744d52d0d12c58589a3ee03cc6ad3c2e9cdd3
3
+ size 245597405
main.py ADDED
@@ -0,0 +1,472 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import base64
3
+ import time
4
+ import easyocr
5
+ from spellchecker import SpellChecker
6
+ import cv2
7
+ import cv2.dnn_superres as dnn_superres
8
+ import json
9
+ import asyncio
10
+ import functools
11
+ from fastapi.responses import JSONResponse
12
+ from fastapi import FastAPI, HTTPException, Depends
13
+ from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
14
+ from pydantic import BaseModel
15
+ import openai
16
+ from openai import OpenAI
17
+ from ultralytics import YOLO
18
+ from wrapper import process_image_description
19
+ from utils.pills import preprocess_image
20
+ import logging
21
+ import tempfile
22
+ import uuid
23
+
24
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
25
+
26
+ app = FastAPI(title="deki-automata API")
27
+
28
+ # Global semaphore limit is set to 1 because the ML tasks take almost all RAM for the current server.
29
+ # Can be increased in the future
30
+ CONCURRENT_LIMIT = 2
31
+ concurrency_semaphore = asyncio.Semaphore(CONCURRENT_LIMIT)
32
+
33
+ def with_semaphore(timeout: float = 20):
34
+ """
35
+ Decorator to limit concurrent access by acquiring the semaphore
36
+ before the function runs, and releasing it afterward.
37
+ """
38
+ def decorator(func):
39
+ @functools.wraps(func)
40
+ async def wrapper(*args, **kwargs):
41
+ try:
42
+ await asyncio.wait_for(concurrency_semaphore.acquire(), timeout=timeout)
43
+ except asyncio.TimeoutError:
44
+ raise HTTPException(status_code=503, detail="Service busy, please try again later.")
45
+ try:
46
+ return await func(*args, **kwargs)
47
+ finally:
48
+ concurrency_semaphore.release()
49
+ return wrapper
50
+ return decorator
51
+
52
+ OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
53
+ API_TOKEN = os.environ.get("API_TOKEN")
54
+
55
+ if not OPENAI_API_KEY or not API_TOKEN:
56
+ logging.error("OPENAI_API_KEY and API_TOKEN must be set in environment variables.")
57
+ raise RuntimeError("OPENAI_API_KEY and API_TOKEN must be set in environment variables.")
58
+
59
+ openai.api_key = OPENAI_API_KEY
60
+
61
+ GLOBAL_SR = None
62
+ GLOBAL_READER = None
63
+ GLOBAL_SPELL = None
64
+ LLM_CLIENT = None
65
+
66
+ os.makedirs("./res", exist_ok=True)
67
+ os.makedirs("./result", exist_ok=True)
68
+ os.makedirs("./output", exist_ok=True)
69
+
70
+
71
+ # for action step tracking
72
+ ACTION_STEPS_LIMIT = 10 # can be updated
73
+
74
+ @app.on_event("startup")
75
+ def load_models():
76
+ """
77
+ Called once when FastAPI starts.
78
+ """
79
+ global GLOBAL_SR, GLOBAL_READER, GLOBAL_SPELL, LLM_CLIENT
80
+
81
+ # Super-resolution
82
+ logging.info("Loading super-resolution model ...")
83
+ start_time = time.perf_counter()
84
+ sr = None
85
+ model_path = "EDSR_x4.pb"
86
+ if hasattr(cv2, 'dnn_superres'):
87
+ logging.info("dnn_superres module is available.")
88
+ try:
89
+ sr = dnn_superres.DnnSuperResImpl_create()
90
+ logging.info("Using DnnSuperResImpl_create()")
91
+ except AttributeError:
92
+ sr = dnn_superres.DnnSuperResImpl()
93
+ logging.info("Using DnnSuperResImpl()")
94
+
95
+ if os.path.exists(model_path):
96
+ sr.readModel(model_path)
97
+ sr.setModel('edsr', 4)
98
+ GLOBAL_SR = sr
99
+ logging.info("Super-resolution model loaded.")
100
+ else:
101
+ logging.warning(f"Super-resolution model file not found: {model_path}. Skipping SR.")
102
+ GLOBAL_SR = None
103
+ else:
104
+ logging.info("dnn_superres module is NOT available; skipping super-resolution.")
105
+ GLOBAL_SR = None
106
+ logging.info(f"Super-resolution initialization took {time.perf_counter()-start_time:.3f}s.")
107
+
108
+ # EasyOCR + SpellChecker
109
+ logging.info("Loading OCR + SpellChecker ...")
110
+ start_time = time.perf_counter()
111
+ GLOBAL_READER = easyocr.Reader(['en'], gpu=True)
112
+ GLOBAL_SPELL = SpellChecker()
113
+ logging.info(f"OCR + SpellChecker init took {time.perf_counter()-start_time:.3f}s.")
114
+ LLM_CLIENT = OpenAI()
115
+
116
+ class ActionRequest(BaseModel):
117
+ image: str # Base64-encoded image
118
+ prompt: str # User prompt (like "Open whatsapp and write my friend user_name that I will be late for 15 minutes")
119
+ history: list[str] = []
120
+
121
+ class ActionResponse(BaseModel):
122
+ response: str
123
+ history: list[str]
124
+
125
+
126
+ class AnalyzeRequest(BaseModel):
127
+ image: str # Base64-encoded image
128
+
129
+ security = HTTPBearer()
130
+
131
+ def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
132
+ if credentials.credentials != API_TOKEN:
133
+ logging.warning("Invalid API token attempt.")
134
+ raise HTTPException(status_code=401, detail="Invalid API token")
135
+ return credentials.credentials
136
+
137
+ def save_base64_image(image_data: str, file_path: str) -> bytes:
138
+ """
139
+ Decode base64 image data (removing any data URI header) and save to the specified file.
140
+ Returns the raw image bytes.
141
+ """
142
+ if image_data.startswith("data:image"):
143
+ _, image_data = image_data.split(",", 1)
144
+ try:
145
+ img_bytes = base64.b64decode(image_data)
146
+ except Exception as e:
147
+ logging.exception("Error decoding base64 image data.")
148
+ raise HTTPException(status_code=400, detail=f"Invalid base64 image data: {e}")
149
+
150
+ try:
151
+ with open(file_path, "wb") as f:
152
+ f.write(img_bytes)
153
+ except Exception as e:
154
+ logging.exception("Error saving image file.")
155
+ raise HTTPException(status_code=500, detail=f"Failed to save image: {e}")
156
+
157
+ return img_bytes
158
+
159
+ def log_request_data(request, endpoint: str):
160
+ """
161
+ Log the user prompt (if any) and a preview of the image data.
162
+ """
163
+ logging.info(f"{endpoint} request received:")
164
+ if hasattr(request, 'prompt'):
165
+ logging.info(f"User prompt: {request.prompt}")
166
+ image_preview = request.image[:100] + "..." if len(request.image) > 100 else request.image
167
+ logging.info(f"User image data (base64 preview): {image_preview}")
168
+
169
+ def run_wrapper(image_path: str, output_dir: str, skip_ocr: bool = False, skip_spell: bool = False, json_mini = False) -> str:
170
+ """
171
+ Calls process_image_description() to perform YOLO detection and image description,
172
+ then reads the resulting JSON or text file from ./result.
173
+ """
174
+
175
+ weights_file = "best.pt"
176
+ no_captioning = True
177
+ output_json = True
178
+
179
+ process_image_description(
180
+ input_image=image_path,
181
+ weights_file=weights_file,
182
+ output_dir=output_dir,
183
+ no_captioning=no_captioning,
184
+ output_json=output_json,
185
+ json_mini=json_mini,
186
+ sr=GLOBAL_SR,
187
+ spell=None if skip_ocr else GLOBAL_SPELL,
188
+ reader=None if skip_ocr else GLOBAL_READER,
189
+ skip_ocr=skip_ocr,
190
+ skip_spell=skip_spell,
191
+ )
192
+
193
+ base_name = os.path.splitext(os.path.basename(image_path))[0]
194
+ result_dir = os.path.join(output_dir, "result")
195
+ json_file = os.path.join(result_dir, f"{base_name}.json")
196
+ txt_file = os.path.join(result_dir, f"{base_name}.txt")
197
+
198
+ if os.path.exists(json_file):
199
+ try:
200
+ with open(json_file, "r", encoding="utf-8") as f:
201
+ return f.read()
202
+ except Exception as e:
203
+ logging.exception("Failed to read JSON description file.")
204
+ raise Exception(f"Failed to read JSON description file: {e}")
205
+ elif os.path.exists(txt_file):
206
+ try:
207
+ with open(txt_file, "r", encoding="utf-8") as f:
208
+ return f.read()
209
+ except Exception as e:
210
+ logging.exception("Failed to read TXT description file.")
211
+ raise Exception(f"Failed to read TXT description file: {e}")
212
+ else:
213
+ logging.error("No image description file was generated.")
214
+ raise FileNotFoundError("No image description file was generated.")
215
+
216
+ @app.get("/")
217
+ async def root():
218
+ return {"message": "deki"}
219
+
220
+ @app.post("/action", response_model=ActionResponse)
221
+ @with_semaphore(timeout=60)
222
+ async def action(request: ActionRequest, token: str = Depends(verify_token)):
223
+ """
224
+ Processes the input image (in base64 format) and a user prompt:
225
+ 1. Decodes and saves the original image.
226
+ 2. Runs the wrapper to generate an image description and the YOLO-updated image file.
227
+ 3. Reads the YOLO-updated image file.
228
+ 4. Preprocesses the YOLO-updated image.
229
+ 5. Constructs a prompt for ChatGPT (using description + preprocessed YOLO image) and sends it.
230
+ 6. Returns the command response.
231
+ """
232
+ start_time = time.perf_counter()
233
+ logging.info("action endpoint start")
234
+ log_request_data(request, "/action")
235
+
236
+ action_step_history = request.history
237
+ action_step_count = len(action_step_history)
238
+
239
+ # Check if the step limit is reached.
240
+ if action_step_count >= ACTION_STEPS_LIMIT:
241
+ logging.warning(f"Step limit of {ACTION_STEPS_LIMIT} reached. Resetting history.")
242
+ # Return a clear response and an empty history to reset the client.
243
+ return ActionResponse(response="Step limit is reached", history=[])
244
+
245
+ # Use a temporary directory to isolate all files for this request.
246
+ with tempfile.TemporaryDirectory() as temp_dir:
247
+ request_id = str(uuid.uuid4())
248
+
249
+ original_image_path = os.path.join(temp_dir, f"{request_id}.png")
250
+ yolo_updated_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
251
+
252
+ save_base64_image(request.image, original_image_path)
253
+
254
+ try:
255
+ loop = asyncio.get_running_loop()
256
+ image_description = await loop.run_in_executor(
257
+ None,
258
+ run_wrapper,
259
+ original_image_path,
260
+ temp_dir,
261
+ False,
262
+ True,
263
+ True,
264
+ )
265
+ except Exception as e:
266
+ logging.exception("Image processing failed in action endpoint.")
267
+ raise HTTPException(status_code=500, detail=f"Image processing failed: {e}")
268
+
269
+ try:
270
+ if not os.path.exists(yolo_updated_image_path):
271
+ logging.error(f"YOLO updated image not found at {yolo_updated_image_path}")
272
+ raise HTTPException(status_code=500, detail="YOLO updated image generation failed or not found.")
273
+ with open(yolo_updated_image_path, "rb") as f:
274
+ yolo_updated_img_bytes = f.read()
275
+ except Exception as e:
276
+ logging.exception(f"Error reading YOLO updated image from {yolo_updated_image_path}")
277
+ raise HTTPException(status_code=500, detail=f"Failed to read YOLO updated image: {e}")
278
+
279
+ try:
280
+ _, new_b64 = preprocess_image(yolo_updated_img_bytes, threshold=2000, scale=0.5, fmt="png")
281
+ except Exception as e:
282
+ logging.exception("YOLO updated image preprocessing failed.")
283
+ raise HTTPException(status_code=500, detail=f"YOLO updated image preprocessing failed: {e}")
284
+
285
+ base64_image_url = f"data:image/png;base64,{new_b64}"
286
+
287
+ current_step = action_step_count + 1
288
+ previous_steps_text = ""
289
+ if action_step_history:
290
+ previous_steps_text = "\nPrevious steps:\n" + "\n".join(f"{i+1}. {step}" for i, step in enumerate(action_step_history))
291
+
292
+ prompt_text = f"""You are an AI agent that controls a mobile device and sees the content of screen.
293
+ User can ask you about some information or to do some task and you need to do these tasks.
294
+ You can only respond with one of these commands (in quotes) but some variables are dynamic
295
+ and can be changed based on the context:
296
+ 1. "Swipe left. From start coordinates 300, 400" (or other coordinates) (Goes right)
297
+ 2. "Swipe right. From start coordinates 500, 650" (or other coordinates) (Goes left)
298
+ 3. "Swipe top. From start coordinates 600, 510" (or other coordinates) (Goes bottom)
299
+ 4. "Swipe bottom. From start coordinates 640, 500" (or other coordinates) (Goes top)
300
+ 5. "Go home"
301
+ 6. "Go back"
302
+ 8. "Open com.whatsapp" (or other app)
303
+ 9. "Tap coordinates 160, 820" (or other coordinates)
304
+ 10. "Insert text 210, 820:Hello world" (or other coordinates and text)
305
+ 11. "Screen is in a loading state. Try again" (send image again)
306
+ 12. "Answer: There are no new important mails today" (or other answer)
307
+ 13. "Finished" (task is finished)
308
+ 14. "Can't proceed" (can't understand what to do or image has problem etc.)
309
+
310
+ The user said: "{request.prompt}"
311
+
312
+ Current step: {current_step}
313
+ {previous_steps_text}
314
+
315
+ I will share the screenshot of the current state of the phone (with UI elements highlighted and the corresponding
316
+ index of these UI elements) and the description (sizes, coordinates and indexes) of UI elements.
317
+ Description:
318
+ "{image_description}" """
319
+
320
+ messages = [
321
+ {"role": "user", "content": [{"type": "text", "text": prompt_text}, {"type": "image_url", "image_url": {"url": base64_image_url, "detail": "high"}}]}
322
+ ]
323
+
324
+ try:
325
+ response = LLM_CLIENT.chat.completions.create(model="gpt-4.1", messages=messages, temperature=0.2)
326
+ except Exception as e:
327
+ logging.exception("OpenAI API error.")
328
+ raise HTTPException(status_code=500, detail=f"OpenAI API error: {e}")
329
+
330
+ command_response = response.choices[0].message.content.strip()
331
+
332
+ action_step_history.append(command_response)
333
+
334
+ command_lower = command_response.strip().strip('\'"').lower()
335
+ if command_lower.startswith(("answer:", "finished", "can't proceed")):
336
+ logging.info(f"Terminal command received ('{command_response}'). Resetting history for next turn.")
337
+ final_history = []
338
+ else:
339
+ final_history = action_step_history
340
+
341
+ logging.info(f"action endpoint total processing time: {time.perf_counter()-start_time:.3f} seconds.")
342
+ logging.info(f"Response: {command_response}, History length for next turn: {len(final_history)}")
343
+
344
+ return ActionResponse(response=command_response, history=final_history)
345
+
346
+
347
+ @app.post("/generate")
348
+ @with_semaphore(timeout=60)
349
+ async def generate(request: ActionRequest, token: str = Depends(verify_token)):
350
+ """
351
+ Processes the input image (in base64 format) and a user prompt:
352
+ 1. Decodes and saves the original image.
353
+ 2. Runs the wrapper to generate an image description and the YOLO updated image file.
354
+ 3. Reads the YOLO updated image file.
355
+ 4. Preprocesses the YOLO-updated image.
356
+ 5. Constructs a prompt for GPT (using description + preprocessed YOLO image) and sends it.
357
+ 6. Returns the command response.
358
+ """
359
+ start_time = time.perf_counter()
360
+ logging.info("generate endpoint start")
361
+ log_request_data(request, "/generate")
362
+
363
+ with tempfile.TemporaryDirectory() as temp_dir:
364
+ request_id = str(uuid.uuid4())
365
+ original_image_path = os.path.join(temp_dir, f"{request_id}.png")
366
+ yolo_updated_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
367
+
368
+ save_base64_image(request.image, original_image_path)
369
+
370
+ try:
371
+ image_description = run_wrapper(image_path=original_image_path, output_dir=temp_dir)
372
+ except Exception as e:
373
+ logging.exception("Image processing failed in generate endpoint")
374
+ raise HTTPException(status_code=500, detail=f"Image processing failed: {e}")
375
+
376
+ try:
377
+ if not os.path.exists(yolo_updated_image_path):
378
+ raise HTTPException(status_code=500, detail="YOLO updated image generation failed or not found")
379
+ with open(yolo_updated_image_path, "rb") as f:
380
+ yolo_updated_img_bytes = f.read()
381
+ except Exception as e:
382
+ raise HTTPException(status_code=500, detail=f"Failed to read YOLO updated image: {e}")
383
+
384
+ try:
385
+ _, new_b64 = preprocess_image(yolo_updated_img_bytes, threshold=1500, scale=0.5, fmt="png")
386
+ except Exception as e:
387
+ raise HTTPException(status_code=500, detail=f"Image preprocessing failed: {e}")
388
+
389
+ base64_image_url = f"data:image/png;base64,{new_b64}"
390
+
391
+ messages = [
392
+ {"role": "user", "content": [
393
+ {"type": "text", "text": f'"Prompt: {request.prompt}"\nImage description:\n"{image_description}"'},
394
+ {"type": "image_url", "image_url": {"url": base64_image_url, "detail": "high"}}
395
+ ]}
396
+ ]
397
+
398
+ try:
399
+ response = LLM_CLIENT.chat.completions.create(model="gpt-4.1", messages=messages, temperature=0.2)
400
+ except Exception as e:
401
+ raise HTTPException(status_code=500, detail=f"OpenAI API error: {e}")
402
+
403
+ command_response = response.choices[0].message.content.strip()
404
+ logging.info(f"generate endpoint total processing time: {time.perf_counter()-start_time:.3f} seconds")
405
+ return {"response": command_response}
406
+
407
+ @app.post("/analyze")
408
+ @with_semaphore(timeout=60)
409
+ async def analyze(request: AnalyzeRequest, token: str = Depends(verify_token)):
410
+ """
411
+ Processes the input image (in base64 format) to return the image description as a JSON object.
412
+ """
413
+ logging.info("analyze endpoint start")
414
+ log_request_data(request, "/analyze")
415
+
416
+ with tempfile.TemporaryDirectory() as temp_dir:
417
+ image_path = os.path.join(temp_dir, "image_to_analyze.png")
418
+ save_base64_image(request.image, image_path)
419
+
420
+ try:
421
+ image_description = run_wrapper(image_path=image_path, output_dir=temp_dir)
422
+ analyzed_description = json.loads(image_description)
423
+ except Exception as e:
424
+ logging.exception("Image processing failed in analyze endpoint.")
425
+ raise HTTPException(status_code=500, detail=f"Image processing failed: {e}")
426
+
427
+ return JSONResponse(content={"description": analyzed_description})
428
+
429
+
430
+ @app.post("/analyze_and_get_yolo")
431
+ @with_semaphore(timeout=60)
432
+ async def analyze_and_get_yolo(request: AnalyzeRequest, token: str = Depends(verify_token)):
433
+ """
434
+ Processes the input image (in base64 format) to:
435
+ 1. Return the image description as a JSON object.
436
+ 2. Return the YOLO-updated image (base64 encoded).
437
+ """
438
+ logging.info("analyze_and_get_yolo endpoint start")
439
+ log_request_data(request, "/analyze_and_get_yolo")
440
+
441
+ with tempfile.TemporaryDirectory() as temp_dir:
442
+ request_id = str(uuid.uuid4())
443
+ image_path = os.path.join(temp_dir, f"{request_id}.png")
444
+ yolo_image_path = os.path.join(temp_dir, f"{request_id}_yolo_updated.png")
445
+
446
+ save_base64_image(request.image, image_path)
447
+
448
+ try:
449
+ image_description = run_wrapper(image_path=image_path, output_dir=temp_dir)
450
+ analyzed_description = json.loads(image_description)
451
+ except Exception as e:
452
+ logging.exception("Image processing failed in analyze_and_get_yolo endpoint.")
453
+ raise HTTPException(status_code=500, detail=f"Image processing failed: {e}")
454
+
455
+ if not os.path.exists(yolo_image_path):
456
+ logging.error("YOLO updated image not found.")
457
+ raise HTTPException(status_code=500, detail="YOLO updated image not generated.")
458
+
459
+ try:
460
+ with open(yolo_image_path, "rb") as f:
461
+ yolo_img_bytes = f.read()
462
+ yolo_b64 = base64.b64encode(yolo_img_bytes).decode("utf-8")
463
+ yolo_image_encoded = f"data:image/png;base64,{yolo_b64}"
464
+ except Exception as e:
465
+ logging.exception("Error reading or encoding YOLO updated image.")
466
+ raise HTTPException(status_code=500, detail=f"Error handling YOLO updated image: {e}")
467
+
468
+ return JSONResponse(content={
469
+ "description": analyzed_description,
470
+ "yolo_image": yolo_image_encoded
471
+ })
472
+
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ easyocr==1.7.2
2
+ numpy==1.26.4
3
+ openai==1.55.3
4
+ pillow==11.0.0
5
+ pydantic==2.9.2
6
+ pyspellchecker==0.8.1
7
+ requests==2.32.3
8
+ tensorflow==2.18.0
9
+ torch==2.5.1
10
+ transformers==4.50.3
11
+ ultralytics==8.3.29
12
+ webcolors==24.11.1
13
+ httpx==0.27.2
14
+ gradio==5.23.1
15
+ opencv-python-headless==4.10.0.84
script.py ADDED
@@ -0,0 +1,808 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import cv2
2
+ import os
3
+ import subprocess
4
+ from PIL import Image
5
+ import easyocr
6
+ from spellchecker import SpellChecker
7
+ import numpy as np
8
+ import webcolors
9
+ from collections import Counter
10
+ import torch
11
+ from transformers import AutoProcessor, Blip2ForConditionalGeneration
12
+ import tensorflow as tf
13
+ import argparse
14
+ import json
15
+ from concurrent.futures import ThreadPoolExecutor, as_completed
16
+ import time
17
+ from utils.json_helpers import NoIndent, CustomEncoder
18
+
19
+
20
+ # constants
21
+ BARRIER = "********\n"
22
+
23
+ # Check if a model is in the cache
24
+ def is_model_downloaded(model_name, cache_directory):
25
+ model_path = os.path.join(cache_directory, model_name.replace('/', '_'))
26
+ return os.path.exists(model_path)
27
+
28
+ # Convert color to the closest name
29
+ def closest_colour(requested_colour):
30
+ min_colours = {}
31
+ css3_names = webcolors.names("css3")
32
+ for name in css3_names:
33
+ hex_value = webcolors.name_to_hex(name, spec='css3')
34
+ r_c, g_c, b_c = webcolors.hex_to_rgb(hex_value)
35
+ rd = (r_c - requested_colour[0]) ** 2
36
+ gd = (g_c - requested_colour[1]) ** 2
37
+ bd = (b_c - requested_colour[2]) ** 2
38
+ distance = rd + gd + bd
39
+ min_colours[distance] = name
40
+ return min_colours[min(min_colours.keys())]
41
+
42
+ def get_colour_name(requested_colour):
43
+ """
44
+ Returns a tuple: (exact_name, closest_name).
45
+ If an exact match fails, 'exact_name' is None, use the 'closest_name' fallback.
46
+ """
47
+ try:
48
+ actual_name = webcolors.rgb_to_name(requested_colour, spec='css3')
49
+ closest_name = actual_name
50
+ except ValueError:
51
+ closest_name = closest_colour(requested_colour)
52
+ actual_name = None
53
+ return actual_name, closest_name
54
+
55
+ def get_most_frequent_color(pixels, bin_size=10):
56
+ """
57
+ Returns the most frequent color among the given pixels,
58
+ using a binning approach (default bin size=10).
59
+ """
60
+ bins = np.arange(0, 257, bin_size)
61
+ r_bins = np.digitize(pixels[:, 0], bins) - 1
62
+ g_bins = np.digitize(pixels[:, 1], bins) - 1
63
+ b_bins = np.digitize(pixels[:, 2], bins) - 1
64
+ combined_bins = r_bins * 10000 + g_bins * 100 + b_bins
65
+ bin_counts = Counter(combined_bins)
66
+ most_common_bin = bin_counts.most_common(1)[0][0]
67
+
68
+ r_bin = most_common_bin // 10000
69
+ g_bin = (most_common_bin % 10000) // 100
70
+ b_bin = most_common_bin % 100
71
+ r_value = bins[r_bin] + bin_size // 2
72
+ g_value = bins[g_bin] + bin_size // 2
73
+ b_value = bins[b_bin] + bin_size // 2
74
+
75
+ return (r_value, g_value, b_value)
76
+
77
+ def get_most_frequent_alpha(alphas, bin_size=10):
78
+ bins = np.arange(0, 257, bin_size)
79
+ alpha_bins = np.digitize(alphas, bins) - 1
80
+ bin_counts = Counter(alpha_bins)
81
+ most_common_bin = bin_counts.most_common(1)[0][0]
82
+ alpha_value = bins[most_common_bin] + bin_size // 2
83
+ return alpha_value
84
+
85
+ # downscale images for OCR. TODO change dim to a suitable one
86
+ def downscale_for_ocr(image_cv, max_dim=600):
87
+ """
88
+ If either dimension of `image_cv` is bigger than `max_dim`,
89
+ scale it down proportionally. This speeds up EasyOCR on large images.
90
+ """
91
+ h, w = image_cv.shape[:2]
92
+ if w <= max_dim and h <= max_dim:
93
+ return image_cv # No downscale needed
94
+
95
+ scale = min(max_dim / float(w), max_dim / float(h))
96
+ new_w = int(w * scale)
97
+ new_h = int(h * scale)
98
+ image_resized = cv2.resize(image_cv, (new_w, new_h), interpolation=cv2.INTER_AREA)
99
+ return image_resized
100
+
101
+ # Worker function to process a single bounding box
102
+ def process_single_region(
103
+ idx, bounding_box, image, sr, reader, spell, icon_model,
104
+ processor, model, device, no_captioning, output_json, json_mini,
105
+ cropped_imageview_images_dir, base_name, save_images,
106
+ model_to_use, log_prefix="",
107
+ skip_ocr=False,
108
+ skip_spell=False
109
+ ):
110
+ """
111
+ Processes one bounding box (region)
112
+ Returns a dict with:
113
+ * "region_dict" (for JSON)
114
+ * "text_log" (file/captions output)
115
+ """
116
+ (x_min, y_min, x_max, y_max, class_id) = bounding_box
117
+ class_names = {0: 'View', 1: 'ImageView', 2: 'Text', 3: 'Line'}
118
+ class_name = class_names.get(class_id, f'Unknown Class {class_id}')
119
+ region_idx = idx + 1
120
+ logs = []
121
+
122
+ x_center = (x_min + x_max) // 2
123
+ y_center = (y_min + y_max) // 2
124
+ width = x_max - x_min
125
+ height = y_max - y_min
126
+
127
+ def open_and_upscale_image(img_path, cid):
128
+
129
+ if cid == 2: # Text
130
+ MAX_WIDTH, MAX_HEIGHT = 30, 30
131
+ else:
132
+ MAX_WIDTH, MAX_HEIGHT = 10, 10
133
+
134
+ def is_small(w, h):
135
+ return w <= MAX_WIDTH and h <= MAX_HEIGHT
136
+
137
+ if cid == 0: # "View" - use PIL to preserve alpha
138
+ pil_image = Image.open(img_path).convert("RGBA")
139
+ w, h = pil_image.size
140
+ if not is_small(w, h):
141
+ logs.append(f"{log_prefix}Skipping upscale for large View (size={w}×{h}).")
142
+ return pil_image
143
+
144
+ # If super-resolution is provided, use it
145
+ if sr:
146
+ image_cv = cv2.cvtColor(np.array(pil_image), cv2.COLOR_RGBA2BGR)
147
+ upscaled = sr.upsample(image_cv)
148
+ return Image.fromarray(cv2.cvtColor(upscaled, cv2.COLOR_BGR2RGBA))
149
+ else:
150
+ return pil_image.resize((w * 4, h * 4), resample=Image.BICUBIC)
151
+ else:
152
+ # For other classes, load the image with OpenCV (BGR)
153
+ cv_image = cv2.imread(img_path)
154
+ if cv_image is None or cv_image.size == 0:
155
+ logs.append(f"{log_prefix}Empty image at {img_path}, skipping.")
156
+ return None
157
+
158
+ h, w = cv_image.shape[:2]
159
+ if not is_small(w, h):
160
+ logs.append(f"{log_prefix}Skipping upscale for large region (size={w}×{h}).")
161
+ return cv_image
162
+
163
+ if sr:
164
+ return sr.upsample(cv_image)
165
+ else:
166
+ return cv2.resize(cv_image, (w * 2, h * 2), interpolation=cv2.INTER_CUBIC)
167
+
168
+ if json_mini:
169
+ simplified_class_name = class_name.lower().replace('imageview', 'image')
170
+ new_id = f"{simplified_class_name}_{region_idx}"
171
+ mini_region_dict = {
172
+ "id": new_id,
173
+ "bbox": NoIndent([x_center, y_center, width, height])
174
+ }
175
+
176
+ # only need to process text for the mini format
177
+ if class_name == 'Text' and not skip_ocr:
178
+ cropped_image_region = image[y_min:y_max, x_min:x_max]
179
+ if cropped_image_region.size > 0:
180
+ # Save the cropped image so open_and_upscale_image can use it
181
+ cropped_path = os.path.join(cropped_imageview_images_dir, f"region_{region_idx}_class_{class_id}.jpg")
182
+ cv2.imwrite(cropped_path, cropped_image_region)
183
+
184
+ upscaled = open_and_upscale_image(cropped_path, class_id)
185
+ if upscaled is not None:
186
+ if isinstance(upscaled, Image.Image):
187
+ upscaled_cv = cv2.cvtColor(np.array(upscaled), cv2.COLOR_RGBA2BGR)
188
+ else:
189
+ upscaled_cv = upscaled
190
+
191
+ gray = cv2.cvtColor(downscale_for_ocr(upscaled_cv), cv2.COLOR_BGR2GRAY)
192
+ text = ' '.join(reader.readtext(gray, detail=0, batch_size=8)).strip()
193
+
194
+ if text:
195
+ if not skip_spell and spell:
196
+ corrected_words = []
197
+ for w in text.split():
198
+ corrected_words.append(spell.correction(w) or w)
199
+ mini_region_dict["text"] = " ".join(corrected_words)
200
+ else:
201
+ mini_region_dict["text"] = text
202
+
203
+ # Clean up the temporary cropped image
204
+ if os.path.exists(cropped_path) and not save_images:
205
+ os.remove(cropped_path)
206
+
207
+ return {"mini_region_dict": mini_region_dict, "text_log": ""}
208
+
209
+ logs.append(f"\n{log_prefix}Region {region_idx} - Class ID: {class_id} ({class_name})")
210
+ x_center = (x_min + x_max) // 2
211
+ y_center = (y_min + y_max) // 2
212
+ logs.append(f"{log_prefix}Coordinates: x_center={x_center}, y_center={y_center}")
213
+ width = x_max - x_min
214
+ height = y_max - y_min
215
+ logs.append(f"{log_prefix}Size: width={width}, height={height}")
216
+
217
+ region_dict = {
218
+ "id": f"region_{region_idx}_class_{class_name}",
219
+ "x_coordinates_center": x_center,
220
+ "y_coordinates_center": y_center,
221
+ "width": width,
222
+ "height": height
223
+ }
224
+
225
+ # Crop region
226
+ cropped_image_region = image[y_min:y_max, x_min:x_max]
227
+ if cropped_image_region.size == 0:
228
+ logs.append(f"{log_prefix}Empty crop for Region {region_idx}, skipping...")
229
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
230
+
231
+ # Save cropped region
232
+ if class_id == 0:
233
+ # Save as PNG if it's a View
234
+ cropped_path = os.path.join(
235
+ cropped_imageview_images_dir, f"region_{region_idx}_class_{class_id}.png"
236
+ )
237
+ cv2.imwrite(cropped_path, cropped_image_region)
238
+ else:
239
+ # Save as JPG
240
+ cropped_path = os.path.join(
241
+ cropped_imageview_images_dir, f"region_{region_idx}_class_{class_id}.jpg"
242
+ )
243
+ cv2.imwrite(cropped_path, cropped_image_region)
244
+
245
+ # for LLaMA (ollama)
246
+ def call_ollama(prompt_text, rid, task_type):
247
+ model_name = "llama3.2-vision:11b"
248
+ cmd = ["ollama", "run", model_name, prompt_text]
249
+ try:
250
+ result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
251
+ if result.returncode != 0:
252
+ logs.append(f"{log_prefix}Error generating {task_type} for Region {rid}: {result.stderr}")
253
+ return None
254
+ else:
255
+ response = result.stdout.strip()
256
+ logs.append(f"{log_prefix}Generated {task_type.capitalize()} for Region {rid}: {response}")
257
+ return response
258
+ except Exception as e:
259
+ logs.append(f"{log_prefix}An error occurred while generating {task_type} for Region {rid}: {e}")
260
+ return None
261
+
262
+ # for BLIP-2
263
+ def generate_caption_blip(img_path):
264
+ pil_image = Image.open(img_path).convert('RGB')
265
+ inputs = processor(images=pil_image, return_tensors="pt").to(device, torch.float16)
266
+ gen_ids = model.generate(**inputs)
267
+ return processor.batch_decode(gen_ids, skip_special_tokens=True)[0].strip()
268
+
269
+ # Handle each class type
270
+ if class_id == 1: # ImageView
271
+ if no_captioning:
272
+ logs.append(f"{log_prefix}(Icon-image detection + captioning disabled by --no-captioning.)")
273
+ if not output_json:
274
+ block = (
275
+ f"Image: region_{region_idx}_class_{class_id} ({class_name})\n"
276
+ f"Coordinates: x_center={(x_min + x_max) // 2}, y_center={(y_min + y_max) // 2}\n"
277
+ f"Size: width={width}, height={height}\n"
278
+ f"{BARRIER}"
279
+ )
280
+ logs.append(block)
281
+ else:
282
+ upscaled = open_and_upscale_image(cropped_path, class_id)
283
+ if upscaled is None:
284
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
285
+
286
+ # Icon detection
287
+ if icon_model:
288
+ icon_input_size = (224, 224)
289
+ if isinstance(upscaled, Image.Image):
290
+ upscaled_cv = cv2.cvtColor(np.array(upscaled), cv2.COLOR_RGBA2BGR)
291
+ else:
292
+ upscaled_cv = upscaled
293
+ resized = cv2.resize(upscaled_cv, icon_input_size)
294
+ rgb_img = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB) / 255.0
295
+ rgb_img = np.expand_dims(rgb_img, axis=0)
296
+ pred = icon_model.predict(rgb_img)
297
+ logs.append(f"{log_prefix}Prediction output for Region {region_idx}: {pred}")
298
+ if pred.shape == (1, 1):
299
+ probability = pred[0][0]
300
+ threshold = 0.5
301
+ predicted_class = 1 if probability >= threshold else 0
302
+ logs.append(f"{log_prefix}Probability of class 1: {probability}")
303
+ elif pred.shape == (1, 2):
304
+ predicted_class = np.argmax(pred[0])
305
+ logs.append(f"{log_prefix}Class probabilities: {pred[0]}")
306
+ else:
307
+ logs.append(f"{log_prefix}Unexpected prediction shape: {pred.shape}")
308
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
309
+
310
+ pred_text = "Icon/Mobile UI Element" if predicted_class == 1 else "Normal Image"
311
+ region_dict["prediction"] = pred_text
312
+ if predicted_class == 1:
313
+ prompt_text = "Describe the mobile UI element on this image. Keep it short."
314
+ else:
315
+ prompt_text = "Describe what is in the image briefly. It's not an icon or typical UI element."
316
+ else:
317
+ logs.append(f"{log_prefix}Icon detection model not provided; skipping icon detection.")
318
+ region_dict["prediction"] = "Icon detection skipped"
319
+ prompt_text = "Describe what is in this image briefly."
320
+
321
+ # Caption
322
+ temp_image_path = os.path.abspath(
323
+ os.path.join(cropped_imageview_images_dir, f"imageview_{region_idx}.jpg")
324
+ )
325
+ if isinstance(upscaled, Image.Image): # TODO check optimization
326
+ upscaled.save(temp_image_path)
327
+ else:
328
+ cv2.imwrite(temp_image_path, upscaled)
329
+
330
+ response = ""
331
+ if model and processor and model_to_use == 'blip':
332
+ response = generate_caption_blip(temp_image_path)
333
+ else: # TODO check optimization
334
+ resp = call_ollama(prompt_text + " " + temp_image_path, region_idx, "description")
335
+ response = resp if resp else "Error generating description"
336
+
337
+ region_dict["description"] = response
338
+
339
+ if not output_json:
340
+ block = (
341
+ f"Image: region_{region_idx}_class_{class_id} ({class_name})\n"
342
+ f"Coordinates: x_center={(x_min + x_max) // 2}, y_center={(y_min + y_max) // 2}\n"
343
+ f"Size: width={width}, height={height}\n"
344
+ f"Prediction: {region_dict['prediction']}\n"
345
+ f"{response}\n"
346
+ f"{BARRIER}"
347
+ )
348
+ logs.append(block)
349
+
350
+ if os.path.exists(temp_image_path) and not save_images:
351
+ os.remove(temp_image_path)
352
+
353
+ elif class_id == 2: # Text
354
+ if skip_ocr or reader is None:
355
+ logs.append(f"{log_prefix}OCR skipped for Region {region_idx}.")
356
+
357
+ if not output_json:
358
+ block = (
359
+ f"Text: region_{region_idx}_class_{class_id} ({class_name})\n"
360
+ f"Coordinates: x_center={(x_min + x_max) // 2}, "
361
+ f"y_center={(y_min + y_max) // 2}\n"
362
+ f"Size: width={width}, height={height}\n"
363
+ f"OCR + spell-check disabled\n"
364
+ f"{BARRIER}"
365
+ )
366
+ logs.append(block)
367
+
368
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
369
+
370
+ upscaled = open_and_upscale_image(cropped_path, class_id)
371
+ if upscaled is None:
372
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
373
+
374
+ if isinstance(upscaled, Image.Image):
375
+ upscaled_cv = cv2.cvtColor(np.array(upscaled), cv2.COLOR_RGBA2BGR)
376
+ else:
377
+ upscaled_cv = upscaled
378
+
379
+
380
+ # TODO use other lib to improve the performance
381
+ upscaled_cv = downscale_for_ocr(upscaled_cv, max_dim=600)
382
+ gray = cv2.cvtColor(upscaled_cv, cv2.COLOR_BGR2GRAY)
383
+ result_ocr = reader.readtext(gray, detail=0, batch_size=8)
384
+ text = ' '.join(result_ocr).strip()
385
+
386
+ # TODO use other lib to improve performance
387
+ if skip_spell or spell is None:
388
+ corrected_text = None
389
+ logs.append(f"{log_prefix}Spell-check skipped for Region {region_idx}.")
390
+ else:
391
+ correction_cache = {}
392
+ corrected_words = []
393
+ for w in text.split():
394
+ if w not in correction_cache:
395
+ correction_cache[w] = spell.correction(w) or w
396
+ corrected_words.append(correction_cache[w])
397
+ corrected_text = " ".join(corrected_words)
398
+
399
+
400
+ logs.append(f"{log_prefix}Extracted Text for Region {region_idx}: {text}")
401
+ if corrected_text is not None:
402
+ logs.append(f"{log_prefix}Corrected Text for Region {region_idx}: {corrected_text}")
403
+
404
+ region_dict["extractedText"] = text
405
+ if corrected_text is not None:
406
+ region_dict["correctedText"] = corrected_text
407
+
408
+ if not output_json:
409
+ block = (
410
+ f"Text: region_{region_idx}_class_{class_id} ({class_name})\n"
411
+ f"Coordinates: x_center={(x_min + x_max) // 2}, y_center={(y_min + y_max) // 2}\n"
412
+ f"Size: width={width}, height={height}\n"
413
+ f"Extracted Text: {text}\n"
414
+ + (f"Corrected Text: {corrected_text}\n" if corrected_text is not None else "")
415
+ + f"{BARRIER}"
416
+ )
417
+ logs.append(block)
418
+
419
+ elif class_id == 0: # View
420
+ upscaled = open_and_upscale_image(cropped_path, class_id)
421
+ if upscaled is None:
422
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
423
+
424
+ data = np.array(upscaled)
425
+ if data.ndim == 2:
426
+ data = cv2.cvtColor(data, cv2.COLOR_GRAY2BGRA)
427
+ elif data.shape[-1] == 3:
428
+ b, g, r = cv2.split(data)
429
+ a = np.full_like(b, 255)
430
+ data = cv2.merge((b, g, r, a))
431
+
432
+ pixels = data.reshape((-1, 4))
433
+ opaque_pixels = pixels[pixels[:, 3] > 0]
434
+
435
+ if len(opaque_pixels) == 0:
436
+ logs.append(f"{log_prefix}No opaque pixels found in Region {region_idx}, cannot determine background color.")
437
+ color_name = "Unknown"
438
+ else:
439
+ dom_color = get_most_frequent_color(opaque_pixels[:, :3], bin_size=10)
440
+ exact_name, closest_name = get_colour_name(dom_color)
441
+ color_name = exact_name if exact_name else closest_name
442
+
443
+ alphas = pixels[:, 3]
444
+ dominant_alpha = get_most_frequent_alpha(alphas, bin_size=10)
445
+ transparency = "opaque" if dominant_alpha >= 245 else "transparent"
446
+
447
+ response = (
448
+ f"1. The background color of the container is {color_name}.\n"
449
+ f"2. The container is {transparency}."
450
+ )
451
+ logs.append(f"{log_prefix}{response}")
452
+ region_dict["view_color"] = f"The background color of the container is {color_name}."
453
+ region_dict["view_alpha"] = f"The container is {transparency}."
454
+
455
+ if not output_json:
456
+ block = (
457
+ f"View: region_{region_idx}_class_{class_id} ({class_name})\n"
458
+ f"Coordinates: x_center={(x_min + x_max) // 2}, y_center={(y_min + y_max) // 2}\n"
459
+ f"Size: width={width}, height={height}\n"
460
+ f"{response}\n"
461
+ f"{BARRIER}"
462
+ )
463
+ logs.append(block)
464
+
465
+ elif class_id == 3: # Line
466
+ logs.append(f"{log_prefix}Processing Line in Region {region_idx}")
467
+ line_img = cv2.imread(cropped_path, cv2.IMREAD_UNCHANGED)
468
+ if line_img is None:
469
+ logs.append(f"{log_prefix}Failed to read image at {cropped_path}")
470
+ return {"region_dict": region_dict, "text_log": "\n".join(logs)}
471
+
472
+ hh, ww = line_img.shape[:2]
473
+ logs.append(f"{log_prefix}Image dimensions: width={ww}, height={hh}")
474
+
475
+ data = np.array(line_img)
476
+ if data.ndim == 2:
477
+ data = cv2.cvtColor(data, cv2.COLOR_GRAY2BGRA)
478
+ elif data.shape[-1] == 3:
479
+ b, g, r = cv2.split(data)
480
+ a = np.full_like(b, 255)
481
+ data = cv2.merge((b, g, r, a))
482
+
483
+ pixels = data.reshape((-1, 4))
484
+ opaque_pixels = pixels[pixels[:, 3] > 0]
485
+
486
+ if len(opaque_pixels) == 0:
487
+ logs.append(f"{log_prefix}No opaque pixels found in Region {region_idx}, cannot determine line color.")
488
+ color_name = "Unknown"
489
+ else:
490
+ dom_color = get_most_frequent_color(opaque_pixels[:, :3], bin_size=10)
491
+ exact_name, closest_name = get_colour_name(dom_color)
492
+ color_name = exact_name if exact_name else closest_name
493
+
494
+ alphas = pixels[:, 3]
495
+ dom_alpha = get_most_frequent_alpha(alphas, bin_size=10)
496
+ transparency = "opaque" if dom_alpha >= 245 else "transparent"
497
+
498
+ response = (
499
+ f"1. The color of the line is {color_name}.\n"
500
+ f"2. The line is {transparency}."
501
+ )
502
+ logs.append(f"{log_prefix}{response}")
503
+ region_dict["line_color"] = f"The color of the line is {color_name}."
504
+ region_dict["line_alpha"] = f"The line is {transparency}."
505
+
506
+ if not output_json:
507
+ block = (
508
+ f"Line: region_{region_idx}_class_{class_id} ({class_name})\n"
509
+ f"Coordinates: x_center={(x_min + x_max) // 2}, y_center={(y_min + y_max) // 2}\n"
510
+ f"Size: width={width}, height={height}\n"
511
+ f"{response}\n"
512
+ f"{BARRIER}"
513
+ )
514
+ logs.append(block)
515
+
516
+ else:
517
+ logs.append(f"{log_prefix}Class ID {class_id} not handled.")
518
+
519
+ # Remove intermediate if not saving
520
+ if os.path.exists(cropped_path) and not save_images:
521
+ os.remove(cropped_path)
522
+
523
+ return {
524
+ "region_dict": region_dict,
525
+ "text_log": "\n".join(logs),
526
+ }
527
+
528
+
529
+ # Main function
530
+ def process_image(
531
+ input_image_path,
532
+ yolo_output_path,
533
+ output_dir:str = '.',
534
+ model_to_use='llama',
535
+ save_images=False,
536
+ icon_model_path=None,
537
+ cache_directory='./models_cache',
538
+ huggingface_token='your_token', # for blip2
539
+ no_captioning=False,
540
+ output_json=False,
541
+ json_mini=False,
542
+ sr=None,
543
+ reader=None,
544
+ spell=None,
545
+ skip_ocr=False,
546
+ skip_spell=False
547
+ ):
548
+ if json_mini:
549
+ json_output = {
550
+ "image_size": None, # Will be populated later
551
+ "bbox_format": "center_x, center_y, width, height",
552
+ "elements": []
553
+ }
554
+ elif output_json:
555
+ json_output = {
556
+ "image": {"path": input_image_path, "size": {"width": None, "height": None}},
557
+ "elements": []
558
+ }
559
+ else:
560
+ json_output = None
561
+
562
+
563
+ start_time = time.perf_counter()
564
+ print("super-resolution initialization start (in script.py)")
565
+ # Super-resolution initialization
566
+ if sr is None:
567
+ print("No sr reference passed; performing local init ...")
568
+ model_path = 'EDSR_x4.pb'
569
+ if hasattr(cv2, 'dnn_superres'):
570
+ print("dnn_superres module is available.")
571
+ import cv2.dnn_superres as dnn_superres
572
+ try:
573
+ sr = cv2.dnn_superres.DnnSuperResImpl_create()
574
+ print("Using DnnSuperResImpl_create()")
575
+ except AttributeError:
576
+ sr = cv2.dnn_superres.DnnSuperResImpl()
577
+ print("Using DnnSuperResImpl()")
578
+ sr.readModel(model_path)
579
+ sr.setModel('edsr', 4)
580
+ else:
581
+ print("dnn_superres module is NOT available; skipping super-resolution.")
582
+ else:
583
+ print("Using pre-initialized sr reference.")
584
+
585
+
586
+ elapsed = time.perf_counter() - start_time
587
+ print(f"super-resoulution init (in script.py) took {elapsed:.3f} seconds.")
588
+
589
+ start_time = time.perf_counter()
590
+
591
+ if skip_ocr:
592
+ print("skip_ocr flag set - skipping EasyOCR and SpellChecker.")
593
+ reader = None
594
+ spell = None
595
+
596
+ else:
597
+ print("OCR initialisation start (in script.py)")
598
+ if reader is None:
599
+ print("No EasyOCR reference passed; performing local init")
600
+ reader = easyocr.Reader(['en'], gpu=True)
601
+ else:
602
+ print("Using pre-initialised EasyOCR object.")
603
+
604
+ if skip_spell:
605
+ print("skip_spell flag set - not initialising SpellChecker.")
606
+ spell = None
607
+ else:
608
+ if spell is None:
609
+ print("No SpellChecker reference passed; performing local init")
610
+ spell = SpellChecker()
611
+ else:
612
+ print("Using pre-initialised SpellChecker object.")
613
+
614
+ elapsed = time.perf_counter() - start_time
615
+ print(f"OCR init (in script.py) took {elapsed:.3f} seconds.")
616
+
617
+
618
+ start_time = time.perf_counter()
619
+ print("icon-model init start (in script.py)")
620
+ # Load icon detection model (if provided)
621
+ if icon_model_path:
622
+ icon_model = tf.keras.models.load_model(icon_model_path)
623
+ print(f"Icon detection model loaded: {icon_model_path}")
624
+ else:
625
+ icon_model = None
626
+
627
+ elapsed = time.perf_counter() - start_time
628
+ print(f"icon-model init (in script.py) took {elapsed:.3f} seconds.")
629
+
630
+
631
+ # Load the original image
632
+ image = cv2.imread(input_image_path, cv2.IMREAD_UNCHANGED)
633
+ if image is None:
634
+ print(f"Image at {input_image_path} could not be loaded.")
635
+ return
636
+
637
+ image_height, image_width = image.shape[:2]
638
+
639
+
640
+ # Read YOLO labels
641
+ with open(yolo_output_path, 'r') as f:
642
+ lines = f.readlines()
643
+
644
+ # Check torch device
645
+ if torch.backends.mps.is_available():
646
+ device = torch.device("mps")
647
+ print("Using MPS")
648
+ elif torch.cuda.is_available():
649
+ device = torch.device("cuda")
650
+ print("Using CUDA")
651
+ else:
652
+ device = torch.device("cpu")
653
+ print("Using CPU")
654
+
655
+ # Conditionally load the captioning model
656
+ processor, model = None, None
657
+ if not no_captioning:
658
+ if model_to_use == 'blip':
659
+ print("Loading BLIP-2 model...")
660
+ blip_model_name = "Salesforce/blip2-opt-2.7b"
661
+ if not is_model_downloaded(blip_model_name, cache_directory):
662
+ print("Model not found in cache. Downloading...")
663
+ else:
664
+ print("Model found in cache. Loading...")
665
+ processor = AutoProcessor.from_pretrained(
666
+ blip_model_name,
667
+ use_auth_token=huggingface_token,
668
+ cache_dir=cache_directory,
669
+ resume_download=True
670
+ )
671
+ model = Blip2ForConditionalGeneration.from_pretrained(
672
+ blip_model_name,
673
+ device_map='auto',
674
+ torch_dtype=torch.float16,
675
+ use_auth_token=huggingface_token,
676
+ cache_dir=cache_directory,
677
+ resume_download=True
678
+ ).to(device)
679
+ else:
680
+ print("Using LLaMA model via external call (ollama).")
681
+ else:
682
+ print("--no-captioning flag is set; skipping model loading.")
683
+
684
+ # Prepare bounding boxes from YOLO
685
+ bounding_boxes = []
686
+ for line in lines:
687
+ parts = line.strip().split()
688
+ class_id = int(parts[0])
689
+ x_center_norm, y_center_norm, width_norm, height_norm = map(float, parts[1:])
690
+ x_center = x_center_norm * image_width
691
+ y_center = y_center_norm * image_height
692
+ box_width = width_norm * image_width
693
+ box_height = height_norm * image_height
694
+ x_min = int(x_center - box_width / 2)
695
+ y_min = int(y_center - box_height / 2)
696
+ x_max = int(x_center + box_width / 2)
697
+ y_max = int(y_center + box_height / 2)
698
+ x_min = max(0, x_min)
699
+ y_min = max(0, y_min)
700
+ x_max = min(image_width - 1, x_max)
701
+ y_max = min(image_height - 1, y_max)
702
+ bounding_boxes.append((x_min, y_min, x_max, y_max, class_id))
703
+
704
+
705
+ # Create output dirs
706
+ cropped_dir = os.path.join(output_dir, "cropped_imageview_images")
707
+ os.makedirs(cropped_dir, exist_ok=True)
708
+ result_dir = os.path.join(output_dir, "result")
709
+ os.makedirs(result_dir, exist_ok=True)
710
+
711
+ base_name = os.path.splitext(os.path.basename(input_image_path))[0]
712
+ captions_file_path = None
713
+ if json_mini:
714
+ json_output["image_size"] = NoIndent([image_width, image_height])
715
+ elif output_json:
716
+ json_output["image"]["size"]["width"] = image_width
717
+ json_output["image"]["size"]["height"] = image_height
718
+ else: # Text output
719
+ captions_filename = f"{base_name}_regions_captions.txt"
720
+ captions_file_path = os.path.join(result_dir, captions_filename)
721
+ with open(captions_file_path, 'w', encoding='utf-8') as f:
722
+ f.write(f"Image path: {input_image_path}\n")
723
+ f.write(f"Image Size: width={image_width}, height={image_height}\n")
724
+ f.write(BARRIER)
725
+
726
+ # Number of workers can be increased if hardware is suitable for it. But testing is needed
727
+ start_time = time.perf_counter()
728
+ print("Process single region start (in script.py)")
729
+
730
+ with ThreadPoolExecutor(max_workers=1) as executor:
731
+ futures = [
732
+ executor.submit(
733
+ process_single_region,
734
+ idx, box, image, sr, reader, spell,
735
+ icon_model, processor, model, (model and device),
736
+ no_captioning, output_json, json_mini,
737
+ cropped_dir, base_name, save_images,
738
+ model_to_use, log_prefix="",
739
+ skip_ocr=skip_ocr,
740
+ skip_spell=skip_spell
741
+ ) for idx, box in enumerate(bounding_boxes)
742
+ ]
743
+
744
+ for future in as_completed(futures):
745
+ item = future.result()
746
+ if json_mini:
747
+ if item.get("mini_region_dict"):
748
+ json_output["elements"].append(item["mini_region_dict"])
749
+ elif output_json:
750
+ if item.get("region_dict"):
751
+ json_output["elements"].append(item["region_dict"])
752
+ else: # Text output
753
+ if item.get("text_log") and captions_file_path:
754
+ with open(captions_file_path, 'a', encoding='utf-8') as f:
755
+ f.write(item["text_log"])
756
+
757
+ elapsed = time.perf_counter() - start_time
758
+ print(f"Processing regions took {elapsed:.3f} seconds.")
759
+
760
+ if json_mini or output_json:
761
+ json_file = os.path.join(result_dir, f"{base_name}.json")
762
+ with open(json_file, 'w', encoding='utf-8') as f:
763
+ json.dump(json_output, f, indent=2, ensure_ascii=False, cls=CustomEncoder)
764
+
765
+ output_type = "mini JSON" if json_mini else "JSON"
766
+ print(f"{output_type} output written to {json_file}")
767
+ else:
768
+ print(f"Text output written to {captions_file_path}")
769
+
770
+
771
+ # CLI entry point
772
+ if __name__ == "__main__":
773
+ parser = argparse.ArgumentParser(description='Process an image and its YOLO labels.')
774
+ parser.add_argument('input_image', help='Path to the input YOLO image.')
775
+ parser.add_argument('input_labels', help='Path to the input YOLO labels file.')
776
+ parser.add_argument('--output_dir', default='.',
777
+ help='Directory to save output files. Defaults to the current directory.')
778
+ parser.add_argument('--model_to_use', choices=['llama', 'blip'], default='llama',
779
+ help='Model for captioning (llama or blip).')
780
+ parser.add_argument('--save_images', action='store_true',
781
+ help='Flag to save intermediate images.')
782
+ parser.add_argument('--icon_detection_path', help='Path to icon detection model.')
783
+ parser.add_argument('--cache_directory', default='./models_cache',
784
+ help='Cache directory for Hugging Face models.')
785
+ parser.add_argument('--huggingface_token', default='your_token',
786
+ help='Hugging Face token for model downloads.')
787
+ parser.add_argument('--no-captioning', action='store_true',
788
+ help='Disable any image captioning.')
789
+ parser.add_argument('--json', dest='output_json', action='store_true',
790
+ help='Output the image data in JSON format')
791
+ parser.add_argument('--json-mini', action='store_true',
792
+ help='Output the image data in a condensed JSON format')
793
+ args = parser.parse_args()
794
+
795
+ process_image(
796
+ input_image_path=args.input_image,
797
+ yolo_output_path=args.input_labels,
798
+ output_dir=args.output_dir,
799
+ model_to_use=args.model_to_use,
800
+ save_images=args.save_images,
801
+ icon_model_path=args.icon_detection_path,
802
+ cache_directory=args.cache_directory,
803
+ huggingface_token=args.huggingface_token,
804
+ no_captioning=args.no_captioning,
805
+ output_json=args.output_json,
806
+ json_mini=args.json_mini
807
+ )
808
+
utils/json_helpers.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+ class NoIndent:
4
+ """ Wrapper class to mark lists that should not be indented """
5
+ def __init__(self, value):
6
+ self.value = value
7
+
8
+ class CustomEncoder(json.JSONEncoder):
9
+ """
10
+ Custom JSON encoder that handles the NoIndent class to produce
11
+ a compact string representation of the list
12
+ """
13
+ def default(self, obj):
14
+ if isinstance(obj, NoIndent):
15
+ # Return the value formatted as a compact string, without newlines
16
+ return json.dumps(obj.value, separators=(',',':'))
17
+ return super().default(obj)
18
+
utils/pills.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import cv2
2
+ import base64
3
+ import numpy as np
4
+ from typing import Tuple
5
+
6
+ def preprocess_image(image_bytes: bytes, threshold: int = 1000, scale: float = 0.5, fmt: str = "png") -> Tuple[bytes, str]:
7
+ """
8
+ Preprocesses the image by checking its dimensions and downscaling it if needed.
9
+
10
+ Parameters:
11
+ - image_bytes: Raw image bytes.
12
+ - threshold: Maximum allowed width or height (in pixels). If either dimension exceeds this,
13
+ the image will be downscaled.
14
+ - scale: Scale factor to use for resizing if the image is too large.
15
+ - fmt: Format for re-encoding the image (e.g., "png" or "jpg").
16
+
17
+ Returns:
18
+ - A tuple (new_image_bytes, new_base64_str) where:
19
+ new_image_bytes: The re-encoded image bytes after potential downscaling.
20
+ new_base64_str: The base64 string (without header) of the new image bytes.
21
+ """
22
+ # Convert raw bytes to a NumPy array then decode with OpenCV
23
+ nparr = np.frombuffer(image_bytes, np.uint8)
24
+ cv_image = cv2.imdecode(nparr, cv2.IMREAD_UNCHANGED)
25
+
26
+ if cv_image is None:
27
+ raise ValueError("Failed to decode image with OpenCV.")
28
+
29
+ h, w = cv_image.shape[:2]
30
+
31
+ # If either dimension is greater than threshold, resize the image.
32
+ if h > threshold or w > threshold:
33
+ new_w, new_h = int(w * scale), int(h * scale)
34
+ cv_image = cv2.resize(cv_image, (new_w, new_h), interpolation=cv2.INTER_AREA)
35
+
36
+ ret, buf = cv2.imencode(f'.{fmt}', cv_image)
37
+
38
+ if not ret:
39
+ raise ValueError("Failed to re-encode image.")
40
+
41
+ new_image_bytes = buf.tobytes()
42
+ new_base64_str = base64.b64encode(new_image_bytes).decode("utf-8")
43
+
44
+ return new_image_bytes, new_base64_str
wrapper.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import argparse
4
+ from ultralytics import YOLO
5
+ from os.path import basename, splitext
6
+ import time
7
+
8
+ from yolo_script import process_yolo
9
+
10
+ from script import process_image
11
+
12
+ def process_image_description(
13
+ input_image: str,
14
+ weights_file: str,
15
+ output_dir: str,
16
+ model_to_use: str = 'llama',
17
+ save_images: bool = False,
18
+ icon_detection_path: str = None,
19
+ cache_directory: str = './models_cache',
20
+ huggingface_token: str = 'your_token',
21
+ no_captioning: bool = False,
22
+ output_json: bool = False,
23
+ json_mini: bool = False,
24
+ model_obj: YOLO = None,
25
+ sr=None,
26
+ reader=None,
27
+ spell=None,
28
+ skip_ocr=False,
29
+ skip_spell=False,
30
+ ) -> None:
31
+ """
32
+ Processes an image by running YOLO detection (via the imported process_yolo function)
33
+ and then calling process_image() from script.py to do the image description work.
34
+
35
+ Parameters:
36
+ - input_image: Path to the input image.
37
+ - weights_file: Path to the YOLO weights file.
38
+ - output_dir: Directory for YOLO output
39
+ - model_to_use: Which model to use for captioning ('llama' or 'blip').
40
+ - save_images: Whether to save intermediate images.
41
+ - icon_detection_path: Optional path to an icon detection model.
42
+ - cache_directory: Cache directory for models.
43
+ - huggingface_token: Hugging Face token for model downloads.
44
+ - no_captioning: If True, disable image captioning.
45
+ - output_json: If True, output the results in JSON format.
46
+ - json_mini: same as output_json but has more compact json output.
47
+ - model_obj: YOLO object that was initialized at a startup time (optional)
48
+ - sr: Super resolution object (optional)
49
+ - reader: EasyOCR object (optional)
50
+ - spell: Spell checker object (optional)
51
+ """
52
+
53
+ base_name = splitext(basename(input_image))[0]
54
+
55
+ process_yolo(input_image, weights_file, output_dir, model_obj=model_obj)
56
+
57
+ labels_dir = os.path.join(output_dir, 'labels')
58
+ label_file = os.path.join(labels_dir, base_name + '.txt')
59
+
60
+ if not os.path.isfile(label_file):
61
+ raise FileNotFoundError(f"Labels file not found at expected path: {label_file}")
62
+
63
+ process_image(
64
+ input_image_path=input_image,
65
+ yolo_output_path=label_file,
66
+ output_dir=output_dir,
67
+ model_to_use=model_to_use,
68
+ save_images=save_images,
69
+ icon_model_path=icon_detection_path,
70
+ cache_directory=cache_directory,
71
+ huggingface_token=huggingface_token,
72
+ no_captioning=no_captioning,
73
+ output_json=output_json,
74
+ json_mini=json_mini,
75
+ sr=sr,
76
+ reader=reader,
77
+ spell=spell,
78
+ skip_ocr=skip_ocr,
79
+ skip_spell=skip_spell,
80
+ )
81
+
82
+ if __name__ == '__main__':
83
+ parser = argparse.ArgumentParser(
84
+ description='Wrapper script to run YOLO detection and image description in sequence.'
85
+ )
86
+ parser.add_argument('--input_image', required=True, help='Path to the input image.')
87
+ parser.add_argument('--weights_file', required=True, help='Path to the YOLO weights file.')
88
+ parser.add_argument('--output_dir', default='./output', help='Output directory for YOLO results.')
89
+ parser.add_argument('--model_to_use', choices=['llama', 'blip'], default='llama',
90
+ help='Model for captioning.')
91
+ parser.add_argument('--save_images', action='store_true',
92
+ help='Flag to save intermediate images.')
93
+ parser.add_argument('--icon_detection_path', help='Path to the icon detection model.')
94
+ parser.add_argument('--cache_directory', default='./models_cache',
95
+ help='Cache directory for models.')
96
+ parser.add_argument('--huggingface_token', default='your_token',
97
+ help='Hugging Face token for model downloads.')
98
+ parser.add_argument('--no-captioning', action='store_true',
99
+ help='Disable any image captioning')
100
+ parser.add_argument('--json', dest='output_json', action='store_true',
101
+ help='Output the image data in JSON format')
102
+ parser.add_argument('--json-mini', action='store_true',
103
+ help='JSON output in a more condensed format')
104
+ parser.add_argument('--skip-ocr', action='store_true',
105
+ help='Disable OCR & spell-checking (faster).')
106
+ parser.add_argument('--skip-spell', action='store_true', help='Run OCR but skip spell-check')
107
+
108
+ args = parser.parse_args()
109
+
110
+ try:
111
+ print("Running YOLO detection...")
112
+ yolo_output_dir = args.output_dir
113
+ os.makedirs(yolo_output_dir, exist_ok=True)
114
+ process_yolo(args.input_image, args.weights_file, yolo_output_dir)
115
+
116
+ base_name = splitext(basename(args.input_image))[0]
117
+ labels_dir = os.path.join(yolo_output_dir, 'labels')
118
+ label_file = os.path.join(labels_dir, base_name + '.txt')
119
+ if not os.path.isfile(label_file):
120
+ raise FileNotFoundError(f"Labels file not found: {label_file}")
121
+
122
+ print("Running image description...")
123
+ process_image(
124
+ input_image_path=args.input_image,
125
+ yolo_output_path=label_file,
126
+ model_to_use=args.model_to_use,
127
+ save_images=args.save_images,
128
+ icon_model_path=args.icon_detection_path,
129
+ cache_directory=args.cache_directory,
130
+ huggingface_token=args.huggingface_token,
131
+ no_captioning=args.no_captioning,
132
+ output_json=args.output_json,
133
+ json_mini=args.json_mini,
134
+ skip_ocr=args.skip_ocr,
135
+ skip_spell=args.skip_spell
136
+ )
137
+ except Exception as e:
138
+ print(e)
139
+ sys.exit(1)
140
+
yolo_script.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ultralytics
2
+ import cv2
3
+ from ultralytics import YOLO
4
+ import os
5
+ import glob
6
+ import argparse
7
+ import sys
8
+ import numpy as np
9
+ import uuid
10
+
11
+ def iou(box1, box2):
12
+ # Convert normalized coordinates to (x1, y1, x2, y2)
13
+ x1_1 = box1[0] - box1[2] / 2
14
+ y1_1 = box1[1] - box1[3] / 2
15
+ x2_1 = box1[0] + box1[2] / 2
16
+ y2_1 = box1[1] + box1[3] / 2
17
+
18
+ x1_2 = box2[0] - box2[2] / 2
19
+ y1_2 = box2[1] - box2[3] / 2
20
+ x2_2 = box2[0] + box2[2] / 2
21
+ y2_2 = box2[1] + box2[3] / 2
22
+
23
+ xi1 = max(x1_1, x1_2)
24
+ yi1 = max(y1_1, y1_2)
25
+ xi2 = min(x2_1, x2_2)
26
+ yi2 = min(y2_1, y2_2)
27
+ inter_area = max(0, xi2 - xi1) * max(0, yi2 - yi1)
28
+
29
+ box1_area = (x2_1 - x1_1) * (y2_1 - y1_1)
30
+ box2_area = (x2_2 - x1_2) * (y2_2 - y1_2)
31
+ union_area = box1_area + box2_area - inter_area
32
+
33
+ return inter_area / union_area if union_area > 0 else 0
34
+
35
+ def load_image(input_image, base_name: str = None):
36
+
37
+ if isinstance(input_image, str):
38
+ img = cv2.imread(input_image)
39
+ if img is None:
40
+ raise ValueError(f"Unable to load image from path: {input_image}")
41
+ if base_name is None:
42
+ base_name = os.path.splitext(os.path.basename(input_image))[0]
43
+ return img, base_name
44
+ else:
45
+ # Assume input_image is raw bytes or a file-like object
46
+ if isinstance(input_image, bytes):
47
+ image_bytes = input_image
48
+ else:
49
+ image_bytes = input_image.read()
50
+ nparr = np.frombuffer(image_bytes, np.uint8)
51
+ img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
52
+ if img is None:
53
+ raise ValueError("Unable to decode image from input bytes or file-like object.")
54
+ if base_name is None:
55
+ base_name = str(uuid.uuid4())
56
+ return img, base_name
57
+
58
+ def process_yolo(input_image, weights_file: str, output_dir: str = './yolo_run', model_obj: YOLO = None, base_name: str = None) -> str:
59
+
60
+ orig_image, inferred_base_name = load_image(input_image, base_name)
61
+ base_name = inferred_base_name
62
+
63
+ os.makedirs(output_dir, exist_ok=True)
64
+ # Determine file extension: if input_image is a file path, use its extension. Otherwise, default to .png
65
+ if isinstance(input_image, str):
66
+ ext = os.path.splitext(input_image)[1]
67
+ else:
68
+ ext = ".png"
69
+ output_image_name = f"{base_name}_yolo{ext}"
70
+ updated_output_image_name = f"{base_name}_yolo_updated{ext}"
71
+
72
+ # If input_image is a file path, call YOLO with the path to preserve filename-based labeling.
73
+ # Otherwise, if processing in-memory, YOLO might default to a generic name.
74
+ if isinstance(input_image, str):
75
+ source_input = input_image
76
+ else:
77
+ source_input = orig_image
78
+
79
+ # Use provided model or load from weights_file.
80
+ if model_obj is None:
81
+ model = YOLO(weights_file)
82
+ else:
83
+ model = model_obj
84
+
85
+ results = model(
86
+ source=source_input,
87
+ save_txt=True,
88
+ project=output_dir,
89
+ name='.',
90
+ exist_ok=True,
91
+ )
92
+
93
+ # Save the initial inference image.
94
+ img_with_boxes = results[0].plot(font_size=2, line_width=1)
95
+ output_image_path = os.path.join(output_dir, output_image_name)
96
+ cv2.imwrite(output_image_path, img_with_boxes)
97
+ print(f"Image saved as '{output_image_path}'")
98
+
99
+ labels_dir = os.path.join(output_dir, 'labels')
100
+ label_file = os.path.join(labels_dir, f"{base_name}.txt")
101
+
102
+ if not os.path.isfile(label_file):
103
+ raise FileNotFoundError(f"No label files found for the image '{base_name}' at path '{label_file}'.")
104
+
105
+ with open(label_file, 'r') as f:
106
+ lines = f.readlines()
107
+
108
+ boxes = []
109
+ for idx, line in enumerate(lines):
110
+ tokens = line.strip().split()
111
+ class_id = int(tokens[0])
112
+ x_center = float(tokens[1])
113
+ y_center = float(tokens[2])
114
+ width = float(tokens[3])
115
+ height = float(tokens[4])
116
+ boxes.append({
117
+ 'class_id': class_id,
118
+ 'bbox': [x_center, y_center, width, height],
119
+ 'line': line,
120
+ 'index': idx
121
+ })
122
+
123
+ boxes.sort(key=lambda b: b['bbox'][1] - (b['bbox'][3] / 2))
124
+
125
+ # Perform NMS.
126
+ keep_indices = []
127
+ suppressed = [False] * len(boxes)
128
+ num_removed = 0
129
+ for i in range(len(boxes)):
130
+ if suppressed[i]:
131
+ continue
132
+ keep_indices.append(i)
133
+ for j in range(i + 1, len(boxes)):
134
+ if suppressed[j]:
135
+ continue
136
+ if boxes[i]['class_id'] == boxes[j]['class_id']:
137
+ iou_value = iou(boxes[i]['bbox'], boxes[j]['bbox'])
138
+ if iou_value > 0.7:
139
+ suppressed[j] = True
140
+ num_removed += 1
141
+
142
+ with open(label_file, 'w') as f:
143
+ for idx in keep_indices:
144
+ f.write(boxes[idx]['line'])
145
+
146
+ print(f"Number of bounding boxes removed: {num_removed}")
147
+
148
+ # Draw updated bounding boxes on the original image (loaded in memory).
149
+ drawn_image = orig_image.copy()
150
+ h_img, w_img, _ = drawn_image.shape
151
+
152
+ for i, idx in enumerate(keep_indices):
153
+ box = boxes[idx]
154
+ x_center, y_center, w_norm, h_norm = box['bbox']
155
+ x_center *= w_img
156
+ y_center *= h_img
157
+ w_box = w_norm * w_img
158
+ h_box = h_norm * h_img
159
+ x1 = int(x_center - w_box / 2)
160
+ y1 = int(y_center - h_box / 2)
161
+ x2 = int(x_center + w_box / 2)
162
+ y2 = int(y_center + h_box / 2)
163
+ cv2.rectangle(drawn_image, (x1, y1), (x2, y2), (0, 255, 0), 2)
164
+ cv2.putText(drawn_image, str(i + 1), (x1, y1 - 5),
165
+ cv2.FONT_HERSHEY_SIMPLEX, 0.8, (143, 10, 18), 1)
166
+
167
+ updated_output_image_path = os.path.join(output_dir, updated_output_image_name)
168
+ cv2.imwrite(updated_output_image_path, drawn_image)
169
+ print(f"Updated image saved as '{updated_output_image_path}'")
170
+
171
+ return updated_output_image_path
172
+
173
+ if __name__ == '__main__':
174
+ parser = argparse.ArgumentParser(description='Process YOLO inference and NMS on an image.')
175
+ parser.add_argument('input_image', help='Path to the input image or pass raw bytes via a file-like object.')
176
+ parser.add_argument('weights_file', help='Path to the YOLO weights file.')
177
+ parser.add_argument('output_dir', nargs='?', default='./yolo_run', help='Output directory (optional).')
178
+ parser.add_argument('--base_name', help='Optional base name for output files (without extension).')
179
+ args = parser.parse_args()
180
+
181
+ try:
182
+ process_yolo(args.input_image, args.weights_file, args.output_dir, base_name=args.base_name)
183
+ except Exception as e:
184
+ print(e)
185
+ sys.exit(1)