document_redaction / tools /redaction_review.py

Commit History

Further updates to line level duplicate identification
c8ffcd4

seanpedrickcase commited on

Updated duplicate pages interface to include subdocuments and review. Updated relevant user guide. Minor package updates
f47b137

seanpedrickcase commited on

Updated duplicate pages functionality. Improve redaction efficiency a little with concat method. Minor modification to documentation and interface
ab04c92

seanpedrickcase commited on

Added possibility of changing model and entity types in config file
bce761b

seanpedrickcase commited on

Added capability to redact all redactions with the same text based on the selected row. Rerranged buttons on review page a little. Improved page navigation efficiency.
c4e3724

seanpedrickcase commited on

Now xfdf Adobe exports can export redacted text that is searchable in Acrobat
a91f87b

seanpedrickcase commited on

Added config options for compressing output pdfs, returning output redacted pdfs at all, and for changing the length of time for showing previous Textract jobs
3bbf593

seanpedrickcase commited on

Improved efficiency of review page navigation, especially for large documents. Updated user guide
93b4c8a

seanpedrickcase commited on

Added compatibility with gradio_image_annotation for passing through id and text properties to annotator. Corrected csv location for Textract api calls. Other minor changes
52c1a90

seanpedrickcase commited on

Minor function documentation changes. Requirements update for new Gradio and version of Gradio annotator that allows for saving preferred redaction format and to include box id
f6e6d80

seanpedrickcase commited on

Implemented Textract document API calls and associated output tracking/download. Fixes to config and cost code implementation. General minor bug fixes.
ed5f8c7

seanpedrickcase commited on

Added workaround to issue with selectdata and dataframes for filtered dataframes. Rearranged some components.
4276db1

seanpedrickcase commited on

Corrected issue where the cropbox method was being overwritten in the review code
b805ec6

seanpedrickcase commited on

Corrected set_cropbox in redaction function. Reset cost code selection to correct method.
11eb675

seanpedrickcase commited on

Cost code dataframe should now pass over selected cost code correctly
0ceb29f

seanpedrickcase commited on

Major update. General code revision. Improved config variables. Dataframe based review frame now includes text, items can be searched and excluded. Costs now estimated. Option for adding cost codes added. Option to extract text only.
0ea8b9e

seanpedrickcase commited on

Fixed manual entry for allow, deny, and full page redaction lists
0e1a4a7

seanpedrickcase commited on

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements
6319afc

seanpedrickcase commited on

Added features to review dataframe to filter and exclude features based on text. Text should now appear consistently in review_df (for boxes not modified). Larger spacy model returned to use. Gradio upgrade.
66e145d

seanpedrickcase commited on

Now redact on whole PDF mediabox size (larger than viewable size sometimes), then converted back to cropbox size for print and Adobe review. Improved some error raising and app flow
08a3ec3

seanpedrickcase commited on

Allowed for output files to be saved into user-specific folders. Added deny list capability to xlsx/csv file redaction
dacc782

seanpedrickcase commited on

Fixed issues with log file list picking up logs from other file runs. Updated packages.
42180e4

seanpedrickcase commited on

Fixed issues with gradio version 5.16. Fixed fuzzy search error with pages with no data.
3cecbfa

seanpedrickcase commited on

Corrected image coordinate translation when the pdf mediabox is not the same size as pdf page rectangle
760ef5c

seanpedrickcase commited on

Zoom and rotate features from forked gradio_annotation package. Fixed csv/xlsx redaction. Updated guide on creating exe.
20d940b

seanpedrickcase commited on

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text.
bde6e5b

seanpedrickcase commited on

Added capabilities to export to and import from Adobe .xfdf files
6b28cfa

seanpedrickcase commited on

Added tab to be able to compare pages across multiple documents and redact duplicates
a265560

seanpedrickcase commited on

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update.
cb349ad

seanpedrickcase commited on

Corrected large image reduction code
3518b67

seanpedrickcase commited on

Dropdown choices for redactions are now listed correctly
3187788

seanpedrickcase commited on

Moved review components to give more space for page. Extended zoom limits. Existing redaction labels should now appear in new redaction box dropdown.
a9dcd2e

seanpedrickcase commited on

Corrected image resizing method for instances where the image is very large.
0c2987b

seanpedrickcase commited on

App should now resize images that are too large before sending to Textract. Textract now more robust to failure. Improved reliability of json conversion to review dataframe
143e2cc

seanpedrickcase commited on

You can now have output redaction boxes in grey according to an environment variable. Review files are now saved every time page is changed.
c3a8cd7

seanpedrickcase commited on

Fixed bug where pages suggested for whole redaction are one lower than requested
e8681e8

seanpedrickcase commited on

Adapted text join options to review file to be more resilient to changes in image size. Added possibility of using client secret with AWS login
c9e23cb

seanpedrickcase commited on

Side review bar is mostly there. A couple of bugs fixed. Can now return identified text in initial review files. Still working on retaining found text throughout review process
a03496e

seanpedrickcase commited on

Hopefully finally fixed the duplicate image_annotation_object issue
59ff822

seanpedrickcase commited on

Now should correctly remove duplicate items from all_image_annotator
8183bc4

seanpedrickcase commited on

Refactor redaction functionality and enhance UI components: Added support for custom recognizers and whole page redaction options. Updated file handling to include new dropdowns for entity selection and improved dataframes for entity management. Enhanced the annotator with better state management and UI responsiveness. Cleaned up redundant code and improved overall performance in the redaction process.
1d772de

seanpedrickcase commited on

Enhance file handling and UI features: improved Gradio app layout with fill width option, and integrated new settings for deny, and fully redacted lists (placeholders so far). Updated file conversion functions to handle CSV inputs and added CSV review file generation for redactions. Now retains all original and merged redaction boxes.
a770956

seanpedrickcase commited on

Fixed issue where redactions were sometimes not removing text underneath boxes. You can now redact in different colours from review page
23f8ca3

seanpedrickcase commited on

Can now specify the root path that the app will run on with an environment variable
b8e245f

seanpedrickcase commited on

Added option for running redact function through CLI (i.e. not going through Gradio UI or API). Test functions for running this through AWS Lambda.
e5dfae7

seanpedrickcase commited on

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging.
e2aae24

seanpedrickcase commited on

Should now retain modified redactions on first use of zoom
face41c

seanpedrickcase commited on

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.
ec98119

seanpedrickcase commited on

Allowed for time limits on redact to avoid timeouts. Improved review interface. Now accepts only one file at a time. Upgraded Gradio version
eea5c07

seanpedrickcase commited on