Commit History

Updated save options for ocr_outputs_with_words
52e26c1

seanpedrickcase commited on

Local text redaction now produces ocr results with words json and can make dataframe format
ef4000e

seanpedrickcase commited on

Further updates to line level duplicate identification
c8ffcd4

seanpedrickcase commited on

Updated packages. Corrected CSV logger headings, can now submit custom log csv names to S3. Started work on identifying and deduplicating at the line level
e424038

seanpedrickcase commited on

Minor update to ensure whole page redactions are applied correctly to a document with existing redactions
3946be6

seanpedrickcase commited on

Updated duplicate pages interface to include subdocuments and review. Updated relevant user guide. Minor package updates
f47b137

seanpedrickcase commited on

Updated duplicate pages functionality. Improve redaction efficiency a little with concat method. Minor modification to documentation and interface
ab04c92

seanpedrickcase commited on

Added folder with CDK code and app. Updated config py file to be compatible with all temp folders needed for read only file systems
36574ae

seanpedrickcase commited on

Added possibility of changing model and entity types in config file
bce761b

seanpedrickcase commited on

Added capability to redact all redactions with the same text based on the selected row. Rerranged buttons on review page a little. Improved page navigation efficiency.
c4e3724

seanpedrickcase commited on

Now xfdf Adobe exports can export redacted text that is searchable in Acrobat
a91f87b

seanpedrickcase commited on

Expanded checks for out of range page cropboxes
5fcccbe

seanpedrickcase commited on

Updated gradio version. Minor changes to redactor function sequence. Minor formatting and wording changes.
5a21738

seanpedrickcase commited on

Added config options for compressing output pdfs, returning output redacted pdfs at all, and for changing the length of time for showing previous Textract jobs
3bbf593

seanpedrickcase commited on

More checks on ocr outputs in redaction functions
97097ff

seanpedrickcase commited on

Corrected a couple of bugs. Now Textract whole document API call outputs will load also the input PDF into the app
10f46e9

seanpedrickcase commited on

Updated logging format for timestamps to be compatible with AWS. Added load_dynamo_logs.py example file.
94e514b

seanpedrickcase commited on

Minor changes for cost codes, package updates. Added pyproject.toml file
47a3a80

seanpedrickcase commited on

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation
f93e49c

seanpedrickcase commited on

Improved logging format a little. Now possible to save logs to DynamoDB
0042e78

seanpedrickcase commited on

Improved efficiency of review page navigation, especially for large documents. Updated user guide
93b4c8a

seanpedrickcase commited on

Added button to convert Textract API outputs to ocr_output files easily. Corrected Textract job file location
46bf91e

seanpedrickcase commited on

Added compatibility with gradio_image_annotation for passing through id and text properties to annotator. Corrected csv location for Textract api calls. Other minor changes
52c1a90

seanpedrickcase commited on

Minor function documentation changes. Requirements update for new Gradio and version of Gradio annotator that allows for saving preferred redaction format and to include box id
f6e6d80

seanpedrickcase commited on

Corrected RUN_AWS_FUNCTIONS environment variable reference when downloading cost codes
818efbc

seanpedrickcase commited on

Made changes to hopefully resolve issue with downloading cost centre details from S3 to container
1418017

seanpedrickcase commited on

Fixed issue where S3 cost codes are defined but not local cost code location
7b345c3

seanpedrickcase commited on

Fixed issue in Docker containers built locally without correct folder permissions. Improved config file. Updated Gradio version to fix issue with selecting filtered rows. Minor bug fixes.
a33b955

seanpedrickcase commited on

Implemented Textract document API calls and associated output tracking/download. Fixes to config and cost code implementation. General minor bug fixes.
ed5f8c7

seanpedrickcase commited on

Added workaround to issue with selectdata and dataframes for filtered dataframes. Rearranged some components.
4276db1

seanpedrickcase commited on

Corrected issue where the cropbox method was being overwritten in the review code
b805ec6

seanpedrickcase commited on

Corrected set_cropbox in redaction function. Reset cost code selection to correct method.
11eb675

seanpedrickcase commited on

Cost code dataframe should now pass over selected cost code correctly
0ceb29f

seanpedrickcase commited on

Modified config entries to not assume allow list or cost codes file exists. Reduced concurrency to 3 and put input and output files in user subfolders by default
25c9832

seanpedrickcase commited on

Fixed s3 load for allow list and cost codes
e4c7d3c

seanpedrickcase commited on

Major update. General code revision. Improved config variables. Dataframe based review frame now includes text, items can be searched and excluded. Costs now estimated. Option for adding cost codes added. Option to extract text only.
0ea8b9e

seanpedrickcase commited on

Fixed manual entry for allow, deny, and full page redaction lists
0e1a4a7

seanpedrickcase commited on

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements
6319afc

seanpedrickcase commited on

Added features to review dataframe to filter and exclude features based on text. Text should now appear consistently in review_df (for boxes not modified). Larger spacy model returned to use. Gradio upgrade.
66e145d

seanpedrickcase commited on

Now redact on whole PDF mediabox size (larger than viewable size sometimes), then converted back to cropbox size for print and Adobe review. Improved some error raising and app flow
08a3ec3

seanpedrickcase commited on

Integrated AWS Comprehend and fuzzy matching functions with tabular data redaction.
ff290e1

seanpedrickcase commited on

Allowed for output files to be saved into user-specific folders. Added deny list capability to xlsx/csv file redaction
dacc782

seanpedrickcase commited on

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process
391712c

seanpedrickcase commited on

Fixed issues with log file list picking up logs from other file runs. Updated packages.
42180e4

seanpedrickcase commited on

Added concurrency limit to run options. Trying again to load in zoom/rotate options from gradio_image_annotator fork.
dea568f

seanpedrickcase commited on

Laid groundwork for passing in AWS API keys. Duplicate pages option should now work for pages with no text.
7907ad4

seanpedrickcase commited on

App now correctly updates custom fuzzy recognisers
82b9d9d

seanpedrickcase commited on

Fixed issues with gradio version 5.16. Fixed fuzzy search error with pages with no data.
3cecbfa

seanpedrickcase commited on