OCR Recognition Errors #3

tyliec · 2021-07-17T05:45:23Z

Context

Currently, we collect our data from the PDFs using OCR. This leads to many OCR errors, as our image cropping isn't the strictest (+ there is no validation). This can lead to the problem that we are serving false data and statistics. An easy way to see this occurring is in our "Age" chart, where there are some sub 18 year old arrests (which obviously is not correct). This isn't a big issue, as the OCR works a good majority of the time (plus, we provide the raw data and record images for anyone to validate our statistics). However, it is still an issue that should be dealt with as we want to be as accurate as possible.

Potential Solutions

Have some sort of "validation" flag, with validated records vs. unvalidated records. This way we aren't presenting false statistics, as we can sort through validated/unvalidated data
Have a stricter bound around the fields we are trying to parse out of each record (Line by Line, instead of cropping over the entire section)

* feat: improve ocr recognitions (#3, #6)

tyliec · 2022-01-05T21:00:01Z

This is not fully corrected. Closing though since we are now reading above 95% accuracy.

tyliec added the bug Something isn't working label Jul 17, 2021

tyliec added a commit that referenced this issue Jan 3, 2022

feat: improve ocr recognitions (#3)

e7d2833

tyliec linked a pull request Jan 3, 2022 that will close this issue

feat: improve ocr recognitions (#3) #12

Merged

tyliec closed this as completed in #12 Jan 4, 2022

tyliec added a commit that referenced this issue Jan 4, 2022

feat: improve ocr recognitions (#3) (#12)

7a8def1

* feat: improve ocr recognitions (#3, #6)

tyliec reopened this May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR Recognition Errors #3

OCR Recognition Errors #3

tyliec commented Jul 17, 2021 •

edited

Loading

tyliec commented Jan 5, 2022

OCR Recognition Errors #3

OCR Recognition Errors #3

Comments

tyliec commented Jul 17, 2021 • edited Loading

Context

Potential Solutions

tyliec commented Jan 5, 2022

tyliec commented Jul 17, 2021 •

edited

Loading