You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we collect our data from the PDFs using OCR. This leads to many OCR errors, as our image cropping isn't the strictest (+ there is no validation). This can lead to the problem that we are serving false data and statistics. An easy way to see this occurring is in our "Age" chart, where there are some sub 18 year old arrests (which obviously is not correct). This isn't a big issue, as the OCR works a good majority of the time (plus, we provide the raw data and record images for anyone to validate our statistics). However, it is still an issue that should be dealt with as we want to be as accurate as possible.
Potential Solutions
Have some sort of "validation" flag, with validated records vs. unvalidated records. This way we aren't presenting false statistics, as we can sort through validated/unvalidated data
Have a stricter bound around the fields we are trying to parse out of each record (Line by Line, instead of cropping over the entire section)
The text was updated successfully, but these errors were encountered:
Context
Currently, we collect our data from the PDFs using OCR. This leads to many OCR errors, as our image cropping isn't the strictest (+ there is no validation). This can lead to the problem that we are serving false data and statistics. An easy way to see this occurring is in our "Age" chart, where there are some sub 18 year old arrests (which obviously is not correct). This isn't a big issue, as the OCR works a good majority of the time (plus, we provide the raw data and record images for anyone to validate our statistics). However, it is still an issue that should be dealt with as we want to be as accurate as possible.
Potential Solutions
The text was updated successfully, but these errors were encountered: