I've wondered that myself. I know the accuracy of OCR is not sufficient to pass it through without human review as being fully accurate, but even partial search matching can be extremely helpful - my experiences during the NASA transcription effort indicated that tesseract was able to do fairly well on clean, high DPI scans (noisy microfiche is another story, unsurprisingly...)
Could the NARA set up a system where a record has a "hidden" OCR text by default in cases where the conversion passes some form of automatic checking (say, some percentage of the words pass a spellchecking test) to enable at least basic searching and then when a citizen archivist comes in to make a human-audited version swap that in to replace it?
6 people found this helpful
We're currently developing OCR extraction for documents in the Catalog, which will be implemented in March 2019. This will feature some front-end enhancements that show a user which pages in a document contain their search term, and allow them to those pages directly. At the outset, the OCR will only apply to new documents added. However, at some point, NARA will approach retroactively extracting OCR from all documents already in the Catalog.