2 Replies Latest reply on Nov 16, 2018 9:18 AM by Jason Clingerman

    OCR for typed documents?

    Michael Ritchie Newbie

      I'm new at this, and have been transcribing for only a month. I'm just curious. For the easily read, typed documents, why doesn't the National Archives just use an OCR software to pull the text out automatically? I would think that it would be much faster than having to rely on volunteers to go in and transcribe them. Don't get me wrong, I enjoy doing this; it just seems like that would be more efficient.

        • Re: OCR for typed documents?
          techhistorynerd Adventurer

          I've wondered that myself. I know the accuracy of OCR is not sufficient to pass it through without human review as being fully accurate, but even partial search matching can be extremely helpful - my experiences during the NASA transcription effort indicated that tesseract was able to do fairly well on clean, high DPI scans (noisy microfiche is another story, unsurprisingly...)

           

          Could the NARA set up a system where a record has a "hidden" OCR text by default in cases where the conversion passes some form of automatic checking (say, some percentage of the words pass a spellchecking test) to enable at least basic searching and then when a citizen archivist comes in to make a human-audited version swap that in to replace it?

          • Re: OCR for typed documents?
            Jason Clingerman Adventurer

            We're currently developing OCR extraction for documents in the Catalog, which will be implemented in March 2019. This will feature some front-end enhancements that show a user which pages in a document contain their search term, and allow them to those pages directly. At the outset, the OCR will only apply to new documents added. However, at some point, NARA will approach retroactively extracting OCR from all documents already in the Catalog.

            4 of 4 people found this helpful