Potential use of AI for transcription

Based on all I have read recently, it would seem that transcription would be an ideal application for AI with CA’s proofreading and correcting AI generated drafts..there is a significant volume of transcription to train an AI bot, such as Chat GTP, and this could significantly expand our efforts.

obviously, many caveats on AI would have to be factored in.

Thoughts CAs?

NA staff..are you already exploring this?  Is there a pilot in the works?  Would love to volunteer to participate!

  • Thank you for your question.  Yes, we're always looking at new technology and how we can incorporate it into the National Archives Catalog.  Be sure to subscribe to the National Archives Catalog Newsletter - https://www.archives.gov/research/catalog/newsletter  This is where we will announce any new projects or features.

    Community Manager, National Archives Catalog

  • i wrote a program for myself that puts the page through an ocr and fixes the most common errors. Then, I mostly just need to proofread and fix mistakes. I am afraid to post it because people really shouldn't be downloading random programs that they run across in a forum. It also is not user friendly. However, if there seems to be interest, I can probably make it accessible. 

    You can also upload the image to google docs and they will do an ocr.

  • I am sure there are several technical developers/researchers out there that have solutions for these types of issues, or like myself, have created or implemented OCR and AI augmented transcription programs and platforms. However, the best of these solutions may only ever get you 95% of the way to a perfect transcription.  I have millions of transcriptions done in this way against records I would like to add via my API access, but I am uncomfortable doing so without the ability to mark the transcriptions as a draft or as "needs reviewed" to get human eyes on it.  I honestly do not know if that is a feature already, but if it is not, is there a way to make feature requests to add to your backlog?  

  • Some of the pages that have not been transcribed have "extractedText' associated with them. 

  • I also use OCR (Abbyy FineReader and Tesseract) as a starting point for printed or typewritten documents. Accuracy depends a lot on the quality of the original. Sometimes the output is near-flawless; other times, not so much. A human proofreader is required in either event. Even a poor scan can often be helpful by providing a framework for the document which can help avoid skipped word/line mistakes. 

    I would like to see a system where the institution does an OCR scan which is then presented to human transcribers who might or might not like to use that as a starting point for the final human-vetted transcription. In the meantime, the (imperfect) OCR can be indexed and used for search purposes. The recently-released 1950 census did this, enabling me to find a family I might not have otherwise located.

  • Slightly off topic, but I did a lot of tests on these types of old documents, like the IDPFs in particular, which contain a large variety of both typed and handwritten documents.  Abbyy, Tesseract, EasyOCR and several other packages, even when adjusted for a variety of facts or used within a variety of document thresholds and manipulations, could not touch the accuracy of any of the major cloud OCR offerings from Google, AWS and Azure, which required no manipulation for accurate results (all have similar results).  As a matter of fact, the results were not even close with any homegrown solution.  The downside is that those cloud services are all paid services that are relatively inexpensive for 1000's of documents, but the costs expand exponentially when we're talking millions of pages of documents (in the case of things like IDPFs).  

    Just thought I would put that out there in case it helped!  

  • The Archives already uses OCR for indexing. They have a hidden variable in the page called extractedText. When a document is transcribed the transcription takes the place of the OCR in the extractedText.

  • Tesseract is maintained by Google. I use PyTesseract as a starting point and then my python program fixes common errors. It gets me most of the way. No matter what you use, it still needs to be proof read.