Extracted Text

I have come across my first images that have "Extracted Text", working on Chinese Exclusion Act records, specifically 27160/1, Fong Mon Song. Questions:

  1. Do we cut and paste the extraction into the transcription box?
  2. Then make corrections by comparing the extraction to the document? 
  3. Do we add information that the extraction missed altogether?
  4. I don't find any instructions that address how we are to handle/process. 

Thanks...

  • Thanks for your questions.  We just launched this code over the weekend, so we were unable to communicate instructions yet.  If your record contains extracted text, you are welcome to follow the steps you included in your questions.  That's exactly how we recommend you use it.  In many cases, due to the complexity of our records, the extracted text may not be completely accurate, so human eyes are really important.

    Please note that we encountered some problems with the Catalog after the code for this and other updates were added this weekend.  As a result the new extracted text feature will be removed today (6/24/2024) and will be added back in again very soon.

    Please let us know if you have any additional questions.

    Sincerely,

    Community Manager, National Archives Catalog

  • Thanks for your prompt answer. I suspected that is how we'd proceed, but thought I should check first. 

  • This is a very valuable tool for transcription. It allows us to be proof readers and layout editors instead of mere typists. Please provide us with this feature as soon as possible.

  • Extracted text suddenly appeared and then suddenly disappeared.  It was very helpful when I was transcribing Air Medal Decoration cards - and the extracted text was really good.  Will the extracted text be searchable for users?  If so, should the documents still be transcribed (and saw the explanation from you)?  I've always wondered why those cards were OCR'ed anyway as I thought it would be a good candidate for it anyway.

    I look forward to the return of extracted text.  Nice job and hope you can work the bugs out.

  • We expect this feature to return (permentatly) by the end of July if not sooner.

    Community Manager, National Archives Catalog

  • We're happy to hear you like this new feature.  We encountered some problems with the Catalog after the code for this and other updates were added this weekend.  As a result the new extracted text feature was removed on 6/25/2024) and it expected that it will be added back in within the month.

    Once the Extracted Text panel has been returned to the Catalog we will share information, hints, and tips about this field on the HELP page linked from the top of every page in the Catalog.  Additionally we will share information in the National Archives Catalog Newsletter.  Be sure you are subscribed, this is where we share Catalog Hints and Tips.

    Please let us know if you have any additional questions.

    Community Manager, National Archives Catalog

  • I'm very happy to see the Extracted Text feature. It actually echoes, I think, what has been my general workflow for typed or typeset materials: (1) download the page image, (2) process the image to create a cleaner input for the OCR software to read, (3) read the image with OCR software — in my case, a rather old version of Abbyy FineReader— (4) proof the result against the original downloaded image, and, (5) upload the results.

    Can you describe your workflow at the Archives? I imagine it's rather more automated than the artisanal process I described.

    In general, the material I've been seeing since Extracted Text is now visible to us is that it's impressively accurate. Certainly more so than the process I've described above. Is your software that much better? Or is it the image pre-read prep? Or both or something else?

    Best regards
    Art