FREE OCR - ALL COLLECTIONS -TYPED AND HANDWRITTEN TEXTS

I had good results using OCR in all collections and I would like to share it, because It can save many time.

Google Drive is an easy and free tool to use OCR with typed/printed texts and handwriting.

Please check the steps below

1 Downloading the image you want to transcribe

1.1 Open By the people link that you want to transcribe

1.2 Click on the 3 buttons on the upper right side

1.3 Click in more tools

1.4 Click on save page as

1.5  Save the page as " Webpage complete"

2 Applying Google OCR

2.1 Open Gmail

2.2 Open Google Drive

2.3 Click in New

2.4 Click in upload file

2.5 Select the jpg file (It's the first) inside the folder you downloaded from "Save page as" in step 1.5

2.6 Wait Google drive finish the upload

2.7 Select the jpg file you've just uploaded

2.8 Click with the right button and select open with

2.9 Select open with Google Documents

After that a new file with the same name and different extension will be created. This file will contain the image and the transcribed text.

REVISION

1. ALL TEXTS SHOULD BE REVISED

The computer can make few mistakes, especially if the image has bad quality, or there are handwritten words covering the printed texts.

2 THE LINES ARE NOT THE SAME

The transcribed text will show different line breaks. So you should break the lines as the original text.

3 SEVERAL PAGES OR COLUMNS

If the image shows 2 pages, you need to split the image and put the first page on the top of the second. To split them you can use Microsoft paint.

The same can be used in newspaper pages. Many of them have several columns, if it has 5 columns, you should split the image in 5 parts creating 5 files, then you upload the 5 files and run the OCR separately.

4 HANDWRITTEN TEXTS

If the handwriting is clear the OCR will transcribe all or most part of it.

Here we have a good example

https://crowd.loc.gov/campaigns/blackwells-extraordinary-family/henry-browne-blackwell-family-correspondence/mss12880013…

Here we have a bad example

https://crowd.loc.gov/campaigns/alan-lomax/british-isles-1950-1958/afc2004004.ms230215/afc2004004ms230215-14/

  • Thank you Rodrigo, our Library of Congress colleague who kindly agreed to share his OCR method here on History Hub for volunteers to use if they like. This process will be helpful for getting text for printed documents, rather than handwritten ones. It can really save time when you're dealing with newsprint, though as Rodrigo says and demonstrates in his examples, there are times when OCR doesn't work so well--always read through the automatically generated text to check that it accurately represents the original (including any punctuation, accents or spelling, regardless of whether this is "correct" by modern standards). You can also get images of each document by clicking on the "view on loc.gov" button in the transcription interface.

    Please preserve the original line breaks as you would if you were transcribing. Make sure that words broken over two lines such as kit-

    ten

    are transcribed as a whole word on the first line, so "kitten", in this example.

  • I have started using this OCR method and it is awesome. Thanks so much for sharing.

  • The OCR reader I have been using with good success on the iPad is called HRReader (free in Apple App Store). Need to search for “Tahira Ghani ocr” it doesn’t come up under HRReader. It claims to read handwriting too but I haven’t had much luck with that. But it does work well with printed text. As was stated above, you still need to proofread mostly for misspelled words before passing the document on for review.

    I have come across many documents that were waiting for review that were not proofread and had to fix for minor misspellings (on instead of “of”, etc.)

    BTW, when OCR’ing a multi column page, I do it one column at a time.

  • Pomodoro software makes this process easier.

    https://pomodoro.semlab.io/

    Only an upload is mandatary.

  • Can someone please explain this procedure?  Where do you get All Collections?  Then you do something in Google Drive?  That sounds cloud base which is not viable for someone stuck with DSL.  I need to process about 15,000 pdfs of 5 to 20 pages.  They were OCR'ed at one time but the company that did it messed up the scanning resolution (every other page is off by a factor of 4) and the cropping.  I would say 90% of the text is jibberish.  Will I need to extract every pdf page as a jpg, do a new OCR, and then rebundle the individual pdfs? 

  • Hi Michael,

    The original message gave instructions for volunteers working on the By the People project, which is the Library of Congress' crowdsourcing transcription program so they may not apply to your situation. "All collections" refers to Library of Congress materials. Our platform doesn't display PDFs and we're usually dealing with files that only include 1 page, which is why this method is pretty specific to the file types/sizes encountered on the Library of Congress sites. There may be volunteers on this message board who could provide advice on your situation, but you might also consider reaching out to the organization or department (if they are Library of Congress materials) where you sourced the PDFs to find out what they advise.

    Best,

    Abby

    By the People Community Manager

  • Thank you for your reply.  Yes, I can see now that the post is in reference to a specific project.  My read was that "All Collections" might be a software app to perform OCR.  I've been struggling with various programs, all results are marginal to horrid.  As for the source of the pdfs, they are from a City archive.  I was able to get the pdfs, I doubt they have the original jpgs.  Whoever did the scanning had every other page set at a different resolution.  As you view the pdf, it goes from too big to too small.  They also made a mess of file naming, listing month and day prior to year.  The City has a web portal that allows for searching, but the user is limited to single word searches.  That, coupled with poor OCR, means you seldom get a meaningful search.

    I thought it would be an interesting project to fix the file names (done), and the page formatting (done).  I then wanted to obtain a new OCR scan of the documents and create a searchable index that would allow searching for phrases, not just single words.  The original documents are in very large ledger style books and I'm afraid they will all need to be reshot to make this happen.  The City archive has been very helpful in my research and I took this on as a way to pay back their help.

  • Hi Michael,

    plese try Pomodoro https://pomodoro.semlab.io/

    Here you can check how it works https://pomodoro.semlab.io/

    In this site you can discard what you don't need and then create a Text file.

    You alco can upload all the 15 thousand files to Google drive than open each one as Google document.

  • Thank you so much!  I will look into this.  I may need to be thinking about this being a crowd source project since there is so much to process.  Given the image quality and format issues of this collection, transcription will likely be the only way to get a reliable search capability.  Thanks again.