I wanted to post this to give some folks an idea of the frustration you can expect when dealing with some free OCR tools. I try to use OCR (optical character recognition) to transcribe information from images I find mostly in online resources but sometimes also ones I scan or photograph. There are some extremely 'clean' resources out there that have been scanned in high-res and will look great in any OCR tool. But there are some awful scans out there as well. Let's run through an example.
In my last post I wrote about the obituary of Sarah Tuggle. I found the scan of the obit from 1883 on Ancestry.com and knew right away that this will not be an easy one to convert to text:
First of all, for some reason, Ancestry.com has recently started downloading images in PNG format. While this is a great format and is a close second to TIF, not many OCR applications can read it, so you have to convert it with some other tool. Luckily the basic Microsoft Office Picture Manager will do that in no time. But as you can see, the image is in extremely bad shape.
I tried first with PaperPort, which is a document organization tool that came with my DocuPen (an excellent handheld pen-sized scanner). PaperPort has a terrific OCR tool which works quickly and almost flawlessly when you deal with a good source image. But this is what I got with PaperPort:
sale.Close, right? That was the original PNG. Then I tried the converted JPG:
.1 -d— of na
o~ueru~~ e siw rr.'i~'~:~ ~r ove.n~w a..n ne.. .eror..o
of a.. riu~~n r~ui:.Not much better. I also have an OCR tool that came with my terrific HP OfficeJet Pro 8500. But I can never get it to work on images that were not scanned at a high DPI and it is clunky and not very user-friendly. I tried it anyway and just got frustrated some more.
aim . .~ me .~ aor nee
wu r~
~:~~ ° .«ac.o
Then I remembered that I had a great free OCR tool somewhere in the 70GB hard drive of my computer, but since I haven't used it in a while I couldn't remember what it was called and couldn't find it anywhere. So I went to look for some good OCR tool online. And there are a lot of those out there.
SimpleOCR looked promising but it couldn't convert the file at all. I tried another good image and it had a lot of errors anyway. The interesting feature was that it allowed you to chose from a drop-down list what word you want to use when it was not 100% sure what it scanned. Also, it has a 14-day trial for handwriting recognition but you have to teach the system how you write and go through a whole training exercise. That might come in handy some day.
Another free program that intrigued me was TopOCR. The interesting thing here is that it is intended for photo capture with cameras of at least 3 mega-pixel. I was sure it would be able to handle some bad scans but this is what I got:
A Adds Ads, Am, Carob Beef ~e, Bulb Or Err. PIU~DeJ TuggerIt basically found only one word right - Butler. So this was not going to work. It is a very quick tool though and let's you edit the outcome in a side-by-side view next to the original:
dled ~uddonl7 ye~lerd~^r at Me r~ld~ ace of h
46ugbler, Art. Plorco llf Inure, on Butler~l~eL 8,
nob * try dlnner *ad wry ~ppuenllr troll, 81 o^lr~n~d & IlUle ox ~^lll0~ d~, howe~ot, Al . dl^cd league -liar TV ~~ ~~

When I tried a good image I got pretty good results. But my problem is not with good images, it's the crappy ones I need help with.
So finally I found the program I had been using before. Obviously it's called FreeOCR. Doh! It also let's you view side-by-side with the original and open the recognized text in MS Word. I can't seem to get a screenshot of this application for some reason but here is what I got when I ran it:
A lnddan Death.The recognition wasn't great, but it was the closest I could get. And there was no difference between PNG and JPG either. When I ran better scans through FreeOCR it did great too. And it's free!
In. Earnh Tuggln, wits nt Hr. Plukncy Tuggle.
dlcd suddnnly yesterdny at the ruldcnce cl her
daughter, Mr:. Plame Mlm, nn llutlaralrael. Shu
aw s Imm dinner and wu nppu-entlr wall. Sha
rnmulnmnij A lime ou smlug clown, however, and
dlud \».|‘un; any cue could mach her. _
Do you have a favorite OCR program (free or not)? I'd love to hear from some of you in the comments.
Last time I wrote about the death of Pinkney J. Tuggle and while searching for more information about his I ran across the obituary for his wife, Sarah Whitehead Battle Carter Tuggle. This one is shorter and very peculiar as it doesn't give a lot of information:
A Sudden Death.
Mrs. Sarah Tuggle, wife of Mr. Pinkney Tuggle,
died suddenly yesterday at the residence of her
daughter, Mrs. Pierce Mims, on Butler street. She
ate a hearty dinner and was apparently well. She
complained a little on sitting down, however, and
died before any one could reach her.
The Atlanta Constituion - 8 May 1883.
Once again, the name of their son-in-law, Pierce Mims is mentioned but this time they live on Butler street. I checked the 1883 Atlanta City Directory (page 439) and found that Pinckney J. Tuggle, a merchant, was renting at 9 Butler Street. In the address listings (page 119) there are actually 3 people listed as living at this address: P. Mims, J.P. (wrong initials) Tuggle and W. Hanley. I wonder who Hanley was.
So what does it mean that she "complained a little" and "died before any one could reach her?" This is very odd. I wonder how I can find out more about this incident.
Anyway, I just thought of another reason that Pinkney didn't want to be buried in Greene County at his father's plantation. His wife died 2 years before him and was buried at Oakland Cemetery in Atlanta.
I am going to write a follow up to this post on two topics that annoyed me:
1. Why does Ancestry.com hide the city directories where you can't easily find them?
2. Why are some OCR product so terrible?
Subscribe
Subscribe To My Podcast
Recent Posts
Blogroll
-
-
-
-
A Surprise Using FamilySearch8 years ago
-
-
-
-
Live Traffic Feed
Blog Network
Twitter Updates
Categories
- Atkinson
- Atlanta
- Auth
- Bannantine
- Battle
- Beltsy
- Benditovich
- Birth
- Bishop
- Books
- Brannon
- Carnival
- Carter
- Casefile Clues
- Census
- Ciechanowiec
- City Directory
- Conley
- Dean
- Death
- Deeds
- Dekel
- Documents
- Dolhinov
- Dombek
- Field Trip
- Finds
- Frank
- Genea-Bloggers
- Genea-Books
- Genea-Challenge
- Genea-Links
- Genea-Tools
- General
- Geni.com
- Greenawalt
- History
- Holocaust
- Hytowitz
- Jablonka
- JewishGen
- Jokes
- Kalmaniewski
- Karpik
- Kearney
- Kilchevsky
- Kosow Lacki
- Kreplak
- Krug
- Library
- Masons
- McElrath
- Meme
- Misc.
- MyHeritage
- Newspapers
- Obits
- Ostrow Mazowiecka
- Personal
- Photographs
- Pittsburgh
- Poems
- Poland
- Politics
- Przytuly
- Radzilow
- Research
- Review
- Rightmire
- Rootsweb
- Russia
- Scanfest
- Segalchik
- Smorgonski
- Snopes
- Software
- Spartanburg
- Sterdyn
- Tech
- Timmons
- Tuggle
- TV
- Video
- Vital Records
- Warsaw
- Whitehead
- Wiley
- Zinberg
Blog Archive
- Aug 2010 (2)
- Jul 2010 (1)
- May 2010 (1)
- Mar 2010 (6)
- Feb 2010 (4)
- Jan 2010 (7)
- Dec 2009 (3)
- Nov 2009 (2)
- Oct 2009 (4)
- Sep 2009 (3)
- Aug 2009 (6)
- Jul 2009 (2)
- Jun 2009 (1)
- May 2009 (3)
- Apr 2009 (4)
- Mar 2009 (5)
- Feb 2009 (2)
- Jan 2009 (9)
- Dec 2008 (9)
- Nov 2008 (10)
- Oct 2008 (7)
- Sep 2008 (6)
- Aug 2008 (20)
- Jul 2008 (2)