I wanted to post this to give some folks an idea of the frustration you can expect when dealing with some free OCR tools. I try to use OCR (optical character recognition) to transcribe information from images I find mostly in online resources but sometimes also ones I scan or photograph. There are some extremely 'clean' resources out there that have been scanned in high-res and will look great in any OCR tool. But there are some awful scans out there as well. Let's run through an example.
In my last post I wrote about the obituary of Sarah Tuggle. I found the scan of the obit from 1883 on Ancestry.com and knew right away that this will not be an easy one to convert to text:
First of all, for some reason, Ancestry.com has recently started downloading images in PNG format. While this is a great format and is a close second to TIF, not many OCR applications can read it, so you have to convert it with some other tool. Luckily the basic Microsoft Office Picture Manager will do that in no time. But as you can see, the image is in extremely bad shape.
I tried first with PaperPort, which is a document organization tool that came with my DocuPen (an excellent handheld pen-sized scanner). PaperPort has a terrific OCR tool which works quickly and almost flawlessly when you deal with a good source image. But this is what I got with PaperPort:
sale.Close, right? That was the original PNG. Then I tried the converted JPG:
.1 -d— of na
o~ueru~~ e siw rr.'i~'~:~ ~r ove.n~w a..n ne.. .eror..o
of a.. riu~~n r~ui:.Not much better. I also have an OCR tool that came with my terrific HP OfficeJet Pro 8500. But I can never get it to work on images that were not scanned at a high DPI and it is clunky and not very user-friendly. I tried it anyway and just got frustrated some more.
aim . .~ me .~ aor nee
~:~~ ° .«ac.o
Then I remembered that I had a great free OCR tool somewhere in the 70GB hard drive of my computer, but since I haven't used it in a while I couldn't remember what it was called and couldn't find it anywhere. So I went to look for some good OCR tool online. And there are a lot of those out there.
SimpleOCR looked promising but it couldn't convert the file at all. I tried another good image and it had a lot of errors anyway. The interesting feature was that it allowed you to chose from a drop-down list what word you want to use when it was not 100% sure what it scanned. Also, it has a 14-day trial for handwriting recognition but you have to teach the system how you write and go through a whole training exercise. That might come in handy some day.
Another free program that intrigued me was TopOCR. The interesting thing here is that it is intended for photo capture with cameras of at least 3 mega-pixel. I was sure it would be able to handle some bad scans but this is what I got:
A Adds Ads, Am, Carob Beef ~e, Bulb Or Err. PIU~DeJ TuggerIt basically found only one word right - Butler. So this was not going to work. It is a very quick tool though and let's you edit the outcome in a side-by-side view next to the original:
dled ~uddonl7 ye~lerd~^r at Me r~ld~ ace of h
46ugbler, Art. Plorco llf Inure, on Butler~l~eL 8,
nob * try dlnner *ad wry ~ppuenllr troll, 81 o^lr~n~d & IlUle ox ~^lll0~ d~, howe~ot, Al . dl^cd league -liar TV ~~ ~~
When I tried a good image I got pretty good results. But my problem is not with good images, it's the crappy ones I need help with.
So finally I found the program I had been using before. Obviously it's called FreeOCR. Doh! It also let's you view side-by-side with the original and open the recognized text in MS Word. I can't seem to get a screenshot of this application for some reason but here is what I got when I ran it:
A lnddan Death.The recognition wasn't great, but it was the closest I could get. And there was no difference between PNG and JPG either. When I ran better scans through FreeOCR it did great too. And it's free!
In. Earnh Tuggln, wits nt Hr. Plukncy Tuggle.
dlcd suddnnly yesterdny at the ruldcnce cl her
daughter, Mr:. Plame Mlm, nn llutlaralrael. Shu
aw s Imm dinner and wu nppu-entlr wall. Sha
rnmulnmnij A lime ou smlug clown, however, and
dlud \».|‘un; any cue could mach her. _
Do you have a favorite OCR program (free or not)? I'd love to hear from some of you in the comments.