Showing posts with label Tech. Show all posts

Free OCR Tools - Frustration  

Posted by Abba-Dad in , , , ,

I wanted to post this to give some folks an idea of the frustration you can expect when dealing with some free OCR tools. I try to use OCR (optical character recognition) to transcribe information from images I find mostly in online resources but sometimes also ones I scan or photograph. There are some extremely 'clean' resources out there that have been scanned in high-res and will look great in any OCR tool. But there are some awful scans out there as well. Let's run through an example.

In my last post I wrote about the obituary of Sarah Tuggle. I found the scan of the obit from 1883 on Ancestry.com and knew right away that this will not be an easy one to convert to text:


First of all, for some reason, Ancestry.com has recently started downloading images in PNG format. While this is a great format and is a close second to TIF, not many OCR applications can read it, so you have to convert it with some other tool. Luckily the basic Microsoft Office Picture Manager will do that in no time. But as you can see, the image is in extremely bad shape.

I tried first with PaperPort, which is a document organization tool that came with my DocuPen (an excellent handheld pen-sized scanner). PaperPort has a terrific OCR tool which works quickly and almost flawlessly when you deal with a good source image. But this is what I got with PaperPort:

sale.
.1 -d— of na
o~ueru~~ e siw rr.'i~'~:~ ~r ove.n~w a..n ne.. .eror..o
Close, right? That was the original PNG. Then I tried the converted JPG:
of a.. riu~~n r~ui:.
aim . .~ me .~ aor nee
wu r~
~:~~ ° .«ac.o
Not much better. I also have an OCR tool that came with my terrific HP OfficeJet Pro 8500. But I can never get it to work on images that were not scanned at a high DPI and it is clunky and not very user-friendly. I tried it anyway and just got frustrated some more.

Then I remembered that I had a great free OCR tool somewhere in the 70GB hard drive of my computer, but since I haven't used it in a while I couldn't remember what it was called and couldn't find it anywhere. So I went to look for some good OCR tool online. And there are a lot of those out there.

SimpleOCR looked promising but it couldn't convert the file at all. I tried another good image and it had a lot of errors anyway. The interesting feature was that it allowed you to chose from a drop-down list what word you want to use when it was not 100% sure what it scanned. Also, it has a 14-day trial for handwriting recognition but you have to teach the system how you write and go through a whole training exercise. That might come in handy some day.

Another free program that intrigued me was TopOCR. The interesting thing here is that it is intended for photo capture with cameras of at least 3 mega-pixel. I was sure it would be able to handle some bad scans but this is what I got:
A Adds Ads, Am, Carob Beef ~e, Bulb Or Err. PIU~DeJ Tugger

dled ~uddonl7 ye~lerd~^r at Me r~ld~ ace of h

46ugbler, Art. Plorco llf Inure, on Butler~l~eL 8,

nob * try dlnner *ad wry ~ppuenllr troll, 81 o^lr~n~d & IlUle ox ~^lll0~ d~, howe~ot, Al . dl^cd league -liar TV ~~ ~~
It basically found only one word right - Butler. So this was not going to work. It is a very quick tool though and let's you edit the outcome in a side-by-side view next to the original:


When I tried a good image I got pretty good results. But my problem is not with good images, it's the crappy ones I need help with.

So finally I found the program I had been using before. Obviously it's called FreeOCR. Doh! It also let's you view side-by-side with the original and open the recognized text in MS Word. I can't seem to get a screenshot of this application for some reason but here is what I got when I ran it:
A lnddan Death.
In. Earnh Tuggln, wits nt Hr. Plukncy Tuggle.
dlcd suddnnly yesterdny at the ruldcnce cl her
daughter, Mr:. Plame Mlm, nn llutlaralrael. Shu
aw s Imm dinner and wu nppu-entlr wall. Sha
rnmulnmnij A lime ou smlug clown, however, and
dlud \».|‘un; any cue could mach her. _
The recognition wasn't great, but it was the closest I could get. And there was no difference between PNG and JPG either. When I ran better scans through FreeOCR it did great too. And it's free!

Do you have a favorite OCR program (free or not)? I'd love to hear from some of you in the comments.

How to run the next #Scanfest on #Twitter  

Posted by Abba-Dad in , , ,

Warning: This has nothing to do with family history except for the fact I propose an interesting way to use Twitter for the next scanfest.

Since I still think that Twitter is just a big single-channel chat room, I thought it might be a great way to run scanfest. For those who don't know, scanfest is the once-a-month multi-user chat event that brings genealogists together as they go through the mundane task of scanning old photos and documents. It's on the last Sunday of every month and usually hosted by @kidmiff (Miriam Robins Midkiff of the Ancestories blog).

In the past scanfest has been run on several platforms with varying success. And here's why I think Twitter will be very successful:

  1. Anyone can join.
  2. No limitation on number of particpants.
  3. Get genealogists more involved in Twitter.
  4. Twitter is searchable and might lead to new relative connections somewhere down the road.
  5. Immediately archived, so if someone is late and want to catch up they can.
  6. People can get updates on their mobile devices even if they can't participate.

Now some of you might think I've lost my mind because the standard Twitter web site is static and requries constant refreshing. But there are some excellent tools out there that will allow for really easy chatting:

  1. Twitterfall - This site really turns twitter into a chat client. The beauty is in the filtering capabilities. If you just add #scanfest to your filter n the left you will only see tweets about scanfest! You can control how quickly the tweets fall down the screen by setting the speed on the right as well as animation effect, theme and other settings. You can retweet, reply, favorite, direct message and follow new users!
  2. Twitpic - A simple photo sharing site for Twitter users. If you want to share one of your recently scanned images, just twitpic it and add the #scanfest tag and everyone can see it!
  3. TweetDeck - Tweetdeck is an Adobe Air Twitter desktop application that lets you monitor several columns of info and apply filters and setting to each one. You can open up a @relpies column to see if someone was replying to you specifically. You can also submit an image to Twitpic right out of TweetDeck as well.

I'm sure there are other methods that can be used, but these are just my initial thoughts. I'm sure Thomas will have something to say about this topic :-)

Well that's it. That was my idea. I have never actually participated in scanfest, because it's on a weekend and during time I usually spend with family, but I thought this would be a great way to do it. And basically this could be a way to do any sort of group chat. The only drawback is that people who post their tweets to facebook might annoy their followers there. But one way to get over that is to use @scanfest at the start of your tweet or some other method to avoid it from updating facebook. I'm just thinking out loud at this point.

What do you think?

BTW, you can follow me on Twitter here: @abba_dad.