Document scanning tools & techniques


Current Environment (2010)

Contex FSS 8300 COPY DSP large-format (E-size) scanner
Contex WIDEimage imaging software

I tried to thow this out but a friend wanted me to keep it.  B&W only and multiple cameras (that require occasional alignment).  Running this out of the vm.

Adobe Capture 3.0

Haven't found anything to improve on this yet, now discontinued and unavailble (except maybe thru ebay).  I'm using the personal edition and have a vritually unlimited supply of dongle counts from ebay purchases (in fact I have an unlimited dongle, but it appears to only work with the cluster edition which I haven't bothered loading). I'm running it and it's associated tools in a VMware vm to avoid reinstalling and reconfiguring them.

TMSSequoia ScanFix 4.2

This is a reasonable app to do automated page fixups for bitonal images like hole and speck removal.  If I'm going to continue to use Capture with grayscale I'll need to do something to replace it.

Adobe Acrobat (including Adobe Catalog)
libtiff v3.5.6
ImageMagick (and Perl Magick)
Kodak Imaging for Windows 

Old environment (2001)

Panasonic KV-SS55EX duplex grayscale 300dpi scanner
Contex FSS 8300 COPY DSP large-format (E-size) scanner
Contex WIDEimage imaging software
Adobe Capture 3.0
Adobe Acrobat (including Adobe Catalog)
libtiff v3.5.6
ImageMagick (and Perl Magick)
Kodak Imaging for Windows 
TMSSequoia ScanFix 4.2


Notes

Skew

Document skew is a problem for OCR software, it substantially degrades the recognition.  Although one source of skew is misaligned feeding from a scanner the another common source is misalignment of the text in the original.

Adobe Acrobat contains it's own skew detection and correction engine.  It's not very good, at least at the correction.  If you see a word that has a strange offset split it in, that what it's from.  Skew correction isn't as simple as rotating the text to un-skew it, in a bitonal image this results in character distortion that interferes with recognition.  There are various techniques around this, unfortunately the one Capture uses isn't very good.  The deskew engine in scanfix is better, and I'll be a running a few tests using it, but I don't have a way to disable the internal Capture deskew so there may be some undesirable interaction.

My long term plan for deskew (unless Capture 4.0 is substantially better) is to convert all of the images to grayscale, detect the skew externally and then use true rotation to deskew them.  Characters aren't deformed in grayscale since pixels can now contain intermediate values instead of just on or off.  This grayscale conversion doesn't result in a substantially larger image after compression since the orginal source is still bitonal.  

Note that there is usually a corresponding .tif file to my .pdf files, these are pre-deskew.