Document scanning tools & techniques
Current Environment (2010)
Contex FSS 8300
COPY DSP
large-format (E-size) scanner
Contex
WIDEimage imaging software
I tried to thow this out but a friend wanted me to keep it.
B&W only and multiple cameras (that require occasional
alignment). Running this out of the vm.
Adobe
Capture 3.0
Haven't found anything to improve on this yet, now discontinued and
unavailble (except maybe thru ebay). I'm using the personal
edition and have a vritually unlimited supply of dongle counts from
ebay purchases (in fact I have an unlimited dongle, but it appears to
only work with the cluster edition which I haven't bothered loading).
I'm running it and it's associated tools in a VMware vm to avoid
reinstalling and reconfiguring them.
TMSSequoia
ScanFix 4.2
This is a reasonable app to do automated page fixups for bitonal images
like hole and speck removal. If I'm going to continue to use
Capture with grayscale I'll need to do something to replace it.
Adobe
Acrobat
(including Adobe Catalog)
libtiff v3.5.6
ImageMagick (and Perl
Magick)
Kodak
Imaging
for Windows
Old environment (2001)
Panasonic
KV-SS55EX
duplex grayscale 300dpi scanner
Contex FSS 8300
COPY DSP
large-format (E-size) scanner
Contex
WIDEimage imaging software
Adobe
Capture 3.0
Adobe
Acrobat
(including Adobe Catalog)
libtiff v3.5.6
ImageMagick (and Perl
Magick)
Kodak
Imaging
for Windows
TMSSequoia
ScanFix 4.2
Notes
Skew
Document skew is a problem for OCR software, it substantially
degrades the
recognition. Although one source of skew is misaligned feeding
from a
scanner the another common source is misalignment of the text in the
original.
Adobe Acrobat contains it's own skew detection and correction
engine.
It's not very good, at least at the correction. If you see a word
that has
a strange offset split it in, that what it's from. Skew
correction isn't
as simple as rotating the text to un-skew it, in a bitonal image this
results in
character distortion that interferes with recognition. There are
various techniques
around this, unfortunately the one Capture uses isn't very good.
The
deskew engine in scanfix is better, and I'll be a running a few tests
using it,
but I don't have a way to disable the internal Capture deskew so there
may be
some undesirable interaction.
My long term plan for deskew (unless Capture 4.0 is
substantially better) is
to convert all of the images to grayscale, detect the skew externally
and then
use true rotation to deskew them. Characters aren't deformed in
grayscale
since pixels can now contain intermediate values instead of just on or
off. This grayscale conversion doesn't result in a substantially
larger
image after compression since the orginal source is still
bitonal.
Note that there is usually a corresponding .tif file to my
.pdf files, these
are pre-deskew.
|