Recently I began playing around with OCR using tesseract at work. Getting it to work proved to be a pain, since there was no administration rights when installing the software.
A brief outline of the approach is highlighted here.
pypdfocr
was the package which was used to convert pdf’s (or even non-pdf files could be used here). Based on the documentation, the external requirements that were used were:
- Portable version of Tesseract (download the “tesseract-XXX-win32-portable.zip”)
- Ghostscript Portable, which is available from PortableApps.com
- ImageMagick portable, which is available in the binary download section, look for “ImageMagick-XXX-portable-XXX.zip”
- Poppler for Windows, which I used the unofficial binaries located here
As a side note I really should learn how to build these unofficial binaries myself so I can learn how to redistribute it. This includes making my own suite of portable apps for personal use. This will probably be a project I work on myself in the future.