A Guide on OCR with tesseract 3.03

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from  3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.

Continue reading