A Guide on OCR with tesseract 3.03

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from  3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.

Installation of tesseract

Installation of tesseract, so you can use the training tools, will require a number of potentially difficult steps on Ubuntu 14.04 (in my case though it worked like a charm):

  1. Compilation of Leptonica 1.7+
  2. Install Dependencies and Download and Compile tesseract 3.03 RC1
  3. Building of training tools

Figure out where the configuration and traineddata-files are located. Best place is: /usr/local/share/tessdata . If not then set $TESSDATA_PREFIX  to that tessdata-folder. Custom configuration files are supposed to be placed in configs -subfolder.

If you don’t intend to train tesseract but only to use it for OCR directly, installation on Ubuntu is no more and no less than sudo apt-get install tesseract-ocr .

Conversion of a PDF to an Image

For a regular sized font of about 11pt a good resolution is about 300 to 500 DPI.

Application tesseract to an Image

The initial OCR result for …

test

might be …

… Meh!

Let me tell you though that for standard (sane) fonts like Arial or Times New Roman the out of the box performance yields an error rate of maybe 1% if your document is of good optical quality. That’s the good part about tesseract – most of the time you won’t have to worry about training tesseract.

Create box file

Let’s assume the following training image …

training

The inital resulting box file might be …

Some letters are identified correctly – others not. By the way the first four numbers is the coordinates of the box (left-x, bottom-y, right-x, top-y) with origin at bottom left. The fourth number is the page index in case you use a multi-page TIFF. Whether to split two characters or to keep them in one box and allocate it the correct value is a source of mystery and speculation. Commen sense and putting yourself mentally into a machine learning algorithm’s shoes will help.

Correcting the box file

I think in some The Intercept article I read that CIA was torturing potential terrorists in those black sites by having them correct tesseract box files for texts of handwritten Sanskrit in case water boarding didn’t work. If you endulge in correcting box files for longer than one hour – make sure you have tissues next to you as your brain might melt and drip from your nostrils. Don’t blame me if you ruin your shirt!

Anyway – my adivce is to segment training into multiple steps. The first training will be tedious b/c tesseract will make many mistakes and you will have to correct a lot of little boxes. But you can use what you learned for the next training step and its initial creation of the box files. So with every training step you increase the complexity of your training data.

To make correction, adjustment, insertion, deletion, merging and splitting of boxes a bit easier I recommend to use a box file editor. jTessBoxEditor is doing a good job. Download, extract and then start it:

So above box file might initially look like this:

Screenshot from 2015-03-15 17:45:22

In above case you would have to correct the value for the marked character from “T” to “F”, you would have to split “N O P” into three different cases etc. When you’re done don’t forget to save the box file edits.

Training tesseract

Tesseract expects involved files to adhere to naming scheme:

[language].[font name].exp[num]

The language might be eng2 (as “eng” already exists). The font name is Lobster Two. So the name of the training picture and its box file might be:

  • eng2.LobsterTwo.exp0.png
  • eng2.LobsterTwo.exp0.box

Now let’s get some training done – I recommend for now to just “accept” the steps taken – don’t question, follow slavishly – as if it was a religion – or some new Apple product.

Did it work?

Okay – chances are that it didn’t work yet – you’ll have to reread this text and draw inspiration from further blogs and even the official documentation. But let’s assume everything did work – so, if I now re-OCR the test image …

… what I will get is …

… well – it’s a bit better :) Not much – but given the oddness of the font I fear we just have to put more effort into the training and provide much more data. It’s been suggested that there should be at least 10 samples per character and also our training data set assumes a larger font spacing. This would have to be addressed as well.

stay-tuned twitter feedly github

Helpful Blog Posts with Further Details

At the End of the Day

There is a lot more stuff to learn about tesseract. And chances are that many things will change if 3.04 sees the light of the day. But if you need to get OCR done I think delving into tesseract is well worth it. It’s terribly documented and the community is not very active but its a very powerful tool nonetheless. Good luck!

9 thoughts on “A Guide on OCR with tesseract 3.03

  1. Hey thanks for taking the time to write this and making a tool to help make this easier!

  2. Awesome article. Just read it did not applied yet but thanks a lot for your time on doing this.

  3. Hi Raffael,

    Thanks for your post! I have made a shell script to automatically install Leptonica and Tesseract (with training tools). You can find it here: https://github.com/kz/smart-treadmill/blob/master/vagrant/install.sh

    This is made for Vagrant’s trusty64 build (Ubuntu 14.04 LTS, 64bit) with the Vagrantfile here. However, if you download install.sh and run it directly without using Vagrant on the same OS, it should still work!

    Hopefully this helps anybody coming across this post. The above code is open source, and feel free to include it inside your main post!

    Regards,
    Kelvin Z.

  4. Hi Raffael

    Great post – have been dredging through the swamps for clear instructions written in plain English and yours was the gem that stood out! Thanks for writing this.

    I have successfully (read: painfully) trained Tesseract on a new font with English (eng) training data and it works! However when presented with standard fonts, Tesseract seems to have forgotten how to recognise these. I have, of course, replaced the eng.traineddata file with my own.

    I noticed you used eng2 in your example above and also a single font_properties entry. I suppose you have gotten around my problem by specifying eng2 as your language and having it co-exist peacefully with the standard eng.traineddata file.

    Have you done anything to “combine” the new font with existing trained data? Any suggestions on best practice around this?

    Thanks, and keep up the writing!

    Andy

    • Hi Andy,

      thanks for your feedback – highly appreciated.

      But – my tinkering with Tesseract is a bit too far in the past – which is why no relevant ideas for suggestions come to my mind regarding your question – sorry!

      Cheers

      Raffael

  5. Hey,

    Thank you for your post. It helped me a lot. Unfortunately I can’t find any informations regarding the recognition results after training tesseract in comparison to before. I would be glad, if you could help me in my serial number recognition with tesseract. Therefore a stackoverflow question summarizes my actual problem. Maybe you can have a look on it,

    https://stackoverflow.com/questions/31145200/achieve-better-recognition-results-via-training-tesseract

    best regards,
    Christoph

  6. This is a very useful and easy to implement way . It works.. thanks :-)

  7. Your writing is very funny and informative. I am going to be working in data analysis and have learned much from your website. Thanks!

Comments are closed.