11a Photo OCR Implementation

Aug 13, 2017 14:11 · 814 words · 4 minutes read

Photo OCR Implementation

Problem Description

Getting computers to pull text out of images. Rather than just being given a character of paragraph of text, this problem involves finding text in real-world images you might take with a mobile phone.

Given an image with text in that we want to read, the pipline looks like this:

Text detection
Character segmentation
Character classification
(optional) language and spelling correction

Each section of the pipeline might have a team of people working on it, or just one person. The sections form natural segments for where to split the workload between engineers. The performance of each section might be tested separately.

An important question to ask in machine learning is what the pipeline should look like.

Text Detection

We train a classifier to recognise characters, trained with a set of images that contain characters (positive examples, $y=1$), and another set that don't contain characters ($y=0$).

We then use a sliding window technique to skim over the image. Usually we'll reduce the size of the patch taken from the image to a more manageable size.

From this, we can create a processed image where white patches indicate text and black indicates no text. From this, we can apply an expansion operator - for each pixel in the image, if it's within 2-3 pixels of a white pixel, we'll mark it as white. This lets us view connected components and draw a boundary box around it. Sometimes the aspect ratio of the text (long sentences, or a couple of characters).

Character segmentation

Again, we can use a 1-dimensional sliding window training on positive examples (images of spaces between characters), and negative examples (complete characters, where there aren't spaces between them).

We then slide along the boundary box from the previous example, and determine where the boundaries are.

Character classification

This is something we've already looked into with neural networks. Because we have segmented characters, this makes our classification job much easier.

Artificial synthesis of training data

This is pretty clever.

In some cases, we can totally synthesise training data. If we're training a character recognition algorithm, we can use the fonts that are built into the computer. We can create training images by generating text in different fonts, and overlaying one or two characters on a random background.

If we have a limited data set, we can amplify it by performing transforms on it. This additional training data helps the algorithm perform better - so long as they're realistic transforms that the algorithm is likely to see in real-world data. For example, if we have a set of example characters, we can perform optical distortions such as skewing or blurring them, distortions which we're likely to see in real life. As another example, if we're looking at speech recognition, we can take an audio clip and add background noise or distortion.

It's worth noting that it usually doesn't help to add random noise to your data.

Getting more data

Before putting lots of effort into getting more data, make sure you have a low bias classifier by plotting learning curves. Increase the number of features/number hidden units in a neural network until we have a low bias classifier.

Another useful question is to ask "how much work would it be to get 10x as much data as we currently have?" Do the maths on how long it would take - if we need 10,000 examples, and each takes 10 seconds to classify, that's only 3 eight-hour days to collect the data. Again, you can also use artificial data synthesis, or even crowdsourcing (eg Amazon Mechanical Turk).

Ceiling Analysis

You want to avoid putting effort in where it won't make a difference.

As a general rule for machine learning, don't trust your gut feeling! Something like ceiling analysis is much more useful.

This a very simple, but very nice technique. You have your pipeline as above of:

Text detection
Character segmentation
Character recognition

Say the overall system accuracy for our test set is 72%.

What we do is replace the text detection system, and instead supply the correct answers (which we've pre-labelled manually) to the next section of the pipeline - character segmentation. This simulates text detection having 100% accuracy, and might bump up the overall accuracy to 89%.

We then repeat and supply the correct, manually pre-labelled results to character recognition (instead of using our character segmentation section), which brings accuracy to 90%.

Finally, if you replaced character recognition with 100% accuracy, the entire system accuracy would jump to 100%.

This tells us the following:

There's 17% overall accuracy to be gained from improving the text detection system - good area to focus on
There's 1% overall accuracy to be gained from improving character segmentation - not a good area to focus on
There's 10% overall accuracy to be gained from improving character recognition - a reasonable area to focus on.