You may have heard of the phrase Optical Character Recognition (OCR) and may or may not know what it refers to. Admittedly, I was one of those that never heard of it before now, but learning about it definitely makes me wish I knew about it sooner. OCR is something that greatly improves people’s quality of life: saving time, money, and energy. Already it is changing the interaction between machines and the world and how we store and communicate information. As more and more people discover it, undoubtedly it’ll change the world. But, you may be asking, what is it?
The term OCR refers to identifying characters in an image. However, it’s common use covers the process of taking images and getting a digital representation of the text in them. This humble tool has a staggering variety of applications!
OCR works fairly similar to how humans do. How do you read? Most likely, you’ll answer something along the lines of “I just do,” “I learned how to in kindergarten”, or if you really think hard, “I look at the paper. My brain recognizes words. Then, it assigns meaning to the word”.
Where writing is concerned, a computer is like the kindergarten you: a blank slate. It must learn:
- What strokes are connected
- What set of strokes from a characters
- How to interpret blank space between characters.
This is a textbook example of machine learning. We present the computer with many examples of typed and handwritten characters, along with the actual character those strokes represent. From these labeled examples, a computer can learn to recognize handwritten and typed characters, much the way you likely did as a child.
So, that’s the entire process, right? Train on lots of examples of fonts or handwriting, and then you’ll be able to take an image and extract the text. Unfortunately, that’s just the start (or rather in this case the middle). While we don’t consciously think about it, we go through a much longer process than just look and read including:
- Get a clear view.
- Identify what you’re looking at.
- Find the text.
- Recognize what symbols go together to make a word or sentence.
- Identify each symbol.
- Assign meaning to those symbols.
- Make corrections.
These steps can be grouped together in what’s generally the pre-processing of the image (1, 2, 3), OCR (4, 5, 6), and processing the data from OCR (7).
If you’re near or far sighted, you put on glasses to make whatever you’re looking at clearer. If a paper askew, you right it. OCR is no different. We first clean the image using techniques like:
- deskewing (fixing orientation and angles)
- denoising (helps make the foreground and background as distinct as possible)
- binarization (turning the image black and white to make it as sharp as possible)
Afterwards, the process identifies what it’s looking at. For instance, something that takes in forms might identify if it’s a 1040 form versus a W4 form. OCR doesn’t always need this step, but it can increase accuracy. If you already know what to expect, then it’s a lot easier to identify the characters.
Regardless of whether the image is identified or not, the areas with actual text need to be found. As people, it’s extremely intuitive to differentiate these different sections of background and random markings versus the actual text. For a computer however, all they see is pixels laid across a grid. Common methods to find the areas of text include contour detection and the EAST algorithm; regardless, people today are still working on finding more accurate and better detection for text areas.
Finally, with all of this complete, it’s time to move onto the fun part: OCR.
This is the part that comes to mind when people think of OCR: the process of converting images of text into strings of characters. Recognizing characters is a textbook application of machine learning, but that doesn’t mean it’s trivial to do. Building a reliable classifier requires tens of thousands of examples of each character. However, we aren’t done! A sequence of characters is only the first step in understanding a document.
Having a giant string isn’t super pleasant to work with. As a result, once the characters are recognized and formed into a string, it’s a matter of breaking the information up into its different pieces to be stored. The location of the string, the contents of the string, and the layout of the form or document all tell us something about how to store the information OCR has recovered for us. What, exactly, they tell us varies by use.
Digitizing notes is a common use of OCR. Note-taking is important both as a method of learning and as a tool to remind us of important information later. With OCR, we can retain these benefits of note taking by hand:
- Less distractions
- Active thinking resulting in better retention
- Cheap and easy to use anywhere
- Ease of use
But digital representations and typing also have advantages.
- No hand cramps!
- Ease of tracking and storing
With OCR, we can retain these benefits of taking notes but still have the advantages of digital representation.
If text is typed already, why do you need OCR? The transition from physical to digital is currently a work in progress. With a lot of information already stored on paper, whether that’s paper forms, old books, etc. even if it’s old that information is valuable. In the past, this meant taking a picture or typing by hand. Pictures alone though can’t be manipulated, edited, or searched, and typing up records takes a lot of time, effort, and commonly results in inconsistencies. OCR foregoes the need to type information up, and instead people just take pictures and the rest will be taken care of.
This is extremely relevant for any industries involving lots of record keeping and paperwork. Some big examples include healthcare, law, and archivists, but as long as a business uses paperwork, it will come in handy. Patient records, tax forms, contract agreements: all of this can be converted to be stored in a digital database quickly using OCR.
One of the most interesting uses of OCR is a mix of typed and handwritten, or rather, using OCR to interpret and analyze the world. Think of robotics and machinery. Machines may be used to deal with items; however, that might require reading the label. Robots and technology may need to look at things like shop names, license plates, signs, etc. OCR gives technology real-time information about the environment, not only pure visuals, but the ability to interpret text in the world.
It also enables accessibility technology, specifically for those that are visually impaired. OCR can be used to create aids that will easily and automatically relay information from labels, signs, and other text around the world. It would be able to read out-loud text from textbooks and other physical books eliminating the need to find digital copies that have an audio version.
As long as the written language continues to be used for the sake of communication, OCR will continue to act as a translator between pixels to characters or images to meaningful data. If you want to learn more about OCR consider one of these tutorials:
Otherwise, watch out for future blog posts! We’ll be exploring the various steps of the OCR pipeline from different basic algorithms to how to get started yourself.