What is OCR?

OCR stands for Optical Character Recognition. The technology is used to extract characters and words from document images. One of the original goal of OCR was to create devices for the blind and help them read. Another usage of OCR is the data entry from printed paper data records. By scanning your documents, you can create a digital version of them and store them for easily retrieval by searching content by keywords. The ultimate goal is to achieve paperless office.

OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

That’s the simple description. To go into details, the pre-processing of the image include some document layout analysis, like number of columns, presence of a table, paragraphs, titles, presence of images to skip… This allow to further more replicate the document text.

