We live in a world of acronyms, that’s for sure. Two that we’re talking about today are OCR and PDF. They stand for Optical Character Recognition and Portable Document Format, respectively. But what can they do for your documents?
We’re all familiar with the PDF due to the utility they have provided homes, schools, and offices across the world since 1993 when the file type was first developed. Aside from the content-rich documents that have been shared with PDF files, many office scan documents to images, and those images are saved into a PDF file.
The problem this presents is that the versatility of born-digital files is lost. While you get a faithful representation of your document (what could be more consistent than an image file?), you can’t copy text out of it or search for text.
OCR is a process whereby alphanumeric data is recognized and extracted, and then layered over the PDF document so that it can be better utilized. With today’s technology, the accuracy is not 100%. Handwriting, skewed text, over-stylized text, and background colors and images present roadblocks to delivering a seamless and complete image-to-character transition.
In the video below, we’ve scanned one of our marketing brochures and run it through an OCR engine. The result is that text can now be searched, selected, and copied for use in other documents.