One of Google’s greatest tricks is that it can develop a technology and then apply it to several products and get benefits for them all. For example, Google has been working on the optical character recognition (OCR) technology for years and uses it extensively for its Google Books project. But, recently, it has also started testing integration with Google Docs. Now, it has quietly included support for converting PDF and image files to native documents using OCR.
When uploading a file to Google Docs, you now have the choice to “Convert text from PDF or image files to Google Docs documents.” Selecting this option kicks in the OCR technology and Docs then tries to decipher the files and present them as plain text in the documents editor. This has the disadvantage of dropping most of the formatting, but what you get in return is a document you can edit.
However, the warning you get at the beginning of any converted document makes it pretty clear that the technology isn’t perfect. “This document contains text automatically extracted from a PDF or image file. Formatting may have been lost and not all text may have been recognized,” the notification reads. At least you get a rendered image of every page in a PDF file inserted in the new document so you can compare the results with the original.
The quality of the OCR varies greatly from file to file and some are going to be inherently harder to convert. In my testing, Google Docs performed flawlessly, without a single error as far as I can tell. Of course, the original PDF document was fairly high quality, so that helped greatly.
Others have had poorer results, so the quality of the conversion is going to vary on a case-by-case basis. But it’s an interesting feature and it should come in handy for those that don’t regularly need these kinds of tools. And since this is just the first iteration, you can expect the technology and the feature itself to improve over time.
Google Docs Adds OCR Conversion of PDF and Image Files
The results can be surprising
CHECK OUT THE GALLERY (3 Images)