22 June 2010

Convert your PDF files to Google Docs documents

After months of experimenting, Google Docs has released the OCR feature in the standard web interface: when you upload a document, there is a new check box – unchecked by default - Convert text from PDF or image files to Google Docs documents. I first spotted it yesterday, but it wasn’t functional yet. Nothing different happened to the files I uploaded as test and the check box kept disappearing when I refreshed the page. Today it looks much more stable and I was able to try the actual character recognition.Google Docs convert image and PDF files

The results are mixed, as expected. The conversion is pretty fast, at least with small files. Instead of an image or PDF, you now have a document in the dashboard, containing the extracted text and, for PDF files, images of the corresponding original pages. The Download Squad article covering the news points out that if you want to share the original file you have to upload two versions of it, but that’s actually how all uploads to Google Docs are handled: you can either use it as storage or convert to their internal format for editing, with the expected loss of fidelity. What I would like to see instead is an option to perform OCR on files uploaded before the feature was introduced. Right now the only solution is to download and re-upload them with conversion on.Google Docs OCR warning

The accuracy of the conversion is very different depending on the file. For a regular text PDF, the recognition of characters was almost perfect, while much of the formatting was stripped, including borders and boxes. Headers and page numbers are also mixed in with the main text, but I suppose it’s not an easy task to distinguish between them. The same result can be easily achieved by simply copying the text from the PDF viewer – assuming the author allowed it – and pasting in any text editor. And this way you don’t need a Google Account or to spend time removing the images of the original pages. Google Docs OCR on text PDF

For image files, the results were largely disappointing: using a screenshot of the same file, Google Docs was only able to make out the larger title, all the smaller text was simply ignored. It’s probably a issue of resolution.

Google Docs OCR on scanned PDF The last test involved the hardest task yet: I uploaded a PDF file with two scanned pages, with Romanian characters and lots of tables. I wasn’t expecting the software to recognize the special characters, but it also failed to draw a table and discarded a large number of words and numbers. Clearly, there is a lot more to do before this becomes a viable addition to Google’s online office suite. It would probably be a good idea to transform tables into Spreadsheet files, once the accuracy improves, to help people manipulate the extracted data more easily and quickly.

