Google turns on OCR for scanned PDFs 
Sunday, November 2, 2008, 03:12 PM - Utilities, News
Posted by Administrator
By David Chartier | Published: October 31, 2008 - 01:50PM CT

Google has covered quite a lot of turf during the march toward its goal of making every last bit of the world's information searchable. But considering all the ground that has yet to be covered—especially in the realms of offline data and paper documents—we weren't surprised when Google began dabbling with OCR technologies over the last couple of years. Now, the search giant has officially launched its next attempt to handle some of this previously unsearchable content.

As announced on the Official Google Blog, the company is now performing optical character recognition (OCR) on documents that it indexes and identifies as scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time. But many documents wind up being made into PDFs through scans, which store the text as images. Google has now decided that its open-source OCRopus technology, based on software called "Tesseract" that HP developed, is up to the task of indexing scanned documents that can contain any mixture of text, images, and coffee stains.

Read More Here
add comment ( 5 views )   |  0 trackbacks   |  permalink   |  related link   |   ( 3 / 222 )

<<First <Back | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Next> Last>>