How Dropbox is using machine learning to index text from billions of images

People have stored more than 20 billion image and PDF files in Dropbox. Of those files, 10-20% are photos of documents—like receipts and whiteboard images—as opposed to documents themselves. These are now candidates for automatic image text recognition. Similarly, 25% of these PDFs are scans of documents that are also candidates for automatic text recognition.

From a computer vision perspective, although a document and an image of a document might appear very similar to a person, there’s a big difference in the way computers see these files: a document can be indexed for search, allowing users to find it by entering some words from the file; an image is opaque to search indexing systems, since it appears as only a collection of pixels. Image formats (like JPEG, PNG, or GIF) are generally not indexable because they have no text content, while text-based document formats (like TXT, DOCX, or HTML) are generally indexable. PDF files fall in-between because they can contain a mixture of text and image content. Automatic image text recognition is able to intelligently distinguish between all of these documents to categorize data contained within.

Sign Up for Our Newsletters

Get smarter with most important stories.

You May Also Like