An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to

June 13, 2013

How Google's Image Recognition Works

Just like Google Drive, Google+ Photos uses some amazing image recognition technology to make photos searchable, even if they don't have captions or useful filenames. "This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle," explains Google.

Google acquired DNNresearch, a start-up created by Professor Geoffrey Hinton and two of his graduate students at the University of Toronto. They built "a system which used deep learning and convolutional neural networks and easily beat out more traditional approaches in the ImageNet computer vision competition designed to test image understanding." Google built and trained similar large-scale models and found that this approach doubles the average precision, compared to other object recognition methods. "We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months," says Chuck Rosenberg, from the Google Image Search Team.

The paper, titled "ImageNet Classification with Deep Convolutional Neural Networks" [PDF], explains how this works. It uses supervised learning, 7 hidden weight layers and feature extractors learned from the data. "Our neural net has 60 million real-valued parameters and 650,000 neurons. It overfits a lot. Therefore we train on 224x224 patches extracted randomly from 256x256 images, and also their horizontal reflections."

Google says that the publicly available photo search feature recognizes 1100 tags. "We came up with a set of about 2000 visual classes based on the most popular labels on Google+ Photos and which also seemed to have a visual component, that a human could recognize visually. In contrast, the ImageNet competition has 1000 classes. As in ImageNet, the classes were not text strings, but are entities, in our case we use Freebase entities which form the basis of the Knowledge Graph used in Google search. An entity is a way to uniquely identify something in a language-independent way. (...) Since we wanted to provide only high precision labels, we also refined the classes from our initial set of 2000 to the most precise 1100 classes for our launch."

Some other examples of classes that are recognized: car, dance, kiss, meal, hibiscus, dahlia, sunsets, polar bear, grizzly bear. The system recognizes both generic visual concepts and specific objects. "Unlike other systems we experimented with, the errors which we observed often seemed quite reasonable to people. The mistakes were the type that a person might make - confusing things that look similar."

This blog is not affiliated with Google.