Google Operating System: Google Reveals New MapReduce Stats

An updated version of Google's paper about MapReduce (available at ACM and mirrored here) provides new information about Google's scale. MapReduce is a software framework used by Google to "support parallel computations over large (...) data sets on unreliable clusters of computers". Google uses it for indexing the web and computing PageRank, for processing geographic information in Google Maps, clustering news articles, machine translation, Google Trends etc.

The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Since 2003, when MapReduce was built, the indexing system progressed from 8 MapReduce operations to a much bigger number today.

Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link.

Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time."

	Aug. '04	Mar. '06	Sep. '07
Number of jobs (1000s)	29	171	2,217
Avg. completion time (secs)	634	874	395
Machine years used	217	2,002	11,081
map input data (TB)	3,288	52,254	403,152
map output data (TB)	758	6,743	34,774
reduce output data (TB)	193	2,970	14,018
Avg. machines per job	157	268	394
Unique implementations
map	395	1958	4083
reduce	269	1208	2418

{ The screenshot illustrates a Google rack from 2007. I don't remember the exact source of the image, but it's likely to be a presentation. }

3 comments:

Zijian AJanuary 9, 2008 at 2:21 PM
The hardware infrastructures make sure that no "next-Google" can possibly come through overnight popularity and lucky money.

In map service market, even Yahoo and Microsoft have long way to catch up.
AnonymousMay 1, 2008 at 5:07 PM
You wrote, "I don't remember the exact source of the image, but it's likely to be a presentation."

The presentation was, "Handling Large Datasets at Google:
Current Systems and Future
Directions", and it was presented by Jeff Dean Google Fellow.

Sincerely,
Ozgur Uksal
AnonymousOctober 8, 2008 at 2:13 PM
It's the future. Does anyone have experience using the Greenplum MapReduce solution?

http://www.greenplum.com/resources/mapreduce/

Note: Only a member of this blog may post a comment.

January 9, 2008

Google Reveals New MapReduce Stats

3 comments: