Sunday, September 10, 2006

How Much Data Does Google Store?

In case you were wondering how much information Google stores, the paper about BigTable I was talking about last week gives some interesting insights.

Google search crawler uses 850 TB of information (1 TB = 1024 GB), so that's the amount of raw data from the web. Google Analytics uses 220 TB stored in two tables: 200 TB for the raw data and 20 TB for the summaries.

Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data. The second table "is relatively small (˜500 GB), but it must serve tens of thousands of queries per second per datacenter with low latency".

Personalized Search doesn't need too much data: only 4 TB. "Personalized Search stores each user's data in Bigtable. Each user has a unique userid and is assigned a row named by that userid. All user actions are stored in a table."

Google Base uses 2 TB and Orkut only 9 TB of data.

If we take into account that all this information is compressed (for example, the crawled data has compression rate of 11%, so 800 TB become 88 TB), Google uses for all the services mentioned before 220 TB. It's also interesting to note that the size of the raw imagery from Google Earth is almost equal to the size of the compressed web pages crawled by Google.
  25 comments ( Post a comment )
Not a lot there, i could nearly fit that on my hard drive *sarcasim*

How did you find this out?
What about Gmail, Blogger, Googlepages, Calendar, Writely etc etc etc.

It would be nice to see the exact number.

Nice job Ionut.
For a comparison:

The U.S. Library of Congress has claimed it contains approximately 20 terabytes of text.
Rapidshare has over 360 terabytes of space used for hosting files.



If Google has 24 billion pages and the crawled data needs 850 TB, an average page should be:

934,584,883,609,600 / 24,000,000,000 = 38,941 (38 K)

Google must have more than 24 billion pages (or store multiple versions of the same page) as this value seems pretty big.
The source is a paper [PDF] written by some Google engineers about BigTable, an interesting way of storing data.
For some perspective, big enterprises often have storage needs in the petabytes, not terabytes. While Google seems to have a lot of data from its web indexing, I am guessing it is less data (less information) than Shell Oil has.

So when we hear or think about Google setting out to index the world's information, they are perhaps only nibbling at it.

Or maybe Google is storing exabytes that aren't obvious to us...
If I can get some of that(mearly a fraction about 0.01%) how big it is to me.
"the crawled data has compression rate of 11%, so 800 TB become 88 TB"

Er that doesn't make sense. 800TB with a compression rate of 11% means it's more like 722TB.

To compress down to 88TB means the compression rate is more like 90%.
Not true. If the file size is S and the compression rate is X%, then the compressed file will have this size S*X/100.

Compression rate = (compressed_size/original_size) * 100.

See this Wikipedia article and check the compression rate displayed by your archiver.
..and we've hardly scratched the surface.......Now, can we start working on hungry children, forgotten elderly, narcisism, greed, hydrogen fuel,....k?
It's rather striking that after only, what, six months of operation, Google Analytics is already using more than a quarter much space as the raw web. Will the trend lines cross in a few years?
well thats a lot thinking from a desktop perspective i guess... but i know of some global banks that have a few petabytes of storage in their data-centers ....
And wath is about Google Video? How much it spends?
Do you know the "How Much Info" page?
http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
11% compression??? I am from the information retrieval community and I can tell you that the index alone takes up 33% of the size of the original set of documents. Plus, google stores all the pages and images. That should be a lot too. I'd say, no matter what google does, it cannot be using less than 50% (425 TB) of space to hold its web-crawl+index.
yall like boys//who gives a shit about terabytes, petabytes and all that shit?? are you kidding me??
You should not have read so much of this page to make a comment....

>Anonymous said...
>yall like boys//who gives a shit >about terabytes, petabytes and >all that shit?? are you kidding >me??

>Tuesday, November 27, 2007 6:27:00> PM PST
the individuals taht hacking me pleases stop . strenght55@gmail.com .9174778373
I think it is just amazing that Google has taken on the task to make all knowledge available on the internet. Even still, only 23 percent of the world's population has access to the internet and two thirds of chinese people still live on two dollars a day. Not very advanced are we? Especially considering that only 1 percent of the people in the world ever goes to college. We are dumb and dumber, though the future looks bright!
Dr. Doug Ikeler
Holy cow! that's a lot of bytes! I'm gonna go figure out how many bytes that is with a calculator! that's 2,252,800,000,000 bytes. our hard drive only holds 250 GB.
How many computers do you use??????????????????
with an efficient algorithm, indexing can actually be compressed at a greater ratio with more indexed data. For example, alot of websites have similar long phrases like "In case you were wondering how much" 807 results. If you have more data, it is more likely that data will repeat and can therefore be linked to a larger index.
"In case you were wondering how much" 7 words * 807, linked as an integer lookup by word 807*7*4*8 = 180KB, Linked to phrase 807*1*4*8 = 26KB
Compression ratio due to the ability to compress larger phrases: 14%
must be outdated as soon as I hit 'post comment'
wow, thats alot of storage, i wonder what it is now. :)
what about blog spot?
i wonder ow much it it now, with youtube aswell. :D