September 10, 2006

How Much Data Does Google Store?

In case you were wondering how much information Google stores, the paper about BigTable I was talking about last week gives some interesting insights.

Google search crawler uses 850 TB of information (1 TB = 1024 GB), so that's the amount of raw data from the web. Google Analytics uses 220 TB stored in two tables: 200 TB for the raw data and 20 TB for the summaries.

Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data. The second table "is relatively small (˜500 GB), but it must serve tens of thousands of queries per second per datacenter with low latency".

Personalized Search doesn't need too much data: only 4 TB. "Personalized Search stores each user's data in Bigtable. Each user has a unique userid and is assigned a row named by that userid. All user actions are stored in a table."

Google Base uses 2 TB and Orkut only 9 TB of data.

If we take into account that all this information is compressed (for example, the crawled data has compression rate of 11%, so 800 TB become 88 TB), Google uses for all the services mentioned before 220 TB. It's also interesting to note that the size of the raw imagery from Google Earth is almost equal to the size of the compressed web pages crawled by Google.

56 comments:

  1. Not a lot there, i could nearly fit that on my hard drive *sarcasim*

    How did you find this out?

    ReplyDelete
  2. What about Gmail, Blogger, Googlepages, Calendar, Writely etc etc etc.

    It would be nice to see the exact number.

    Nice job Ionut.

    ReplyDelete
  3. For a comparison:

    The U.S. Library of Congress has claimed it contains approximately 20 terabytes of text.
    Rapidshare has over 360 terabytes of space used for hosting files.



    If Google has 24 billion pages and the crawled data needs 850 TB, an average page should be:

    934,584,883,609,600 / 24,000,000,000 = 38,941 (38 K)

    Google must have more than 24 billion pages (or store multiple versions of the same page) as this value seems pretty big.

    ReplyDelete
    Replies
    1. I think the 850tb of memory is just the ram

      Delete
    2. This comment has been removed by the author.

      Delete
  4. The source is a paper [PDF] written by some Google engineers about BigTable, an interesting way of storing data.

    ReplyDelete
  5. For some perspective, big enterprises often have storage needs in the petabytes, not terabytes. While Google seems to have a lot of data from its web indexing, I am guessing it is less data (less information) than Shell Oil has.

    So when we hear or think about Google setting out to index the world's information, they are perhaps only nibbling at it.

    Or maybe Google is storing exabytes that aren't obvious to us...

    ReplyDelete
  6. If I can get some of that(mearly a fraction about 0.01%) how big it is to me.

    ReplyDelete
  7. "the crawled data has compression rate of 11%, so 800 TB become 88 TB"

    Er that doesn't make sense. 800TB with a compression rate of 11% means it's more like 722TB.

    To compress down to 88TB means the compression rate is more like 90%.

    ReplyDelete
  8. Not true. If the file size is S and the compression rate is X%, then the compressed file will have this size S*X/100.

    Compression rate = (compressed_size/original_size) * 100.

    See this Wikipedia article and check the compression rate displayed by your archiver.

    ReplyDelete
  9. well thats a lot thinking from a desktop perspective i guess... but i know of some global banks that have a few petabytes of storage in their data-centers ....

    ReplyDelete
  10. And wath is about Google Video? How much it spends?

    ReplyDelete
  11. Do you know the "How Much Info" page?
    http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

    ReplyDelete
  12. 11% compression??? I am from the information retrieval community and I can tell you that the index alone takes up 33% of the size of the original set of documents. Plus, google stores all the pages and images. That should be a lot too. I'd say, no matter what google does, it cannot be using less than 50% (425 TB) of space to hold its web-crawl+index.

    ReplyDelete
  13. You should not have read so much of this page to make a comment....

    >Anonymous said...
    >yall like boys//who gives a shit >about terabytes, petabytes and >all that shit?? are you kidding >me??

    >Tuesday, November 27, 2007 6:27:00> PM PST

    ReplyDelete
  14. the individuals taht hacking me pleases stop . strenght55@gmail.com .9174778373

    ReplyDelete
  15. I think it is just amazing that Google has taken on the task to make all knowledge available on the internet. Even still, only 23 percent of the world's population has access to the internet and two thirds of chinese people still live on two dollars a day. Not very advanced are we? Especially considering that only 1 percent of the people in the world ever goes to college. We are dumb and dumber, though the future looks bright!
    Dr. Doug Ikeler

    ReplyDelete
  16. Holy cow! that's a lot of bytes! I'm gonna go figure out how many bytes that is with a calculator! that's 2,252,800,000,000 bytes. our hard drive only holds 250 GB.

    ReplyDelete
  17. How many computers do you use??????????????????

    ReplyDelete
  18. with an efficient algorithm, indexing can actually be compressed at a greater ratio with more indexed data. For example, alot of websites have similar long phrases like "In case you were wondering how much" 807 results. If you have more data, it is more likely that data will repeat and can therefore be linked to a larger index.
    "In case you were wondering how much" 7 words * 807, linked as an integer lookup by word 807*7*4*8 = 180KB, Linked to phrase 807*1*4*8 = 26KB
    Compression ratio due to the ability to compress larger phrases: 14%

    ReplyDelete
  19. must be outdated as soon as I hit 'post comment'

    ReplyDelete
  20. wow, thats alot of storage, i wonder what it is now. :)

    ReplyDelete
  21. i wonder ow much it it now, with youtube aswell. :D

    ReplyDelete
  22. Just hope that Google uses our data for the good and not evil :)

    ReplyDelete
  23. What, 70 TerraByte for Google earth? Im actally shocked that is so low. I would have thought satelite imagery for even 1/8th of the planet would take up amounts of data sorage I simply cannot image. And my Scientific Calculator cannot display!

    ReplyDelete
  24. Smoke some Pot an think about this...

    One day the world will be available on Virtual Reality. You will just be able to fly around in 3D. Accurate to 1cm/100 Megapixels.

    How much storage will that take?

    Any mathematicians care to take a guess? Im sure someone can work out a formula. Mwahahahaha! God Ill be up all night thinking about this now....

    ReplyDelete
  25. 510385129 Petabytes of information is all it would take to store every surface on the planet at microscopic scale, buildings,ants, people. not to much if you think about it. just find a scanner and get started...it should onlyn take a few decades

    ReplyDelete
  26. 5103851292548736987450000 MB of information

    ReplyDelete
  27. Haha! Brilliant question. And brilliant answers. Can you please show your work. I want to see your long division...

    ReplyDelete
  28. Sorry off topic, but in response to Dr. Ikeler,

    I.T aside, you have touched on very important political agendas.
    With extreme poverty there is wealth, as in your e.g. China and also 400 million in poverty in India.

    Global politics dictates the 50 cent per day slave trade.
    Essential knowledge: Money Masters, Constructing Fear, Life and debt.

    We are VERY advanced, but wealth and knowledge is NOT for everyone.

    Politics, media hype, influences the young that sport is the ticket to success, and not education. Just like in the human society, Drones in the bee society work for the queen.

    http://corpau.blogspot.com

    ReplyDelete
  29. I think its all a lie- afterall; we all came here from google or some other search engine. There was previously, a site which i got from another search engine that talked about it...it didnt interest me at tht time...but now i cant find it!
    Of course its all wrong- think about it this way-let's say i buy 1tb hdds. For argument's sake, i get it for about 300$. Let's say i increase the amount they "claim" to about 1000 TB(which is a lot compared to what they say). So i can actlly get the whole data of google for 300K? so, in effect, after spending, say, a million dollars more...i can actlly start up my own company?
    OF COURSE NOT!
    And what about gmail, blog, youtube, etc?

    ReplyDelete
  30. I think its all a lie!
    I had come across a good link on hw much google stores a few years back...from another search engine...but tht whole site has vanished now cos it dosnt appear anywhere.

    I can get 2000TB for a max of 0.6 million. And, say, with an additional 1 million for running costs, etc...if i cud lay my hands on the data, i can actlly set up my own rival company?
    OF COURSE NOT!
    This is all a lie, ppl! The amount of data they store is enormous...and they are scared tht ppl will be frightened and tht environmentalists will be concerned, etc

    And what about Gmail, Youtube, etc?!

    I like google...but i'm a bit scared abt the influence it has....if it cud go into wrong hands and stuff.

    ReplyDelete
  31. @Anonymous:

    1. The data is from 2006, so obviously it's no longer accurate.

    2. The data is from a Google paper, so you can't claim it's "all a lie".

    3. Google acquired YouTube in October 2006, after this paper has been published.

    4. Let's assume that Google indexed 10 billion web pages at that time and the average size of a page was 100 KB. To store all those pages, Google would've needed 931 TB, which is close to the value from Google's paper (850 TB).

    ReplyDelete
  32. SERIOUSLY! THATS IT?! HOW FREAKIN' COOL! i totally though google had some unfathomably huge network of data shared among multiple affiliates to make this enormous dataset that they had access to, not actually stored themselves.

    but on a side note:

    are people seriously freaking out about how much data google has? how come nobody cares about NASA, the Russians or North Korea? i mean i don't think google's ever blown anything the f**k up...

    and for thoes anonys that are flippin out about how it must be a lie based on some point about how simply having "disk space" in the ballpark of 2000TB must mEaN SoMetHInG, someone, please, slap them. google does more than just "store stuff". they offer special ways of accessing/indexing info, plus provide services based on productivity, marketing, communication and education.

    its not impossible to start a search engine with some cash a big HD and a dumb idea. it happens all the time.
    BUT
    then they get washed
    AND
    then we go back to google(or yahoo or msn or bing or whoever).

    i mean, think: when google started up, Yahoo was DOMINATING. when facebook started up Myspace was DOMINATING. Anybody could likeley name their favorite poineer in technology who came in w/guns blazin at just the right tune while someone else was monopolizing the scene. making a new search engine now is exactly the same concept, just larger scale. dont hate just cuz ur broke with no ideas.

    And no, telling people how fake or impossible something is doesnt earn you any respect nor does it make you sound smart. unless you happen to be a talking snowball or dung pile. then u deserve your own youtube channel.

    ReplyDelete
  33. any current information on that topic?

    ReplyDelete
  34. Great info... i recently came to know that google does its server backup with battery installed in each CPU !!! now if each CPU has 1 TB hard drive ...given the info in the article ...its server firm size should not be too huge.

    -- RIA Tweet (http://twitter.com/wave_)

    ReplyDelete
  35. google is cyberdine

    ReplyDelete
  36. Which server does google uses to store the data....?

    ReplyDelete
  37. I'd like to see these figures updated... I bet in four years there have been some huge changes. Just how much sh*t are we (humans) (obviously, 'tho my dog has a Facebook account and he's never bloody off it) storing via interweb now, with Google, Facebook, Youtube (also Google), Twitter, etc. etc. There's just tons of "stuff" out there and growing fast.

    ReplyDelete
  38. Alright alright, so obviously this is a great topic as it has been going on for over 2 years now... I'm impressed anyways...

    Google's compression ratio (which is somewhat impressive as well) may put a comparatively large dent in the amount of space Google uses; however, it does not save it from the Petabytes among Petabytes of information that Google indexes. Think of every page Across the web. Google is not text only, it does images and videos as well...

    Speaking of which, we're also dealing with YouTube, a Google-owned company. They may compress the videos into .mov files, but that isn't saving it from it's mass of data either...

    Now you must consider the estimated 500GB of space in just code for Google's applications, if you will.

    Google owns server farms everywhere--each of these little farms has Terabytes among Terabytes of information, and they're spread out across the world!

    Now think of how much data we're dealing with now, and then multiply that by three. Each byte of data that Google stores is backed up three times...

    Compression or not, if you some up every one of Google's applications and multiply it by three, you're getting Petabytes among Petabytes of data...

    At this point, dare I say it sarcastically, they probably have Googlebytes of indexed and compressed information on their mass servers. Yup, give Google their own unit of space, hands down...

    ReplyDelete
  39. lol, the mathematical number "1 google" (which is 10^100, or 1 followed by 100 zeroes) has a real-world application

    ReplyDelete
    Replies
    1. Actually, the number 10^100 is googol. Google changed the spelling for reasons unknown.

      Delete
  40. thats a lot of data

    ReplyDelete
  41. think of it,
    if google has a virus system for all files on the web,that is mediafire+rapidshare+......
    plus the virus database...
    sooner or later google will take over the whole world.

    ReplyDelete
  42. but still it can not store beyond the limit. if that exceeds than how do they manage. and what is the capacity of that storing device?

    ReplyDelete
  43. Does google system has limited data storage? what will happend if all the 220TB is used? This is a very interesting and informative post.

    ReplyDelete
  44. @Anonymous & @Storage Melbourne

    What limit? Do you think Google has just one server with hundreds of drives plugged in?

    They have hundreds of "blade" servers with quite a few disks in - basically its like one huge disk spread out over hundreds of servers, all over the world...

    ::smiles:: I can't help smiling about you lot thinking about all those USB cables to external caddies!!

    @Kwong Tung Nan: What are you talking about? Virus? Rapidshare? What has this to do with Google?

    ReplyDelete
  45. A google is a very large number.. when we get to a googleplex thats a number that's hard to even imagine.. a 1 with a google of zero's after it.. millions and billions of zeros just wrap ya mind round that....++=---==== ERROR OUT OF CHEESE ERROR>>>... SYNTAX ERROR IN LINE 10... REDO FROM START =====++++_----==

    ReplyDelete
  46. It's googol and googolplex, not google.

    ReplyDelete
  47. Updated link to the PDF:

    http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf

    ReplyDelete
  48. Google processes about 24 petabytes of data per day. To process that king of data per day, google probably has a few exabytes in data all together and probably keep adding every month :o

    Interesting!

    ReplyDelete
  49. Sitting here in 2016 commenting on a 2006 post

    ReplyDelete
  50. HOW DOES GOOGLE STORE AND ACCESS PETABYTES OF DATA ? HOW DO THEY MANAGE HARD DRIVES? IS THERE ANY BACKUP PLANS IF THE DATA IS LOST ?

    ReplyDelete
  51. You may also create a directory of dofollow blogs for other bloggers to use as a reference when posting comments for backlinks. dofollow

    ReplyDelete

Note: Only a member of this blog may post a comment.