Engineering manager interview questions shared by candidates
How many computers would it take to run gmail.
These sorts of questions don't have answers. The real answer is to think out loud about what would be required in terms of bandwidth, storage, computing power, etc. for all of the different elements: receiving, sending, website, spam filtering, addressbook lookup, ...
Assume that gmail has 2 bln of users (there are 3 bln internet users around the world and almost every internet user around the world has a gmail account). Assume that average size of mailbox is 500Mb (I've worked at many email services and know what I'm saying). 2 bln by 500Mb is 1000pb (petabytes). If gmail stores each email 3 times (2 times is a good choice, but since users' mailboxes are scattered around the storage cluster then 2 collapsed hard disks at the same moment could result in some data lost while the possibility of 3 hard disk have collapsed in the different data centers at the same time is negligibly low) then total is 3000pb. But I'm completely sure that gmail has deduplication and gmail has text parts of emails zipped. The first measure can save around 50% of space (the most of the mail store - it's attachments, and attachments are usually sent by person, so an attachment appears twice - in an inbox folder and in a sent folder). The second measure can save around 20% of space (this is the approximate total size of all text parts of mailings that could loose their weight by times). So 3000Pb are reduced by 70% to again 1000Pb. A 36x4Tb storage server has a 144Tb capacity. 1000Pb divide by 144Tb is about 7000. Why is it 36x4Tb? Because 36 is as far as I know economically feasible maximum number of disks per server. And 4Tb is economically feasible maximum disk size. I mean here that the more disks per server the less is the cost of the server other-than-disk parts per byte of storage but to the limit after that other-than-disk parts become more expensive. So is about the disk size - the bigger the disk size the cheaper is the byte stored on it, but to the limit of 4Tb. Of course gmail is not only storage service. There are also web services, smtp, imap, pop3, MX, antispam and so on. All of these services just process data and they don't have to store anything large (maybe there are some databases negligible compared to the email storage to store user credentials for example). A service does not necessarily needs a separate server. And storage servers luckily have a plenty of CPU time and some memory. So, the storage server's CPU time, memory and small portion of the disk could be used for data processing and for some small databases needed to web services, smtp, pop3, imap and so on. Google would not be Google, if they used a lot of additional servers for the life of the email system, except the email storage servers. That's why the correct answer is 7000 servers.