How I'd build the next Google
-or-
How to exploit the fact that only the first 10 pages of a search result are really necessary, to do a distributed
map-reduce that doesn't bottleneck on the difficult-to-parallelise 'reduce'.
In a traditional reduce/merge operation, you'd combine the results from the disparate systems to create one massive dataset on one system. A common search term would produce a large number of results, which would be fine in the 'map' (parallel searching) step, but for the 'reduce' step (merging the results) it would cause quite a bottleneck. When i first heard of the map-reduce algorithm, i often wondered how search engines got around this problem. But i recently was learning about the
merge algorithm, and was thinking about how it could be combined with the fact that most people never look past the first few pages of a search result, and it made sense. So this article contains my somewhat vague ramblings (another post will have real code!) about how i'd implement the next biggest search engine (
Clarkson's voice) ... in the world.
Lets say i've got a big dataset of things to search through. Split this dataset up into quarters: A, B, C and D. Lets have 4 servers, each responsible for searching through its own quarter. A user then comes along, and says they want to search for 'flux capacitor'. The first part of how this works is pretty straight forwards, and lovely to parallelise: each of the 4 servers searches through its own quarter, making a list of all pages that match this search.
If we then got the 4 servers to send their results to one server to merge them together, we would quickly overrun the network in the data centre, so the next bit is where i think the 'magic' is. In hindsight, its kinda obvious to me now, so if you're smarter than me and think 'this has been done a million times before' please don't leave rude comments. Anyway, I digress, so lets start with an overview picture:
[[posterous-content:tnddjlzGfdDtdghpgqDI]]
Each of the 4 servers will then sort their results, and grab only the top 100 rows. It's as though each server did this:
select top 100 * from my_quarter where contents like '%flux capacitor%' order by pagerank desc
Servers B and D then send their results to A and C, respectively. Server A and C now have 100+100 rows each. They each perform a merge operation on these rows to get a sorted 200-row set, and only keep the first 100 rows of that. (A merge operation is used to combine 2 independently sorted sets into one sorted set.)
Then server C sends its 100-row set to server A. Again, server A merges its own 100 rows with C's 100 rows to produce the final results. These results are then paginated, at 10 rows per page x 10 pages, to give the user his results.
I think this would be immensely scalable, what are your thoughts?