How Search Engines Work
Before I get into details as to why I think this web accelerator is a major search move by Google, I first need to educate some of my readers as to the basics of search and some of the issues relating to creating a good search product. If you already know about search indexing, you can skip to the next section.
Search engines are basically acting as not only giant card catalogs, similar to the ones you can find in a library but also as giant libraries in and off themselves. When you type a word in a search box, what happens next includes a number of different steps that allow to look through a giant index, which is basically an image of all the pages the search engine knows about.
The way those indexes are created is through programs that are known as spiders (sometimes also referred to as web-robots or crawlers). Those programs are independent pieces of software that go and basically surf the web at very high speed, making copies of everything they encounter and comparing what they find to what other spiders are found. That giant set of pages copied by spiders is called an index (it is also sometimes referred to as a collection). They run around the clock and their sole job is to get more pages and ensure that the pages they’ve gotten in the past still exist and that they have not changed (if they have changed, the spider will “re-index” the page, ie. delete the previous one from the index and put the new version in its place).
Size Matters
The idea is a surprisingly simple one and was first introduced in the early days of the web. At the time, creating an index of all the pages on the web was relatively easy, largely due to the fact that there were not that many pages and that not that many people were creating them (I actually enjoy surprising newbies by telling them that I once saw the whole web, every single pages on it. What I omit until later in the story is that I did this in 1993, at a time when you could count the number of web servers without hitting 100 and when you could actually see the whole web in only a few hours.)
The amazing thing is that, although the number of web sites (and hence the number of web pages) has exploded, the basic technology to build a search index has not evolved that much. The concepts are basically the same today as they were in 1994-1995 but the web is now much, much larger.
How large, you wonder? Well, a good indicator would be to take a look at the bottom of the Google home page for a number. As of this writing, that number stands at 8,058,044,651. That’s over 8 billion pages, a very large number and one that folks at Google are appropriately proud of.
There’s only one little issue with that number. It’s on the low side. In fact, it’s estimated that it represents less than one percent of the actual number of pages on the web. In 2001, that number was estimated at over 500 billion pages in what is called the Deep Web, a part of the web that has not been indexed by search engines yet. With the growth of weblogs, which are generated tons of content on a daily basis, and the connection of more systems like books, satellite maps, etc… to the web, you can only imagine that the number has grown.
Let’s pause for a moment and assume that only as many pages were created between 2001 and now as were created in the previous four years, at the high of the dotcom boom. This means that there would be over a trillion web pages on the Internet. Now that gets to be a much more interesting number.
You Call THIS Fresh?!?
So we know that Google has a problem in finding a lot of the pages that already exist on the Internet. But that’s nothing compared to the other problem Google has.
Imagine an index with 1 million pages. If you assume that a spider can index that one million pages in a day, the content on those pages is refreshed daily, meaning that the index has a new version of the pages only once a day. Now try to do the same with 8 billion pages and it becomes a pretty complicated problem. Google has solved some of that problem by basically deciding that some sites have a higher worth than others. As a results, sites which are known to refresh their content on a regular basis get more attention from Google than sites that do not.
With the explosion of weblogs, however, a new breed of sites has created a problem for Google. For starters, there are a lot of them, and most of them refresh their content regularly, in some cases more than once a day. This makes the job of producing relevant indexes almost impossible for Google, turning their search engine into something more akin to a library, the kind of place that you use when you are looking for a reference, than an up to date source.
Not only that but, if Google is to also index the deep web, keeping track of all the changes across all the web becomes impossible… Impossible, that is, if you are using crawlers.
So we now know that the crawlers are no longer the right option when it comes to keeping fresh information within a proper search engine index. Looking at this, Google needs to do something radical. On the one hand, they can try to build a system that will get the most up to date information through notification from the sites that are updating content. This is where services like Technorati and Feedster come in, getting updates from RSS feeds and thus building indexes with more recent information than Google’s.
On the other hand, they could look at increasing the number of crawlers they are using. We know that Google has a lot of machines but trying to scale to the point where they can monitor a trillion pages via crawl would require a lot more power than that.
Enters Web Accelerator!
Spreading the Load
In the late 90s, distributing computing took hold as a concept. Projects like [email protected] and [email protected] have shown the way in terms of harnessing the power of millions of computers to solve processor-intensive kinds of problems. Google started looking at this with the roll out of their toolbar with a feature called Google Compute.
Now let’s move forward. What if you could get information as to what pages are new and what pages are changes by just observing where people are surfing? This is the space that the accelerator occupies. Sitting neatly between your web browser and the Google architecture is a mini proxy that keeps checking if it can find a way to give you pages at a faster rate from the Google index than it does from the actual existing site. Along the way, Google finds out what pages are missing from its index (and gets a chance to add them) and what pages in its index are not up to date.
Imagine a million people downloading the Google Web Accelerator and all of a sudden, you have an infrastructure that finds out about a lot of pages very quickly.
Microsoft and Yahoo! are already in competition with Google in the search space. In order to maintain its leadership, Google needs to not only provide an index that is larger than its competitors but also more up to date. With this accelerator, they can do that and only one of its competitor can ever hope to match the feature: Microsoft.
The webmaster FAQ points the accelerator does not cover pages which are secure (nicely bypassing security issues) nor large media files. I suspect that we will see that change in the future, with the addition of images coming first.