Gathering the data
As I had done for the previous effort, I gathered data against the same list of site around the same date. This provided me with some consistency in the data set that allowed for better comparison. Compare one or two site and you may get some false positives. Compare 100 sites and things start getting a little more interesting. The Yahoo! data ended up looking at this (for people who are new to the series, I am doing the same graphs for a number of search engines):
Technorati Top 100 | Yahoo Links | Technorati Links | Technorati/Yahoo Links |
---|---|---|---|
Boing Boing | 1880000 | 22532 | 1.19851% |
InstaPundit | 2160000 | 15190 | 0.70324% |
Daily Kos | 1690000 | 15833 | 0.93686% |
Gizmodo | 1970000 | 12278 | 0.62325% |
Fark | 1420000 | 10216 | 0.71944% |
EnGadget | 2820000 | 15051 | 0.53372% |
Davenetics | 66400 | 7571 | 11.40211% |
Eschaton | 1400000 | 8713 | 0.62236% |
Dooce | 653000 | 6797 | 1.04089% |
Andrew Sullivan | 1260000 | 7680 | 0.60952% |
The Best Page In The Universe | 62000 | 6333 | 10.21452% |
Talking Points Memo: by Joshua Micah Marshall | 563000 | 7592 | 1.34849% |
lgf: anti-idiotarian | 49300 | 8275 | 16.78499% |
kottke.org | 1200000 | 7278 | 0.60650% |
WIL WHEATON DOT NET | 564000 | 6314 | 1.11950% |
Metafilter | 1160000 | 7591 | 0.65440% |
Doc Searls | 1150000 | 5690 | 0.49478% |
(In)formacao e (In)utilidade | 110000 | 6040 | 5.49091% |
Wonkette | 1370000 | 5877 | 0.42898% |
Scripting News | 1470000 | 5728 | 0.38966% |
Power Line | 344000 | 7477 | 2.17355% |
Balmasque | 40500 | 4544 | 11.21975% |
Corante | 265000 | 7686 | 2.90038% |
A list Apart | 620000 | 5536 | 0.89290% |
Something Awful | 372000 | 4512 | 1.21290% |
Megatokyo | 361000 | 4154 | 1.15069% |
Michelle Malkin | 537000 | 6091 | 1.13426% |
Arts and Letters Daily | 866000 | 3983 | 0.45993% |
Gawker | 1060000 | 4453 | 0.42009% |
Afterall it was the best I ever had | 34900 | 3591 | 10.28940% |
The Volokh Conspiracy | 1190000 | 5873 | 0.49353% |
Scobelizer | 937000 | 5524 | 0.58954% |
Jeffrey Zeldman | 528000 | 4134 | 0.78295% |
This Modern World | 813000 | 3913 | 0.48130% |
The Web Standards Project | 59800 | 3810 | 6.37124% |
Joel on Software | 966000 | 4514 | 0.46729% |
Media Matters for America | 536000 | 6809 | 1.27034% |
Television without pity | 356000 | 3859 | 1.08399% |
Kuro5hin | 866000 | 4208 | 0.48591% |
Lileks | 39700 | 3824 | 9.63224% |
Hugh Hewitt | 929000 | 4573 | 0.49225% |
Joel Veitch | 135000 | 3774 | 2.79556% |
Truthout | 371000 | 6528 | 1.75957% |
Baghdad Burning | 552000 | 3519 | 0.63750% |
Buzz machine | 1010000 | 4145 | 0.41040% |
fleugel | 201000 | 3670 | 1.82587% |
Informed Comment | 787000 | 3905 | 0.49619% |
Doppler: redefining podcasting | 607000 | 3040 | 0.50082% |
geek and proud | 9110 | 3166 | 34.75302% |
loadmemory (Asian site) | 1550 | 3324 | 214.45161% |
Photojunkie | 51200 | 2860 | 5.58594% |
Ross Rader | 48200 | 2976 | 6.17427% |
The Truth Laid Bear | 717000 | 4127 | 0.57559% |
Joi Ito | 1050000 | 5165 | 0.49190% |
ScrappleFace | 807000 | 3480 | 0.43123% |
LexText | 31200 | 2671 | 8.56090% |
Google Blog | 297000 | 3688 | 1.24175% |
Xbox | 237000 | 4221 | 1.78101% |
My life in a Bush of Ghosts | 903 | 2519 | 278.95903% |
Astronomy picture of the day | 113000 | 3498 | 3.09558% |
Crooked Timber | 67500 | 3617 | 5.35852% |
Vodka Pundit | 169000 | 3085 | 1.82544% |
Captain’s quarter | 730000 | 3671 | 0.50288% |
A small victory | 460000 | 3223 | 0.70065% |
Gato Fedorento | 126000 | 2574 | 2.04286% |
Mezzoblue | 278000 | 2952 | 1.06187% |
PostSecret | 202000 | 2707 | 1.34010% |
Samizdata.net | 18000 | 2872 | 15.95556% |
Lawrence Lessig | 959000 | 2949 | 0.30751% |
Counterpunch | 295000 | 3278 | 1.11119% |
Democractic Underground | 417000 | 3913 | 0.93837% |
Right Wing News | 794000 | 2967 | 0.37368% |
StopDesign | 255000 | 3037 | 1.19098% |
iBiblio | 197000 | 3105 | 1.57614% |
Samizdata.net (mistake?) | 697000 | 2743 | 0.39354% |
Abrupto | 44700 | 2935 | 6.56600% |
gene7299 (Asian MSNSpaces site) | 764 | 3215 | 420.81152% |
Where is Raed? | 232000 | 2409 | 1.03836% |
B3TA: We love the web | 839000 | 2614 | 0.31156% |
Talkleft | 221000 | 2901 | 1.31267% |
Wizbang | 634000 | 3358 | 0.52965% |
m1net (MSN spaces site) | 579 | 3548 | 612.78066% |
Hoder | 20900 | 5422 | 25.94258% |
CTRL+Alt+Del | 171000 | 2315 | 1.35380% |
Brad DeLong | 882000 | 2715 | 0.30782% |
Blogs for Bush | 824000 | 3560 | 0.43204% |
Neil Gaiman | 319000 | 2194 | 0.68777% |
Gothamist | 491000 | 2729 | 0.55580% |
Thought Mechanics | 190000 | 2197 | 1.15632% |
IMAO | 407000 | 2905 | 0.71376% |
Dan Gillmor (old weblog) | 298000 | 2600 | 0.87248% |
HINAGATA | 21100 | 2186 | 10.36019% |
Dean’s World | 784000 | 2985 | 0.38074% |
Defamer | 725000 | 2372 | 0.32717% |
USS Clueless | 264000 | 2570 | 0.97348% |
Dive into Mark | 235000 | 2540 | 1.08085% |
Pandagon | 743000 | 2822 | 0.37981% |
Blogging.la | 67700 | 3061 | 4.52142% |
Why are you worshipping the ground I blog on? | 85000 | 2238 | 2.63294% |
Daring Fireball | 221000 | 2573 | 1.16425% |
The first thing of interest when putting together that set of numbers was how much larger the number of links found in the Yahoo! index was, compared to the number of links found in either Technorati or Google. The second item I found interesting was a relative consistency in terms of Asian sites not figuring well in the Yahoo! index compared to the Technorati one. It seems that Technorati is getting a better handle on the Asian blogosphere than Yahoo! is, a surprising result considering how much time and effort the latter has put into its Asian operations.
In order to get some real visual comparison, I decided to draw a similar diagram of the link percentages distributed across all 100 sites. It looked like this:

The interesting story, looking at this is that it appeared that there was much greater variance from site to site in the Google index that there was in the Yahoo! one. In the Yahoo system, the vast majority of site fall in the below one percent range but what became even more interesting was that the rate of variance was really not that high: when comparing the median and the average, it turned out to be less than .1% of difference:
Technorati Top 100 | Yahoo Links | Technorati Links | Technorati/Yahoo Links |
---|---|---|---|
Total | 56150006 | 479580 | 0.85410% |
Median | 389500 | 3679.5 | 0.94467% |
While the number were vastly different in terms of size (it appeared Yahoo! had a lot more links), I figured the patterns would be roughly the same in terms of coverage: I expected the top sites to get better coverage in a large search engine like Yahoo! than smaller sites. Imagine my surprise then when I started to do some group analysis:
Technorati Top 100 | Yahoo Links | Technorati Links | Technorati/Yahoo Links | |
---|---|---|---|---|
AVERAGE TOP 10 | 1531940 | 12186.1 | 0.79547% | Â |
AVERAGE TOP 25 | 986368 | 8733.36 | 0.88541% | Â |
AVERAGE TOP 50 | 768245.2 | 6534.36 | 0.85056% | Â |
AVERAGE BOTTOM 50 | 354754.92 | 3057.24 | 0.86179% | Â |
AVERAGE BOTTOM 25 | 362220.8846 | 2834.884615 | 0.78264% | Â |
AVERAGE BOTTOM 10 | 350072.7273 | 2622.909091 | 0.74925% | Â |
Those numbers seemed to be all over the map, a fact that became much clearer once I graphed it:

None of the nice downgrade curve I had with the Google set. Here was a much more disparate set, providing little in terms of supporting a theory of bias from a search engine. In fact, it worked more to potentially prove such theory wrong.
Was my data set wrong? I rechecked it and it was not. So what was happening here? As dreams of long tail and power law distributions fell out, I started to wonder how Yahoo! and Google compared. So, of course, I decided to run the numbers again…
Yahoo! vs. Google
This time I decided to compare Google and Yahoo! First, I figured I would get some reference data on the subject. I was surprised to not find any actual side by side comparison on a large set of sites. Anecdotal evidence existed but nothing compared to the data set I had amassed so I figure I would trust my own data set (note: If you have a better one, please leave a comment as to where it is located). The set ended up looking like this:
Name | Position 5/19/05 | Yahoo | Google/Yahoo Links | |
---|---|---|---|---|
Boing Boing | 1 | 45200 | 1880000 | 2.40% |
InstaPundit | 2 | 75000 | 2160000 | 3.47% |
Daily Kos | 3 | 59800 | 1690000 | 3.54% |
Gizmodo | 4 | 39300 | 1970000 | 1.99% |
Fark | 5 | 43600 | 1420000 | 3.07% |
EnGadget | 6 | 46800 | 2820000 | 1.66% |
Davenetics | 7 | 1780 | 66400 | 2.68% |
Eschaton | 8 | 62400 | 1400000 | 4.46% |
Dooce | 9 | 23600 | 653000 | 3.61% |
Andrew Sullivan | 10 | 41100 | 1260000 | 3.26% |
The Best Page In The Universe | 11 | 656 | 62000 | 1.06% |
Talking Points Memo: by Joshua Micah Marshall | 12 | 74600 | 563000 | 13.25% |
lgf: anti-idiotarian | 13 | 14700 | 49300 | 29.82% |
kottke.org | 14 | 32000 | 1200000 | 2.67% |
WIL WHEATON DOT NET | 15 | 16900 | 564000 | 3.00% |
Metafilter | 16 | 34500 | 1160000 | 2.97% |
Doc Searls | 17 | 33600 | 1150000 | 2.92% |
(In)formacao e (In)utilidade | 18 | 1780 | 110000 | 1.62% |
Wonkette | 19 | 28800 | 1370000 | 2.10% |
Scripting News | 20 | 39400 | 1470000 | 2.68% |
Power Line | 21 | 7510 | 344000 | 2.18% |
Balmasque | 22 | 24 | 40500 | 0.06% |
Corante | 23 | 6770 | 265000 | 2.55% |
A list Apart | 24 | 21100 | 620000 | 3.40% |
Something Awful | 25 | 9020 | 372000 | 2.42% |
Megatokyo | 26 | 7310 | 361000 | 2.02% |
Michelle Malkin | 27 | 17300 | 537000 | 3.22% |
Arts and Letters Daily | 28 | 23900 | 866000 | 2.76% |
Gawker | 29 | 23500 | 1060000 | 2.22% |
Afterall it was the best I ever had | 30 | 95 | 34900 | 0.27% |
The Volokh Conspiracy | 31 | 42000 | 1190000 | 3.53% |
Scobelizer | 32 | 21800 | 937000 | 2.33% |
Jeffrey Zeldman | 33 | 22500 | 528000 | 4.26% |
This Modern World | 34 | 32100 | 813000 | 3.95% |
The Web Standards Project | 35 | 1850 | 59800 | 3.09% |
Joel on Software | 36 | 22400 | 966000 | 2.32% |
Media Matters for America | 37 | 24800 | 536000 | 4.63% |
Television without pity | 38 | 13300 | 356000 | 3.74% |
Kuro5hin | 39 | 17300 | 866000 | 2.00% |
Lileks | 40 | Â | 39700 | 0.00% |
Hugh Hewitt | 41 | 26700 | 929000 | 2.87% |
Joel Veitch | 42 | 2830 | 135000 | 2.10% |
Truthout | 43 | 8780 | 371000 | 2.37% |
Baghdad Burning | 44 | 22700 | 552000 | 4.11% |
Buzz machine | 45 | 30600 | 1010000 | 3.03% |
fleugel | 46 | 1890 | 201000 | 0.94% |
Informed Comment | 47 | 27900 | 787000 | 3.55% |
Doppler: redefining podcasting | 48 | 4420 | 607000 | 0.73% |
geek and proud | 49 | 355 | 9110 | 3.90% |
loadmemory (Asian site) | 50 | 83 | 1550 | 5.35% |
Photojunkie | 51 | 1540 | 51200 | 3.01% |
Ross Rader | 52 | 1070 | 48200 | 2.22% |
The Truth Laid Bear | 53 | 23900 | 717000 | 3.33% |
Joi Ito | 54 | 23400 | 1050000 | 2.23% |
ScrappleFace | 55 | 31100 | 807000 | 3.85% |
LexText | 56 | 1970 | 31200 | 6.31% |
Google Blog | 57 | 46 | 297000 | 0.02% |
Xbox | 58 | 6600 | 237000 | 2.78% |
My life in a Bush of Ghosts | 59 | 6 | 903 | 0.66% |
Astronomy picture of the day | 60 | 5020 | 113000 | 4.44% |
Crooked Timber | 61 | 3560 | 67500 | 5.27% |
Vodka Pundit | 62 | 4520 | 169000 | 2.67% |
Captain’s quarter | 63 | 27100 | 730000 | 3.71% |
A small victory | 64 | 16700 | 460000 | 3.63% |
Gato Fedorento | 65 | 1630 | 126000 | 1.29% |
Mezzoblue | 66 | 12000 | 278000 | 4.32% |
PostSecret | 67 | 5790 | 202000 | 2.87% |
Samizdata.net | 68 | 1050 | 18000 | 5.83% |
Lawrence Lessig | 69 | 30600 | 959000 | 3.19% |
Counterpunch | 70 | 11700 | 295000 | 3.97% |
Democractic Underground | 71 | 14900 | 417000 | 3.57% |
Right Wing News | 72 | 27900 | 794000 | 3.51% |
StopDesign | 73 | 10200 | 255000 | 4.00% |
iBiblio | 74 | 9730 | 197000 | 4.94% |
Samizdata.net (mistake?) | 75 | 25500 | 697000 | 3.66% |
Abrupto | 76 | 550 | 44700 | 1.23% |
gene7299 (Asian MSNSpaces site) | 77 | 58 | 764 | 7.59% |
Where is Raed? | 78 | 10100 | 232000 | 4.35% |
B3TA: We love the web | 79 | 12000 | 839000 | 1.43% |
Talkleft | 80 | 7170 | 221000 | 3.24% |
Wizbang | 81 | 21000 | 634000 | 3.31% |
m1net (MSN spaces site) | 82 | 104 | 579 | 17.96% |
Hoder | 83 | 1480 | 20900 | 7.08% |
CTRL+Alt+Del | 84 | 2310 | 171000 | 1.35% |
Brad DeLong | 85 | 30100 | 882000 | 3.41% |
Blogs for Bush | 86 | 16200 | 824000 | 1.97% |
Neil Gaiman | 87 | 13700 | 319000 | 4.29% |
Gothamist | 88 | 15200 | 491000 | 3.10% |
Thought Mechanics | 89 | 4400 | 190000 | 2.32% |
IMAO | 90 | 23800 | 407000 | 5.85% |
Dan Gillmor (old weblog) | 91 | 10800 | 298000 | 3.62% |
HINAGATA | 92 | 10100 | 21100 | 47.87% |
Dean’s World | 93 | 30600 | 784000 | 3.90% |
Defamer | 94 | 9310 | 725000 | 1.28% |
USS Clueless | 95 | 8470 | 264000 | 3.21% |
Dive into Mark | 96 | 14600 | 235000 | 6.21% |
Pandagon | 97 | 27300 | 743000 | 3.67% |
Blogging.la | 98 | 3200 | 67700 | 4.73% |
Why are you worshipping the ground I blog on? | 99 | 1430 | 85000 | 1.68% |
Daring Fireball | 100 | 12000 | 221000 | 5.43% |
Nothing particularly impressive there. It seemed that Google, on average, ended up with only about 3% of the links Yahoo! had in its index. However, the story got more interesting when looking at divergence between the average and the median, as it seemed there was a statistical divergence (almost half a percent) between the two:
Technorati Top 100 | Yahoo | Google/Yahoo Links | |
---|---|---|---|
Total | 1739867 | 56150006 | 3.10% |
Median | 13700 | 389500 | 3.52% |
But wait, for the weirdness is only getting started. Next up was looking at the distributions (as I’ve done for Technorati vs. each of the engines):
Technorati Top 100 | Yahoo | Google/Yahoo Links | |
---|---|---|---|
AVERAGE TOP 10 | 43858 | 1531940 | 2.86% |
AVERAGE TOP 25 | 30397.6 | 986368 | 3.08% |
AVERAGE TOP 50 | 23599.04082 | 768245.2 | 3.07% |
AVERAGE BOTTOM 50 | 11443.07843 | 354754.92 | 3.23% |
AVERAGE BOTTOM 25 | 11980.07692 | 362220.8846 | 3.31% |
AVERAGE BOTTOM 10 | 13782.72727 | 350072.7273 | 3.94% |
I looked at the number and they did not seem right so I ran them again and ended up with the same results. Ran them a third time and still couldn’t make sense of it. So I graphed it:

… and to my surprise, it appeared that the further down the line one went, the greater the differential. In fact, sites that are in the bottom of the top 100 are one full percent more likely to get indexed in Yahoo! than in Google.
Conclusions
From here, we can draw a few conclusions:
- Yahoo! generally does a better job at indexing the blogosphere than Google does. We know they have been working hard to improve their index and here’s proof that they are getting results
- Even if Google is the one with the motto about not doing evil, Yahoo! seems to be the one interested in giving equal opportunity to the little guy: smaller blogs seem to have a better chance of being recognized by Yahoo! than they do of being recognized by Google
- While the front page of Google advertises they are currently indexing over 8 billion pages, it is very difficult to find ways to support that claim via the link feature they are offering: this can be seen as confirmation that Google does not tell you about all the links it has in its index.
- Sure volume counts but in the case of search indexes, they may count against sites: if one is less likely to appear in Google than it is to appear in Yahoo! and the Google index is much larger than the Yahoo! one, then, if Yahoo! and Google had the same amount of traffic, a single blog could find itself receiving more traffic from Yahoo! than it does from Google. This would be due to the fact that each individual page in Yahoo! has more weight than it does in Google.
- The top 100 blogs have other 56 million links in the Yahoo!. That’s a lot of links and clearly shows that links are the currency of the blogging world. It would be interested to get data that would help analyze how much interlinking exists across those sites.
Up next, we’ll take a look at how MSN plays in all this game. So stay tuned!