Technorati Yahoo and Google Too

In the last entry on the subject, we took a look at how Technorati and Google compared. From there, we discovered that Technorati was getting roughly a fourth of the links Google could locate. Which brought up some interesting questions: could we rely on the Google numbers? Were they so much larger than any other search engine that we were building an unfair comparison? And, as some alert readers pointed in email, was Google under-reporting the number of links to a site? In order to answer some of those questions, I decided to build some more comparisons. So I decided to take a look at some of Google’s competitors. Today, I’ll go into how Yahoo! fared (Hint: I was surprised by the results).

Gathering the data

As I had done for the previous effort, I gathered data against the same list of site around the same date. This provided me with some consistency in the data set that allowed for better comparison. Compare one or two site and you may get some false positives. Compare 100 sites and things start getting a little more interesting. The Yahoo! data ended up looking at this (for people who are new to the series, I am doing the same graphs for a number of search engines):

Technorati Top 100 Yahoo Links Technorati Links Technorati/Yahoo Links
Boing Boing 1880000 22532 1.19851%
InstaPundit 2160000 15190 0.70324%
Daily Kos 1690000 15833 0.93686%
Gizmodo 1970000 12278 0.62325%
Fark 1420000 10216 0.71944%
EnGadget 2820000 15051 0.53372%
Davenetics 66400 7571 11.40211%
Eschaton 1400000 8713 0.62236%
Dooce 653000 6797 1.04089%
Andrew Sullivan 1260000 7680 0.60952%
The Best Page In The Universe 62000 6333 10.21452%
Talking Points Memo: by Joshua Micah Marshall 563000 7592 1.34849%
lgf: anti-idiotarian 49300 8275 16.78499%
kottke.org 1200000 7278 0.60650%
WIL WHEATON DOT NET 564000 6314 1.11950%
Metafilter 1160000 7591 0.65440%
Doc Searls 1150000 5690 0.49478%
(In)formacao e (In)utilidade 110000 6040 5.49091%
Wonkette 1370000 5877 0.42898%
Scripting News 1470000 5728 0.38966%
Power Line 344000 7477 2.17355%
Balmasque 40500 4544 11.21975%
Corante 265000 7686 2.90038%
A list Apart 620000 5536 0.89290%
Something Awful 372000 4512 1.21290%
Megatokyo 361000 4154 1.15069%
Michelle Malkin 537000 6091 1.13426%
Arts and Letters Daily 866000 3983 0.45993%
Gawker 1060000 4453 0.42009%
Afterall it was the best I ever had 34900 3591 10.28940%
The Volokh Conspiracy 1190000 5873 0.49353%
Scobelizer 937000 5524 0.58954%
Jeffrey Zeldman 528000 4134 0.78295%
This Modern World 813000 3913 0.48130%
The Web Standards Project 59800 3810 6.37124%
Joel on Software 966000 4514 0.46729%
Media Matters for America 536000 6809 1.27034%
Television without pity 356000 3859 1.08399%
Kuro5hin 866000 4208 0.48591%
Lileks 39700 3824 9.63224%
Hugh Hewitt 929000 4573 0.49225%
Joel Veitch 135000 3774 2.79556%
Truthout 371000 6528 1.75957%
Baghdad Burning 552000 3519 0.63750%
Buzz machine 1010000 4145 0.41040%
fleugel 201000 3670 1.82587%
Informed Comment 787000 3905 0.49619%
Doppler: redefining podcasting 607000 3040 0.50082%
geek and proud 9110 3166 34.75302%
loadmemory (Asian site) 1550 3324 214.45161%
Photojunkie 51200 2860 5.58594%
Ross Rader 48200 2976 6.17427%
The Truth Laid Bear 717000 4127 0.57559%
Joi Ito 1050000 5165 0.49190%
ScrappleFace 807000 3480 0.43123%
LexText 31200 2671 8.56090%
Google Blog 297000 3688 1.24175%
Xbox 237000 4221 1.78101%
My life in a Bush of Ghosts 903 2519 278.95903%
Astronomy picture of the day 113000 3498 3.09558%
Crooked Timber 67500 3617 5.35852%
Vodka Pundit 169000 3085 1.82544%
Captain’s quarter 730000 3671 0.50288%
A small victory 460000 3223 0.70065%
Gato Fedorento 126000 2574 2.04286%
Mezzoblue 278000 2952 1.06187%
PostSecret 202000 2707 1.34010%
Samizdata.net 18000 2872 15.95556%
Lawrence Lessig 959000 2949 0.30751%
Counterpunch 295000 3278 1.11119%
Democractic Underground 417000 3913 0.93837%
Right Wing News 794000 2967 0.37368%
StopDesign 255000 3037 1.19098%
iBiblio 197000 3105 1.57614%
Samizdata.net (mistake?) 697000 2743 0.39354%
Abrupto 44700 2935 6.56600%
gene7299 (Asian MSNSpaces site) 764 3215 420.81152%
Where is Raed? 232000 2409 1.03836%
B3TA: We love the web 839000 2614 0.31156%
Talkleft 221000 2901 1.31267%
Wizbang 634000 3358 0.52965%
m1net (MSN spaces site) 579 3548 612.78066%
Hoder 20900 5422 25.94258%
CTRL+Alt+Del 171000 2315 1.35380%
Brad DeLong 882000 2715 0.30782%
Blogs for Bush 824000 3560 0.43204%
Neil Gaiman 319000 2194 0.68777%
Gothamist 491000 2729 0.55580%
Thought Mechanics 190000 2197 1.15632%
IMAO 407000 2905 0.71376%
Dan Gillmor (old weblog) 298000 2600 0.87248%
HINAGATA 21100 2186 10.36019%
Dean’s World 784000 2985 0.38074%
Defamer 725000 2372 0.32717%
USS Clueless 264000 2570 0.97348%
Dive into Mark 235000 2540 1.08085%
Pandagon 743000 2822 0.37981%
Blogging.la 67700 3061 4.52142%
Why are you worshipping the ground I blog on? 85000 2238 2.63294%
Daring Fireball 221000 2573 1.16425%

The first thing of interest when putting together that set of numbers was how much larger the number of links found in the Yahoo! index was, compared to the number of links found in either Technorati or Google. The second item I found interesting was a relative consistency in terms of Asian sites not figuring well in the Yahoo! index compared to the Technorati one. It seems that Technorati is getting a better handle on the Asian blogosphere than Yahoo! is, a surprising result considering how much time and effort the latter has put into its Asian operations.

In order to get some real visual comparison, I decided to draw a similar diagram of the link percentages distributed across all 100 sites. It looked like this:

Technorati vs. Yahoo
Technorati vs. Yahoo – Source:TNL.net

The interesting story, looking at this is that it appeared that there was much greater variance from site to site in the Google index that there was in the Yahoo! one. In the Yahoo system, the vast majority of site fall in the below one percent range but what became even more interesting was that the rate of variance was really not that high: when comparing the median and the average, it turned out to be less than .1% of difference:

Technorati Top 100 Yahoo Links Technorati Links Technorati/Yahoo Links
Total 56150006 479580 0.85410%
Median 389500 3679.5 0.94467%

While the number were vastly different in terms of size (it appeared Yahoo! had a lot more links), I figured the patterns would be roughly the same in terms of coverage: I expected the top sites to get better coverage in a large search engine like Yahoo! than smaller sites. Imagine my surprise then when I started to do some group analysis:

Technorati Top 100 Yahoo Links Technorati Links Technorati/Yahoo Links
AVERAGE TOP 10 1531940 12186.1 0.79547% Â
AVERAGE TOP 25 986368 8733.36 0.88541% Â
AVERAGE TOP 50 768245.2 6534.36 0.85056% Â
AVERAGE BOTTOM 50 354754.92 3057.24 0.86179% Â
AVERAGE BOTTOM 25 362220.8846 2834.884615 0.78264% Â
AVERAGE BOTTOM 10 350072.7273 2622.909091 0.74925% Â

Those numbers seemed to be all over the map, a fact that became much clearer once I graphed it:

Technorati vs. Yahoo
Technorati vs. Yahoo – Source:TNL.net

None of the nice downgrade curve I had with the Google set. Here was a much more disparate set, providing little in terms of supporting a theory of bias from a search engine. In fact, it worked more to potentially prove such theory wrong.

Was my data set wrong? I rechecked it and it was not. So what was happening here? As dreams of long tail and power law distributions fell out, I started to wonder how Yahoo! and Google compared. So, of course, I decided to run the numbers again…

Yahoo! vs. Google

This time I decided to compare Google and Yahoo! First, I figured I would get some reference data on the subject. I was surprised to not find any actual side by side comparison on a large set of sites. Anecdotal evidence existed but nothing compared to the data set I had amassed so I figure I would trust my own data set (note: If you have a better one, please leave a comment as to where it is located). The set ended up looking like this:

Name Position 5/19/05 Google Yahoo Google/Yahoo Links
Boing Boing 1 45200 1880000 2.40%
InstaPundit 2 75000 2160000 3.47%
Daily Kos 3 59800 1690000 3.54%
Gizmodo 4 39300 1970000 1.99%
Fark 5 43600 1420000 3.07%
EnGadget 6 46800 2820000 1.66%
Davenetics 7 1780 66400 2.68%
Eschaton 8 62400 1400000 4.46%
Dooce 9 23600 653000 3.61%
Andrew Sullivan 10 41100 1260000 3.26%
The Best Page In The Universe 11 656 62000 1.06%
Talking Points Memo: by Joshua Micah Marshall 12 74600 563000 13.25%
lgf: anti-idiotarian 13 14700 49300 29.82%
kottke.org 14 32000 1200000 2.67%
WIL WHEATON DOT NET 15 16900 564000 3.00%
Metafilter 16 34500 1160000 2.97%
Doc Searls 17 33600 1150000 2.92%
(In)formacao e (In)utilidade 18 1780 110000 1.62%
Wonkette 19 28800 1370000 2.10%
Scripting News 20 39400 1470000 2.68%
Power Line 21 7510 344000 2.18%
Balmasque 22 24 40500 0.06%
Corante 23 6770 265000 2.55%
A list Apart 24 21100 620000 3.40%
Something Awful 25 9020 372000 2.42%
Megatokyo 26 7310 361000 2.02%
Michelle Malkin 27 17300 537000 3.22%
Arts and Letters Daily 28 23900 866000 2.76%
Gawker 29 23500 1060000 2.22%
Afterall it was the best I ever had 30 95 34900 0.27%
The Volokh Conspiracy 31 42000 1190000 3.53%
Scobelizer 32 21800 937000 2.33%
Jeffrey Zeldman 33 22500 528000 4.26%
This Modern World 34 32100 813000 3.95%
The Web Standards Project 35 1850 59800 3.09%
Joel on Software 36 22400 966000 2.32%
Media Matters for America 37 24800 536000 4.63%
Television without pity 38 13300 356000 3.74%
Kuro5hin 39 17300 866000 2.00%
Lileks 40 Â 39700 0.00%
Hugh Hewitt 41 26700 929000 2.87%
Joel Veitch 42 2830 135000 2.10%
Truthout 43 8780 371000 2.37%
Baghdad Burning 44 22700 552000 4.11%
Buzz machine 45 30600 1010000 3.03%
fleugel 46 1890 201000 0.94%
Informed Comment 47 27900 787000 3.55%
Doppler: redefining podcasting 48 4420 607000 0.73%
geek and proud 49 355 9110 3.90%
loadmemory (Asian site) 50 83 1550 5.35%
Photojunkie 51 1540 51200 3.01%
Ross Rader 52 1070 48200 2.22%
The Truth Laid Bear 53 23900 717000 3.33%
Joi Ito 54 23400 1050000 2.23%
ScrappleFace 55 31100 807000 3.85%
LexText 56 1970 31200 6.31%
Google Blog 57 46 297000 0.02%
Xbox 58 6600 237000 2.78%
My life in a Bush of Ghosts 59 6 903 0.66%
Astronomy picture of the day 60 5020 113000 4.44%
Crooked Timber 61 3560 67500 5.27%
Vodka Pundit 62 4520 169000 2.67%
Captain’s quarter 63 27100 730000 3.71%
A small victory 64 16700 460000 3.63%
Gato Fedorento 65 1630 126000 1.29%
Mezzoblue 66 12000 278000 4.32%
PostSecret 67 5790 202000 2.87%
Samizdata.net 68 1050 18000 5.83%
Lawrence Lessig 69 30600 959000 3.19%
Counterpunch 70 11700 295000 3.97%
Democractic Underground 71 14900 417000 3.57%
Right Wing News 72 27900 794000 3.51%
StopDesign 73 10200 255000 4.00%
iBiblio 74 9730 197000 4.94%
Samizdata.net (mistake?) 75 25500 697000 3.66%
Abrupto 76 550 44700 1.23%
gene7299 (Asian MSNSpaces site) 77 58 764 7.59%
Where is Raed? 78 10100 232000 4.35%
B3TA: We love the web 79 12000 839000 1.43%
Talkleft 80 7170 221000 3.24%
Wizbang 81 21000 634000 3.31%
m1net (MSN spaces site) 82 104 579 17.96%
Hoder 83 1480 20900 7.08%
CTRL+Alt+Del 84 2310 171000 1.35%
Brad DeLong 85 30100 882000 3.41%
Blogs for Bush 86 16200 824000 1.97%
Neil Gaiman 87 13700 319000 4.29%
Gothamist 88 15200 491000 3.10%
Thought Mechanics 89 4400 190000 2.32%
IMAO 90 23800 407000 5.85%
Dan Gillmor (old weblog) 91 10800 298000 3.62%
HINAGATA 92 10100 21100 47.87%
Dean’s World 93 30600 784000 3.90%
Defamer 94 9310 725000 1.28%
USS Clueless 95 8470 264000 3.21%
Dive into Mark 96 14600 235000 6.21%
Pandagon 97 27300 743000 3.67%
Blogging.la 98 3200 67700 4.73%
Why are you worshipping the ground I blog on? 99 1430 85000 1.68%
Daring Fireball 100 12000 221000 5.43%

Nothing particularly impressive there. It seemed that Google, on average, ended up with only about 3% of the links Yahoo! had in its index. However, the story got more interesting when looking at divergence between the average and the median, as it seemed there was a statistical divergence (almost half a percent) between the two:

Technorati Top 100 Google Yahoo Google/Yahoo Links
Total 1739867 56150006 3.10%
Median 13700 389500 3.52%

But wait, for the weirdness is only getting started. Next up was looking at the distributions (as I’ve done for Technorati vs. each of the engines):

Technorati Top 100 Google Yahoo Google/Yahoo Links
AVERAGE TOP 10 43858 1531940 2.86%
AVERAGE TOP 25 30397.6 986368 3.08%
AVERAGE TOP 50 23599.04082 768245.2 3.07%
AVERAGE BOTTOM 50 11443.07843 354754.92 3.23%
AVERAGE BOTTOM 25 11980.07692 362220.8846 3.31%
AVERAGE BOTTOM 10 13782.72727 350072.7273 3.94%

I looked at the number and they did not seem right so I ran them again and ended up with the same results. Ran them a third time and still couldn’t make sense of it. So I graphed it:

Google vs. Yahoo round 2
Google vs. Yahoo round 2

… and to my surprise, it appeared that the further down the line one went, the greater the differential. In fact, sites that are in the bottom of the top 100 are one full percent more likely to get indexed in Yahoo! than in Google.

Conclusions

From here, we can draw a few conclusions:

  • Yahoo! generally does a better job at indexing the blogosphere than Google does. We know they have been working hard to improve their index and here’s proof that they are getting results
  • Even if Google is the one with the motto about not doing evil, Yahoo! seems to be the one interested in giving equal opportunity to the little guy: smaller blogs seem to have a better chance of being recognized by Yahoo! than they do of being recognized by Google
  • While the front page of Google advertises they are currently indexing over 8 billion pages, it is very difficult to find ways to support that claim via the link feature they are offering: this can be seen as confirmation that Google does not tell you about all the links it has in its index.
  • Sure volume counts but in the case of search indexes, they may count against sites: if one is less likely to appear in Google than it is to appear in Yahoo! and the Google index is much larger than the Yahoo! one, then, if Yahoo! and Google had the same amount of traffic, a single blog could find itself receiving more traffic from Yahoo! than it does from Google. This would be due to the fact that each individual page in Yahoo! has more weight than it does in Google.
  • The top 100 blogs have other 56 million links in the Yahoo!. That’s a lot of links and clearly shows that links are the currency of the blogging world. It would be interested to get data that would help analyze how much interlinking exists across those sites.

Up next, we’ll take a look at how MSN plays in all this game. So stay tuned!

Previous Post
Secrets of the A-list bloggers: Technorati vs. Google
Next Post
Microsoft Loves RSS
%d bloggers like this: