Source Check
In my initial review, I noticed that Technorati was ranking sites bases on sources. However, incoming and outgoing information is not really available from the major search engines when it comes to sourcing data. So, for this particular investigation, I decided to dismiss source data and focus on link data. I decided to go and get link data from the three largest search engines: Google, Yahoo! and MSN (that last one was included at the last minute just because I knew that Robert Scoble would complain about the study being biased if I didn’t include MSN).
Picking three search engines was also interesting because it providing some sort of reference check. If one of the engines did not line up with the other two, we could point out to a potential flaw in that engine instead of trying to understand why the data was wrong.
Having picked that data set, I decided to start gathering the data. Let me say that it’s a lot of information and, should I try to do this again in the future, writing software to gather the information will probably be less time consuming that trying to get it by hand.
But enough about the process, let’s get into the numbers.
Technorati vs. Google
So the first dataset I created was a comparative index of Technorati and Google. The set was created by grabbing the number of links to a site in Google and getting the equivalent value for Technorati. The resut looked like this:
Technorati Top 100 | Google Links | Technorati Links | Technorati/Google |
---|---|---|---|
Boing Boing | 45200 | 22532 | 49.8496% |
InstaPundit | 75000 | 15190 | 20.2533% |
Daily Kos | 59800 | 15833 | 26.4766% |
Gizmodo | 39300 | 12278 | 31.2417% |
Fark | 43600 | 10216 | 23.4312% |
EnGadget | 46800 | 15051 | 32.1603% |
Davenetics | 1780 | 7571 | 425.3371% |
Eschaton | 62400 | 8713 | 13.9631% |
Dooce | 23600 | 6797 | 28.8008% |
Andrew Sullivan | 41100 | 7680 | 18.6861% |
The Best Page In The Universe | 656 | 6333 | 965.3963% |
Talking Points Memo: by Joshua Micah Marshall | 74600 | 7592 | 10.1769% |
lgf: anti-idiotarian | 14700 | 8275 | 56.2925% |
kottke.org | 32000 | 7278 | 22.7438% |
WIL WHEATON DOT NET | 16900 | 6314 | 37.3609% |
Metafilter | 34500 | 7591 | 22.0029% |
Doc Searls | 33600 | 5690 | 16.9345% |
(In)formacao e (In)utilidade | 1780 | 6040 | 339.3258% |
Wonkette | 28800 | 5877 | 20.4063% |
Scripting News | 39400 | 5728 | 14.5381% |
Power Line | 7510 | 7477 | 99.5606% |
Balmasque | 24 | 4544 | 18933.3333% |
Corante | 6770 | 7686 | 113.5303% |
A list Apart | 21100 | 5536 | 26.2370% |
Something Awful | 9020 | 4512 | 50.0222% |
Megatokyo | 7310 | 4154 | 56.8263% |
Michelle Malkin | 17300 | 6091 | 35.2081% |
Arts and Letters Daily | 23900 | 3983 | 16.6653% |
Gawker | 23500 | 4453 | 18.9489% |
Afterall it was the best I ever had | 95 | 3591 | 3780.0000% |
The Volokh Conspiracy | 42000 | 5873 | 13.9833% |
Scobelizer | 21800 | 5524 | 25.3394% |
Jeffrey Zeldman | 22500 | 4134 | 18.3733% |
This Modern World | 32100 | 3913 | 12.1900% |
The Web Standards Project | 1850 | 3810 | 205.9459% |
Joel on Software | 22400 | 4514 | 20.1518% |
Media Matters for America | 24800 | 6809 | 27.4556% |
Television without pity | 13300 | 3859 | 29.0150% |
Kuro5hin | 17300 | 4208 | 24.3237% |
Lileks | 0 | 3824 | N/A |
Hugh Hewitt | 26700 | 4573 | 17.1273% |
Joel Veitch | 2830 | 3774 | 133.3569% |
Truthout | 8780 | 6528 | 74.3508% |
Baghdad Burning | 22700 | 3519 | 15.5022% |
Buzz machine | 30600 | 4145 | 13.5458% |
fleugel | 1890 | 3670 | 194.1799% |
Informed Comment | 27900 | 3905 | 13.9964% |
Doppler: redefining podcasting | 4420 | 3040 | 68.7783% |
geek and proud | 355 | 3166 | 891.8310% |
loadmemory (Asian site) | 83 | 3324 | 4004.8193% |
Photojunkie | 1540 | 2860 | 185.7143% |
Ross Rader | 1070 | 2976 | 278.1308% |
The Truth Laid Bear | 23900 | 4127 | 17.2678% |
Joi Ito | 23400 | 5165 | 22.0726% |
ScrappleFace | 31100 | 3480 | 11.1897% |
LexText | 1970 | 2671 | 135.5838% |
Google Blog | 46 | 3688 | 8017.3913% |
Xbox | 6600 | 4221 | 63.9545% |
My life in a Bush of Ghosts | 6 | 2519 | 41983.3333% |
Astronomy picture of the day | 5020 | 3498 | 69.6813% |
Crooked Timber | 3560 | 3617 | 101.6011% |
Vodka Pundit | 4520 | 3085 | 68.2522% |
Captain’s quarter | 27100 | 3671 | 13.5461% |
A small victory | 16700 | 3223 | 19.2994% |
Gato Fedorento | 1630 | 2574 | 157.9141% |
Mezzoblue | 12000 | 2952 | 24.6000% |
PostSecret | 5790 | 2707 | 46.7530% |
Samizdata.net | 1050 | 2872 | 273.5238% |
Lawrence Lessig | 30600 | 2949 | 9.6373% |
Counterpunch | 11700 | 3278 | 28.0171% |
Democractic Underground | 14900 | 3913 | 26.2617% |
Right Wing News | 27900 | 2967 | 10.6344% |
StopDesign | 10200 | 3037 | 29.7745% |
iBiblio | 9730 | 3105 | 31.9116% |
Samizdata.net (mistake?) | 25500 | 2743 | 10.7569% |
Abrupto | 550 | 2935 | 533.6364% |
gene7299 (Asian MSNSpaces site) | 58 | 3215 | 5543.1034% |
Where is Raed? | 10100 | 2409 | 23.8515% |
B3TA: We love the web | 12000 | 2614 | 21.7833% |
Talkleft | 7170 | 2901 | 40.4603% |
Wizbang | 21000 | 3358 | 15.9905% |
m1net (MSN spaces site) | 104 | 3548 | 3411.5385% |
Hoder | 1480 | 5422 | 366.3514% |
CTRL+Alt+Del | 2310 | 2315 | 100.2165% |
Brad DeLong | 30100 | 2715 | 9.0199% |
Blogs for Bush | 16200 | 3560 | 21.9753% |
Neil Gaiman | 13700 | 2194 | 16.0146% |
Gothamist | 15200 | 2729 | 17.9539% |
Thought Mechanics | 4400 | 2197 | 49.9318% |
IMAO | 23800 | 2905 | 12.2059% |
Dan Gillmor (old weblog) | 10800 | 2600 | 24.0741% |
HINAGATA | 10100 | 2186 | 21.6436% |
Dean’s World | 30600 | 2985 | 9.7549% |
Defamer | 9310 | 2372 | 25.4780% |
USS Clueless | 8470 | 2570 | 30.3424% |
Dive into Mark | 14600 | 2540 | 17.3973% |
Pandagon | 27300 | 2822 | 10.3370% |
Blogging.la | 3200 | 3061 | 95.6563% |
Why are you worshipping the ground I blog on? | 1430 | 2238 | 156.5035% |
Daring Fireball | 12000 | 2573 | 21.4417% |
The third column in this is just a quick set of calculation providing us with some data as to what percentage of Google links was available in Technorati. From there, we’re already noticing some interesting trends. While most of the data ends up showing Google has having a larger set of links in its index than Technorati, there are 16 cases where the Technorati index of links is larger than the Google one. In any study, over 15% of a dataset is statistically significant. How Technorati ends up getting more data than Google is something that someone might want to investigate. Beyond that, it appears that Technorati gets about 30% of the links that Google get to a particular site, as illustrated in the chart below:

The next set of interesting findings is that while the linkage from Technorati is generally lower than it is in Google, it is consistently that way. A quick analysis of the data set shows that the average percentage of Technorati links compared to Google links is not that far from the average median of Technorati links compared to Google links. Confused by that last sentence? Don’t worry (I was too after I wrote it) and let me show you, by pulling out another data chart:
Technorati Top 100 | Google Links | Technorati Links | Technorati/Google |
---|---|---|---|
TOTAL | 1739867 | 479580 | 27.5642% |
MEDIAN | 13500 | 3679.5 | 27.2556% |
Doesn’t it all become clearer? On average, for the top 100 bloggers, Technorati holds 27.56% of the links that Google holds. Part of the reason behind this may be that Technorati only represents the blogs subset of the whole web while Google represents linkage for the web as a whole. From here, we could gather that for every link a blog provides, other sources on the web provide 3 links. Since blogs still represent a small portion of the web, however, the importance of links in the blog world may be outpacing the importance of links in the non-blog world. Part of the reason behind this could be that links are one of the big currency in the web space and many blogs are offering little content but are heavy on the linking. If an average blog entry is under 300 words, it often contains at least one link. This could mean that Technorati and other blog search engines are right to consider links as a strong measurement, but may show that blogs, as a medium, are not providing that much content beyond linking.
However, it gets even more interesting if you dig in. Looking at the data, these values are actually misleading. What is happening is not truly an egalitarian match. Doing a quick review of the distribution, we start seeing some interesting trends.
Technorati Top 100 | Google Links | Technorati Links | Technorati/Google |
---|---|---|---|
AVERAGE TOP 10 | 43858 | 12186.1 | 27.7854% |
AVERAGE TOP 25 | 30397.6 | 8733.36 | 28.7304% |
AVERAGE TOP 50 | 23127.06 | 6534.36 | 28.2542% |
AVERAGE BOTTOM 50 | 11443.07843 | 3057.24 | 26.7169% |
AVERAGE BOTTOM 25 | 11980.07692 | 2834.884615 | 23.6633% |
AVERAGE BOTTOM 10 | 13782.72727 | 2622.909091 | 19.0304% |
Let’s graph the Technorati links as percentage of Google to see a little more of what I’m inferring:

Looking at this, it seems that our friends at Technorati have a bias. On average, blogs in the top 10 are 8% more likely to get indexed by both Google and Technorati than they are to be indexed by Google only. Considering that Google already admits to some level of bias in their system (part of the foundation for PageRank is that sites with higher PageRanks get indexed more often), it is a bit worrisome, especially if the trend holds across the whole of Technorati’s universe. If Google favors indexing more popular sites more often, a clear opprtunity for world-live-web search engines like Technorati would be in the long tail of less-often-indexed sites but Technorati seems to ignore that opportunity and concentrate on the top sites. What that will translate into is a direct reproduction of the power laws when it comes to indexing of blogs.
But is that true of Google vs Technorati only? Or do the same rules apply for other search engines? We’ll look at that in the next entry.