Tuesday, 28 December 2010

data: list of top sites from alexa

Alexa has a free list of the top 1m websites: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

sample:

1,google.com
2,facebook.com
3,youtube.com
4,yahoo.com
5,live.com

A few curiosities:

  • while most entries have are just domains, 10007 have path information:

    2760,feedproxy.google.com/~r
    5824,mail.qip.ru/~Inbox
    7108,xhamster.com/user/video
    7634,journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2


  • Two of the entries with path info contain commas:

    490727,pomoho.com/user/cmuser,1
    936298,intranet.espace-privilege.leclercvoyages.com/user/eleclerc-voyages,2

    which causes weirdness when using R's parse.csv() command.

    Script I used to find where the ranks diverged from the indexes (in before I found the CSV had unescaped commas):

    #the data from the CSV file is in scores
    onem = seq(1, 1000002)
    head(onem[scores$rank != onem])
    scores[490728,]




Oh, and here are the two extra rows that parse.csv silently created:

> scores[scores$domain == "",]
rank domain
490728 1
936300 2

No comments:

Post a Comment