sample:
1,google.com
2,facebook.com
3,youtube.com
4,yahoo.com
5,live.com
A few curiosities:
- while most entries have are just domains, 10007 have path information:
2760,feedproxy.google.com/~r
5824,mail.qip.ru/~Inbox
7108,xhamster.com/user/video
7634,journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2 - Two of the entries with path info contain commas:
490727,pomoho.com/user/cmuser,1
936298,intranet.espace-privilege.leclercvoyages.com/user/eleclerc-voyages,2
which causes weirdness when using R's parse.csv() command.
Script I used to find where the ranks diverged from the indexes (in before I found the CSV had unescaped commas):
#the data from the CSV file is in scores
onem = seq(1, 1000002)
head(onem[scores$rank != onem])
scores[490728,]
Oh, and here are the two extra rows that parse.csv silently created:
> scores[scores$domain == "",]
rank domain
490728 1
936300 2
No comments:
Post a comment