visitor_hash | timestamp | requested_url | referer_from_a_search_engine |
---|
E.g.,
a997c1950718d75c03f22ca8715e50b3 | [28/Feb/2007:23:45:47 -0800] | /group/svsa/cgi-bin/www/officers.php | "http://www.google.com/sea rch?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts" |
See http://www.stanford.edu/~antonell/tags_dataset.html for more information about how to get and use this file.
Excerpt: Available for noncommercial research license from The Linguistic Data Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and 2007 (that's 7,475 issues, to be exact). This collection includes the text of 1.8 million articles written at The Times (for wire service articles, you'll have to look elsewhere). Of these, more than 1.5 million have been manually annotated by The New York Times Index with distinct tags for people, places, topics and organizations drawn from a controlled vocabulary. A further 650,000 articles also include summaries written by indexers from the New York Times Index. The corpus is provided as a collection of XML documents in the News Industry Text Format and includes open source Java tools for parsing documents into memory resident objects.
If you are interested in obtaining either of these data sets, they can be emailed as love-cs345 at cellixis dt cm.