MemeTracker phrase cluster data
Data contains phrase clusters. For each phrase cluster the data contains all the phrases in the cluster and
a list of URLs where the phrases appeared.
Download: clust-qt08080902w3mfq5.txt.gz (220mb)
Data format: Tab separated file with the following nested structure. Each block of the data has the following structure:
Download: clust-qt08080902w3mfq5.txt.gz (220mb)
Data format: Tab separated file with the following nested structure. Each block of the data has the following structure:
A: <ClSz> <TotFq> <Root> <ClId> B: <QtFq> <Urls> <QtStr> <QtId> C: <Tm> <Fq> <UrlTy> <Url>
- <ClSz>: number of different phrases in the cluster (number of B records).
- <TotFq>: total frequency (number of mentions) of all the phrases in the cluster.
- <Root>: root phrase of the cluster. Representative phrase from the cluster of phrases.
- <ClId>: cluster id.
- <QtFq>: total frequency (number of mentions) of the phrase.
- <Urls>: number of urls where the phrase appeared (number of C records).
- <QtStr>: phrase string.
- <QtId>: phrase id.
- <Tm>: time when the article/post <Url>: was published.
- <Fq>: number of times phrase <QtStr>: was mentioned at the <Url>:.
- <UrlTy>: type of the url: B: blog, M: mainstream media.
- <Url>: URL of the blog post/news article.
2 8 we're not commenting on that story i'm afraid 2131865 3 3 we're not commenting on that 489007 2008-08-18 14:23:05 1 M http://business.theage.com.au/business/bb-chief-set-to-walk-plank-20080818-3xp7.html 2008-11-26 01:27:13 1 B http://sfweekly.com/2008-11-26/news/buy-line 2008-11-27 18:55:30 1 B http://aconstantineblacklist.blogspot.com/2008/11/re-researcher-matt-janovic.html 5 2 we're not commenting on that story 2131864 2008-12-08 14:50:18 3 B http://videogaming247.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee 2008-12-08 19:35:31 2 B http://jplaystation.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee
Raw MemeTracker phrase data
Data contains phrases and hyper-links extracted from each article/blogpost.
For each article we extract timestamp, phrases and hyper-links.
Download: one file per month starting in August 2008. Format: files contain the records like this separated by blank lines:
where the first letter of the line encodes:
Download: one file per month starting in August 2008. Format: files contain the records like this separated by blank lines:
P http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html T 2008-09-09 22:35:24 Q that's not change Q you know you can put lipstick on a pig Q what's the difference between a hockey mom and a pit bull lipstick Q you can wrap an old fish in a piece of paper called change L http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112 L http://cbn.com/cbnnews/436448.aspx L http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews
where the first letter of the line encodes:
- P: URL of the document
- T: time of the post (timestamp)
- Q: phrase extracted from the text of the document
- L: hyper-links in the document (links pointing out to other documents on the web)