Download MemeTracker data

MemeTracker phrase cluster data

Data contains phrase clusters. For each phrase cluster the data contains all the phrases in the cluster and a list of URLs where the phrases appeared.

Download: clust-qt08080902w3mfq5.txt.gz (220mb)

Data format: Tab separated file with the following nested structure. Each block of the data has the following structure:
    A:  <ClSz>  <TotFq>  <Root>  <ClId>
    B:          <QtFq>   <Urls>  <QtStr>  <QtId>
    C:                   <Tm>    <Fq>     <UrlTy>  <Url>
            
  • <ClSz>: number of different phrases in the cluster (number of B records).
  • <TotFq>: total frequency (number of mentions) of all the phrases in the cluster.
  • <Root>: root phrase of the cluster. Representative phrase from the cluster of phrases.
  • <ClId>: cluster id.
  • <QtFq>: total frequency (number of mentions) of the phrase.
  • <Urls>: number of urls where the phrase appeared (number of C records).
  • <QtStr>: phrase string.
  • <QtId>: phrase id.
  • <Tm>: time when the article/post <Url>: was published.
  • <Fq>: number of times phrase <QtStr>: was mentioned at the <Url>:.
  • <UrlTy>: type of the url: B: blog, M: mainstream media.
  • <Url>: URL of the blog post/news article.
Example of a record in the file: lines below map to the fields above. First line is record A, followed by B and 3 C records. Then another B and 2 C records.
  2  8  we're not commenting on that story i'm afraid   2131865
     3  3  we're not commenting on that    489007
        2008-08-18 14:23:05  1  M  http://business.theage.com.au/business/bb-chief-set-to-walk-plank-20080818-3xp7.html
        2008-11-26 01:27:13  1  B  http://sfweekly.com/2008-11-26/news/buy-line
        2008-11-27 18:55:30  1  B  http://aconstantineblacklist.blogspot.com/2008/11/re-researcher-matt-janovic.html
     5  2  we're not commenting on that story      2131864
        2008-12-08 14:50:18  3  B  http://videogaming247.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee
        2008-12-08 19:35:31  2  B  http://jplaystation.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee

Raw MemeTracker phrase data

Data contains phrases and hyper-links extracted from each article/blogpost. For each article we extract timestamp, phrases and hyper-links.

Download: one file per month starting in August 2008. Format: files contain the records like this separated by blank lines:
    P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html
    T       2008-09-09 22:35:24
    Q       that's not change
    Q       you know you can put lipstick on a pig
    Q       what's the difference between a hockey mom and a pit bull lipstick
    Q       you can wrap an old fish in a piece of paper called change
    L       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112
    L       http://cbn.com/cbnnews/436448.aspx
    L       http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews

where the first letter of the line encodes:
  • P: URL of the document
  • T: time of the post (timestamp)
  • Q: phrase extracted from the text of the document
  • L: hyper-links in the document (links pointing out to other documents on the web)
Note some documents have zero phrases or zero links.