MemeTracker: Download MemeTracker data

Download MemeTracker data

Phrase cluster data
Raw phrases data

MemeTracker phrase cluster data

Data contains phrase clusters. For each phrase cluster the data contains all the phrases in the cluster and a list of URLs where the phrases appeared.

Download: clust-qt08080902w3mfq5.txt.gz (220mb)

Data format: Tab separated file with the following nested structure. Each block of the data has the following structure:

    A:  <ClSz>  <TotFq>  <Root>  <ClId>
    B:          <QtFq>   <Urls>  <QtStr>  <QtId>
    C:                   <Tm>    <Fq>     <UrlTy>  <Url>

<ClSz>: number of different phrases in the cluster (number of B records).
<TotFq>: total frequency (number of mentions) of all the phrases in the cluster.
<Root>: root phrase of the cluster. Representative phrase from the cluster of phrases.
<ClId>: cluster id.
<QtFq>: total frequency (number of mentions) of the phrase.
<Urls>: number of urls where the phrase appeared (number of C records).
<QtStr>: phrase string.
<QtId>: phrase id.
<Tm>: time when the article/post <Url>: was published.
<Fq>: number of times phrase <QtStr>: was mentioned at the <Url>:.
<UrlTy>: type of the url: B: blog, M: mainstream media.
<Url>: URL of the blog post/news article.

Example of a record in the file: lines below map to the fields above. First line is record A, followed by B and 3 C records. Then another B and 2 C records.

  2  8  we're not commenting on that story i'm afraid   2131865
     3  3  we're not commenting on that    489007
        2008-08-18 14:23:05  1  M  http://business.theage.com.au/business/bb-chief-set-to-walk-plank-20080818-3xp7.html
        2008-11-26 01:27:13  1  B  http://sfweekly.com/2008-11-26/news/buy-line
        2008-11-27 18:55:30  1  B  http://aconstantineblacklist.blogspot.com/2008/11/re-researcher-matt-janovic.html
     5  2  we're not commenting on that story      2131864
        2008-12-08 14:50:18  3  B  http://videogaming247.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee
        2008-12-08 19:35:31  2  B  http://jplaystation.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee

Raw MemeTracker phrase data

Data contains phrases and hyper-links extracted from each article/blogpost. For each article we extract timestamp, phrases and hyper-links.

Download: one file per month starting in August 2008.

Format: files contain the records like this separated by blank lines:

    P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html
    T       2008-09-09 22:35:24
    Q       that's not change
    Q       you know you can put lipstick on a pig
    Q       what's the difference between a hockey mom and a pit bull lipstick
    Q       you can wrap an old fish in a piece of paper called change
    L       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112
    L       http://cbn.com/cbnnews/436448.aspx
    L       http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews

where the first letter of the line encodes:

P: URL of the document
T: time of the post (timestamp)
Q: phrase extracted from the text of the document
L: hyper-links in the document (links pointing out to other documents on the web)

Note some documents have zero phrases or zero links.